Parsing Domain From URL In PHP

0 votes
asked Nov 9, 2008 by zuk1

I need to build a function which parses the domain from a URL.

So, with

http://google.com/dhasjkdas/sadsdds/sdda/sdads.html

or

http://www.google.com/dhasjkdas/sadsdds/sdda/sdads.html

it should return google.com

with

http://google.co.uk/dhasjkdas/sadsdds/sdda/sdads.html

it should return google.co.uk.

18 Answers

0 votes
answered Jan 9, 2008 by greg

Check out parse_url()

0 votes
answered Nov 9, 2008 by owen

Check out parse_url():

$url = 'http://google.com/dhasjkdas/sadsdds/sdda/sdads.html';
$parse = parse_url($url);
echo $parse['host']; // prints 'google.com'

parse_url doesn't handle really badly mangled urls very well, but is fine if you generally expect decent urls.

0 votes
answered Nov 29, 2009 by philfreo

From http://us3.php.net/manual/en/function.parse-url.php#93983

for some odd reason, parse_url returns the host (ex. example.com) as the path when no scheme is provided in the input url. So I've written a quick function to get the real host:

function getHost($Address) { 
   $parseUrl = parse_url(trim($Address)); 
   return trim($parseUrl['host'] ? $parseUrl['host'] : array_shift(explode('/', $parseUrl['path'], 2))); 
} 

getHost("example.com"); // Gives example.com 
getHost("http://example.com"); // Gives example.com 
getHost("www.example.com"); // Gives www.example.com 
getHost("http://example.com/xyz"); // Gives example.com 
0 votes
answered Nov 29, 2009 by alix-axel
$domain = str_ireplace('www.', '', parse_url($url, PHP_URL_HOST));

This would return the google.com for both http://google.com/... and http://www.google.com/...

0 votes
answered Jan 23, 2011 by luka

Here is the code i made that 100% finds only the domain name, since it takes mozilla sub tlds to account. Only thing you have to check is how you make cache of that file, so you dont query mozilla every time.

For some strange reason, domains like co.uk are not in the list, so you have to make some hacking and add them manually. Its not cleanest solution but i hope it helps someone.

//=====================================================
static function domain($url)
{
    $slds = "";
    $url = strtolower($url);

            $address = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';
    if(!$subtlds = @kohana::cache('subtlds', null, 60)) 
    {
        $content = file($address);
        foreach($content as $num => $line)
        {
            $line = trim($line);
            if($line == '') continue;
            if(@substr($line[0], 0, 2) == '/') continue;
            $line = @preg_replace("/[^a-zA-Z0-9\.]/", '', $line);
            if($line == '') continue;  //$line = '.'.$line;
            if(@$line[0] == '.') $line = substr($line, 1);
            if(!strstr($line, '.')) continue;
            $subtlds[] = $line;
            //echo "{$num}: '{$line}'"; echo "<br>";
        }
        $subtlds = array_merge(Array(
            'co.uk', 'me.uk', 'net.uk', 'org.uk', 'sch.uk', 'ac.uk', 
            'gov.uk', 'nhs.uk', 'police.uk', 'mod.uk', 'asn.au', 'com.au',
            'net.au', 'id.au', 'org.au', 'edu.au', 'gov.au', 'csiro.au',
            ),$subtlds);

        $subtlds = array_unique($subtlds);
        //echo var_dump($subtlds);
        @kohana::cache('subtlds', $subtlds);
    }


    preg_match('/^(http:[\/]{2,})?([^\/]+)/i', $url, $matches);
    //preg_match("/^(http:\/\/|https:\/\/|)[a-zA-Z-]([^\/]+)/i", $url, $matches);
    $host = @$matches[2];
    //echo var_dump($matches);

    preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
    foreach($subtlds as $sub) 
    {
        if (preg_match("/{$sub}$/", $host, $xyz))
        preg_match("/[^\.\/]+\.[^\.\/]+\.[^\.\/]+$/", $host, $matches);
    }

    return @$matches[0];
}
0 votes
answered Nov 27, 2011 by shaun

The code that was meant to work 100% didn't seem to cut it for me, I did patch the example a little but found code that wasn't helping and problems with it. so I changed it out to a couple of functions (to save asking for the list from mozilla all the time, and removing the cahce system). This has been tested against a set of 1000 URLs and seemed to work.

function domain($url)
{
    global $subtlds;
    $slds = "";
    $url = strtolower($url);

   $host = parse_url('http://'.$url,PHP_URL_HOST);

    preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
    foreach($subtlds as $sub){
        if (preg_match('/\.'.preg_quote($sub).'$/', $host, $xyz)){
            preg_match("/[^\.\/]+\.[^\.\/]+\.[^\.\/]+$/", $host, $matches);
        }
    }

    return @$matches[0];
}

function get_tlds(){
    $address = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';
    $content = file($address);
    foreach($content as $num => $line){
            $line = trim($line);
            if($line == '') continue;
            if(@substr($line[0], 0, 2) == '/') continue;
            $line = @preg_replace("/[^a-zA-Z0-9\.]/", '', $line);
            if($line == '') continue;  //$line = '.'.$line;
            if(@$line[0] == '.') $line = substr($line, 1);
            if(!strstr($line, '.')) continue;
            $subtlds[] = $line;
            //echo "{$num}: '{$line}'"; echo "<br>";
    }

    $subtlds = array_merge(array(
            'co.uk', 'me.uk', 'net.uk', 'org.uk', 'sch.uk', 'ac.uk', 
            'gov.uk', 'nhs.uk', 'police.uk', 'mod.uk', 'asn.au', 'com.au',
            'net.au', 'id.au', 'org.au', 'edu.au', 'gov.au', 'csiro.au'
            ),$subtlds);

    $subtlds = array_unique($subtlds);

    return $subtlds;    
}

Then use it like

$subtlds = get_tlds();
echo domain('www.example.com') //outputs: exmaple.com
echo domain('www.example.uk.com') //outputs: exmaple.uk.com
echo domain('www.example.fr') //outputs: exmaple.fr

I know I should have turned this into a class, but didn't have time.

0 votes
answered Jan 29, 2012 by will

parse_url didn't work for me. It only returned the path. Switching to basics using php5.3+:

$url  = str_replace('http://', '', strtolower( $s->website));
if (strpos($url, '/'))  $url = strstr($url, '/', true);
0 votes
answered Jan 4, 2014 by wonderland

Here my crawler based on the above answers.

  1. Class implementation ( I Like Obj :)
  2. it uses Curl so we can use http auth is required
  3. it only crawl link that belongs to the start url domain
  4. it prints the http header response code ( useful to check problems on a site )

CRAWL CLASS CODE

class crawler
{
    protected $_url;
    protected $_depth;
    protected $_host;

    public function __construct($url, $depth = 5)
    {
        $this->_url = $url;
        $this->_depth = $depth;
        $parse = parse_url($url);
        $this->_host = $parse['host'];
    }

    public function run()
    {
        $this->crawl_page($this->_url, $this->_depth = 5);
    }

    public function crawl_page($url, $depth = 5)
    {
        static $seen = array();
        if (isset($seen[$url]) || $depth === 0) {
            return;
        }
        $seen[$url] = true;
        list($content, $httpcode) = $this->getContent($url);

        $dom = new DOMDocument('1.0');
        @$dom->loadHTML($content);
        $this->processAnchors($dom, $url, $depth);

        ob_end_flush();
        echo "CODE::$httpcode, URL::$url <br>";
        ob_start();
        flush();
        // echo "URL:", $url, PHP_EOL, "CONTENT:", PHP_EOL, $dom->saveHTML(), PHP_EOL, PHP_EOL;
    }

    public function processAnchors($dom, $url, $depth)
    {
        $anchors = $dom->getElementsByTagName('a');
        foreach ($anchors as $element) {
            $href = $element->getAttribute('href');
            if (0 !== strpos($href, 'http')) {
                $path = '/' . ltrim($href, '/');
                if (extension_loaded('http')) {
                    $href = http_build_url($url, array('path' => $path));
                } else {
                    $parts = parse_url($url);
                    $href = $parts['scheme'] . '://';
                    if (isset($parts['user']) && isset($parts['pass'])) {
                        $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                    }
                    $href .= $parts['host'];
                    if (isset($parts['port'])) {
                        $href .= ':' . $parts['port'];
                    }
                    $href .= $path;
                }
            }
            // Crawl only link that belongs to the start domain
            if (strpos($href, $this->_host) !== false)
                $this->crawl_page($href, $depth - 1);
        }
    }

    public function getContent($url)
    {
        $handle = curl_init($url);
        curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);

        /* Get the HTML or whatever is linked in $url. */
        $response = curl_exec($handle);

        /* Check for 404 (file not found). */
        $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
        if ($httpCode == 404) {
            /* Handle 404 here. */
        }

        curl_close($handle);
        return array($response, $httpCode);
    }
}

// USAGE
$startURL = 'http://YOUR_START_ULR';
$depth = 2;
$crawler = new crawler($startURL, $depth);
$crawler->run();
0 votes
answered Jan 10, 2014 by t-brian-jones

This will generally work very well if the input URL is not total junk. It removes the subdomain.

$host = parse_url( $Row->url, PHP_URL_HOST );
$parts = explode( '.', $host );
$parts = array_reverse( $parts );
$domain = $parts[1].'.'.$parts[0];

Example

Input: http://www2.website.com:8080/some/file/structure?some=parameters

Output: website.com

0 votes
answered Jan 17, 2014 by notfound-life

I have edited for you:

function getHost($Address) { 
    $parseUrl = parse_url(trim($Address));
    $host = trim($parseUrl['host'] ? $parseUrl['host'] : array_shift(explode('/', $parseUrl['path'], 2))); 

    $parts = explode( '.', $host );
    $num_parts = count($parts);

    if ($parts[0] == "www") {
        for ($i=1; $i < $num_parts; $i++) { 
            $h .= $parts[$i] . '.';
        }
    }else {
        for ($i=0; $i < $num_parts; $i++) { 
            $h .= $parts[$i] . '.';
        }
    }
    return substr($h,0,-1);
}

All type url (www.domain.ltd, sub1.subn.domain.ltd will result to : domain.ltd.

Welcome to Q&A, where you can ask questions and receive answers from other members of the community.
Website Online Counter

...