Curl Redirect: Why FOLLOWLOCATION doesn't follow right DOMAIN

0 votes
asked Sep 5, 2010 by user3042509

I'm trying to scrape a website but the page I tried to scrape contains a redirect to another page.I put FOLLOWLOCATION parameter on curl but I arrive on a url http://localhost/....pageredirected.php and so on

The problem is that redirect works but DOMAIN is not right (because it is mine not scraped page). Here is code:

<?php
// create a new CURL resource
$ch = curl_init();

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://voli.govolo.it/etape1.cfm?ref=2008052701&destination=484&Provenance=320&Date_Depart=11/9/2010&Date_Retour=18/9/2010&AllerRetour=1&Adultes=1&ENFANTS=0&BEBES=0&dated=110910&dater=180910&TypeClasse=0&langue=it");
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);    

// grab URL and pass it to the browser
$esito = curl_exec($ch);
print_r(curl_getinfo($ch));
echo $esito;
// close CURL resource, and free up system resources
curl_close($ch);
?>

page will be redirect is etape1.cfm TO etape2.cfm but I get 404 Error because I see http://localhost/scraping/etape2.cfm?... and not http://voli.govolo.it/etape2.cfm?...

Why FOLLOWLOCATION doesn't follow right DOMAIN (http://voli.govolo.it) ?

1 Answer

0 votes
answered Sep 5, 2010 by marc-b

The problem isn't curl. Part of what that first url sends is this:

<script language="JavaScript" type="text/javascript">
<!--

    function historyDeleteAndRedirect()
    {

        window.location.replace('etape2.cfm?ref=2008052701&destination=484&Provenance=320&Date_Depart=11/9/2010&Date_Retour=18/9/2010&AllerRetour=1&Adultes=1&ENFANTS=0&BEBES=0&dated=110910&dater=180910&TypeClasse=0&langue=it');


    //alert(window.location.href);
    //alert(document.referrer);
    }

//-->
</script>

Since you're not accessing the site in a normal manner, this javascript breaks, as you're really hitting "localhost" rather than "WhateverSiteThisIs.com". Remember, curl works on the server. So you're hitting "http://localhost/etape1.cfm?...... Since the .replace() isn't an absolute URL, your browser is doing the correct thing and re-using localhost.

Welcome to Q&A, where you can ask questions and receive answers from other members of the community.
Website Online Counter

...