How do you scrape AJAX pages?

0 votes
asked Nov 4, 2008 by xxxxxxx

The title says it all. Please advise how to scrape AJAX pages.

9 Answers

0 votes
answered Nov 4, 2008 by sblundy

If you can get at it, try examining the DOM tree. Selenium does this as a part of testing a page. It also has functions to click buttons and follow links, which may be useful.

0 votes
answered Nov 4, 2008 by wonderchook

Depends on the ajax page. The first part of screen scraping is determining how the page works. Is there some sort of variable you can iterate through to request all the data from the page? Personally I've used Web Scraper Plus for a lot of screen scraping related tasks because it is cheap, not difficult to get started, non-programmers can get it working relatively quickly.

Side Note: Terms of Use is probably somewhere you might want to check before doing this. Depending on the site iterating through everything may raise some flags.

0 votes
answered Nov 4, 2008 by brian-r-bondy

Overview:

All screen scraping first requires manual review of the page you want to extract resources from. When dealing with AJAX you usually just need to analyze a bit more than just simply the HTML.

When dealing with AJAX this just means that the value you want is not in the initial HTML document that you requested, but that javascript will be exectued which asks the server for the extra information you want.

You can therefore usually simply analyze the javascript and see which request the javascript makes and just call this URL instead from the start.


Example:

Take this as an example, assume the page you want to scrape from has the following script:

<script type="text/javascript">
function ajaxFunction()
{
var xmlHttp;
try
  {
  // Firefox, Opera 8.0+, Safari
  xmlHttp=new XMLHttpRequest();
  }
catch (e)
  {
  // Internet Explorer
  try
    {
    xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
    }
  catch (e)
    {
    try
      {
      xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
      }
    catch (e)
      {
      alert("Your browser does not support AJAX!");
      return false;
      }
    }
  }
  xmlHttp.onreadystatechange=function()
    {
    if(xmlHttp.readyState==4)
      {
      document.myForm.time.value=xmlHttp.responseText;
      }
    }
  xmlHttp.open("GET","time.asp",true);
  xmlHttp.send(null);
  }
</script>

Then all you need to do is instead do an HTTP request to time.asp of the same server instead. Example from w3schools.


Advanced scraping with C++:

For complex usage, and if you're using C++ you could also consider using the firefox javascript engine SpiderMonkey to execute the javascript on a page.

Advanced scraping with Java:

For complex usage, and if you're using Java you could also consider using the firefox javascript engine for Java Rhino

Advanced scraping with .NET:

For complex usage, and if you're using .Net you could also consider using the Microsoft.vsa assembly. Recently replaced with ICodeCompiler/CodeDOM.

0 votes
answered Nov 11, 2011 by alex

As a low cost solution you can also try SWExplorerAutomation (SWEA). The program creates an automation API for any Web application developed with HTML, DHTML or AJAX.

0 votes
answered Nov 26, 2011 by deepan-prabhu-babu

Another best tool that lets us write scrappers on live DOM is solvent from MIT . http://simile.mit.edu/wiki/Solvent .

Also promising is , Envjs at http://www.envjs.com/doc/guides .

If your intention is to scrape to a high scale , you might need to introduce delays and mimic human behavior to avoid being banned by the service being scraped.

0 votes
answered Nov 9, 2013 by sw

The best way to scrape web pages using Ajax or in general pages using Javascript is with a browser itself or a headless browser (a browser without GUI). Currently phantomjs is a well promoted headless browser using WebKit. An alternative that I used with success is HtmlUnit (in Java or .NET via IKVM, which is a simulated browser. Another known alternative is using a web automation tool like Selenium.

I wrote many articles about this subject like web scraping Ajax and Javascript sites and automated browserless OAuth authentication for Twitter. At the end of the first article there are a lot of extra resources that I have been compiling since 2011.

0 votes
answered Nov 14, 2013 by michael

I think Brian R. Bondy's answer is useful when the source code is easy to read. I prefer an easy way using tools like Wireshark or HttpAnalyzer to capture the packet and get the url from the "Host" field and the "GET" field.

For example,I capture a packet like the following:

GET /hqzx/quote.aspx?type=3&market=1&sorttype=3&updown=up&page=1&count=8&time=164330 
 HTTP/1.1
Accept: */*
Referer: http://quote.hexun.com/stock/default.aspx
Accept-Language: zh-cn
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
Host: quote.tool.hexun.com
Connection: Keep-Alive

Then the URL is :

http://quote.tool.hexun.com/hqzx/quote.aspx?type=3&market=1&sorttype=3&updown=up&page=1&count=8&time=164330
0 votes
answered Nov 9, 2014 by mattspain

In my opinion the simpliest solution is to use Casperjs, a framework based on the WebKit headless browser phantomjs.

The whole page is loaded, and it's very easy to scrape any ajax-related data. You can check this basic tutorial to learn Automating & Scraping with PhantomJS and CasperJS

You can also give a look at this example code, on how to scrape google suggests keywords :

/*global casper:true*/
var casper = require('casper').create();
var suggestions = [];
var word = casper.cli.get(0);

if (!word) {
    casper.echo('please provide a word').exit(1);
}

casper.start('http://www.google.com/', function() {
    this.sendKeys('input[name=q]', word);
});

casper.waitFor(function() {
  return this.fetchText('.gsq_a table span').indexOf(word) === 0
}, function() {
  suggestions = this.evaluate(function() {
      var nodes = document.querySelectorAll('.gsq_a table span');
      return [].map.call(nodes, function(node){
          return node.textContent;
      });
  });
});

casper.run(function() {
  this.echo(suggestions.join('\n')).exit();
});
0 votes
answered Nov 16, 2015 by ttt

I like PhearJS, but that might be partially because I built it.

That said, it's a service you run in the background that speaks HTTP(S) and renders pages as JSON for you, including any metadata you might need.

Welcome to Q&A, where you can ask questions and receive answers from other members of the community.
Website Online Counter

...