Scraping AJAX sites with Scrapy-IAMTOP1982-ChinaUnix博客

Havvy Tech Fieldhavytech.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

IAMTOP1982

博客访问： 291526
博文数量： 40
博客积分： 1807
博客等级：上尉
技术积分： 350
用户组：普通用户
注册时间： 2009-08-03 15:42

文章分类

全部博文（40）

数据结构与算法（1）
翻译（2）
转载（21）
开源项目（10）
MDC（0）
经验总结（2）
错误修订（1）
未分配的博文（3）

文章存档

2011年（18）

2010年（20）

2009年（2）

我的朋友

相关博文

Scraping AJAX sites with Scrapy

分类： LINUX

2011-06-29 15:57:01

This post was contributed by Ismael Carnales.

A common question in the is how to scrape AJAX sites. As Mark Ellul in the mailing list, there are two basic types of AJAX requests that web sites make use of. These are: "static" requests which their parameters (URL, post data) doesn't change, and "dynamic" requests that use some variables based on properties from the current page.

The general approach when dealing with "static" AJAX requests is adding their URLs to start_urls attribute as a "normal" URL. And to deal with "dynamic" ones we will try to generate the same requests from Scrapy.

To help us in this task, we'll use a Firefox add-on called . This add-on comes with a that let us monitor the requests being sent to the server and their responses

We will scrape Nasa Image of the Day Gallery. When loading the site we can see that the page loads the gallery information from another source, so to find it out we launch Firebug, go to the Net panel and reload the page.

In the Net panel, we see each request (and its response) made to load the entire page contents, here we can filter the requests and look for XmlHttpRequests in the XHR tab.

In the XHR tab, we see that two requests are made, one to iotdxml.xml and one to image_feature_NUMBER.xml. If we look at the response of the first one (clicking on it and then going to response tab) we see that it holds the gallery slider data.

Now, if we navigate to another photography, clicking on its slider link we'll see that a new request has been made. This request points to image_feature_NUMBER.xml, that looks suspiciously similar to the second request that we got when loading the page for the first time (that request got the first image on the gallery). So if we look at the iotdxml.xml file we'll find that the image URL for finding its complete data is stored in a ap attribute.

So, to scrape this site, we add the iotdxml.xml URL to the Spider start_urls, parse it and make requests for each individual image (mimicking the requests the browser makes when clicking on images).

Here's a simple spider to illustrate this:

from urlparse import urljoin

from scrapy.http import Request

from scrapy.selector import XmlXPathSelector

from scrapy.spider import BaseSpider

class NasaImagesSpider(BaseSpider):

    name = "nasa.gov"

    start_urls = (

        'http://www.nasa.gov/multimedia/imagegallery/iotdxml.xml',

    )

    def parse(self, response):

        xxs = XmlXPathSelector(response)

        urls = xxs.select('//ig/ap/text()').extract()

        for url in urls:

            abs_url = urljoin(self.start_urls[0], url) + '.xml'

            yield Request(abs_url, callback=self.parse_image)

    def parse_image(self, response):

        # parse individual images here

        pass

SPIDER = NasaImagesSpider()

You can run this spider quickly (without creating a project) by saving it into a nasaspider.py file and running:

scrapy runspider nasaspider.py

阅读(3452) | 评论(0) | 转发(0) |

上一篇：Linux下的top命令的图解使用

下一篇：没有了

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6