Chinaunix首页 | 论坛 | 博客
  • 博客访问: 287691
  • 博文数量: 40
  • 博客积分: 1807
  • 博客等级: 上尉
  • 技术积分: 350
  • 用 户 组: 普通用户
  • 注册时间: 2009-08-03 15:42
文章分类

全部博文(40)

文章存档

2011年(18)

2010年(20)

2009年(2)

我的朋友

分类: LINUX

2011-06-29 15:57:01

This post was contributed by Ismael Carnales.

A common question in the is how to scrape AJAX sites. As Mark Ellul  in the mailing list, there are two basic types of AJAX requests that web sites make use of. These are: "static" requests which their parameters (URL, post data) doesn't change, and "dynamic" requests that use some variables based on properties from the current page. 

The general approach when dealing with "static" AJAX requests is adding their URLs to start_urls attribute as a "normal" URL. And to deal with "dynamic" ones we will try to generate the same requests from Scrapy.

To help us in this task, we'll use a Firefox add-on called . This add-on comes with a  that let us monitor the requests being sent to the server and their responses

We will scrape Nasa Image of the Day Gallery. When loading the site we can see that the page loads the gallery information from another source, so to find it out we launch Firebug, go to the Net panel and reload the page.

In the Net panel, we see each request (and its response) made to load the entire page contents, here we can filter the requests and look for XmlHttpRequests in the XHR tab. 

In the XHR tab, we see that two requests are made, one to iotdxml.xml and one to image_feature_NUMBER.xml. If we look at the response of the first one (clicking on it and then going to response tab) we see that it holds the gallery slider data.

Now, if we navigate to another photography, clicking on its slider link we'll see that a new request has been made. This request points to image_feature_NUMBER.xml, that looks suspiciously similar to the second request that we got when loading the page for the first time (that request got the first image on the gallery). So if we look at the iotdxml.xml file we'll find that the image URL for finding its complete data is stored in a ap attribute.

So, to scrape this site, we add the iotdxml.xml URL to the Spider start_urls, parse it and make requests for each individual image (mimicking the requests the browser makes when clicking on images).

Here's a simple spider to illustrate this:

  1. from urlparse import urljoin

  2. from scrapy.http import Request
  3. from scrapy.selector import XmlXPathSelector
  4. from scrapy.spider import BaseSpider


  5. class NasaImagesSpider(BaseSpider):
  6.     name = "nasa.gov"
  7.     start_urls = (
  8.         'http://www.nasa.gov/multimedia/imagegallery/iotdxml.xml',
  9.     )

  10.     def parse(self, response):
  11.         xxs = XmlXPathSelector(response)
  12.         urls = xxs.select('//ig/ap/text()').extract()
  13.         for url in urls:
  14.             abs_url = urljoin(self.start_urls[0], url) + '.xml'
  15.             yield Request(abs_url, callback=self.parse_image)

  16.     def parse_image(self, response):
  17.         # parse individual images here
  18.         pass


  19. SPIDER = NasaImagesSpider()

You can run this spider quickly (without creating a project) by saving it into a nasaspider.py file and running:

    scrapy runspider nasaspider.py

阅读(3417) | 评论(0) | 转发(0) |
0

上一篇:Linux下的top命令的图解使用

下一篇:没有了

给主人留下些什么吧!~~