转载记录：Python在Web Page抓取、JS解析方面的介绍-runningsparrow-ChinaUnix博客

Chinaunix首页 | 论坛 | 博客

runningsparrow

首页　| 　博文目录　| 　关于我

博客访问： 541451
博文数量： 64
博客积分： 2426
博客等级：大尉
技术积分： 569
用户组：普通用户
注册时间： 2008-05-19 23:17

文章分类

全部博文（64）

未分配的博文（64）

文章存档

2015年（2）

2014年（2）

2013年（6）

2012年（8）

2011年（10）

2010年（28）

2009年（8）

我的朋友

最近访客

推荐博文

相关博文

转载记录：Python在Web Page抓取、JS解析方面的介绍

分类： Python/Ruby

2015-07-26 08:19:43

由于目前的Web开发中AJAX、Javascript、CSS的大量使用，一些网站上的重要数据是由Ajax或Javascript动态生成的，并不能直接通过解析html页面内容就能获得（例如采用正如

其中
争取其他语言一些类似的软件还有：

1、应用场景
关于Selenium的详细说明，可以参考其文档，这里使用Python+Selenium Remote Control (RC)+Firefox 来实现如下几个典型的功能：

1）、Screen Scraping，也即由程序自动将访问网页在浏览器内显示的图像保存为图片，类似那些digg站点的网页缩略图。Screen Scraping有分成两种：只Scraping当前浏览器页面可视区域网页的图片（例如google.com首页），Scraping当前浏览器完整页面的图片（页面有滚动，例如 2）、获取Javascript脚本生成的内容

例如要用程序自动爬取并下载 a）、进入百度新歌TOP100 (.*)” class=”search”></a> 或采用 b）、在查询结果页面，获得第一条结果的地址<a href=”(.*)” title=”(.*)</a>，进入mp3的实际下载地址

c）、在歌曲实际下载页面，解析html页面内容，会发现mp3的实际现在地址为空
```
   <a id="urla" href="" onmousedown="sd(event,0)" target="_blank"></a>
```
实际的下载地址是由javascript脚本设置的：
```
                    var encurl = "…", newurl = "";
                    var urln_obj = G("urln"), urla_obj = G("urla");
                    newurl = decode(encurl);
                    urln_obj.href = urla_obj.href = song_1287289709 = newurl;
  其中函数G(str)为：
```
```
           function G(str){
                        return document.getElementById(str);
                };
```
因此直接解析页面并不能获得下载地址，必须通过python调用浏览器引擎来解析javascript代码后获得对应的下载地址。

2、Selenium RC基础

Selenium RC的运行机制及架构在 Selenium RC主要包括两部分：Selenium Server、Client Libraries，其中：
- The Selenium Server which launches and kills browsers, interprets and runs the Selenese commands passed from the test program, and acts as an HTTP proxy, intercepting and verifying HTTP messages passed between the browser and the AUT.
Selenium Server 对应Selenium RC 开发包中的selenium-server-xx目录，其中

xx对应相应的版本
- Client libraries which provide the interface between each programming language and the Selenium-RC Server.
Selenium RC提供了包括java、python、ruby、perl、.net、php等语言的client driver，分别如下：

selenium-dotnet-client-driver-xx

selenium-java-client-driver-xx

selenium-perl-client-driver-xx

selenium-php-client-driver-xx

selenium-python-client-driver-xx

selenium-ruby-client-driver-xx
Python等语言通过调用client driver来发出浏览器操作指令（例如打开制定url），由client driver把指令传递给Selenium Server解析。Selenium Server负责接收、解析、执行客户端执行的Selenium 指令，转换成各种浏览器的命令，然后调用相应的浏览器API来完成实际的浏览器操作。

Selenium Server实际充当了客户端程序与浏览器间http proxy。

3、例子：

1）、下载Selenium RC 2）、解压后selenium-remote-control-1.0.3.zip
3）、运行Selenium Server

cd selenium-remote-control-1.0.3\selenium-server-1.0.3

java -jar selenium-server.jar

Selenium Server缺省监听端口为4444，在org.openqa.selenium.server.RemoteControlConfiguration中设定

4）、测试代码
```
#coding=gbk
from selenium import selenium

def selenium_init(browser,url,para):
    sel = selenium('localhost', 4444, browser, url)
    sel.start()
    sel.open(para)
    sel.set_timeout(60000)
    sel.window_focus()
    sel.window_maximize()
    return sel     

def selenium_capture_screenshot(sel):
    sel.capture_screenshot("d:\\singlescreen.png")

def selenium_get_value(sel):
    innertext=sel.get_eval("this.browserbot.getCurrentWindow().document.getElementById('urla').innerHTML")
    url=sel.get_eval("this.browserbot.getCurrentWindow().document.getElementById('urla').href")
    print("The innerHTML is :"+innertext+"\n")
    print("The url is :"+url+"\n")

def selenium_capture_entire_page_screenshot(sel):
    sel.capture_entire_page_screenshot("d:\\entirepage.png", "background=#CCFFDD")

if __name__ =="__main__" :
    sel1=selenium_init('*firefox3','','/m?word=mp3,[%B1%A7%BD%F4%C4%E3+%CF%F4%D1%C7%D0%F9]&ct=134217728&tn=baidusg,%B1%A7%BD%F4%C4%E3%20%20&si=%B1%A7%BD%F4%C4%E3;;%CF%F4%D1%C7%D0%F9;;0;;0&lm=16777216&sgid=1')
    selenium_get_value(sel1)
    selenium_capture_screenshot(sel1)
    sel1.stop()
    sel2=selenium_init('*firefox3','','/')
    selenium_capture_entire_page_screenshot(sel2)
    sel2.stop()
```
几点注意事项：

1）、在selenium-remote-control-1.0.3/selenium-python-client-driver-1.0.1/doc/selenium.selenium-class.html 中对Selenium支持的各种命令的说明，值得花点时间看看

2）、在__init__(self, host, port, browserStartCommand, browserURL) 中，browserStartCommand为使用的浏览器，目前Selenium支持的浏览器对应参数如下：
*firefox
*mock
*firefoxproxy
*pifirefox
*chrome
*iexploreproxy
*iexplore
*firefox3
*safariproxy
*googlechrome
*konqueror
*firefox2
*safari
*piiexplore
*firefoxchrome
*opera
*iehta
*custom

3)、capture_entire_page_screenshot目前只支持firefox、IE

使用firefox时候使用capture_entire_page_screenshot比较简单，不需要特别设置，Selenium会自动处理。因此如果使用capture_entire_page_screenshot推荐使用firefox。
IE必须运行在非HTA（non-HTA）模式下（browserStartCommand值为：*iexploreproxy ），并且需要安装管理员在2009年8月13日编辑了该文章文章。

阅读(2156) | 评论(0) | 转发(0) |

0

上一篇：解决错误“Logon Failure: The User Has Not Been Granted The Requested ”

下一篇：没有了

给主人留下些什么吧！~~

关于我们 | 关于IT168 | 联系方式 | 广告合作 | 法律声明 | 免费注册

Copyright 2001-2010 ChinaUnix.net All Rights Reserved 北京皓辰网域网络信息技术有限公司. 版权所有

感谢所有关心和支持过ChinaUnix的朋友们