用urllib2 和BeautifulSoup抓取豆瓣电影Top250-MingZznet-ChinaUnix博客

Mingz技术博客

首页　| 　博文目录　| 　关于我

MingZznet

博客访问： 539059
博文数量： 71
博客积分： 0
博客等级：民兵
技术积分： 159
用户组：普通用户
注册时间： 2013-07-13 12:37

个人简介

aaaaaaaaa

文章分类

全部博文（71）

v8（4）
杂项（1）
firefox（1）
Qt（21）
Linux（1）
json（2）
ubuntu（0）
PyQt（1）
python（40）
未分配的博文（0）

文章存档

2013年（71）

我的朋友

相关博文

用urllib2 和BeautifulSoup抓取豆瓣电影Top250

分类： Python/Ruby

2013-07-24 12:47:14

原文地址：用urllib2 和BeautifulSoup抓取豆瓣电影Top250 作者：chinaboywg

用urllib2 和BeautifulSoup抓取分析网页

以抓取豆瓣电影Top250的排行信息为例，本以为还要做个爬虫，后来发现那排行有文本列表显示的选项，直接把Top250个电影信息在一个页面显示出来了，所以只要用urllib2下载页面，再用BeautifulSoup分析就行了。

Beautiful Soup第三方库的下载地址，其使用方法很简单的，可以详见参考文档。注意的是，Beautiful Soup将所遇到的各种编码方式都转为UTF-8的编码格式，在显示分析结果时，必须注意原网页的编码方式，这样才能正常显示。

首先要分析豆瓣电影Top250的源代码，其列出排行结果的主要代码如下:

<table class="list_view" summary="豆瓣电影250: 序号影片名评分评价人数">
<caption>豆瓣电影250</caption>
<thead>
<tr>
<th id="m_order" width="20"></th>
<th id="m_name" width="460"></th>
<th id="m_rating_score" width="39">评分</th>
<th id="m_rating_num">评价人数</th>
</tr>
</thead>
<tbody>
<tr class="item">
<td headers="m_order" class="m_order">
1
</td>
<td headers="m_name">
<a href="">肖申克的救赎 / The Shawshank Redemption</a>
<span class="year">1994</span>
</td>
<td headers="m_rating_score">
<em>9.5</em>
</td>
<td headers="m_rating_num">
445262
</td>
</tr>

对其分析，获得排名信息的，简单python代码如下：

# _*_ coding:utf-8 _*_
import urllib2
import re
from bs4 import BeautifulSoup
def crawl(url):
page = urllib2.urlopen(url)
contents = page.read()
soup = BeautifulSoup(contents)
print(u'豆瓣电影250: 序号 \t影片名\t 评分 \t评价人数')
for tag in soup.find_all('tr', class_='item'):
m_order = int(tag.find('td', class_='m_order').get_text())
m_name = tag.a.get_text()
m_year = tag.span.get_text()
m_rating_score = float(tag.em.get_text())
m_rating_num = int(tag.find(headers="m_rating_num").get_text())
print("%s %s %s %s %s" % (m_order, m_name, m_year, m_rating_score, m_rating_num))
if __name__ == '__main__':
crawl('')

结果如图：自己在原有基础上修改后的脚本如下：
#coding:utf-8

import urllib2
import re
from bs4 import BeautifulSoup

def crawl(url):
   page = urllib2.urlopen(url)
   contents = page.read()
   soup = BeautifulSoup(contents)
   print(u'               豆瓣电影TOP250:\n 序号 \t影片名\t 评分 \t评价人数 \t 链接 ')
   for tag in soup.find_all('tr', class_='item'):
       #print tag
       m_order = int(tag.find('td', class_='m_order').get_text())
       m_name = tag.a.get_text()
       m_year = tag.span.get_text()
       m_rating_score = float(tag.em.get_text())
       m_rating_num = int(tag.find(headers="m_rating_num").get_text())
       m_url=str(tag.find('a')).split('"')[1]
       #print m_url


       print("%s %s %s %s %s %s " % (m_order, m_name, m_year, m_rating_score, m_rating_num,m_url))

if __name__ == '__main__':
    crawl('')

运行结果：

               豆瓣电影TOP250:
序号    影片名   评分    评价人数    链接
1 肖申克的救赎 / The Shawshank Redemption 1994 9.5 471051
2 这个杀手不太冷 / Léon 1994 9.4 445356
3 阿甘正传 / Forrest Gump 1994 9.3 404984
4 霸王别姬 1993 9.4 314060
5 盗梦空间 / Inception 2010 9.2 451323
6 海上钢琴师 / La leggenda del pianista sull'oceano 1998 9.1 352935
7 美丽人生 / La vita è bella 1997 9.4 216487
8 三傻大闹宝莱坞 / 3 Idiots 2009 9.1 358053
9 辛德勒的名单 / Schindler's List 1993 9.3 206974
10 放牛班的春天 / Les choristes 2004 9.1 249464
11 龙猫 / となりのトトロ 1988 9.1 226835
12 搏击俱乐部 / Fight Club 1999 9.1 232290
13 泰坦尼克号 / Titanic 1997 8.9 362303
14 教父 / The Godfather 1972 9.2 182709
15 天堂电影院 / Nuovo Cinema Paradiso 1988 9.1 178064
16 忠犬八公的故事 / Hachi: A Dog's Tale 2009 9.1 207328
17 千与千寻 / 千と千尋の神隠し 2001 9.0 341993
18 罗马假日 / Roman Holiday 1953 8.9 227359
19 乱世佳人 / Gone with the Wind 1939 9.2 159744
20 大话西游之大圣娶亲 / 西遊記大結局之仙履奇緣 1995 8.9 215545
21 天使爱美丽 / Le fabuleux destin d'Amélie Poulain 2001 8.8 305080
22 当幸福来敲门 / The Pursuit of Happyness 2006 8.8 310963
23 楚门的世界 / The Truman Show 1998 8.9 220642
24 怦然心动 / Flipped 2010 8.8 257920
25 两杆大烟枪 / Lock, Stock and Two Smoking Barrels 1998 9.1 148561
26 飞越疯人院 / One Flew Over the Cuckoo's Nest 1975 9.0 152383
27 指环王3：王者无敌 / The Lord of the Rings: The Return of the King 2003 9.0 157985
28 七宗罪 / Se7en 1995 8.7 260277
29 闻香识女人 / Scent of a Woman 1992 8.8 180207
30 让子弹飞 2010 8.8 395990
31 情书 / Love Letter 1995 8.7 214434
32 海豚湾 / The Cove 2009 9.4 119388
33 大话西游之月光宝盒 / 西遊記第一百零一回之月光寶盒 1995 8.8 185762
34 剪刀手爱德华 / Edward Scissorhands 1990 8.7 298530
35 无间道 / 無間道 2002 8.7 207719
36 少年派的奇幻漂流 / Life of Pi 2012 9.0 338297
37 美丽心灵 / A Beautiful Mind 2001 8.8 170725
38 鬼子来了 2000 9.1 116185
39 指环王1：魔戒再现 / The Lord of the Rings: The Fellowship of the Ring 2001 8.8 169109
40 阿凡达 / Avatar 2009 8.7 340480
41 低俗小说 / Pulp Fiction 1994 8.8 180731
42 勇敢的心 / Braveheart 1995 8.7 192747
43 指环王2：双塔奇兵 / The Lord of the Rings: The Two Towers 2002 8.8 149774
44 机器人总动员 / WALL·E 2008 9.3 290827
45 飞屋环游记 / Up 2009 8.8 298769
46 蝙蝠侠：黑暗骑士 / The Dark Knight 2008 8.8 161024
47 活着 1994 8.9 132871
48 窃听风暴 / Das Leben der Anderen 2006 9.0 107988
49 死亡诗社 / Dead Poets Society 1989 8.8 146513
50 入殓师 / おくりびと 2008 8.7 199862

阅读(1673) | 评论(0) | 转发(0) |

上一篇：python实现自动登录discuz论坛

下一篇：Beautiful Soup 帮助文档1 快速入门

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6