用urllib2 和BeautifulSoup抓取豆瓣电影Top250-chinaboywg-ChinaUnix博客

chinaboy小宝chinaboy007.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

chinaboywg

博客访问： 2926242
博文数量： 348
博客积分： 2907
博客等级：中校
技术积分： 2272
用户组：普通用户
注册时间： 2010-03-12 09:16

个人简介

专注 K8S研究

文章分类

全部博文（348）

elk（2）
docker（5）
error（0）
zabbix（21）
haproxy（2）
linux（11）
redis（2）
lvs（9）
squid（8）
nagios（4）
puppet（6）
html（1）
nginx（45）
apache（3）
mysql（65）
php（0）
python（114）

pycharm（1）

pip（1）

requests（1）

requests（0）

urllib（0）

logging（1）

flask（0）

lib（0）

pyqt4（14）

django（7）

beautifulsoup（11）

scrapy（3）

string（6）

pexpect（4）
shell（19）
linux（25）
other（4）
未分配的博文（2）

文章存档

2019年（22）

2018年（57）

2016年（2）

2015年（27）

2014年（33）

2013年（190）

2011年（3）

2010年（14）

我的朋友

相关博文

用urllib2 和BeautifulSoup抓取豆瓣电影Top250

分类： Python/Ruby

2013-07-06 01:30:46

用urllib2 和BeautifulSoup抓取分析网页

以抓取豆瓣电影Top250的排行信息为例，本以为还要做个爬虫，后来发现那排行有文本列表显示的选项，直接把Top250个电影信息在一个页面显示出来了，所以只要用urllib2下载页面，再用BeautifulSoup分析就行了。

Beautiful Soup第三方库的下载地址，其使用方法很简单的，可以详见参考文档。注意的是，Beautiful Soup将所遇到的各种编码方式都转为UTF-8的编码格式，在显示分析结果时，必须注意原网页的编码方式，这样才能正常显示。

首先要分析豆瓣电影Top250的源代码，其列出排行结果的主要代码如下:

<table class="list_view" summary="豆瓣电影250: 序号影片名评分评价人数">
<caption>豆瓣电影250</caption>
<thead>
<tr>
<th id="m_order" width="20"></th>
<th id="m_name" width="460"></th>
<th id="m_rating_score" width="39">评分</th>
<th id="m_rating_num">评价人数</th>
</tr>
</thead>
<tbody>
<tr class="item">
<td headers="m_order" class="m_order">
1
</td>
<td headers="m_name">
<a href="">肖申克的救赎 / The Shawshank Redemption</a>
<span class="year">1994</span>
</td>
<td headers="m_rating_score">
<em>9.5</em>
</td>
<td headers="m_rating_num">
445262
</td>
</tr>

对其分析，获得排名信息的，简单python代码如下：

# _*_ coding:utf-8 _*_
import urllib2
import re
from bs4 import BeautifulSoup
def crawl(url):
page = urllib2.urlopen(url)
contents = page.read()
soup = BeautifulSoup(contents)
print(u'豆瓣电影250: 序号 \t影片名\t 评分 \t评价人数')
for tag in soup.find_all('tr', class_='item'):
m_order = int(tag.find('td', class_='m_order').get_text())
m_name = tag.a.get_text()
m_year = tag.span.get_text()
m_rating_score = float(tag.em.get_text())
m_rating_num = int(tag.find(headers="m_rating_num").get_text())
print("%s %s %s %s %s" % (m_order, m_name, m_year, m_rating_score, m_rating_num))
if __name__ == '__main__':
crawl('')

结果如图：自己在原有基础上修改后的脚本如下：
#coding:utf-8

import urllib2
import re
from bs4 import BeautifulSoup

def crawl(url):
   page = urllib2.urlopen(url)
   contents = page.read()
   soup = BeautifulSoup(contents)
   print(u'               豆瓣电影TOP250:\n 序号 \t影片名\t 评分 \t评价人数 \t 链接 ')
   for tag in soup.find_all('tr', class_='item'):
       #print tag
       m_order = int(tag.find('td', class_='m_order').get_text())
       m_name = tag.a.get_text()
       m_year = tag.span.get_text()
       m_rating_score = float(tag.em.get_text())
       m_rating_num = int(tag.find(headers="m_rating_num").get_text())
       m_url=str(tag.find('a')).split('"')[1]
       #print m_url


       print("%s %s %s %s %s %s " % (m_order, m_name, m_year, m_rating_score, m_rating_num,m_url))

if __name__ == '__main__':
    crawl('')

运行结果：

               豆瓣电影TOP250:
序号    影片名   评分    评价人数    链接
1 肖申克的救赎 / The Shawshank Redemption 1994 9.5 471051
2 这个杀手不太冷 / Léon 1994 9.4 445356
3 阿甘正传 / Forrest Gump 1994 9.3 404984
4 霸王别姬 1993 9.4 314060
5 盗梦空间 / Inception 2010 9.2 451323
6 海上钢琴师 / La leggenda del pianista sull'oceano 1998 9.1 352935
7 美丽人生 / La vita è bella 1997 9.4 216487
8 三傻大闹宝莱坞 / 3 Idiots 2009 9.1 358053
9 辛德勒的名单 / Schindler's List 1993 9.3 206974
10 放牛班的春天 / Les choristes 2004 9.1 249464
11 龙猫 / となりのトトロ 1988 9.1 226835
12 搏击俱乐部 / Fight Club 1999 9.1 232290
13 泰坦尼克号 / Titanic 1997 8.9 362303
14 教父 / The Godfather 1972 9.2 182709
15 天堂电影院 / Nuovo Cinema Paradiso 1988 9.1 178064
16 忠犬八公的故事 / Hachi: A Dog's Tale 2009 9.1 207328
17 千与千寻 / 千と千尋の神隠し 2001 9.0 341993
18 罗马假日 / Roman Holiday 1953 8.9 227359
19 乱世佳人 / Gone with the Wind 1939 9.2 159744
20 大话西游之大圣娶亲 / 西遊記大結局之仙履奇緣 1995 8.9 215545
21 天使爱美丽 / Le fabuleux destin d'Amélie Poulain 2001 8.8 305080
22 当幸福来敲门 / The Pursuit of Happyness 2006 8.8 310963
23 楚门的世界 / The Truman Show 1998 8.9 220642
24 怦然心动 / Flipped 2010 8.8 257920
25 两杆大烟枪 / Lock, Stock and Two Smoking Barrels 1998 9.1 148561
26 飞越疯人院 / One Flew Over the Cuckoo's Nest 1975 9.0 152383
27 指环王3：王者无敌 / The Lord of the Rings: The Return of the King 2003 9.0 157985
28 七宗罪 / Se7en 1995 8.7 260277
29 闻香识女人 / Scent of a Woman 1992 8.8 180207
30 让子弹飞 2010 8.8 395990
31 情书 / Love Letter 1995 8.7 214434
32 海豚湾 / The Cove 2009 9.4 119388
33 大话西游之月光宝盒 / 西遊記第一百零一回之月光寶盒 1995 8.8 185762
34 剪刀手爱德华 / Edward Scissorhands 1990 8.7 298530
35 无间道 / 無間道 2002 8.7 207719
36 少年派的奇幻漂流 / Life of Pi 2012 9.0 338297
37 美丽心灵 / A Beautiful Mind 2001 8.8 170725
38 鬼子来了 2000 9.1 116185
39 指环王1：魔戒再现 / The Lord of the Rings: The Fellowship of the Ring 2001 8.8 169109
40 阿凡达 / Avatar 2009 8.7 340480
41 低俗小说 / Pulp Fiction 1994 8.8 180731
42 勇敢的心 / Braveheart 1995 8.7 192747
43 指环王2：双塔奇兵 / The Lord of the Rings: The Two Towers 2002 8.8 149774
44 机器人总动员 / WALL·E 2008 9.3 290827
45 飞屋环游记 / Up 2009 8.8 298769
46 蝙蝠侠：黑暗骑士 / The Dark Knight 2008 8.8 161024
47 活着 1994 8.9 132871
48 窃听风暴 / Das Leben der Anderen 2006 9.0 107988
49 死亡诗社 / Dead Poets Society 1989 8.8 146513
50 入殓师 / おくりびと 2008 8.7 199862

阅读(11233) | 评论(1) | 转发(1) |

上一篇：用python抓取oj题目（2）——Sqlalchemy将数据存到数据库

下一篇：BeautifulSoup的解析a href

给主人留下些什么吧！~~

像少年啦啦啦飞驰2015-06-04 16:19:20

你好，我使用了您的代码，发现只能输出到第一行标题，循环无法继续，不知道是什么问题？
另外最后两行有点不明白，能否解释下，本人小白，刚开始学。。

回复 | 举报

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6