chinaboy小宝chinaboy007.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

博客访问： 2926234
博文数量： 348
博客积分： 2907
博客等级：中校
技术积分： 2272
用户组：普通用户
注册时间： 2010-03-12 09:16

个人简介

专注 K8S研究

文章分类

全部博文（348）

elk（2）
docker（5）
error（0）
zabbix（21）
haproxy（2）
linux（11）
redis（2）
lvs（9）
squid（8）
nagios（4）
puppet（6）
html（1）
nginx（45）
apache（3）
mysql（65）
php（0）
python（114）

pycharm（1）

pip（1）

requests（1）

requests（0）

urllib（0）

logging（1）

flask（0）

lib（0）

pyqt4（14）

django（7）

beautifulsoup（11）

scrapy（3）

string（6）

pexpect（4）
shell（19）
linux（25）
other（4）
未分配的博文（2）

文章存档

2019年（22）

2018年（57）

2016年（2）

2015年（27）

2014年（33）

2013年（190）

2011年（3）

2010年（14）

我的朋友

相关博文

用python抓取oj题目（1）——用beautifulsoup分析oj元素

分类： Python/Ruby

2013-07-06 01:07:56

终于搞完了记录一下

　　我的任务是hdoj和toj这两个，事实上也就一个。做hdoj用了4天的样子，toj一上午就ok了、、、所以撇开toj，直接用hdoj的东西来说。也就是肿么把oj上这些字儿啊图片啊神马的抓下来存到数据库的。当然，为了验证是否正确，django这个方便的东西是不能少的。

　　btw：原来django的静态文件是这么个意思啊，这个以后再说、、、

　　首先点开杭电的网址，找到problem archive，进来之后看题目，一堆啊，随便点一个题，比如1056（让我很纠结的一个题），1057，第一件需要做的就是分析这个页面的元素。为嘛那，要知道这些个玩意儿是早晚都要存到数据库里面的，所以首先要看看建的表里面会有那些个列，而且还要看不同题号的题目有那些是相同的东西，写个函数一劳永逸。so，打开火狐或者是chrome的firebugs，可以看到类似这样子的。

看看界面里面，貌似题目里面会有1.title 2.limit des 3.problem des 4.input 5.output 6.sample input 7.sample output 8.hint 9.author 10.source 11.recommend 12.imgages。一开始的时候我以为前5项是一定会有的，对啊，肯定会有标题，限制描述，问题描述，输入输出吧，直到我第一次写完之后遇到了奇葩的1056题，这个题竟然没有input，output啊我去，当时我是从第1000题往2000题抓，但是每次到1056的时候，python就给了我一个异常，然后就跪了。我还没弄明白神马事儿的，到处查后来看了看1056，哎，这样啊、、、

　　所以，不要绝对相信一些个东西、、、

　　后来，求助了下学长，他以前做过类似的这种抓oj题的东西，给了我一个图，狠好啊，不敢独享，传上来先，学长是万能的～。当然，我现在的任务只需要看problem那一列。

ok，最后发现杭电所有的题目都是 = 加上一个题号（4位），估计oj们也都是用数据库存的。
　　好吧，下面开始对照代码来说说BeautifulSoup是肿么分析网页的。
　　先上代码：
1 #! -*- encoding:utf-8 -*-
2 import urllib2
3 import traceback
4 from BeautifulSoup import BeautifulSoup
5 from sqlalchemy import *
6 from sqlalchemy.orm import *
7
8 def catch(url=None, pro_image='/images/hdoj/'):
9 ##        """ return 12 infos
10 ##        1.title 2.limit des 3.problem des 4.input 5.output
11 ##        6.sample input 7.sample output 8.hint 9.author
12 ##        10.source 11.recommend 12.imgages
13 ##        the last element is a list of images """
14     content_stream = urllib2.urlopen(url)
15     content = content_stream.read()
16     print 'catching: ' + url
17     soup = BeautifulSoup(content)
18     table = soup.table
19
20     #images the real url
21     images_src = table.findAll('img')[1:]
22     images = []
23
24     len_img = len(images_src)
25
26     for i in range(len_img):
27         image = str(images_src[i].attrs[0][1])
28         images.append(image)
29
30     # now we change the images url
31
32     for i in range(len_img):
33         images_src[i]['src'] = pro_image + images_src[i].attrs[0][1].split('/')[-1]
34
35     #title
36     table_title = table.find('h1')
37     table_title.hidden = True
38     #title below limits description
39     table_limit_des = table_title.findNext('span')
40     table_limit_des.hidden = True
41     # problem description, input, output, sample input, sample output
42     try:
43         table_problem_des = table.find(text='Problem Description').findNext('div', {'class':'panel_content'})
44         table_problem_des.hidden = True
45     except Exception as e:
46         table_problem_des = None
47
48     #input
49     try:
50         table_input = table.find(text='Input').findNext('div', {'class':'panel_content'})
51         table_input.hidden = True
52     except Exception as e:
53         table_input = None
54     #output
55     try:
56         table_output = table.find(text='Output').findNext('div', {'class':'panel_content'})
57         table_output.hidden = True
58     except Exception as e:
59         table_output = None
60     #sample input
61     try:
62         table_sample_input = table.find(text='Sample Input').findNext('div', {'class':'panel_content'})
63         table_sample_input.hidden = True
64     except Exception as e:
65         table_sample_input = None
66     #sample output
67     try:
68         table_sample_output = table.find(text='Sample Output').findNext('div', {'class':'panel_content'})
69         table_sample_output.hidden = True
70     except Exception as e:
71         table_sample_output = None
72
73     # hint
74     try:
75         table_hint = table_sample_output.i.next.next
76     except Exception as e:
77         table_hint = None
78     try:
79         table_sample_output = table_sample_output.i.previous.previous.previous
80     except Exception as e:
81         pass
82
83     # source
84     try:
85         table_source = table.find(text='Source').findNext('div', {'class':'panel_content'})
86         table_source.hidden = True
87     except Exception as e:
88        # print e
89         table_source = None
90
91     #recommend
92     try:
93         table_recommend = table.find(text='Recommend').findNext('div', {'class':'panel_content'})
94         table_recommend.hidden = True
95     except Exception as e:
96       # print e
97         table_recommend = None
98
99     # author
100     try:
101         table_author = table.find(text='Author').findNext('div', {'class':'panel_content'})
102         table_author.hidden = True
103     except Exception as e:
104       # print e
105         table_author = None
106
107
108
109
110     info = []
111
112
113     info.append(str(table_title))
114     info.append(str(table_limit_des))
115     info.append(str(table_problem_des))
116     info.append(str(table_input))
117     info.append(str(table_output))
118     info.append(str(table_sample_input))
119     info.append(str(table_sample_output))
120     info.append(str(table_hint))
121     info.append(str(table_author))
122     info.append(str(table_source))
123     info.append(str(table_recommend))
124     info.append(images)
125
126     return info

第二行 importurllib2 导入的是python的一个库
导入之后就能做 14行 content_stream = urllib2.urlopen(url) （打开网页）
15行 content = content_stream.read() (读取网页元素)
你甚至可以print content看一下和那个网站下firebugs分析的数据一样

第四行 from BeautifulSoup importBeautifulSoup
你甚至可以print content看一下和那个网站下firebugs分析的数据一样

可以看到网页的东西都给抓出来了，真的和firebugs看到的一样，当然，这些个玩意儿，不管是firebugs看到的还是beautifulsoup分析的都是在我们缓存里面的，而不是网上的东西，所以bs（beautifulsoup）里面可以直接修改标签（尤其是更改图片的路径啊）

　　现在，BS登场。首先是漂亮一点儿的显示，下面这个图里这两行就不用解释了，名字就很明显。

ok，所有BS的详细介绍可以查阅中文文档：http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html
　　我就解释自己的代码好了。
　　首先 18 table = soup.table 取出这张表里面的

标签，因为分析了一下下，杭电里面我需要的信息都在

标签里面，然后再从table里面找。
　　然后处理图片，为什马要先处理图片那，因为：1、图片需要保存，因此需要原来图片真正的url地址；2、保存下来的网页里面图片的src要改成本地的地址。也就是说如果原src = “/data/images/神马神马”，我需要把他改成“/images/hdoj/神马神马”，然后在存到数据库里面，所以先处理保存图片原地址，然后用BS把缓存中的改成想要的东西，再进行后面的操作（这就是说为什马BS是在缓存操作，而不是在网上，网上的东西我们是改不了的）。
　　图片的相关代码：
1         #images the real url
2     images_src = table.findAll('img')[1:]
3     images = []
4
5     len_img = len(images_src)
6
7     for i in range(len_img):
8         image = str(images_src[i].attrs[0][1])
9         images.append(image)
10
11     # now we change the images url
12
13     for i in range(len_img):
14         images_src[i]['src'] = pro_image + images_src[i].attrs[0][1].split('/')[-1]

2   images_src = table.findAll('img')[1:]，取出所有的图片标签，可以type（）一下，是，而image_src【i】的type是这个
3   images = [] 这是最后需要保存到info里面杭电图片真正的url的，（图片需要）。
8　　image = str(images_src[i].attrs[0][1]) 这一行比较绕，为神马呢，我在BS文档里面发现了BeautifulSoup.Tag这个里面有个attrs属性，打印出来看了看是这个玩意儿[(u'src', u'http://www.cnblogs.com/../data/images/1828-1.jpg')]，一个list里面有个元组，而元组里面第二个元素刚好是图片的url，so：attrs[0]是这个(u'src', u'http://www.cnblogs.com/../data/images/1828-1.jpg')（一个元组），attrs[0][1]就是图片的url了。
14 images_src[i]['src'] = pro_image + images_src[i].attrs[0][1].split('/')[-1] 更改缓存里面的url，pro_image='/images/hdoj/'，这样保存下来的图片就和我本地图片路径一样了，方便后来django显示。
　　图片处理完之后就是文字了，经过多次尝试，题目和限制信息是真的都有的。所以有下面的代码：
1         #title
2     table_title = table.find('h1')
3     table_title.hidden = True
4     #title below limits description
5     table_limit_des = table_title.findNext('span')
6     table_limit_des.hidden = True

　
2 table_title = table.find('h1') 找到table里面的

标签，因为杭电里面title就是这玩意儿
　第5行同，理不解释了。
6     table_limit_des.hidden = True 是隐藏标签神马意思那，直接上图，直观：



之后的东西就像下面的代码一样，所以只解释第一个了，先上代码：
1      try:
2          table_problem_des = table.find(text='Problem Description').findNext('div', {'class':'panel_content'})
3          table_problem_des.hidden = True
4      except Exception as e:
5          table_problem_des = None

　　为什么要抓异常那，就像我之前说的，oj神马情况都有可能发生，input都可能没有，所以题目描述，input，output之类的必须要抓异常。例如如果找不到problem，那么table.find(text='Problem Description') 返回一个None，而下一个findNext('div', {'class':'panel_content'})就会报异常。简单的方法是建一个表，info里面有这12个信息，如果有一个信息没有的话，就把它赋值为none。
　　最后的信息加到info【】这个list里面返回出去，images是最后一个元素，注意这个时候info里面problem里的src已经和images的不一样了。
　　ok，第一部分catch就到这里了，BeautifulSoup的详细用法还真需要找文档，不麻烦的，在几天内我争取把store（存储用的sqlalchemy）和django（显示静态文件原来这样用）写出来，努力努力。

阅读(2216) | 评论(0) | 转发(1) |

0

上一篇：urllib2使用总结

下一篇：用python抓取oj题目（2）——Sqlalchemy将数据存到数据库

给主人留下些什么吧！~~

评论热议

请登录后评论。
登录 注册

关于我们 | 关于IT168 | 联系方式 | 广告合作 | 法律声明 | 免费注册
Copyright 2001-2010 ChinaUnix.net All Rights Reserved 北京皓辰网域网络信息技术有限公司. 版权所有

感谢所有关心和支持过ChinaUnix的朋友们
16024965号-6