python实现的小爬虫-有经验的网管-ChinaUnix博客

有经验的网管的ChinaUnix博客

首页　| 　博文目录　| 　关于我

有经验的网管

博客访问： 59511
博文数量： 13
博客积分： 0
博客等级：民兵
技术积分： 199
用户组：普通用户
注册时间： 2014-03-07 22:11

文章分类

全部博文（13）

C（2）
面试（1）
python基础学习（7）
linux基础学习（3）
未分配的博文（0）

文章存档

2014年（13）

我的朋友

相关博文

python实现的小爬虫

分类： Python/Ruby

2014-05-11 22:13:14

1.功能
在一个每日播报猪价的网页上，抓取近几天的数据，保存在本地文件，然后对文件进行去重复行和空行
2.代码

点击(此处)折叠或打开

# coding: utf8
import urllib.request
import re
import sys
import time
#时间处理
ti=time.asctime()
print (ti)
t=ti[4:7]
#ti_day=ti[8:10]
if t=='Jan':t='1'
elif t=='Feb':t='2'
elif t=='Mar':t='3'
elif t=='Apr':t='4'
elif t=='May':t='5'
elif t=='Jun':t='6'
elif t=='Jul':t='7'
elif t=='Aug':t='8'
elif t=='Sep':t='9'
elif t=='Oct':t='10'
elif t=='Nov':t='11'
elif t=='Dec':t='12'
#print (ti_day)
#在此网页中抓取今天湖南价格的超链接
for f in range(1,4):
url = '%s' %f
#print (url)
url_time= '"h.{53} title="2014年%s月.{1,2}日湖南.{0,2}生猪价格' %(t)
find_re0 = re.compile(url_time,re.DOTALL)
#print (url_time)
#用GBK解码
html = urllib.request.urlopen(url).read().decode('GBK')
# 找到资源信息
for x in find_re0.findall(html):
#去掉连接串中多余的部分得到今日的url
aaa=re.compile(r'h.{52}')
f=open(r'zhujia.txt','a')
for yy in aaa.findall(x):
url=yy
print (url)
# 匹配规则
url_time = '湖南省.{3} .{2,3} 生猪价格.{4} %s月.{1,2}日 .{9}' %(t)
find_re = re.compile(url_time, re.DOTALL)
print (find_re)
#打开文件
# 下载数据
html = urllib.request.urlopen(url).read().decode('gb2312')
# 找到资源信息
if find_re.findall(html)==[]:
url_time = '湖南省 .{2,3} 生猪价格.{4} %s月.{1,2}日 .{9}' %(t)
find_re = re.compile(url_time, re.DOTALL)
print (find_re)
for x in find_re.findall(html):
#写入文件
f.writelines(x+'\r\n')
print (f.writelines)
else:
for x in find_re.findall(html):
#写入文件
f.writelines(x+'\r\n')
print (f.writelines)
#睡眠2秒
time.sleep(2)
#因为后面几页匹配要循环遍历几个网页，所以去掉变化的部分用变量替代
url=url[:-5]
for y in range(2,5):
url_1 = '%s_%s.html' %(url,y)
print (url_1)
url_time = '湖南省.{3} .{2,3} 生猪价格.{4} %s月.{1,2}日 .{9}'%(t)
find_re = re.compile(url_time, re.DOTALL)
# 下载数据
htm = urllib.request.urlopen(url_1).read().decode('gb2312')
# 找到资源信息
time.sleep(2)
if find_re.findall(html)==[]:
url_time = '湖南省 .{2,3} 生猪价格.{4} %s月.{1,2}日 .{9}' %(t)
find_re = re.compile(url_time, re.DOTALL)
print (find_re)
for x in find_re.findall(html):
#写入文件
f.writelines(x+'\r\n')
print (f.writelines)
else:
for x in find_re.findall(html):
#写入文件
f.writelines(x+'\r\n')
print (f.writelines)
f.close()
print ('download complete!')
#去除重复行
lines_seen = set()
outfile = open("2.txt", "w")
infile = open("zhujia.txt", "r")
lines = infile.readlines()
for line in lines:
if line not in lines_seen:
outfile.write(line)
lines_seen.add(line)
outfile.close()
#去除空行
infp = open('2.txt', "r")
outfp = open('1.txt', "w")
lines = infp.readlines()
for lin in lines:
if lin.split():
outfp.writelines(li)
infp.close()
outfp.close()
print ('文件排版正常')

3.问题
a.这个网站没限制爬虫，开始没设置请求连接的间隔时间，频繁的循环，貌似让对方服务器崩溃了一会。
b.通过分析HTML，用正则表达式匹配关键数据，而且也可以分析出该网站的数据抄写的习惯。
c.代码中有很多重复的代码，可以用类和函数去写，时间有点不够，没进一步优化代码。
d.写的过程中都是一个一个小功能去写的，小功能能正确输出后，然后进行拼接。
e.还是个小菜鸟所以水平有限，献丑了。
f.有打算把数据用reportlab绘制出PDF的价格走势图。

阅读(1673) | 评论(0) | 转发(0) |

上一篇：linux文件系统inode的破坏和superblock的破坏与恢复

下一篇：浅探python

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6