用python写的抓取更新小说的程序-maonx

乱乱的linux

首页　| 　博文目录　| 　关于我

maonx_cu

博客访问： 171407
博文数量： 84
博客积分： 2010
博客等级：大尉
技术积分： 940
用户组：普通用户
注册时间： 2008-10-12 20:30

文章分类

全部博文（84）

游戏动漫（1）

勇者斗恶龙相关（1）
windows（5）
linux（62）

Ubuntu（26）

电子书类（3）

ARCH（5）

python（8）

C相关资料（6）

shell相关资料（14）
原创文章（15）
未分配的博文（1）

文章存档

2010年（18）

2009年（27）

2008年（39）

我的朋友

相关博文

用python写的抓取更新小说的程序

分类： Python/Ruby

2010-03-07 23:45:02

花了一个周末两天的时间，搞了一个自己在看小说的章节更新抓取程序，虽然第一次写这种的，模块什么的有点乱，也没有好好的设计一下结构，但是勉强是能成功运行了，会自动保存运行一次后有更新的章节，生成一个文件，存放的是有更新章节的内容。在同目录里还需要二个默认的文件，一个存放你要看的小说的书号和书名，可以在快眼小说里面查到，另一个可以是空文件 readChpater 和 novelName
这个程序可能不是通用版的，我自己机子上可以运行，不知道别人的会怎么样，这里面的程序是linux下的有些命令，不能在windows下面用
程序成功了，也没有报错，但第一次运行的时候老是没反应，会报错连接错误，直接ctrl＋C 后，再马上运行一遍，或再打断再重新运行一下就会成功运行了，不知道这个是什么问题，我也看不出来，如果有哪位看到了，知道的话留下言，先谢谢了～～～

代码：＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝

#!/usr/bin/python # -*- coding=utf8 -*- # 用来看天天中文小说网的小说更新章节 # 第一次写这种的，感觉有点乱，没有规划好，模块什么的也乱 # 外面带一个readChapter 文件和novelName 文件 # readChapter 文件可以为空，默认就载入 novelName中的前五章最新 # 更新，novelName 中为书本号，书本名书本号在快眼看书里查出来的 # from sgmllib import SGMLParser import urllib import os class htmlParser(SGMLParser): # 自己写的网页过滤，只过滤script 和 frame def reset(self): SGMLParser.reset(self) self.data=[] self.process=0 #self.num # Get the data in 2 th script self.src=None self.countNum=0 def start_frame(self,attrs): self.src=[ v for k,v in attrs if k=='src'][0] def start_script(self,attrs): self.process=1 self.countNum+=1 def handle_data(self,text): if self.process==1 and self.countNum==self.num : self.data.append(text) def end_script(self): self.process=0 def newChapterList(bookId): ## 列出有无新的章节更新 url=""+bookId+"&SiteID=167&Level=0&History=6" urlfd = urllib.urlopen(url) parser = htmlParser() parser.num=2 parser.feed(urlfd.read()) newIndex=parser.data[0] fd=open('readChapter') hadRead=fd.read() newIndex=newIndex[3:-3].split('(')[1:6] #print newIndex print '\n'+10*'*'+namelist[bookId]+10*'*' getNovelList[bookId]=[] if len(hadRead): hadRead=hadRead.replace('\n',',').split(',')[:-1] for i in range(0,len(hadRead),2): readlist[hadRead[i]]=hadRead[i+1] lastChapter=readlist[bookId] for i in range(len(newIndex)): if newIndex[i].find(lastChapter)!=-1: break if i==0: print u'\n 没有新的章节更新！' else: readlist[bookId]=newIndex[0].split(',')[0] print readlist[bookId] else: i=6 #print i #print lastChapter if i: newIndex=newIndex[:i] newIndex.reverse() for i in range(len(newIndex)): newIndex[i]=newIndex[i].split(',')[:2] print newIndex[i][1][1:-1] # print newIndex[i][0] readlist[bookId]=newIndex[i][0] getNovelList[bookId].append(newIndex[i][0]) #print readlist[bookId] fd.close() urlfd.close() #return url def getNovel(bookId,chapterId,filefd): # 取得有更新小说的章节写入tempnovel.txt文本中， # 文本写入方式是追加方式 url=""+bookId+"&ChapterID="+chapterId urlfd=urllib.urlopen(url) parser = htmlParser() parser.feed(urlfd.read()) url=parser.src urlfd.close() urlfd=urllib.urlopen(url) parser= htmlParser() parser.num=5 parser.feed(urlfd.read()) s=parser.data[0] #print s s=s.split('"') # print s if s[0][:9]=='outputTxt': txturl=""+s[1] txturlfd=urllib.urlopen(txturl) stxt=txturlfd.read() stxt=stxt.replace(' ','\n') stxt=stxt.decode('gbk').encode('utf-8') #print stxt stxt='\n**********start**********\n'+'\n'+stxt[16:-3]+'\n' txturlfd.close() else: s=s[:-1] for i in range(1,len(s),2): imageurl="wget "+s[i] os.system(imageurl) stxt='\n**********start**********\n'+'\n 此章为图片版已下载\n' #print stxt filefd.writelines(stxt) urlfd.close() ##########################main################### try: os.system('rm tempnovel.txt 2>/dev/null') os.system('rm *_*.gif 2>/dev/null') os.system('clear') except : pass readlist={} # 读过的章节 namelist={} # 小说的书号和名字 getNovelList={} # 存放取得更新的列表 fd=open('novelName') filefd=open('tempnovel.txt','wa') name=fd.read() name=name.replace('\n',',').split(',')[:-1] for i in range(0,len(name),2): namelist[name[i]]=name[i+1] for i in range(len(namelist)): newChapterList(namelist.keys()[i]) if len(getNovelList[namelist.keys()[i]]): for j in range(len(getNovelList[namelist.keys()[i]])): getNovel(namelist.keys()[i],getNovelList[namelist.keys()[i]][j],filefd) chapterfd=open('readChapter','w') for i in range(len(readlist)): chapterfd.writelines(readlist.keys()[i]+','+readlist[readlist.keys()[i]]+'\n') chapterfd.close() filefd.close() fd.close()

阅读(713) | 评论(0) | 转发(0) |

上一篇：用python写的抓取天气预报的脚本

下一篇：郁闷了关于上次写的python脚本

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6