Chinaunix首页 | 论坛 | 博客
  • 博客访问: 59157
  • 博文数量: 29
  • 博客积分: 667
  • 博客等级: 上士
  • 技术积分: 300
  • 用 户 组: 普通用户
  • 注册时间: 2010-04-11 15:55
文章分类
文章存档

2012年(2)

2011年(27)

我的朋友
最近访客

分类: Python/Ruby

2012-04-28 15:16:18


  1. #!/usr/bin/python
  2. import re
  3. import urllib

  4. def downURL(url,filename):
  5.     #print url
  6.     print filename
  7.     try:
  8.         fp = urllib.urlopen(url)
  9.     except:
  10.         print 'download exception'
  11.         return 0
  12.     op = open(filename,"wb")
  13.     while 1:
  14.         s = fp.read()
  15.         if not s:
  16.             break
  17.         op.write(s)

  18.     op.close()
  19.     fp.close()
  20.     return 1

  21. def getURL(url):
  22.     try:
  23.         fp = urllib.urlopen(url)
  24.     except:
  25.         print 'get url exception'
  26.         return 0

  27.     pattern = re.compile("*.sina.com.cn.*.shtml")
  28.     while 1:
  29.         s = fp.read()
  30.         if not s:
  31.             break
  32.         urls = pattern.findall(s)
  33.     fp.close()
  34.     return urls

  35. def spider(startURL,times):
  36.     urls = []
  37.     urls.append(startURL)
  38.     i = 0
  39.     while 1:
  40.         if i > times:
  41.             break;
  42.         if len(urls)>0:
  43.             url = urls.pop(0)
  44.             print url,len(urls), str(i)
  45.             downURL(url,str(i)+'.htm')
  46.             i = i + 1
  47.             if len(urls)<times:
  48.                 urllist = getURL(url)
  49.                 for url in urllist:
  50.                     if urls.count(url) == 0:
  51.                         urls.append(url)
  52.         else:
  53.             break
  54.     return 0

  55. #downURL('','http.log')
  56. spider('.cn',30)

阅读(1135) | 评论(0) | 转发(0) |
0

上一篇:实现memcpy函数

下一篇:python

给主人留下些什么吧!~~