Chinaunix首页 | 论坛 | 博客
  • 博客访问: 482400
  • 博文数量: 59
  • 博客积分: 345
  • 博客等级: 二等列兵
  • 技术积分: 1380
  • 用 户 组: 普通用户
  • 注册时间: 2011-06-18 22:44
个人简介

to be myself

文章分类

全部博文(59)

文章存档

2017年(5)

2013年(47)

2012年(3)

2011年(4)

分类: Python/Ruby

2017-08-22 17:31:01

version:python3.5

点击(此处)折叠或打开

  1. #-*- coding:utf-8 -*-
  2. import re
  3. import os
  4. #import requests
  5. import urllib.request

  6. import socket
  7. socket.setdefaulttimeout(5)
  8. keyword = ""
  9. i = 0

  10. def dowmloadPic(html, word):
  11.     global i
  12.     pic_url = re.findall('"objURL":"(.*?)",',html,re.S)
  13.     print ('keyword: '+keyword+' downloading...')
  14.     dir = 'pictures\\'+keyword
  15.     if os.path.exists(dir) == False:
  16.         os.makedirs(dir)
  17.     for each in pic_url:
  18.         if each.endswith('.jpg') == False:
  19.             print("not a jpg picture"+each)
  20.             continue
  21.         print ('Downloading index('+str(i+1)+'), url:'+str(each))
  22.         try:
  23.             string = dir+"\\"+word+'_'+str(i) + '.jpg'
  24.             print(string)
  25.             urllib.request.urlretrieve(each, string)
  26.             #pic= requests.get(each, timeout=10)
  27.         except :
  28.             print ('failed to download the current picture')
  29.             continue
  30.         #resolve the problem of encode, make sure that chinese name could be store
  31.         #fp = open(string.decode('utf-8').encode('cp936'),'wb')
  32.         #fp.write(pic.content)
  33.         #fp.close()
  34.         i += 1



  35. if __name__ == '__main__':
  36.     word = input("Input key word: ")
  37.     keyword = word
  38.     word = urllib.parse.quote(word)
  39.     page = 0
  40.     while 1 :
  41.         url = 'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word='+word+'&pn='+str(page)+'&ct=&ic=0&lm=-1&width=0&height=0'
  42.         #result = requests.get(url)
  43.         print('download url:'+ url)
  44.         result = urllib.request.urlopen(url).read()
  45.         result = result.decode('utf-8')
  46.         dowmloadPic(result, word)
  47.         page += 20

上述源码基于网友voidsky的代码修改,链接
http://blog.csdn.net/hk2291976/article/details/51188728
感谢voidsky,其blog有更多关于爬虫技术介绍,推荐参阅。
阅读(2768) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~