Python根据URL抓取网页并压缩存储-alertx-ChinaUnix博客

open source

首页　| 　博文目录　| 　关于我

alertx

博客访问： 208208
博文数量： 48
博客积分： 1935
博客等级：上尉
技术积分： 491
用户组：普通用户
注册时间： 2010-07-29 00:59

文章分类

全部博文（48）

script（24）
未分配的博文（24）

文章存档

2011年（1）

2010年（47）

我的朋友

相关博文

Python根据URL抓取网页并压缩存储

分类： Python/Ruby

2010-09-21 00:36:00

Python根据URL抓取网页并压缩存储

2010-06-25 14:18

好久没有更新自己的GAE站点了，之前做过的一个小功能，就是在增加bookmark的时候，会去根据URL把这个page抓下来，并压缩保存到GAE的 model里去。大致实现方法如下：

1. 根据URL抓网页：

content = service.getContent(url)

def getContent(url):
    try:
        urlopen = MyOpener()
        fp = urlopen.open(url)
        content = fp.read()
        fp.close()
        content = zlib.compress(content, 9)
    except:
        content = None
        
    return content

user_agents = [
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
    'Opera/9.25 (Windows NT 5.1; U; en)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
    'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9'
]

class MyOpener(URLopener, object):
    version = choice(user_agents)

service 提供了 getContent(url)方法，需要通过urlopen.open方法去取内容，这里urlopen是自己定制的一个类，是URLOpener的子类。这里主要是设置version，以达到欺骗某些网站会屏蔽程序抓取网页的功能。我曾经直接用URLOpener，结果无法抓取javaeye论坛。当然如果有些网站采用更为严厉的防抓取手段的话，这个也就不管用了。

抓取回来的内容通过zlib.compress来压缩，存储。

2. 显示抓取内容：

class ShowContent(webapp.RequestHandler):
    def get(self, id):
        bookmark = CoolBookmark.get_by_id(int(id))
        if bookmark:
            self.response.headers['Content-Type'] = "text/html"
            decomp = zlib.decompressobj()
            content = decomp.decompress(bookmark.zipcontent)
            self.response.out.write(content)

这里只要注意把压缩的内容进行解压，然后再用response.out就可以了。

阅读(1580) | 评论(0) | 转发(0) |

上一篇：抓取网页图片-使用python

下一篇：抓取百度博客文章的Python脚本

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6