python HTTP压缩-yaoshiyan-ChinaUnix博客

Yaoyao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

yaoshiyan

博客访问： 557967
博文数量： 142
博客积分： 2966
博客等级：少校
技术积分： 1477
用户组：普通用户
注册时间： 2009-12-07 22:37

文章分类

全部博文（142）

web（0）
互联网趣事（2）
【NoSQL】（5）
【Windows Mobile（1）
【生活娱乐】（3）
【perl】（1）
【nagios】（2）
【apache】（0）
【php】（12）
【python】（65）
【asterisk】（2）
【mysql】（9）
【linux】（18）
【shell】（21）
未分配的博文（1）

文章存档

2013年（3）

2012年（21）

2011年（53）

2010年（33）

2009年（32）

我的朋友

相关博文

python HTTP压缩

分类： Python/Ruby

2011-08-14 14:30:31

朋友学习用python抓去网页，使用了python自带的urllib2，测试抓取搜狐首页：

1 2 3

import urllib2

resp = urllib2.urlopen('') 

resp.read()

发现读取出来的内容，不是html代码，而是压缩过的内容。我跟他推荐了模块，问题自动解决。

查了一下httplib2的源码，对于经过压缩的内容，有一个内部方法自动进行解压缩操作（我加了几个换行，方便显示，此方法在httplib2的init.py中）：

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

def _decompressContent(response, new_content):

	content = new_content

	try:
        encoding = response.get('content-encoding', None)

        		if encoding in ['gzip', 'deflate']: 

			if encoding == 'gzip':

                				content = gzip.GzipFile(
                    fileobj=StringIO.StringIO(new_content)).read() 

			if encoding == 'deflate':

                				content = zlib.decompress(content) 

			response['content-length'] = str(len(content)) 

# Record the historical presence of the encoding in a way the won't

# interfere.

			response['-content-encoding'] = response['content-encoding'] 

			del response['content-encoding'] 

	except IOError: 

	        content = "" 

		raise FailedToDecompressContent(_("Content purported to be compressed 

            			with %s but failed to decompress.") % response.get('content-encoding'), 

            			response, content) 

return content

这里传入的两个参数，response其实是一个字典，存储了真正的response的header，new_content是从真正的response读取出来的内容。根据header中的content-encoding来判断内容是否经过压缩，采用的是deflate还是gzip压缩。如果是deflate压缩，直接用zlib进行解压；如果是gzip压缩过的，需要先用StringIO模拟成文件然后用gzip读取。

如果一定要用urllib2.urlopen来读取页面内容，可以参考httplib2的代码进行判断和解压，否则还是用建议用httplib2。

阅读(1477) | 评论(0) | 转发(0) |

上一篇：python超轻量级协程框架 Eurasia

下一篇：Python中生成唯一ID的模块——UUID

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6