chardet.detect经常提示是gb2312 。另外网页charset="gb2312"
但实际上是 gbk或者是 GB18030 。
txt =c.content.decode("gbk")
txt =c.content.decode("GB18030")
例子
c = requests.get(url,stream=True)
print chardet.detect(c.content)
txt =c.content.decode("GB18030")
txt = txt.encode("utf-8")
soup = BeautifulSoup(txt, 'lxml',from_encoding='utf-8')
阅读(2451) | 评论(0) | 转发(0) |