探测python中字符集模块chardet-laoliulaoliu-ChinaUnix博客

miraclemiracle.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

laoliulaoliu

博客访问： 4608615
博文数量： 1214
博客积分： 13195
博客等级：上将
技术积分： 9105
用户组：普通用户
注册时间： 2007-01-19 14:41

个人简介

C++,python,热爱算法和机器学习

文章分类

全部博文（1214）

cloud（3）
operation（9）
tornado（4）
mac_os（1）
golang（4）
架构（13）
git（4）
security（29）
shell（1）
macbook（1）
ruby（13）
javascript（15）
design（3）
testing（1）
mac（1）
bigdata（69）
nosql（46）
R（9）
gcj/acm（6）
NLP（10）
小说（3）
matlab（4）
web（44）
java（66）
product（7）
c#（1）
language（4）
machine learning（76）
science（4）
opencourse（2）
windows（3）
search（33）
algorithm（65）
database（51）
compiler（11）
ACE（5）
poem（1）
programming（29）
python（140）
assembly（1）
linux（49）
C++（16）
book（2）
cate（1）
phliosophy（3）
mental（30）
Science fiction（1）
Software（5）
c（23）
network（65）
CS（15）
thinking（10）
BSD（13）
solaris10（2）
life（57）
Debian（16）
economy（7）
Mathematics（57）
OS（8）
ibm（2）
gentoo（32）
未分配的博文（8）

文章存档

2021年（13）

2020年（49）

2019年（14）

2018年（27）

2017年（69）

2016年（100）

2015年（106）

2014年（240）

2013年（5）

2012年（193）

2011年（155）

2010年（93）

2009年（62）

2008年（51）

2007年（37）

我的朋友

相关博文

探测python中字符集模块chardet

分类： Python/Ruby

2012-04-26 20:42:48

文章来源：http://blog.csdn.net/yatere/article/details/6654860

文档：

其实还有一个：

isinstance(s, str) 用来判断是否为一般字符串

isinstance(s, unicode) 用来判断是否为unicode

下面才是正文:

Basic usage

The easiest way to use the Universal Encoding Detector library is with the detect function.

Example: Using the detect function

The detect function takes one argument, a non-Unicode string. It returns a dictionary containing the auto-detected character encoding and a confidence level from 0 to 1.

[python] view plain copy

import urllib
rawdata = urllib.urlopen('').read()
mport chardet
print (chardet.detect(rawdata))
#{'encoding': 'EUC-JP', 'confidence': 0.99}

Advanced usage

If you’re dealing with a large amount of text, you can call the Universal Encoding Detector library incrementally, and it will stop as soon as it is confident enough to report its results.

Create a UniversalDetector object, then call its feed method repeatedly with each block of text. If the detector reaches a minimum threshold of confidence, it will set detector.done to True.

Once you’ve exhausted the source text, call detector.close(), which will do some final calculations in case the detector didn’t hit its minimum confidence threshold earlier. Then detector.result will be a dictionary containing the auto-detected character encoding and confidence level (the same as ).

Example: Detecting encoding incrementally

[python] view plain copy

import urllib
from chardet.universaldetector import UniversalDetector
usock = urllib.urlopen('')
detector = UniversalDetector()
for line in usock.readlines():
detector.feed(line)
if detector.done: break
detector.close()
usock.close()
print (detector.result)
#{'encoding': 'EUC-JP', 'confidence': 0.99}

If you want to detect the encoding of multiple texts (such as separate files), you can re-use a single UniversalDetector object. Just call detector.reset() at the start of each file, call detector.feed as many times as you like, and then call detector.close() and check the detector.result dictionary for the file’s results.

Example: Detecting encodings of multiple files

[python] view plain copy

import glob
from chardet.universaldetector import UniversalDetector
detector = UniversalDetector()
for filename in glob.glob('*.xml'):
print (filename.ljust(60))
detector.reset()
for line in file(filename, 'rb'):
detector.feed(line)
if detector.done: break
detector.close()
print (detector.result)

阅读(2019) | 评论(0) | 转发(0) |

上一篇：Python deque用法介绍

下一篇：Closures in Python

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6