python处理中文字符-chengxuyonghu-ChinaUnix博客

chengxuyonghu

首页　| 　博文目录　| 　关于我

chengxuyonghu

博客访问： 1872864
博文数量： 636
博客积分： 0
博客等级：民兵
技术积分： 3950
用户组：普通用户
注册时间： 2014-08-06 21:58

个人简介

博客是我工作的好帮手，遇到困难就来博客找资料

文章分类

全部博文（636）

运维（20）
法务（11）
未分配的博文（605）

文章存档

2024年（5）

2022年（2）

2021年（4）

2020年（40）

2019年（4）

2018年（78）

2017年（213）

2016年（41）

2015年（183）

2014年（66）

我的朋友

相关博文

python处理中文字符

分类：系统运维

2017-02-16 18:31:28

python里面默认的字符串都是ASCII编码，是string类型，ASCII编码处理中文字符是会出问题的。

python的内部编码格式是unicode，在字符串前加‘u’前缀也可直接声明unicode字符串，如 u'hello'就是unicode类型。

如果处理的字符串中出现非ascii码表示的字符，要想不出错，就得转成unicode编码了。具体的方法有：

decode()，将其他编码的字符串转换成unicode编码，如str1.decode('gb2312')，表示将gb2312编码的字符串str1转换成unicode编码；

encode()，将unicode编码转换成其他编码的字符串，如str2.encode('gb2312')，表示将unicode编码的字符串str2转换成gb2312编码；

unicode()，同decode()，将其他编码的字符串转换成unicode编码，如unicode(str3, 'gb2312')，表示将gb2312编码的字符串str3转换成unicode编码。

转码的时候一定要先搞明白字符串str是什么编码，然后decode成unicode，最后再encode成其他编码。

另外，对一个unicode编码的字符串在进行解码会出错，所以在编码未知的情况下要先判断其编码方式是

否为unicode，可以用isinstance(str, unicode)。

不仅是中文，以后处理含非ascii编码的字符串时，都可以遵循以下步骤：

1、确定源字符的编码格式，假设是utf8；

2、使用unicode()或decode()转换成unicode编码，如str1.decode('utf8')，或者unicode(str1, 'utf8');

3、把处理后字符串用encode()编码成指定格式。

#!/usr/bin/env python

#-*- coding:utf-8 -*-

import sys, os

import md5

destPath = r'h:\路径A\测试'

srcPath = r'h:\路径B\测试'

rstPath = r'h:\路径C\rst.txt'

#----------------------------------------------------------------------

def find_all_files(path):

'''''

'''

print '\r\r'

files = os.listdir(path.decode('utf8'))

fileslist = []

for ff in files:

ffPath = path + '\\' + ff

print ffPath,

if os.path.isfile(ffPath):

fileslist.append(ffPath)

print 'file'

elif os.path.isdir(ffPath):

print 'dir'

fileslist += find_all_files(ffPath)

else:

print 'parse error!', '\t', ffPath

return fileslist

#----------------------------------------------------------------------

def md5_list(path):

'''''

'''

filesList = find_all_files(path)

filesMd5 = {}

for ff in filesList:

try:

fp = open(ff, 'rb')

m = md5.md5()

strRead = ""

while True:

strRead = fp.read(8096)

if not strRead:

break

m.update(strRead)

strMd5 = m.hexdigest()

filesMd5[strMd5] = ff

fp.close()

except Exception, ex:

print ex

fp.close()

return filesMd5

if __name__=='__main__':

reload(sys)

sys.setdefaultencoding('utf-8')

print 'Begin.......'

srcFilesMd5 = md5_list(srcPath)

destFilesMd5 = md5_list(destPath)

rst = ''

for key in srcFilesMd5.keys():

if key not in destFilesMd5.keys():

fileName = srcFilesMd5[key]

rst = rst + fileName.encode('utf8') + '\r'

fp = open(rstPath, 'w')

fp.write(rst)

fp.close()

print '\nRun Over......'

#因为在python2.X中默认是ASCII编码，你在文件中指定编码为UTF-8，但是UTF-8如果你想转GBK的话是不能直接转的，的需要Unicode做一个转接站点。

#!/usr/bin/env python
#-*- coding:utf-8 -*-
#author luotianshuai

import chardet
tim = '你好'

print chardet.detect(tim)

#先解码为Unicode编码，然后在从Unicode编码为GBK

new_tim = tim.decode('UTF-8').encode('GBK')

print chardet.detect(new_tim)

#结果
'''
{'confidence': 0.75249999999999995, 'encoding': 'utf-8'}
{'confidence': 0.35982121203616341, 'encoding': 'TIS-620'}
'''

#因为在Python3中默认就是unicode编码

#!/usr/bin/env python
#-*- coding:utf-8 -*-
#author luotianshuai

tim = '天帅'

#转为UTF-8编码
print(tim.encode('UTF-8'))

#转为GBK编码
print(tim.encode('GBK'))

#转为ASCII编码(报错为什么?因为ASCII码表中没有‘天帅’这个字符集~~)
print(tim.encode('ASCII'))

python中有两个很好用的函数 decode() 和 encode()

decode(‘utf-8’) 是从utf-8编码转换成unicode编码，当然括号里也可以写'gbk'

encode('gbk') 是将unicode编码编译成gbk编码，当然括号里也可以写'utf-8'

假如我知道一串编码是用utf-8编写的，怎么转成gbk呢

s.decode('utf-8').encode('gbk')

阅读(1057) | 评论(0) | 转发(0) |

上一篇：浅析python 中__name__ = '__main__' 的作用

下一篇：pandas 读写sql数据库和matplotlib模块

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6