This function converts HTML entities and character references to ordinary characters.-runningdark-ChinaUnix博客

2B酱的编程Tipsrunningdark.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

runningdark

博客访问： 1018705
博文数量： 150
博客积分： 3017
博客等级：少校
技术积分： 3829
用户组：普通用户
注册时间： 2011-11-19 14:40

个人简介

Now in Baidu WISE team

文章分类

全部博文（150）

Web技术（4）
.NET Progra（4）
编译原理（3）
JAVA Programming（6）
C Programming（7）
我要做个股票软件（4）
Linux（5）
算法数据结构和面（102）

coding练习册（21）

算法导论（4）
python（1）
android（1）
perl（13）
未分配的博文（0）

文章存档

2014年（8）

2013年（31）

2012年（111）

我的朋友

相关博文

This function converts HTML entities and character references to ordinary characters.

分类： Python/Ruby

2012-03-14 14:34:37

From:

import re, htmlentitydefs

##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.

def unescape(text):
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except ValueError:
pass
else:
# named entity
try:
text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
except KeyError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)

阅读(939) | 评论(0) | 转发(0) |

上一篇：Android Instrumentation Testing

下一篇：Hide input when inputing password

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6