无聊之人--除了技术,还是技术,你懂得
分类: Python/Ruby
2011-08-17 19:02:31
7.2. Case Study:
Street Addresses This series of examples was inspired by a real-life
problem I had in my day job several years ago, when I needed to scrub and
standardize street addresses exported from a legacy system before importing
them into a newer system. (See, I don't just make this stuff up; it's actually
useful.) This example shows how I approached the problem 本章这一系列的例子的灵感来源于一个现实生活问题。我需要从一个遗留系统中导出街道的数据并在将他们导入一个新的系统之前对这些数据进行清洗并标准化的时候,这就是我几年前的日常工作。.(看吧,我没有对这些东西进行吹嘘吧,它确实很有用)。这个例子显示了我如何解决问题 Example 7.1. Matching at the End of a String 例7.1 匹配字符串尾 My goal is to standardize a street
address so that 'ROAD' is always abbreviated as 'RD.'. At
first glance, I thought this was simple enough that I could just use the
string method replace. After all, all the data was already uppercase, so
case mismatches would not be a problem. And the search string, 'ROAD',
was a constant. And in this deceptively simple
example, s.replace does indeed work. 我的目标是对街道地址进行标准化,这样如“road”就可以缩减为RD.刚开始的时候,我以为这个问题非常简单,我使用简单的字符串替换函数就可以完成。毕竟,所有的数据都已经是大写形式,因此大小写误匹配将不会是一个问题。而且搜索字符串,
“ROAD”是一个字符串常量。在这个具有欺骗性的简单例子中,s.replace确实起了作用。 Life, unfortunately, is full of counterexamples,
and I quickly discovered this one. The problem here is
that 'ROAD' appears twice in the address, once as part of the
street name 'BROAD' and once as its own word.
The replace method sees these two occurrences and blindly replaces
both of them; meanwhile, I see my addresses getting destroyed. 不幸的是,生活总是存在无数的反面例子,不久我就发现其中一个存在问题。这个问题是在地址中”ROAD”出现了两次,其中一次是出现在街道名字’BROAD’,另一次出现在单词本身上。Replace方法看到Road出现了两次,就盲目的都对它们进行了替换:同时,地址已经被毁坏了。 To solve the problem of addresses with
more than one 'ROAD' substring, you could resort to something like
this: only search and replace 'ROAD' in the last four characters of
the address (s[-4:]), and leave the string alone (s[:-4]). But you can see
that this is already getting unwieldy. For example, the pattern is dependent
on the length of the string you're replacing (if you were
replacing 'STREET' with 'ST.', you would need to
use s[:-6] and s[-6:].replace(...)). Would you like to come
back in six months and debug this? I know I wouldn't. 为了解决’Road‘字串出现次数多于一次的问题,你的求助某些东西来解决,例如这样:你只搜索地址的最后4个字符(s[:-4]),然后将(s[:-4])先放到一边。但是你会看到,现在代码就变的很笨重了。比如,这种替换模式依赖于你正在打算替代的字符串的长度(如果你在打算使用’ST’替换’STREET’),你需要使用是s[:-6],s[-6:],以及replace函数。你愿意6各月以后回来对这段代码进行调试?我想不会的。 It's time to move up to regular
expressions. In Python, all functionality related to regular expressions
is contained in the re module. 现在是时候学习正则表达式了。在Python中,所有和正则表达式有关的函数都包含在模块re中。 Take a look at the first
parameter: 'ROAD$'. This is a simple regular expression that
matches 'ROAD' only when it occurs at the end of a string.
The $ means “end of the string”. (There is a corresponding
character, the caret ^, which means “beginning of the string”.) 注意一下第一个参数’ROAD$’.这是一个非常简单的饿正则表达式,它只匹配那些那些出现在字符串末尾的’ROAD‘。$的意思的是字符串的尾部。(还有一个相对应的字符,^,它的意思是字符串的开始。) Using the re.sub function,
you search the string s for the regular
expression 'ROAD$' and replace it with 'RD.'. This matches
the ROAD at the end of the string s, but does notmatch
the ROAD that's part of the word BROAD, because that's in the
middle of s. 使用re.sub函数,你对字符串s进行搜索,然后使用’RD’来替代表达式’ROAD‘。这种模式匹配所有出现字符串末尾的ROAD,但是不匹配那些出现在单词中的ROAD,这是因为ROAD出现了字符串中间。 Continuing with my story of scrubbing addresses, I soon
discovered that the previous example, matching 'ROAD' at the end of the address, was not good enough, because not all
addresses included a street designation at all; some just ended with the street
name. Most of the time, I got away with it, but if the street name was 'BROAD', then the regular expression would match 'ROAD' at the end of the string as part of the word 'BROAD', which is not what I wanted. 让我来继续讲解我的地址清洗的故事,不久我发现早先的例子,对地址尾部的‘ROAD’进行替换并不是很好,这是因为所有的地址都包含一个街道名;某些地址仅仅是以街道名字结尾的。大多数时间,我都忽略了它们。,但是如果街道的名字刚好是’BROAD’,那么正则表达式将会匹配作为街道名字一部分的字符串尾部从而进行替换,这不是我所希望的。 Example 7.2. Matching Whole Words 例7.2 匹配整个单词 What I really wanted
was to match 'ROAD' when it was at the end of the string and it
was its own whole word, not a part of some larger word. To express this in a
regular expression, you use \b, which means “a word boundary must occur
right here”. In Python, this is complicated by the fact that
the '\' character in a string must itself be escaped. This is
sometimes referred to as the backslash plague, and it is one reason why
regular expressions are easier in Perl than in Python. On the
down side, Perlmixes regular expressions with other syntax, so if you
have a bug, it may be hard to tell whether it's a bug in syntax or a bug in
your regular expression. 我所想做的就是对’ROAD’进行匹配那些出现在字符串尾部并且是一个单词,而不是某个单词的一部分。为了使用正则表达式来表示上述意思,你使用\b,意思是单词限定符必须在右边出现。在Python中,这有时候被认为是反斜杠瘟疫,这也就是为什么Perl中的正则表达式要比Python中容易的原因。在另一个方面,Perl混合正则表达式和其它的语法,因此如果你的代码有bug,分辨出bug是语法错误还是正则表达式错误是很困难的。 To work around the backslash plague,
you can use what is called a raw string, by prefixing the string with the
letter r. This tells Python that nothing in this string should
be escaped; '\t' is a tab character, but r'\t' is really
the backslash character \ followed by the letter t. I
recommend always using raw strings when dealing with regular expressions;
otherwise, things get too confusing too quickly (and regular expressions get
confusing quickly enough all by themselves). 为了避免反斜杠瘟疫,你是使用所谓的原生字符串,通过使用以字母r开头的前缀。这就告诉Python字符串的所有字符都不必被转义。‘\t’是一个制表字符,但是r’\t’就是一个反斜杠后面一个字母t。我总是推荐使用原生字符串当你在处理正则表达式的时候:其它的时候,事情总是改变的很快(正则表达式很快也被他们自身所混淆)。 *sigh* Unfortunately, I soon found more cases that
contradicted my logic. In this case, the street address contained the word 'ROAD' as
a whole word by itself, but it wasn't at the end, because the address had an
apartment number after the street designation. Because 'ROAD' isn't
at the very end of the string, it doesn't match, so the entire call to re.sub ends
up replacing nothing at all, and you get the original string back, which is
not what you want. 哎,不幸的是,我不久就发现更复杂情况同我的逻辑相违背。在本例中,街道地址中包含了 ‘ROAD’单词,即作为单词本身出现,但是它不是出现在字符串的尾部,这是因为地址在街道地址后面还包括一个公寓号码。因为’ROAD’不是出现在字符串的尾部,它是不匹配的,因此整个对re.sub的调用将什么也不替换,你获得了最初的字符串,这也不是你想要的。 To solve this problem, I removed
the $ character and added another \b. Now the regular
expression reads “match 'ROAD' when it's a whole word by itself
anywhere in the string,” whether at the end, the beginning, or somewhere in
the middle. 为了解决这个问题,我移调$字符串,然后增加了一个字符\b.现在正则表达开始匹配模式’ROAD’,当它作为一个单词出现在任何字符串的时候,而不论该字符串出现在尾部,或是字符串开始处还是在字符串中间。