无聊之人--除了技术,还是技术,你懂得
分类: Python/Ruby
2011-08-19 18:20:11
7.4. Using the {n,m} Syntax
In the previous section, you were dealing with a pattern where the same character could be repeated up to three times. There is another way to express this in regular expressions, which some people find more readable. First look at the method we already used in the previous example.
在前一部分,你在处理模式中某些相同的字符可能会出现到三次。同时还存在另一种方式来表示该正则表达式,这种方式某些人认为更加具有可读性。我们首先看看我们先前例子中说使用的方法。
Example 7.5. The Old Way: Every Character Optional
例7.5 老方法:每一个字符都是可选的
This matches the start of the string, and then the first optional M, but not the second and third M (but that's okay because they're optional), and then the end of the string. 该模式匹配字符串的开始,接着是是一个可选的M,但是第二个和第三个都不匹配(但是这是没有问题,因为它们本身都是可选的),然后字符串就结束。 |
|
This matches the start of the string, and then the first and second optional M, but not the third M (but that's okay because it's optional), and then the end of the string. 该模式匹配字符串的开始,接着匹配第一和第二个可选的M,但是不匹配第三个M(但是这也是没有问题的,这是因为第三个M也是可选的),然后是字符串的结束。 |
|
This matches the start of the string, and then all three optional M, and then the end of the string. 该模式也是匹配字符串的开始处,然后是三个可选的M,然后字符串结束。 |
|
This matches the start of the string, and then all three optional M, but then does not match the the end of the string (because there is still one unmatched M), so the pattern does not match and returns None. 该模式也是匹配字符串的开始,然后是三个可选的M,但是不匹配最后字符串的结束(因为还存在一个为匹配的M),因此模式是不匹配的,从而返回NONE。 |
Example 7.6. The New Way: From n o m
例7.6 新方式:从n到m
This pattern says: “Match the start of
the string, then anywhere from zero to three M characters, then the
end of the string.” The 0 and 3 can be any numbers; if you want to match at
least one but no more than three M characters, you could
say M{1,3}. 该模式的意思是说,它 从字符串的开始进行匹配,接着是可以匹配0-3个M,接着整个模式就结束了。其中0-3中可以是其中的任何一个:如果你打算匹配至少一个,但是不能超过3个,你可以使用M{1,3}模式。 This matches the start of the string,
then one M out of a possible three, then the end of the string. 该模式也是从字符串的开始进行匹配,其中一个M是三种情况中的一个,满足模式,然后是字符串结束。 This matches the start of the string, then
two M out of a possible three, then the end of the string. 该模式也是从字符串的开始进行匹配,两个M也满足情况,然后字符串结束。 This matches the start of the string,
then three M out of a possible three, then the end of the string. 该模式也是从字符串的开始进行匹配,三个M也是其中的一种情况,然后字符串结束。 This matches the start of the string,
then three M out of a possible three, but then does not
match the end of the string. The regular expression allows for up to
only three Mcharacters before the end of the string, but you have four,
so the pattern does not match and returns None. 该模式也是从字符串的开始进行匹配,然后是3个M也是满足情况,但是不满足字符串应该结束的要求。正则表达式允许最多3个M在字符串结束以前,但是对于4个M,模式就不在匹配,然后返回None。 There is no way to programmatically
determine that two regular expressions are equivalent. The best you can do is
write a lot of test cases to make sure they behave the same way on all
relevant inputs. You'll talk more about writing test cases later in this
book. 没有任何方法从语法的角度来确定这两种正则表达式是等价的。最好的方法就是你写尽可能多的测试案例,来确保对于所有相关的输出,模式的输出是一样的。在本书的后面章节将会涉及更多的如何写测试案例。
7.4.1. Checking for Tens and Ones
7.4.1 校验十位和各位
Now let's expand the Roman numeral regular expression to cover the tens and ones place. This example shows the check for tens.
接着让我们扩展罗马数字的正则表达式来包含十位和各位。下面的例子展示了如何校验十位
Example 7.7. Checking for Tens
This matches the start of the string,
then the first optional M, then CM, then XL, then the end of
the string. Remember, the (A|B|C) syntax means “match exactly one
of A, B, or C”. You match XL, so you ignore
the XC and L?X?X?X? choices, and then move on to the end
of the string. MCML is the Roman numeral representation of 1940. This matches the start of the string,
then the first optional M, then CM, then L?X?X?X?. Of
the L?X?X?X?, it matches the L and skips all three
optional X characters. Then you move to the end of the
string. MCML is the Roman numeral representation of 1950. 该模式从字符串的开始就匹配,接着是第一个可选的M,接着是CM,接着L?X?X?X?。对于L?X?X?X?,它匹配L接着忽略了三个可选的X。接着匹配到了字符串的结尾。MCML在罗马数字中代表着1950. This matches the start of the string,
then the first optional M, then CM, then the
optional L and the first optional X, skips the second and
third optional X, then the end of the string. MCMLX is the
Roman numeral representation of 1960. 该模式也是从字符串的开始进行匹配,接着匹配第一个可选的M,然后是CM,接着是可选的L,然后是第一个可选的X,跳过第二个和第三个可选的X,接着匹配字符串的结尾。罗马数字MCMLX代表的是1960. This matches the start of the string,
then the first optional M, then CM, then the
optional L and all three optional X characters, then the
end of the string. MCMLXXX is the Roman numeral representation
of 1980. 该模式也是从字符串的开始进行匹配,接着是第一个可选的M,然后是CM,接着是可选的L,然后是三个可选的X,左后匹配到字符串的末尾。罗马数字MCMLXXX 为1980. This matches the start of the string,
then the first optional M, then CM, then the
optional L and all three optional X characters,
then fails to match the end of the string because there is
still one more X unaccounted for. So the entire pattern fails to
match, and returns None. MCMLXXXX is not a valid Roman
numeral. 该模式也是从字符串的开始进行匹配,接着是第一个可选的M,然后是CM,接着是可选的L,然后是三个可选的X,最后却不匹配字符串的结尾,这是因为最后还有一个X不能匹配。因此整个模式匹配失败,然后返回None。MCMLXXXX不是一个合法的罗马数字。
The expression for the ones place follows the same pattern. I'll spare you the details and show you the end result.
对于各位,表达式遵循同样的模式,我将与你共享这些细节,然后给出最后的结果。
>>> pattern = '^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
So what does that look like using this alternate {n,m} syntax? This example shows the new syntax.
那么使用这种可选的{n,m}方式的表达式看起来会是什么样呢?下面的例子展示了使用新的语法所构成的表达式。
Example 7.8. Validating Roman Numerals with {n,m}
例7.8 使用{n,m}来校验罗马数字
This matches the start of the string, then one of a possible four M characters, then D?C{0,3}. Of that, it matches the optional D and zero of three possible C characters. Moving on, it matches L?X{0,3} by matching the optional L and zero of three possible X characters. Then it matches V?I{0,3} by matching the optional V and zero of three possible I characters, and finally the end of the string. MDLV is the Roman numeral representation of 1555. 该模式从字符串的开始就匹配,然后匹配一个M,接着D?C{0,3}中,,匹配D,然后忽略了所有的C。在继续匹配的时候,L?X{0,3}模式匹配一个可选的L,忽略所有的X。在对V?I{0,3} 进行匹配时,V能匹配,然后I为0,最后匹配到字符串结束。MDLV在罗马数字为1555. |
|
This matches the start of the string, then two of a possible four M characters, then the D?C{0,3} with a D and one of three possible C characters; then L?X{0,3} with an L and one of three possible X characters; then V?I{0,3} with a V and one of three possible I characters; then the end of the string. MMDCLXVI is the Roman numeral representation of 2666. 该模式也是从字符串的开始进行匹配,匹配两个M,对D?C{0,3},匹配一个D,一个C;对于L?X{0,3} 模式,也是匹配一个L一个X;最后是V?I{0,3} 匹配一个V,一个I,然后匹配到字符串结束。罗马数字MMDCLXVI是2666. |
|
This matches the start of the string, then four out of four M characters, then D?C{0,3} with a D and three out of three C characters; then L?X{0,3} with an L and three out of three X characters; then V?I{0,3} with a V and three out of three I characters; then the end of the string. MMMMDCCCLXXXVIII is the Roman numeral representation of 3888, and it's the longest Roman numeral you can write without extended syntax. 该模式也是匹配了字符串的开始,即使4个M完全匹配,接着D?C{0,3}匹配一个D,三个C;L?X{0,3} 匹配一个L,然后是3个L;接着对V?I{0,3}进行匹配,匹配一个V和三个I;接着就匹配到了字符串的结束。罗马数字MMMMDCCCLXXXVIII 代表3888,它是你在不适用扩展语法的情况下所能构造的最长的罗马数字。 |
|
Watch closely. (I feel like a magician. “Watch closely, kids, I'm going to pull a rabbit out of my hat.”) This matches the start of the string, then zero out of four M, then matches D?C{0,3} by skipping the optional D and matching zero out of three C, then matches L?X{0,3} by skipping the optional L and matching zero out of three X, then matches V?I{0,3} by skipping the optional V and matching one out of three I. Then the end of the string. Whoa. 仔细一点。(我感觉它像一个魔术“看仔细一点,孩子们,我将从我的帽子中拿出一只兔子“)。该模式也是从字符串的开始就匹配,所有的M都不匹配,接着D?C{0,3}中,D为可选被忽略掉,C也被忽略掉; L?X{0,3}在忽略掉L后,对多有的X也都忽略;V?I{0,3} 跳过可选的V,然后匹配了一个I,然后匹配到字符串的。哇偶。 |
If you followed all that and understood it on the first try, you're doing better than I did. Now imagine trying to understand someone else's regular expressions, in the middle of a critical function of a large program. Or even imagine coming back to your own regular expressions a few months later. I've done it, and it's not a pretty sight.
如果你理解了全部的思路并在第一次尝试的时候理解它,你做的会比我好。现在想想一下,当你尝试理解他人写的正则表达式,该表达式在一个大程序关键函数的中间。或是想象一下,几个月以后回过头来看你自己的正则表达式。这样的事我做过,这不是也好差事。
In the next section you'll explore an alternate syntax that can help keep your expressions maintainable.
在下一部分,我们会探究另外一种语法,它能帮助你让你的表达式更加可维护。