Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1804450
  • 博文数量: 335
  • 博客积分: 4690
  • 博客等级: 上校
  • 技术积分: 4341
  • 用 户 组: 普通用户
  • 注册时间: 2010-05-08 21:38
个人简介

无聊之人--除了技术,还是技术,你懂得

文章分类

全部博文(335)

文章存档

2016年(29)

2015年(18)

2014年(7)

2013年(86)

2012年(90)

2011年(105)

分类: Python/Ruby

2011-08-18 19:58:28

7.3. Case Study: Roman Numerals

You've most likely seen Roman numerals, even if you didn't recognize them. You may have seen them in copyrights of old movies and television shows (“Copyright MCMXLVI” instead of “Copyright 1946”), or on the dedication walls of libraries or universities (“established MDCCCLXXXVIII” instead of “established 1888”). You may also have seen them in outlines and bibliographical references. It's a system of representing numbers that really does date back to the ancient Roman empire (hence the name).

即使你不能分辨罗马数字,但是你应该早就看见它们了。你或许在老电影和电视的版权中看到它们(版权MCMXLVI 而不是 copyright 1946,或是在图书馆或是大学的精心制作的壁画上看到它们(建立于MDCCCLXXXVII而不是 建立于1888.你或许同样在大纲或是参考文献中看到它们。它是一个数字系统,这些系统确实可以追溯到古罗马帝国(正如它的名字一样)

In Roman numerals, there are seven characters that are repeated and combined in various ways to represent numbers.

在罗马数字中,存在7个字符以不同的方式用来重复以及组合来代表不同的数字。

  • I = 1
  • V = 5
  • X = 10
  • L = 50
  • C = 100
  • D = 500
  • M = 1000

The following are some general rules for constructing Roman numerals:

下面是组建罗马数字的基本规则:

  • Characters are additive. I is 1II is 2, and III is 3VI is 6 (literally, “5 and 1”), VII is 7, and VIII is 8.
  • 字符是递增的. I  1II  2, 以及 III  3VI  6 (字面意思, “51”), VII  7,  VIII  8.
  • The tens characters (IXC, and M) can be repeated up to three times. At 4, you need to subtract from the next highest fives character. You can't represent 4 as IIII; instead, it is represented as IV (“1 less than 5”). The number 40 is written as XL (10 less than 50), 41 as XLI42 as XLII43 as XLIII, and then 44 as XLIV (10 less than 50, then 1 less than 5).
  • 10位字符(I,X,C,M)能够被重复使用3次。对于4,你需要从高位字符51来得到。你不能用iiii来表示4,它应该表示为IV(51)
  • Similarly, at 9, you need to subtract from the next highest tens character: 8 is VIII, but 9 is IX (1 less than 10), not VIIII (since the I character can not be repeated four times). The number 90 is XC900 is CM.
  • 类似的,对于9,你需要从10位上减18VIII,但是9IX(比101)而不是VIIII(这是因为I字符不能被重复使用4次)90XC900CM
  • The fives characters can not be repeated. The number 10 is always represented as X, never as VV. The number 100 is always C, never LL.
  • 字符5不能被重复。数字10总是用X来表示,而不死VV,数字100C,而不是LL
  • Roman numerals are always written highest to lowest, and read left to right, so the order the of characters matters very much. DC is 600CD is a completely different number (400100 less than 500). CI is 101IC is not even a valid Roman numeral (because you can't subtract 1 directly from 100; you would need to write it as XCIX, for 10 less than100, then 1 less than 10).
  • 罗马数字总是从高位向低位书写,从左向右读,因此字符的顺序是很重要的。DC600CD则是完全不同的数字(400),比500少一百。CI101,而IC不是合法的罗马数字(这是因为你不能直接从1001,你必须写作XCIX,对于10100小,,然后110小。

7.3.1. Checking for Thousands

7.3.1校验千位

What would it take to validate that an arbitrary string is a valid Roman numeral? Let's take it one digit at a time. Since Roman numerals are always written highest to lowest, let's start with the highest: the thousands place. For numbers 1000 and higher, the thousands are represented by a series of M characters.

校验任意一个字符串是不是合法的罗马数字该怎么办呢?让我们一个数字一个数字的来考虑。因为罗马数字总是从高位写向低位,那我也从高位开始:也就是千位。对于数字1000以及更大的数,千位是用一系列的M来表示的。

Example 7.3. Checking for Thousands

7.3 校验千位

  1. >>> import re
  2. >>> pattern = '^M?M?M?$'
  3. >>> re.search(pattern, 'M')
  4. <SRE_Match object at 0106FB58>
  5. >>> re.search(pattern, 'MM')
  6. <SRE_Match object at 0106C290>
  7. >>> re.search(pattern, 'MMM')
  8. <SRE_Match object at 0106AA38>
  9. >>> re.search(pattern, 'MMMM')
  10. >>> re.search(pattern, '')
  11. <SRE_Match object at 0106F4A8>

1

This pattern has three parts:    

模式由三部分组成:

  • ^ to match what follows only at the beginning of the string. If this were not specified, the pattern would match no matter where the M characters were, which is not what you want. You want to make sure that the M characters, if they're there, are at the beginning of the string.
  • ^用来匹配那些字符串开始位置。如果该字符没有说明,不论字符M在何处,模式都会匹配,这不是你想要的。你需要确保,如果存在,必须在字符串的开始处。
  • M? to optionally match a single M character. Since this is repeated three times, you're matching anywhere from zero to three M characters in a row.
  • M?用来匹配单个M字符。因为字符重复了三次,在一行中从03都不匹配。
  • $ to match what precedes only at the end of the string. When combined with the ^ character at the beginning, this means that the pattern must match the entire string, with no other characters before or after the M characters.
  • $用来匹配那些出现在字符串尾部前面的字符。当该字符同^一块使用的时候,她的意思就是模式必需匹配整个字符串,在字符M的前面和后面都不能出现任何其它字符。

2

The essence of the re module is the search function, that takes a regular expression (pattern) and a string ('M') to try to match against the regular expression. If a match is found, search returns an object which has various methods to describe the match; if no match is found, search returns None, the Python null value. All you care about at the moment is whether the pattern matches, which you can tell by just looking at the return value of search. 'M' matches this regular expression, because the first optional M matches and the second and third optional M characters are ignored.

整个正则表达式模块的精华就是搜索函数,它接受一个正则表达式(也就是模式)和字符串(‘M’然后尝试匹配正则表达式。如果不能匹配,则返回NONE,即Python空值。你所关心的时刻也就是模式是否匹配,通过查看搜搜的返回值来确定是否匹配。‘M“匹配这个表达式,这是因为第一个M是可选的而第二个和第三个M被忽略掉。

3

'MM' matches because the first and second optional M characters match and the third M is ignored.

MM‘也能匹配,这是因为第一个个第二个可选的M都能匹配,而第三个M被忽略。

4

'MMM' matches because all three M characters match.

MMM‘能匹配是因为三个M都匹配。

5

'MMMM' does not match. All three M characters match, but then the regular expression insists on the string ending (because of the $ character), and the string doesn't end yet (because of the fourth M). So search returns None.

‘MMMM‘不匹配。三个M是可以匹配的,但是正则表达式要求字符串必须结束(因为$),而字符串确没有结束(因为存在第四个M),因此搜索返回NONE

6

Interestingly, an empty string also matches this regular expression, since all the M characters are optional.

有趣的是,该字符表达式确匹配一个空字符串,这是因为所有的M都是可选的。

7.3.2. Checking for Hundreds

7.3.2 校验百位

The hundreds place is more difficult than the thousands, because there are several mutually exclusive ways it could be expressed, depending on its value.

百位的校验要比千位难难很多,这是因为根据质的不同,存在好几种互斥的表达式可以表达。

  • 100 = C
  • 200 = CC
  • 300 = CCC
  • 400 = CD
  • 500 = D
  • 600 = DC
  • 700 = DCC
  • 800 = DCCC
  • 900 = CM

So there are four possible patterns:

因此存在四种可能的模式:

  • CM
  • CD
  • Zero to three C characters (zero if the hundreds place is 0)
  • 0-3C(为0时,百位为0
  • D, followed by zero to three C characters
  • D,后面可以跟0-3C

The last two patterns can be combined:

后两种模式可以综合在一起:

  • an optional D, followed by zero to three C characters
  • 一个可选的D,后面跟0-3C

This example shows how to validate the hundreds place of a Roman numeral.

这个例子显示了如何校验罗马数字百位的合法性。

Example 7.4. Checking for Hundreds

7.4 校验百位

  1. >>> import re
  2. >>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'
  3. >>> re.search(pattern, 'MCM')
  4. <SRE_Match object at 01070390>
  5. >>> re.search(pattern, 'MD')
  6. <SRE_Match object at 01073A50>
  7. >>> re.search(pattern, 'MMMCCC')
  8. <SRE_Match object at 010748A8>
  9. >>> re.search(pattern, 'MCMC')
  10. >>> re.search(pattern, '')
  11. <SRE_Match object at 01071D98>

1

This pattern starts out the same as the previous one, checking for the beginning of the string (^), then the thousands place (M?M?M?). Then it has the new part, in parentheses, which defines a set of three mutually exclusive patterns, separated by vertical bars: CM, CD, and D?C?C?C? (which is an optional D followed by zero to three optional C characters). The regular expression parser checks for each of these patterns in order (from left to right), takes the first one that matches, and ignores the rest.

该模式的开始同先前的例子一样,使用’^’用来测试字符串的开始,然后是千位(M?M?M?).接着该模式又包含了新的部分,使用了括号定义了三组互斥模式,通过使用|来分隔CM,CD,D?C?C?C?(D是可选的,后面跟着0-3C)。正则表达式解析器对这些模式都进行校验,以防止(从左至右),如果第一个匹配,可忽略后面的。

2

'MCM' matches because the first M matches, the second and third M characters are ignored, and the CM matches (so the CD and D?C?C?C? patterns are never even considered).MCM is the Roman numeral representation of 1900.

MCM‘是匹配的,这是因为第一个M是匹配的,第二个M和第三个M都被忽略掉,而CD 也匹配(因此CDDC?C?C?模式都没有被考虑)。MCM是罗马数字1900.

3

'MD' matches because the first M matches, the second and third M characters are ignored, and the D?C?C?C? pattern matches D (each of the three C characters are optional and are ignored). MD is the Roman numeral representation of 1500.

MD‘也匹配,这是因为第一个M是匹配的,第二和第三个M都被忽略。D?C?C?C?模式匹配了D(所有的C都被忽略了)MD是罗马数字1500.

4

'MMMCCC' matches because all three M characters match, and the D?C?C?C? pattern matches CCC (the D is optional and is ignored). MMMCCC is the Roman numeral representation of 3300.

MMMCCC’也匹配,这是因为所有的M匹配,D?C?C?C?模式匹配了CCCD是可选的并且被忽略),该数字在罗马中代表3300.

5

'MCMC' does not match. The first M matches, the second and third M characters are ignored, and the CM matches, but then the $ does not match because you're not at the end of the string yet (you still have an unmatched C character). The C does not match as part of the D?C?C?C? pattern, because the mutually exclusive CM pattern has already matched.

MCMC’不匹配。第一个M匹配,第二个和第三个M都被忽略,CM也匹配,但是$不匹配,因为它不出现在字符串的尾部(到现在都还没有匹配C),C不匹配,作为D?C?C?C?模式的一部分,这是因为CCM是互斥的,CM早已经匹配。

6

Interestingly, an empty string still matches this pattern, because all the M characters are optional and ignored, and the empty string matches the D?C?C?C? pattern where all the characters are optional and ignored.

同样有意思的是,该模式同样匹配空字符串,这是因为所有的M都是可选的,因而都被忽略,然后空字符串匹配D?C?C?C?模式,该模式中所有的字符串也都是可选的,都被忽略。

Whew! See how quickly regular expressions can get nasty? And you've only covered the thousands and hundreds places of Roman numerals. But if you followed all that, the tens and ones places are easy, because they're exactly the same pattern. But let's look at another way to express the pattern.

哇塞,注意到没多久正则表达式就变得很糟糕了没?而且你现在只是校验了罗马数字的百位和千位。但是如果你按照这种思路,十位和各位就变得很简单,这是因为十位和各位都具有相同的模式。但是,让我以另一种方式来表达该模式吧

 

阅读(1118) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~