Chinaunix首页 | 论坛 | 博客
  • 博客访问: 385223
  • 博文数量: 69
  • 博客积分: 1984
  • 博客等级: 上尉
  • 技术积分: 953
  • 用 户 组: 普通用户
  • 注册时间: 2007-03-28 00:43
个人简介

学无所长,一事无成

文章分类

全部博文(69)

文章存档

2015年(19)

2014年(14)

2013年(9)

2012年(17)

2010年(10)

我的朋友

分类: Python/Ruby

2012-03-22 16:20:46

参考:

Regexp 类:
1、使用 /.../ 或者 %r{} 创建,或者 Regexp.new

  1. /hay/ =~ 'haystack' #=> 0 # 返回值为匹配字符所在位置,或者 nil
  2. /y/.match('haystack') #=> #<MatchData "y"> # 返回值为 MatchData 或者 nil

2、元字符有: ()[]{}.,?+* 

3、模板的行为类似于双引号,可以加入转义符

  1. /\s\u{6771 4eac 90fd}/.match("Go to 東京都")
  2.     #=> #<MatchData " 東京都">
同样可以嵌入 #{...}

  1. place = "東京都"
  2. /#{place}/.match("Go to 東京都")
  3.     #=> #<MatchData "東京都">
4、[0-9a-f] 、支持 && 操作符:取两个表达式的交集

  1. /[a-w&&[^c-g]z]/ # ([a-w] AND ([^c-g] OR z))
  2. # 等级于如下表达式
  3. /[abh-w]/
5、
  1. /./ - 任意字符,不包括回车换行
  2. /./m - 任意字符,包括回车换行,类似于 perl 的 s ( m 修饰符启用多行模式,就是将多行当一行处理)
  3. /\w/ - 等价于 ([a-zA-Z0-9_])
  4. /\W/ - 同 \w 正相反 ([^a-zA-Z0-9_])
  5. /\d/ - 数字 ([0-9])
  6. /\D/ - 非数字 ([^0-9])
  7. /\h/ - 16进制,等价于 ([0-9a-fA-F])
  8. /\H/ - 非16进制 ([^0-9a-fA-F])
  9. /\s/ - 空白字符,包括回车换行: /[ \t\r\n\f]/
  10. /\S/ - 非空白字符: /[^ \t\r\n\f]/
6、
  1. /[[:alnum:]]/ - 等价于 [0-9a-zA-z]
  2. /[[:alpha:]]/ - 等价于 [a-zA-Z]
  3. /[[:blank:]]/ - 空格或 tab
  4. /[[:cntrl:]]/ - ctrl
  5. /[[:digit:]]/ - [0-9]
  6. /[[:graph:]]/ - 非空白字符 (excludes spaces, control characters, and similar)
  7. /[[:lower:]]/ - 等价于 [a-z]
  8. /[[:print:]]/ - Like [:graph:], but includes the space character
  9. /[[:punct:]]/ - Punctuation character
  10. /[[:space:]]/ - 空白字符 ([:blank:], 换行,回车,.)
  11. /[[:upper:]]/ - 大写字符,[A-Z]
  12. /[[:xdigit:]]/ - 16进制数,等价于 [0-9a-fA-F] (i.e., 0-9a-fA-F)
  13. /[[:word:]]/ - A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
  14. /[[:ascii:]]/ - A character in the ASCII
7、匹配次数

  1. * - >= 0
  2. + - >= 1
  3. ? - 0 or 1
  4. {n} - = n
  5. {n,} - >=n
  6. {,m} - <=m
  7. {n,m} - n < ... < m
8、结果捕获,使用 (..)
a、
  1. # 'at' is captured by the first group of parentheses, then referred to
  2. # later with \1
  3. /[csh](..) [csh]\1 in/.match("The cat sat in the hat")
  4.     #=> #<MatchData "cat sat in" 1:"at">
  5. # Regexp#match returns a MatchData object which makes the captured
  6. # text available with its #[] method.
  7. /[csh](..) [csh]\1 in/.match("The cat sat in the hat")[1] #=> 'at'
b、
对捕获结果命名,使用  (?<name>) 或者 (?'name') 
  1. /\$(?<dollars>\d+)\.(?<cents>\d+)/.match("$3.67")
  2.     => #<MatchData "$3.67" dollars:"3" cents:"67">
  3. /\$(?<dollars>\d+)\.(?<cents>\d+)/.match("$3.67")[:dollars] #=> "3"
c、
引用上述匹配结果,使用  \k<name

  1. /(?<vowel>[aeiou]).\k<vowel>.\k<vowel>/.match('ototomy')
  2.     #=> #<MatchData "ototo" vowel:"o">
注意:不能同时使用命名引用和数字引用,即不能同时使用  \k<name>   和 $1 等方式

d、如果 regexp 位于表达式,或者  =~ 操作符左侧,ruby 会生成一个本地变量,保存结果,可以直接使用

  1. /\$(?<dollars>\d+)\.(?<cents>\d+)/ =~ "$3.67" #=> 0
  2. dollars #=> "3"

9、分组
a、
(..) 即分组,其后可跟重复量词

  1. # The pattern below matches a vowel followed by 2 word characters:
  2. # 'aen'
  3. /[aeiou]\w{2}/.match("Caenorhabditis elegans") #=> #<MatchData "aen">
  4. # Whereas the following pattern matches a vowel followed by a word
  5. # character, twice, i.e. <tt>[aeiou]\w[aeiou]\w</tt>: 'enor'.
  6. /([aeiou]\w){2}/.match("Caenorhabditis elegans")
  7.     #=> #<MatchData "enor" 1:"or">
b、
(?:) 表示分组,但不捕获结果。


  1. # The group of parentheses captures 'n' and the second 'ti'. The
  2. # second group is referred to later with the backreference \2
  3. /I(n)ves(ti)ga\2ons/.match("Investigations")
  4.     #=> #<MatchData "Investigations" 1:"n" 2:"ti">
  5. # The first group of parentheses is now made non-capturing with '?:',
  6. # so it still matches 'n', but doesn't create the backreference. Thus,
  7. # the backreference \1 now refers to 'ti'.
  8. /I(?:n)ves(ti)ga\1ons/.match("Investigations")
  9.     #=> #

c、原子分组 ....

没怎么看懂....
  1. # The <tt>" in the pattern below matches the first character of
  2. # the string, then .* matches Quote"</i>. This causes the
  3. # overall match to fail, so the text matched by <tt>.*</tt> is
  4. # backtracked by one position, which leaves the final character of the
  5. # string available to match <tt>"
  6.       /".*"/.match('"Quote"') #=> #\"Quote\"">
  7. # If <tt>.*</tt> is grouped atomically, it refuses to backtrack
  8. # <i>Quote", even though this means that the overall match fails
  9. /"(?>.*)"/.match('"Quote"') #=> nil
参考:http://blog.donews.com/maverick/archive/2005/11/28/641232.aspx
(刚刚发现这个地址是《精通正则表达式》译者的博客,这本书还没看完,汗颜!)

这篇文章讲得很清楚: 简单的说,Atomic Grouping的主要功能便是取消回溯,提高效率——如果匹配成功,它与普通的grouping并无区别,但是如果匹配失败,所有位于Atomic Grouping中的状态会全部失效。 

一般正则表达式为贪婪匹配,或者非贪婪匹配。在匹配不成功时,会进行回溯,不断测试各个分支。原子分组就是将 (?> pat) 中的 pat 匹配作为一个原子操作,(不管贪婪还是非贪婪),要么成功,要么失败,一锤子买卖,不做回溯。

10、子表达式引用 : Subexpression Calls
通过  \g<name> 语法 对 (?) 匹配到的内容进行反向引用。也可以通过数字来进行,类似于前面的 $1。

  1. # Matches a <i>(</i> character and assigns it to the <tt>paren</tt>
  2. # group, tries to call that the <tt>paren</tt> sub-expression again
  3. # but fails, then matches a literal <i>)</i>.
  4. /\A(?<paren>\(\g<paren>*\))*\z/ =~ '()'

  5. /\A(?<paren>\(\g<paren>*\))*\z/ =~ '(())' #=> 0
  6. # ^1
  7. # ^2
  8. # ^3
  9. # ^4
  10. # ^5
  11. # ^6
  12. # ^7
  13. # ^8
  14. # ^9
  15. # ^10

  1. Matches at the beginning of the string, i.e. before the first character.
  2. Enters a named capture group called paren
  3. Matches a literal (, the first character in the string
  4. Calls the paren group again, i.e. recurses back to the second step
  5. Re-enters the paren group
  6. Matches a literal (, the second character in the string
  7. Try to call paren a third time, but fail because doing so would prevent an overall successful match
  8. Match a literal ), the third character in the string. Marks the end of the second recursive call
  9. Match a literal ), the fourth character in the string
  10. Match the end of the string
11 、选择性匹配
两个表达式通过 | 关联,表示任意匹配其中一个即可,例子:

  1. /\w(and|or)\w/.match("Feliformia") #=> #<MatchData "form" 1:"or">
  2. /\w(and|or)\w/.match("furandi") #=> #<MatchData "randi" 1:"and">
  3. /\w(and|or)\w/.match("dissemblance") #=> nil

12、字符属性 : Character Properties
东西太多,只举几个例子:

  1. /\p{Alnum}/ - Alphabetic and numeric character
  2. /\p{Alpha}/ - Alphabetic character
  3. /\p{Blank}/ - Space or tab
  4. /\p{Cntrl}/ - Control character
  5. /\p{Digit}/ - Digit
  6. /\p{Graph}/ - Non-blank character (excludes spaces, control characters, and similar)
  7. /\p{Lower}/ - Lowercase alphabetical character
  8. /\p{Print}/ - Like \p{Graph}, but includes the space character
  9. /\p{Punct}/ - Punctuation character
  10. /\p{Space}/ - Whitespace character ([:blank:], newline, carriage return, etc.)
  11. /\p{Upper}/ - Uppercase alphabetical
  12. /\p{XDigit}/ - Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
  13. /\p{Word}/ - A member of one of the following Unicode general category Letter, Mark,

13、锚

^ - 匹配行首
$ - 匹配行尾
\A - 匹配字符串的开头
\Z - 匹配字符串的结尾. 如果字符串结尾是回车换行,只匹配回车换行前。
\z - 匹配字符串的结尾
\G - Matches point where last match finished
\b - Matches word boundaries when outside brackets; backspace (0x08) when inside brackets
\B - Matches non-word boundaries
(?=pat) - Positive lookahead assertion: ensures that the following characters match pat, but doesn't include those characters in the matched text
(?!pat) - Negative lookahead assertion: ensures that the following characters do not match pat, but doesn't include those characters in the matched text
(?<=pat) - Positive lookbehind assertion: ensures that the preceding characters match pat, but doesn't include those characters in the matched text
(?

  1. # If a pattern isn't anchored it can begin at any point in the string
  2. /real/.match("surrealist") #=> #
  3. # Anchoring the pattern to the beginning of the string forces the
  4. # match to start there. 'real' doesn't occur at the beginning of the
  5. # string, so now the match fails
  6. /\Areal/.match("surrealist") #=> nil
  7. # The match below fails because although 'Demand' contains 'and', the
  8. pattern does not occur at a word boundary.
  9. /\band/.match("Demand")
  10. # Whereas in the following example 'and' has been anchored to a
  11. # non-word boundary so instead of matching the first 'and' it matches
  12. # from the fourth letter of 'demand' instead
  13. /\Band.+/.match("Supply and demand curve") #=> #<MatchData "and curve">
  14. # The pattern below uses positive lookahead and positive lookbehind to
  15. # match text appearing in <b></b> tags without including the tags in the
  16. # match
  17. /(?<=<b>)\w+(?=<\/b>)/.match("Fortune favours the bold")
  18.     #=> #<MatchData "bold">
14、修饰符

/pat/i -  忽略大小写
/pat/m - 允许 . 匹配回车换行;同 perl 的 s 修饰符类似。
/pat/x - 忽略空白字符和注释;模板可以写的较为优美,易读。
/pat/o - 仅对 #{} 做一次解析;具体用法还没搞清楚。

i, m, 和x 修饰符可以用在子表达式中。通过 (?) 语法进行开关。

例子如下:这里 (?i:b) 表示对字符 b 忽略大小写
  1. /a(?i:b)c/.match('aBc') #=> #<MatchData "aBc">
  2. /a(?i:b)c/.match('abc') #=> #<MatchData "abc">
使用 x 修饰符的例子:模板中的空白字符和 # 注释都会被忽略,因此可以写出较为优美的正则。

  1. # A contrived pattern to match a number with optional decimal places
  2. float_pat = /\A
  3.     [[:digit:]]+ # 1 or more digits before the decimal point
  4.     (\. # Decimal point
  5.         [[:digit:]]+ # 1 or more digits after the decimal point
  6.     )? # The decimal point and following digits are optional
  7. \Z/x
  8. float_pat.match('3.14') #=> #<MatchData "3.14" 1:".14">
注意:在 x 修饰符作用下,模板如果匹配空白字符,需要使用   \s 或者 \p{Space}. 
不使用 x 修饰符,添加注释使用  (?#comment)  

另:模式匹配使用的字符编码 encoding 同你的源文件一致,但也可通过以下修饰符修改:

  1. /pat/u - UTF-8
  2. /pat/e - EUC-JP
  3. /pat/s - Windows-31J
  4. /pat/n - ASCII-8BIT
正则表达式可以解析的字符串,两者编码或者一致,或者正则使用 US-ASCII 编码,字符串使用 ASCII 兼容的编码。

如果编码不同,会引发     异常。

可以使用   强行指定编码:

  1. r = Regexp.new("a".force_encoding("iso-8859-1"),Regexp::FIXEDENCODING)
  2. r =~"a\u3042"
  3.    #=> Encoding::CompatibilityError: incompatible encoding regexp match
  4.         (ISO-8859-1 regexp with UTF-8 string)
15、性能

一些变态的写法会导致性能极差:

  1. s = 'a' * 25 + 'd' 'a' * 4 + 'c'
  2.     #=> "aaaaaaaaaaaaaaaaaaaaaaaaadadadadac"

  3. # 下面几句完成相同的匹配

  4. /(b|a)/ =~ s #=> 0
  5. /(b|a+)/ =~ s #=> 0
  6. /(b|a+)*\/ =~ s #=> 0

  7. # 很明显下面这句耗时更长
  8. /(b|a+)*c/ =~ s #=> 32
This happens because an atom in the regexp is quantified by both an immediate + and an enclosing * with nothing to differentiate which is in control of any particular character. The nondeterminism that results produces super-linear performance. (Consult Mastering Regular Expressions (3rd ed.), pp 222, by Jeffery Friedl, for an in-depth analysis). This particular case can be fixed by use of atomic grouping, which prevents the unnecessary backtracking:

  1. (start = Time.now) && /(b|a+)*c/ =~ s && (Time.now - start)
  2.    #=> 24.702736882
  3. (start = Time.now) && /(?>b|a+)*c/ =~ s && (Time.now - start)
  4.    #=> 0.000166571
另一个糟糕的例子,运行它足足花了60秒:

  1. # Match a string of 29 <i>a</i>s against a pattern of 29 optional
  2. # <i>a</i>s followed by 29 mandatory <i>a</i>s.
  3. Regexp.new('a?' * 29 + 'a' * 29) =~ 'a' * 29

The 29 optional as match the string, but this prevents the 29 mandatory as that follow from matching. Ruby must then backtrack repeatedly so as to satisfy as many of the optional matches as it can while still matching the mandatory 29. It is plain to us that none of the optional matches can succeed, but this fact unfortunately eludes Ruby.

One approach for improving performance is to anchor the match to the beginning of the string, thus significantly reducing the amount of backtracking needed.


  1. Regexp.new('\A' 'a?' * 29 + 'a' * 29).match('a' * 29)
  2.     #=> #<MatchData "aaaaaaaaaaaaaaaaaaaaaaaaaaaaa">



阅读(2747) | 评论(3) | 转发(0) |
给主人留下些什么吧!~~

horsley2012-03-26 12:42:45

十七岁的回忆: 元字符.....
英文为: Metacharacters ;翻译成元字符还是比较合适

十七岁的回忆2012-03-25 22:42:24

元字符

夏冰软件2012-03-23 15:26:47

博主写的不错,支持一下