Really advanced perl RegEx reference-snowtty-ChinaUnix博客

* Samples
Swap two item
s/(\S+)\s+(\S+)/$2 $1/

Search C identifier
m/[_A-Za-z][_A-Za-z0-9]*/
m/[_[:alpha:]][_[:alnum:]]*/

Empty Line
/^$/

Word
\b\w+\b

* Questions

* Reference
perlre (bytes and utf8)
regex.h (regcomp regexec regfree regerror) (single byte only)
java (unicode only)
python (bytes and unicode)

* Basic Structure

* Syntax
m/regex/ismx
s/regex/replacement/ismxg

* Flags
i case-insensitive
s single-line or dot-match-all (only affects .)
m multi-line (only ^ $)
x allows space and comment (perl specific)
g global subsitution

* Alternations
m/ABC|XYZ/

* Sequence
m/ABC/

* Repeatition
(agressive)
A = a? 0 or 1
a* 0 or more
a+ 1 or more
a{m} m
a{m,} m or more
a{m,n} m to n (inclusively)

(lazy)
a??
a*?
a+?
a{m}?
a{m,}?
a{m,n}?

aa
(a?)(a*) $1 => a a
(a??)(a*) $1 => "" aa

* Atoms
Character = a b c
Character Class
Escape = \ + non-alpha, such as \\, \+, \(, except reference
Meta Escape= \ + alpha[a-zA-Z]
Groups = (...)

* Character Class
[abc] [a-b] [^abc] [^abc0-9]
[- and [] are considered literal
[-a] = - or a
[^\-]

[[]
[]]
[ ]

* Posix Character Class
[[.a.]] collation
[[=a=]] equivalence
[[:alpha:]]

* Meta
. anything except newlines (normal mode)
. anything (s mode, singleline, dotall)
^ start of string, or start of line (m mode)
$ end of string (including newline), or end of line (m mode)

* Meta Escape
\t \n \r \f \a \e
\0nn \xnn
\cA (using algorithm ch ^ 0x40)
\cM
\N{name}
\l lowercase next char
\u uppercase next char
\L...\E lowercase until \E
\U...\E uppercase until \E
\Q...\E quote until \E
\w \W word char
\s \S space
\d \D digit
\b \B boundary
\p{property}
\P{property}
\X combining character sequence
\C single byte (perl)
\< start of word (emacs)
\> end of word (emacs)

* Groups
(abc) for capture group

* Special group
(?#comment)
(?imsx-imsx) embedded flags
(?:pattern) for non-capture
(?imsx-imsx:pattern) subpattern
(?=pattern) positive look ahead
(?!pattern) negative look ahead
(?<=pattern) positive look behind
(?

* Reference for capture
m/(x)\1/
s/(x)/$1$1/

* Traditional vs Extended
\{m,n\} vs {m,n}
$xxx$ vs (xxx)
Emacs is still using traditional regular expression

* Special extension
\< start of word (emacs)
\> end of word (emacs)