全部博文(89)
分类: LINUX
2010-02-06 22:22:54
One issue we haven't discussed yet is the question "how much text matches?" Really, there are two questions. The second question is "where does the match start?" Indeed, when doing simple text searches, such as with grep or egrep, both questions are irrelevant. All you want to know is whether a line matched, and if so, to see the line. Where in the line the match starts, or to where in the line it extends, doesn't matter.
However, knowing the answer to these questions becomes vitally important when doing text substitution with sed or programs written in awk. (Understanding this is also important for day-to-day use when working inside a text editor, although we don't cover text editing in this book.)
The answer to both questions is that a regular expression matches the longest, leftmost substring of the input text that can match the entire expression. In addition, a match of the null string is considered to be longer than no match at all. (Thus, as we explained earlier, given the regular expression ab*c, matching the text ac, the b* successfully matches the null string between a and c.) Furthermore, the POSIX standard states: "Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the longest possible string." (Subpatterns are the parts enclosed in parentheses in an ERE. For this purpose, GNU programs often extend this feature to \(...\) in BREs too.)
If sed is going to be replacing the text matched by a regular expression, it's important to be sure that the regular expression doesn't match too little or too much text. Here's a simple example:
$ echo Tolstoy writes well | sed 's/Tolstoy/Camus/' Use fixed strings Camus writes well
Of course, sed can use full regular expressions. This is where understanding the "longest leftmost" rule becomes important:
$ echo Tolstoy is worldly | sed 's/T.*y/Camus/' Try a regular expression Camus What happened?
The apparent intent was to match just Tolstoy. However, since the match extends over the longest possible amount of text, it went all the way to the y in worldly! What's needed is a more refined regular expression:
$ echo Tolstoy is worldly | sed 's/T[[:alpha:]]*y/Camus/' Camus is worldly
In general, and especially if you're still learning the subtleties of regular expressions, when developing scripts that do lots of text slicing and dicing, you'll want to test things very carefully, and verify each step as you write it.
Finally, as we've seen, it's possible to match the null string when doing text searching. This is also true when doing text replacement, allowing you to insert text:
$ echo abc | sed 's/b*/1/' Replace first match 1abc $ echo abc | sed 's/b*/1/g' Replace all matches 1a1c1
Note how b* matches the null string at the front and at the end of abc.