分类: LINUX
2004-12-16 11:50:32
v1.0.9 / chapter 2 of 3 / 01 oct 04 / greg goebel / public domain
* This chapter gives a description of the precise syntax of Awk.
* Awk is invoked as follows:
awk [ -F-- where:] {pgm} | { -f } [ ] [ - | ]
ch: Field-separator character.An Awk program has the general form:
pgm: Awk command-line program.
pgm file: File containing an Awk program.
vars: Awk variable initializations.
data file: Input data file.
BEGIN {If the Awk program is written on the command line, it should be enclosed in single quotes ('{pgm}') instead of double quotes ("{pgm}") to prevent the shell from interpreting characters within the program as special shell characters. Please remember that the PC COMMAND.COM shell does not allow use of single quotes in this way. Naturally, if such interpretation is desired, double quotes can be used. Those special shell characters in the Awk program that the shell should not interpret should be preceded with a "".}
{ }
{ }
...
END {}
* This syntax diagram should be easily understood by anyone who has read the first chapter, with a few comments.
First, the data file is optional. If it isn't specified, Awk takes data from standard input, with input terminated by a CTRL-D. However, if you are initializing variables on the command line, a matter to be explained shortly, you must specify standard input by using "-" as a parameter.
Multiple data files can also be specified. Awk will scan each in turn and generate a continuous output from the contents of the multiple files.
Second, notice the "-F" option. This allows you to change Awk's "field separator" character. As noted in the previous chapter, Awk regards each line of input data as composed of multiple "fields", which are essentially words separated by blank spaces. A blank space (or a tab character) is the default "field separator".
In some cases, the input data may be divided by another character, for example, a ":", and it would be nice to be able to tell Awk to use a different field separator. This is what the "-F" option does. To invoke Awk and specify a ":" as the field separator, you write:
awk -F: ...This can also be done by changing one of Awk's built-in variables; again, more on this later.
Third, it is also possible to initialize Awk variables on the command line. This is obviously only useful if the Awk program is stored in a file or is an element in a shell script, as any initial values needed in a script written on the command-line can be written as part of the program text.
Consider the program example in the previous chapter to compute the value of a coin collection. The current prices for silver and gold were embedded in the program, which means that the program would have to be modified every time the price of either metal changed. It would be much simpler to specify the prices when the program is invoked.
The main part of the original program was written as:
/gold/ { num_gold++; wt_gold += $2 }The prices of gold and silver could be specified by variables, say, "pg" and "ps":
/silver/ { num_silver++; wt_silver += $2 }
END { val_gold = 485 * wt_gold
val_silver = 16 * wt_silver
...
END { val_gold = pg * wt_gold-- and then the program would be invoked with variable initializations in the command line as follows:
val_silver = ps * wt_silver
...
awk -f summary.awk pg=485 ps=16 coins.txt-- with the same results as before. Notice that the variable initializations are listed as "pg=485" and "ps=16", and not "pg = 485" and "ps = 16"; including spaces is not recommended as it might confuse command-line parsing.
* The simplest kind search pattern that can be specified is a simple string, enclosed in forward-slashes ("/"). For example:
/The/-- searches for any line that contains the string "The". This will not match "the" as Awk is "case-sensitive", but it will match words like "There" or "Them".
This is the crudest sort of search pattern. Awk defines special characters or "metacharacters" that can be used to make the search more specific. For example, preceding the string with a "^" tells Awk to search for the string at the beginning of the input line. For example:
/^The/-- matches any line that begins with the string "The". Similarly, following the string with a "$" matches any line that ends with "The", for example:
/The$/But what if you actually want to search the text for a character like "^" or "$"? Simple, just precede the character with a backslash (""). For example:
/$/-- matches any line with a "$" in it.
* Such a pattern-matching string is known as a "regular expression". There are many different characters that can be used to specify regular expressions. For example, it is possible to specify a set of alternative characters using square brackets ("[]"):
/[Tt]he/This example matches the strings "The" and "the". A range of characters can also be specified. For example:
/[a-z]/-- matches any character from "a" to "z", and:
/[a-zA-Z0-9]/-- matches any letter or number.
A range of characters can also be excluded, by preceding the range with a "^". For example:
/^[^a-zA-Z0-9]/-- matches any line that doesn't start with a letter or digit.
A "|" allows regular expressions to be logically ORed. For example:
/(^Germany)|(^Netherlands)/-- matches lines that start with the word "Germany" or the word "Netherlands". Notice how parentheses are used to group the two expressions.
* The "." special characters allows "wildcard" matching, meaning it can be used to specify any arbitrary character. For example:
/wh./-- matches "who", "why", and any other string that has the characters "wh" and any following character.
This use of the "." wildcard should be familiar to UN*X shell users, but awk interprets the "*" wildcard in a subtly different way. In the UN*X shell, the "*" substitutes for a string of arbitrary characters of any length, including zero, while in awk the "*" simply matches zero or more repetitions of the previous character or expression. For example, "a*" would match "a", "aa", "aaa", and so on. That means that ".*" will match any string of characters.
There are other characters that allow matches against repeated characters expressions. A "?" matches zero or one occurrences of the previous regular expression, while a "+" matches one or more occurrences of the previous regular expression. For example:
/^[+-]?[0-9]+$/-- matches any line that consists only of a (possibly signed) integer number. This is a somewhat confusing example and it is helpful to break it down by parts:
/^ Find string at beginning of line.
/^[-+]? Specify possible "-" or "+" sign for number.
/^[-+]?[0-9]+ Specify one or more digits "0" through "9".
/^[-+]?[0-9]+$/ Specify that the line ends with the number.
* There is more to Awk's string-searching capabilities. The search can be constrained to a single field within the input line. For example:
$1 ~ /^France$/-- searches for lines whose first field ("$1" -- more on "field variables" later) is the word "France", while:
$1 !~ /^Norway$/-- searches for lines whose first field is not the word "Norway".
It is possible to search for an entire series or "block" of consecutive lines in the text, using one search pattern to match the first line in the block and another search pattern to match the last line in the block. For example:
/^Ireland/,/^Summary/-- matches a block of text whose first line begins with "Ireland" and whose last line begins with "Summary".
* There is no need for the search pattern to be a regular expression. It can be a wide variety of other expressions as well. For example:
NR == 10-- matches line 10. NR is, as explained in the overview, a count of the lines searched by Awk; and "==" is the "equality" operator. Similarly:
NR == 10,NR == 20-- matches lines 10 through 20 in the input file. Awk supports search patterns using a full range of comparison operations:
< Less than.For example:
<= Less than or equal.
== Equal.
!= Not equal.
>= Greater than or equal to.
> Greater than.
NF == 0-- matches all blank lines, or those whose number of fields is zero.
$1 == "France"-- is a string comparison that matches any line whose first field is the string "France". The astute reader may notice that this example seems to do the same thing as a the previous example:
$1 ~ /^France$/In fact, both examples do the same thing, but in the example immediately above the "^" and "$" metacharacters had to be used in the regular expression to specify a match with the entire first field; without them, it would match such strings as "FranceFour", "NewFrance", and so on. The string expression matches only to "France".
* It is also possible to combine several search patterns with the "&&" (AND) and "||" (OR) operators. For example:
((NR >= 30) && ($1 == "France")) || ($1 == "Norway")-- matches any line past the 30th that begins with "France", or any line that begins with "Norway".
* One class of pattern-matching that wasn't listed above is performing a numeric comparison on a field variable. It can be done, of course; for example:
$1 == 100-- matches any line whose first field has a numeric value greater than 100. This is a simple thing to do and it will work fine. However, suppose you want to perform:
$1 < 100This will generally work fine, but there's a nasty catch to it, which requires some explanation. The catch is that if the first field of the input can be either a number or a text string, this sort of numeric comparison can give crazy results, matching on some text strings that aren't equivalent to a numeric value.
This is because awk is a "weakly-typed" language. Its variables can store a number or a string, with awk performing operations on each appropriately. In the case of the numeric comparison above, if $1 contains a numeric value, awk will perform a numeric comparison on it, as expected; but if $1 contains a text string, awk will perform a text comparison between the text string in $1 and the three-letter text string "100". This will work fine for a simple test of equality or inequality, since the numeric and string comparisons will give the same results, but it will give crazy results for a "less than" or "greater than" comparison.
Awk is not broken; it is doing what it is supposed to do in this case. If this problem comes up, it is possible to add a second test to the comparison to determine if the field contains a numeric value or a text string. This second test has the form:
(( $1 + 0 ) == $1 )If $1 contains a numeric value, the left-hand side of this expression will add 0 to it, and awk will perform a numeric comparison that will always be true.
If $1 contains a text string that doesn't look like a number, for want of anything better to do awk will interpret its value as 0. This means the left-hand side of the expression will evaluate to zero; since there is a non-numeric text string in $1, awk will perform a string comparison that will always be false. This leads to a more workable comparison:
((( $1 + 0 ) == $1 ) && ( $1 > 100 ))The same test could be modified to check for a text string instead of a numeric value:
(( $1 + 0 ) != $1 )It is worthwhile to remember this trickery for the rare occasions it is needed. Weakly-typed languages are convenient, but in some unusual cases they can turn around and sink their fangs into your hand.
* Incidentally, if you're not sure how awk is handling a particular sort of data, it is simple to run tests to find out for sure. For example, I wanted to see if my version of Awk could handle a hexadecimal value as would be specified in C -- for example, "0xA8" -- and so I simply typed in the following at the command prompt:
awk 'BEGIN {tv="0xA8"; print tv,tv+0}'This printed "0xA8 0", which meant awk thought that the data was strictly a string. This little example consists only of a BEGIN clause, allowing an Awk program to be run without specifying an input file, which is convenient when playing with examples. If you're not sure what awk is doing in some case, just ask it; you won't break anything.
* Numbers can be expressed in Awk as either decimal integers or floating-point quantities. For example:
789 3.141592654 +67 +4.6E3 -34 -2.1e-2There is no provision for specifying values in other bases, such as hex or octal, though, as will be shown later, it is possible to output them from Awk in hex or octal format.
Strings are expressed in double-quotes. For example:
"All work and no play makes Jack a homicidal maniac!"Awk also supports null strings, which are represented by empty quotes: "".
"1987A1"
"do re mi fa so la ti do"
There are various "special" characters that can be embedding into strings:
n Newline (line feed).A double-quote (") can be embedded in a string by preceding it with a "", and a "" can be embedded in a string by typing it in twice: "". If a backslash is used with other characters (say, "m"), it is simply treated as a normal character.
t Horizontal tab.
b Backspace.
r Carriage return.
f Form feed.
It is possible in the C programming language to specify a character by its three-digit octal code, preceded by a "", but this is not possible in Awk.
* As already mentioned, Awk supports both user-defined variables and its own predefined variables. Any string beginning with a letter, defined as consisting of alphanumeric characters or underscores ("_"), and which does not conflict with Awk's reserved words can be used as a variable name. Beware that using a reserved word is a common bug when building Awk programs, so if your program blows up on a seemingly inoffensive word, try changing it to something more unusual and see if the problem goes away.
There is no need to declare variables, and in fact you can't, though it is a good idea in an elaborate Awk program to initialize variables in the BEGIN clause to make them obvious and to make sure they have proper initial values. Relying on default values is a bad habit in any programming language. The fact that variables aren't declared in awk can also lead to some odd bugs, for example by misspelling the name of a variable and not realizing that this has created a second, different variable that is out of the loop in the rest of the program.
Also as mentioned, awk is weakly typed. Variables have no data type, and can be used to store either string or numeric values; string operations on variables will give a string result and numeric operations will give a numeric result, with a text string that doesn't look like a number simply being regarded as 0 in a numeric operation. Awk will follow its own rules in this issue and so it is important for the programmer to remember it and avoid possible traps. For example:
var = 1776-- is the same as:
var = "1776"-- both loading the value 1776 into the variable "var". This can be treated as a numeric value in calculations in either case, and string operations can be performed on it as well. If "var" is loaded up with a text string of the form:
var = "somestring"-- string operations can be performed on it, but it will evaluate to a 0 in numeric operations. If this example is changed as follows:
var = somestring-- this will always return 0 for both string and numeric operations -- because awk thinks "somestring" without quotes is the name of an uninitialized variable. Incidentally, an uninitialized variable can be tested for a value of 0:
var == 0This tests "true" if "var" hasn't been initialized; but, oddly, if you try to "print" an uninitialized variable, you get nothing. For example:
print var-- simply prints a blank line, while:
var = 0; print var-- prints a "0".
* Unlike many other languages, an Awk string variable is not represented as one-dimensional array of characters. However, it is possible to use the "substr()" function, more on this later, to access characters or substrings of a string.
* Awk's built-in variables include the field variables -- $1, $2, $3, and so on ($0 is the entire line) -- that give the text or values in the individual text fields in a line, and a number of variables with specific functions:
By the way, values can be loaded into field variables; they aren't read-only. For example:
$2 = "NewText"-- changes the second text field in the input line to "NewText". I once saw someone use this trick to perform a modification on the lines of an input file and then simply print the lines using "print" without any parameters.
* Awk also permits the use of arrays. The naming convention is the same as it is for variables, and, as with variables, the array does not have to be declared. Awk arrays can only have one dimension; the first index is 1. Array elements are identified by an index, contained in square brackets. For example:
some_array[1], some_array[2], some_array[3] ...One interesting feature of Awk arrays is that the indexes can also be strings, which allows them to be used as a sort of "associative memory". For example, an array could be used to tally the money your friends owe you, as follows:
debts["Kimmie"], debts["Michael"], debts["Hugh"] ...
* Awk's relational operations ("<" "<=" "==" "!=" ">=" ">") have already been discussed. Note that, unlike some languages, relational expressions in Awk do not return a value. They only evaluate to a true condition or a false condition. That means that a Awk program like:
BEGIN {a=1; print (a==1)}-- doesn't print anything at all, and trying to use relational expressions as part of an arithmetic expression causes an error.
Awk uses the standard four arithmetic functions:
+ additionAll computations are performed in floating-point. There is also a modulo-division ("remainder") operator:
- subtraction
* multiplication
/ division
% modFor example, "13 % 8" yields 5, "20 % 6" yields 2, "3 % 5" yields 3, and so on.
There are increment and decrement operators:
++ Increment.The position of these operators with respect to the variable they operate on is important. If "++" precedes a variable, that variable is incremented before it is used in some other operation. For example:
-- Decrement.
BEGIN {x=3; print ++x}-- prints: 4. If "++" follows a variable, that variable is incremented after it is used in some other operation. For example:
BEGIN {x=3; print x++}-- prints: 3. Similar remarks apply to "--". Of course, if the variable being incremented or decremented is not part of some other operation at that time, it makes no difference where the operator is placed.
Awk also allows the following shorthand operations for modifying the value of a variable:
x += 2 -- is the same as: x = x + 2* There is only one unique string operation: concatenation. All that you have to do to concatenate two strings is place them consecutively on the same line. For example:
x -= 2 -- is the same as: x = x - 2
x *= 2 -- is the same as: x = x * 2
x /= 2 -- is the same as: x = x / 2
x %= 2 -- is the same as: x = x % 2
BEGIN {string = "Super" "power"; print string}-- prints:
Superpower
* Awk includes a number of predefined functions. The simplest function is "length()", which returns the length of its parameter. If no parameter is specified, it returns the length of the input line in number of characters. For example:
{print length, $0}-- prints each input line, preceded by its length. When provided with a string parameter, "length()", obviously, returns the length of the string. When provided with an arithmetic parameter, "length()" returns the length of the numeric string that "print" would have printed by default, as defined by default output format, if given the same arithmetic parameter.
* There are several predefined arithmetic functions:
sqrt() Square root.The "exp()" function can be used to derive powers of numbers besides e. Given that "^" is an exponentiation operator:
log() Base-e log.
exp() Power of e.
int() Integer part of argument.
2^x-- then if:
2 = e^k-- where "k" is the log to the base e of 2:
k = log(2)-- then:
2^x = (e^k)^x = (e^log(2))^x = e^(x * log(2))So, to let Awk compute the 20th power of 2, you would give it the commands:
BEGIN {log_two = log(2); print exp(log_two * 20)}Sine and cosine are also supported by some versions of Awk.
* Awk, not surprisingly, includes a set of string-processing operations:
substr() As mentioned, extracts a substring from a string.The "substr()" function has the syntax:
split() Splits a string into its elements and stores them in an array.
index() Finds the starting point of a substring within a string.
substr(For example, to extract and print the word "get" from "unforgettable":, , )
BEGIN {print substr("unforgettable",6,3)}Please be aware that the first character of the string is numbered "1", not "0". To extract a substring of at most ten characters, starting from position 6 of the first field variable, you use:
substr($1,6,10)The "split()" function has the syntax:
split(This function takes a string with n fields and stores the fields into array[1], array[2], ... , array[n]. If the optional field separator is not specified, the value of FS (normally "white space", the space and tab characters) is used. For example, suppose we have a field of the form:, ,[ ])
joe:frank:harry:bill:bob:silWe could use "split()" to break it up and print the names as follows:
my_string = "joe:frank:harry:bill:bob:sil";The "index()" function has the syntax:
split(my_string,names,":");
print names[1];
print names[2];
...
index(-- and returns the position at which the search string begins in the target string (remember, the initial position is "1"). For example:, )
index("gorbachev","bach") returns: 4
index("superficial","super") returns: 1
index("sunfire","fireball") returns: 0
index("aardvark","z") returns: 0
* Awk supports control structures similar to those used in C, including:
if ... elseThe syntax of "if ... else" is:
while
for
if (The "else" clause is optional. The "condition" can be any expression discussed in the section on pattern matching, including matches with regular expressions. For example, consider the following Awk program:) [else ]
{if ($1=="green") print "GO";By the way, for test purposes this program can be invoked as:
else if ($1=="yellow") print "SLOW DOWN";
else if ($1=="red") print "STOP";
else print "SAY WHAT?";}
echo "red" | awk -f pgm.txt-- where "pgm.txt" is a text file containing the program.
The "action" clauses can consist of multiple statements, contained by curly brackets ("{}").
The syntax for "while" is:
while (The "action" is performed as long the "condition" tests true, and the "condition" is tested before each iteration. The conditions are the same as for the "if ... else" construct. For example, since by default an Awk variable has a value of 0, the following Awk program could print the numbers from 1 to 20:)
BEGIN {while(++x<=20) print x}* The "for" loop is more flexible. It has the syntax:
for (For example, the following "for" loop prints the numbers 10 through 20 in increments of 2:; ; )
BEGIN {for (i=10; i<=20; i+=2) print i}This is equivalent to:
i=10The C programming language has a similar "for" construct, with an interesting feature in that multiple actions can be taken in both the initialization and end-of-loop actions, simply by separating the actions with a comma. Most implementations of Awk, unfortunately, do not support this feature.
while (i<=20) {
print i;
i+=2;}
The "for" loop has an alternate syntax, used when scanning through an array:
for (If you recall the example:in )
my_string = "joe:frank:harry:bill:bob:sil";-- then the names could be printed with the following statement:
split(my_string, names, ":");
for (idx in names) print idx, names[idx];This yields:
2 frankNotice that the names are not printed in the proper order. One of the characteristics of this type of "for" loop is that the array is not scanned in a predictable order.
3 harry
4 bill
5 bob
6 sil
1 joe
* Awk defines three unconditional control statements: "break", "continue", "next", and "exit". "Break" and "continue" are strictly associated with the "while" and "for" loops:
"Next" and "exit" control Awk's input scanning:
* The simplest output statement is the by-now familiar "print" statement. There's not too much to it:
* The "printf()" (formatted print) function is much more flexible, and trickier. It has the syntax:
printf(The "string" can be a normal string of characters:, )
printf("Hi, there!")This prints "Hi, there!" to the display, just like "print" would, with one slight difference: the cursor remains at the end of the text, instead of skipping to the next line, as it would with "print". A "newline" code ("n") has to be added to force "printf()" to skip to the next line:
printf("Hi, there!n")So far, "printf()" looks like a step backward from "print", and if you use it to do dumb things like this, it is. However, "printf()" is useful when you want precise control over the appearance of the output.
The trick is that the string can contain format or "conversion" codes to control the results of the expressions in the expression list. For example, the following program:
BEGIN {x = 35; printf("x = %d decimal, %x hex, %o octal.n",x,x,x)}-- prints:
x = 35 decimal, 23 hex, 43 octal.The format codes in this example include: "%d" (specifying decimal output), "%x" (specifying hexadecimal output), and "%o" (specifying octal output). The "printf()" function substitutes the three variables in the expression list for these format codes on output.
* The format codes are highly flexible and their use can be a bit confusing. The "d" format code prints a number in decimal format. The output is an integer, even if the number is a real, like 3.14159. Trying to print a string with this format code results in a "0" output. For example:
x = 35; printf("x = %dn",x) yields: x = 35* The "o" format code prints a number in octal format. Other than that, this format code behaves exactly as does the "%d" format specifier. For example:
x = 3.1415; printf("x = %dn",x) yields: x = 3
x = "TEST"; printf("x = %dn",x) yields: x = 0
x = 255; printf("x = %on",x) yields: x = 377* The "x" format code prints a number in hexadecimal format. Other than that, this format code behaves exactly as does the "%d" format specifier. For example:
x = 197; printf("x = %xn",x) yields: x = c5* The "c" format code prints a character, given its numeric code. For example, the following statement outputs all the printable characters:
BEGIN {for (ch=32; ch<128; ch++) printf("%c %cn",ch,ch+128)}* The "s" format code prints a string. For example:
x = "jive"; printf("string = %sn",x) yields: string = jive* The "e" format code prints a number in exponential format, in the default format:
[-]D.DDDDDDe[+/-]DDDFor example:
x = 3.1415; printf("x = %en",x) yields: x = 3.141500e+000* The "f" format code prints a number in floating-point format, in the default format:
[-]D.DDDDDDFor example:
x = 3.1415; printf("x = %fn",x) yields: f = 3.141500* The "g" format code prints a number in exponential or floating-point format, whichever is shortest.
* A numeric string may be inserted between the "%" and the format code to specify greater control over the output format. For example:
%3dThis works as follows:
%5.2f
%08s
%-8.4s
For example, consider the output of a string:
x = "Baryshnikov"-- or an integer:
printf("[%3s]n",x) yields: [Baryshnikov]
printf("[%16s]n",x) yields: [ Baryshnikov]
printf("[%-16s]n",x) yields: [Baryshnikov ]
printf("[%.3s]n",x) yields: [Bar]
printf("[%16.3s]n",x) yields: [ Bar]
printf("[%-16.3s]n",x) yields: [Bar ]
printf("[%016s]n",x) yields: [00000Baryshnikov]
printf("[%-016s]n",x) yields: [Baryshnikov ]
x = 312-- or a floating-point number:
printf("[%2d]n",x) yields: [312]
printf("[%8d]n",x) yields: [ 312]
printf("[%-8d]n",x) yields: [312 ]
printf("[%.1d]n",x) yields: [312]
printf("[%08d]n",x) yields: [00000312]
printf("[%-08d]n",x) yields: [312 ]
x = 251.673209
printf("[%2f]n",x) yields: [251.67309]
printf("[%16f]n",x) yields: [ 251.67309]
printf("[%-16f]n",x) yields: [251.67309 ]
printf("[%.3f]n",x) yields: [251.673]
printf("[%16.3f]n",x) yields: [ 251.673]
printf("[%016.3f]n",x) yields: [00000000251.673]
* While "sprintf()" is a string function, it was not discussed with the other string functions, since its syntax is virtually identical to that of "printf()". In fact, "sprintf()" acts in exactly the same way as "printf()", except that "sprintf()" assigns its output to a variable, not standard output. For example:
BEGIN {var = sprintf("[%8.3f]",3.141592654); print var}-- yields:
[ 3.142]
* The output-redirection operator ">" can be used in Awk output statements. For example:
print 3 > "tfile"-- creates a file named "tfile" containing the number "3". If "tfile" already exists, its contents are overwritten. The "append" redirection operator (">>") can be used in exactly the same way. For example:
print 4 >> "tfile"-- tacks the number "4" to the end of "tfile". If "tfile" doesn't exist, it is created and the number "4" is appended to it.
Output redirection can be used with "printf" as well. For example:
BEGIN {for (x=1; x<=50; ++x) {printf("%3dn",x) >> "tfile"}}-- dumps the numbers from 1 to 50 into "tfile".
* The output can also be "piped" into another utility with the "|" ("pipe") operator. As a trivial example, I could pipe output to the "tr" ("translate") utility to convert it to upper-case:
print "This is a test!" | "tr [a-z] [A-Z]"This yields:
THIS IS A TEST!