- General structure of awk (Aho, Weinberg, and Kernighan)
- awk, oawk, nawk, gawk, mawk
- The original version, based on the first edition of The awk
Programming Language was called awk
- 2nd edition of book led to nawk
- Unices usually ship with three different names for awk: oawk,
nawk, and awk; either oawk=awk or nawk=awk.
- gawk is the FSF version.
- mawk is a speedier rewrite which does a partial compilation
- The awk command line is:
awk [program|-f programfile] [flags/variables] [files]
- Command line flags
- -f file -- Read the awk script from the specified file rather than the
command line
- -F re -- Use the given regular expression re as the field separator
rather than the default "white space"
- variable=value -- Initialize the awk variable with the specified
- An awk program consists of one or more awk commands separated by
either \n or semicolons.
- The structure of awk commands
- Each awk command consists of a selector and/or an action; both
may not be omitted in the same command. Braces surround the
action.
- selector [only] -- action is print
- {action}[only] -- selector is every line
- selector {action} -- perform action on each line where selector is
true
- Each action may have multiple statements separated from each
other by semicolons or \n
- Line selection
- A selector is either zero, one, or two selection criteria; in the latter
case the criteria are separated by commas
- A selection criterion may be either an RE or a boolean expression
(BE) which evaluates to true or false
- Commands which have no selection criteria are applied to each
line of the input data set
- Commands which have one selection criterion are applied to every
line which matches or makes true the criterion depending upon
whether the criterion is an RE or a BE
- Commands which have two selection criteria are applied to the
first line which matches the first criterion, the next line which
matches the second criterion and all the lines between them.
- Unless a prior applied command has a next in it, every selector is
tested against every line of the input data set.
- Processing
- The BEGIN block(s) is(are) run (mawk's -v runs first)
- Command line variables are assigned
- For each line in the input data set
- It is read and NR, NF, $I, etc. are set
- For each command, its criteria are evaluated
- If the criteria is true/matches the command is executed
- After the input data set is exhausted, the END block(s) is(are) run
- Elementary awk programming
- Constants
- Strings are enclosed in quotes (")
- Numbers are written in the usual decimal way; non-integer values
are indicated by including a period (.) in the representation.
- REs are delimited by /
- Variables
- Need not be declared
- May contain any type of data, their data type may change over the
life of the program
- Are named as any token beginning with a letter and continuing
with letters, digits and underscores
- As in C, case matters; since all the built-in variables are all
uppercase, avoid this form.
- Some of the commonly used built-in variables are:
- NR -- The current line's sequential number
- NF -- The number of fields in the current line
- FS -- The input field separator; defaults to whitespace and
is reset by the -F command line parameter
- Fields
- Each record is separated into fields named $1, $2, etc
- $0 is the entire record
- NF contains the number of fields in the current line
- FS contains the field separator RE; it defaults to the white space
RE, /[]*/
- Fields may be accessed either by $n or by $var where var contains
a value between 0 and NF
- print/printf
- print prints each of the values of $1 through $NF separated by
OFS then prints a \n onto stdout; the default value of OFS is a
blank
- print value value ... prints the value(s) in order and then puts out
a \n onto stdout;
- printf(format,value,value,...) prints the value(s) using the format
supplied onto stdout, just like C. There is no default \n for each
printf so multiples can be used to build a line. There must be as
many values in the list as there are item descriptors in .
- Values in print or printf may be constants, variables, or
expressions in any order
- Operators - awk has many of the same as C, excepting the bit
operators. It also adds some text processing operators.
- Built-in functions
- substr(s,p,l) -- The substring of s starting at p and continuing for l
characters
- index(s1,s2) -- The first location of s2 within s1; 0 if not found
- length(e) -- The length of e, converted to character string if
necessary, in bytes
- sin, cos, tan -- Standard C trig functions
- atan2(x,y) -- Standard quadrant oriented arctangent function
- exp, log -- Standard C exponential functions
- srand(s), rand() -- Random number seed and access functions
- Elementary examples and uses
-
length($0)>72
-- print all of the lines whose length exceeds 72 bytes -
{$2="";print}
-- remove the second field from each line -
{print $2}
-- print only the second field of each line -
/Ucast/{print $1 "=" $NF}
-- for each line which contains the
string 'Ucast' print the first variable, an equal sign and the last variable
(awk code to create awk code; a common trick) -
BEGIN{FS="/"};NF<4
-- using '/' as a field separator, print only those
records with less than four fields; when applied to the output of du, gives a
two level summary -
{n++;t+=$4};END{print n " " t}
-- when applied to the
output of an ls -l command provides a count and total size of the listed
files; I use it as part of an alias for dir. Depending on your flavor of UNIX,
the $4 may need to be changed to $5. -
$1==prv{ct++;next}{printf("%8d %s",ct,prv);ct=1;pr
v=$0}
-- prints each unique record with a count of the number of
occurrences of it; presumes input is sorted
- Advanced awk programming
- Program structure (if, for, while, etc.)
- if(boolean) statement1 else statement2 if the boolean expression
evaluates to true execute statement1, otherwise execute statement 2
- for(v=init;boolean;v change) statement Standard C for loop, assigns v
the value of init then while the boolean expression is true executes the
statement then the v change
- for(v in array) statement Assigns to v each of the values of the
subscripts of array, not in any particular order, then executes statement
- while(boolean) statement While the boolean expression is true, execute
the statement
- do statement while(boolean) execute statement, evaluate the boolean
expression and if true, repeat
- statement in any of the above constructs may be either a simple statement
or a series of statements enclosed in {}, again like C; a further requirement
is that the opening { must be on the line with the beginning keyword (if,
for, while, do) either physically or logically via \ .
- break -- exit from an enclosing for or while loop
- continue -- restart the enclosing for or while loop from the top
- next -- stop processing the current record, read the next record and begin
processing with the first command
- exit -- terminate all input processing and, if present, execute the END
command
- Arrays
- There are two types of arrays in awk - standard and generalized
- Standard arrays take the usual integer subscripts, starting at 0 and going
up; multidimensional arrays are allowed and behave as expected
- Generalized arrays take any type of variable(s) as subscripts, but the
subscript(s) are treated as one long string expression.
- The use of for(a in x) on a generalized array will return all of the valid
subscripts in some order, not necessarily the one you wished.
- The subscript separator is called SUBSEP and has a default value of
comma (,)
- Elements can be deleted from an array via the delete(array[subscript])
statement
- Built-in variables
- FILENAME -- The name of the file currently being processed
- OFS -- Output Field Separator default ' '
- RS -- Input Record Separator default \n
- ORS -- Output Record Separator default \n
- FNR -- Current line's number with respect to the current file
- OFMT -- Output format for printed numbers default %.6g
- RSTART -- The location of the data matched using the match built-in
function
- RLENGTH -- The length of the data matched using the match built-in
function
- Built-in functions
- gsub(re,sub,str) -- replace, in str, each occurrence of the regular
expression re with sub; return the number of substitutions performed
- int(expr) -- return the value of expr with all fractional parts removed
- match(str,re) -- return the location in str where the regular expression re
occurs and set RSTART and RLENGTH; if re is not found return 0
- split(str,arrname,sep) -- split str into pieces using sep as the separator
and assign the pieces in order to the elements from 1 up of arrname; use
FS if sep is not given
- sprintf(format,value,value,...) -- write the values, as the
indicates, into a string and return that string
- sub(re,sub,str) -- replace, in str, the first occurrence of the regular
expression re with sub; return 1 if successful, 0 otherwise
- system(command) -- pass command to the local operating system to
execute and return the exit status code returned by the operating system
- tolower(str) -- return a string similar to str with all capital letters changed
to lower case
- Other file I/O
- print and printf may have > (or >>) filename or | command appended
and the output will be sent to the named file or command; once a file is
opened, it remains open until explicitly closed
- getline var < filename will read the next line from filename into var.
Again, once a file is opened, it remains so until it is explicitly closed
- close(filename) explicitly closes the file named by the filename
expression
- Writing your own functions
- A function begins with a function header of the form:
function name(argument(s), localvar(s)) {
- and ends with the matching }
- The value of the function is returned via a statement of the form:
return value
- Functions do not have to return a value and the value returned by a
function (either built-in or written locally) may be ignored by just placing
the function with its arguments as a whole, separate statement
- The local variables indicated in the localvars of the heading replace the
global variables of the same name until the function completes, at which
time the globals are restored
- Functions may have side effects such as updating global variables, doing
I/O or running other functions with side effects; beware the frumious
bandersnatch
- Advanced examples and uses
{ split($1,t,":")
$1 = (t[1]*60+t[2])*60+t[3]
print
}
Replaces an HH:MM:SS time stamp in the first field with a seconds since midnight value which
can be more easily plotted, computed with, etc.
{ for(i = 1; i<=NF; i++) ct[$i] += 1 }
END { for(w in ct) {
printf("%6d %s",ct[w],w)
}
}
This reads a file of text and creates a file containing each unique word along with the number of
occurrences of the word in the text.
NR=1 { t0=$1; tp = $1; for(i=1;i<=nv;i++) dp[i] = $(I+1);next}
{ dt=$1-tp;
tp = $1
printf("%d ",$1-t0)
for(i=1;i<=nv;i++) {
printf("%d ",($(I+1)-dp[i])/dt)
dp[i] = $(i+1)
}
printf("\n")
}
Take a set of time stamped data and convert the data from absolute time and counts to relative
time and average counts. The data is presumed to be all amenable to treatment as integers. If not,
formats better the %d must be used.
BEGIN{ printf("set term postscript\n") > "plots"
printf("set output '|lpr -Php'\n") > "plots" }
{ if(system("test -s " $1 ".r") {
print "process1 " $1 ".r " $2
printf("plot '%s.data' using 2:5 title '%s'",\
$1,$3) >> "plots"
}
}
END { print "gnuplot < plots" }
Write a pair of set lines to a file called plots. For each input line, if a file whose name is the first
field on the line with a .r appended exists, write a command to the stdout file containing the file
name and the second field from the line; also write a plot statement to a file called plots using the
third field from the input line. After the file has been processed, add a gnuplot command to the
stdout file. If all of the output is passed to sh or csh through a pipe, the commands will be
executed.
BEGIN { l[1]=25; l[2]=20; l[3]=50 }
/^[ABC]/ {
I = index("ABC", substr($0,1,1))
a=$0 " "
print substr(a,1,l[i])
}
{ print }
Make lines whose first characters are 'A', 'B', or 'C' have lengths of 25, 20, and 50 bytes
respectively, changing no other lines.
/^\+/ { hold = hold "\r" substr($0,2); next}
{ if( unfirst ) print hold
hold =""
}
/^1/ { hold = "\f" }
/^0/ { hold = "\n" }
/^-/ { hold = "\n\n" }
{ unfirst = 1
hold = hold + substr($0,2)
}
END { if(unfirst) print hold }
This routine will take FORTRAN-type output with leading ANSI vertical motion indicators and
convert it to a stream with ASCII printer control sequences in it.
BEGIN { b=""; if(ll==0) ll=72 }
NF==0 { print b; b=""; print ""; next }
{ if(substr(b,length(b),1)=="-") {
b=substr(b,1,length(b)-1) $0 }
else b=b " " $0
while(length(b)>ll) {
i = ll
while(substr(b,i,1)=" ") I--
print substr(b,1,i-1)
b = substr(b,i+1)
}
}
END { print b; print "" }
This will take an arbitrary stream of text (where paragraphs are indicated by consecutive \n) and
make all the lines approximately the same length. The default output line length is 72, but it may
be set via a parameter on the awk command line. Both long and short lines are taken care of but
extra spaces/tabs within the text are not correctly handled.
BEGIN { FS = "\t" # make tab the field separator
printf("%10s %6s %5s %s\n\n",
"COUNTRY", "AREA", "POP", "CONTINENT")
}
{ printf("%10s %6d %5d %s\n", $1, $2, $3, $4)
area = area +$2
pop = pop + $3
}
END { printf("\n%10s %6d %5d\n", "TOTAL", area, pop) }
This will take a variable width table of data with four tab separated fields and print it as a fixed
length table with headings and totals.
- Important things which will bite you
- $1 inside the awk script is not $1 of the shell script; use variable assignment
on the command line to move data from the shell to the awk script,
- Actions are within {}, not selections
- Every selection is applied to each input line after the previously selected
actions have occurred; this means that a previous action can cause unexpected
selections or selection misses.
Operators
" " The blank is the concatenation operator
+ - * / % All of the usual C arithmetic
operators, add, subtract, multiply,
divide and mod.
== != < <= > >= All of the usual C relational
operators, equal, not equal, less
than, less than or equal and greater
than, greater than or equal
&& || The C boolean operators and and or
= += -= *= /= %= The C assignment operators
~ !~ Matches and doesn't match
?: C conditional value operator
^ Exponentiation
++ -- Variable increment/decrement
Note the absence of the C bit operators &, |, << and >>
[s]printf format items
Format strings in the printf statement and sprintf function consist of three different type of items:
literal characters, escaped literal characters and format items. Literal characters are just that:
characters which will print as themselves. Escaped literal characters begin with a backslash (\)
and are used to represent control characters; the common ones are: \n for new line, \t for tab and
\r for return. Format items are used to describe how program variables are to be printed.
All format items begin with a percent sign (%). The next part is an optional length and precision
field. The length is an integer indicating the minimum field width of the item, negative if the data
is to be white spacethe left of the field. If the length field begins with a zero (0), then instead of
padding the value with leading blanks, the item will be padded with leading 0s. The precision is
a decimal followed by the number of decimal digits to be displayed for various floating point
representations. Next is an optional source field size modifier, usually 'l' (ell). The last item is
the actual source data type, commonly one of the list below:
d Integer
f Floating point in fixed point format
e Floating point invaluel format
g Floating point in "best fit" format; integer, fixed
point, or exponential; depending on exact value
s Character string
c Integer to be interpreted as a character
x Integer to be printed as hexadecimal
Examples:
%-20s Print a string in the left portion of a 20 character
field
%d Print an integer in however many spaces it takes
%6d Print an integer in at least 6 spaces; used to format
pretty output
%9ld Print a long integer in at least 9 spaces
%09ld Print a long integer in at least 9 spaces with leading
0s, not blanks
%.6f Print a float with 6 digits after the decimal and as
many before it as needed
%10.6f Print a float in a 10 space field with 6 digits after
the decimal