Chinaunix首页 | 论坛 | 博客
  • 博客访问: 91456
  • 博文数量: 42
  • 博客积分: 880
  • 博客等级: 准尉
  • 技术积分: 375
  • 用 户 组: 普通用户
  • 注册时间: 2006-03-03 12:47
个人简介

学习笔记

文章分类

全部博文(42)

文章存档

2017年(2)

2011年(6)

2009年(1)

2007年(7)

2006年(26)

我的朋友

分类:

2006-09-13 10:35:42

  • Outline

  • General structure of awk (Aho, Weinberg, and Kernighan)
    • awk, oawk, nawk, gawk, mawk
      • The original version, based on the first edition of The awk Programming Language was called awk
      • 2nd edition of book led to nawk
      • Unices usually ship with three different names for awk: oawk, nawk, and awk; either oawk=awk or nawk=awk.
      • gawk is the FSF version.
      • mawk is a speedier rewrite which does a partial compilation
    • The awk command line is:

awk [program|-f programfile] [flags/variables] [files]

  • Command line flags
    • -f file -- Read the awk script from the specified file rather than the command line
    • -F re -- Use the given regular expression re as the field separator rather than the default "white space"
    • variable=value -- Initialize the awk variable with the specified

  • An awk program consists of one or more awk commands separated by either \n or semicolons.

  • The structure of awk commands
    • Each awk command consists of a selector and/or an action; both may not be omitted in the same command. Braces surround the action.
    • selector [only] -- action is print
    • {action}[only] -- selector is every line
    • selector {action} -- perform action on each line where selector is true
    • Each action may have multiple statements separated from each other by semicolons or \n

  • Line selection
    • A selector is either zero, one, or two selection criteria; in the latter case the criteria are separated by commas
    • A selection criterion may be either an RE or a boolean expression (BE) which evaluates to true or false
    • Commands which have no selection criteria are applied to each line of the input data set
    • Commands which have one selection criterion are applied to every line which matches or makes true the criterion depending upon whether the criterion is an RE or a BE
    • Commands which have two selection criteria are applied to the first line which matches the first criterion, the next line which matches the second criterion and all the lines between them.
    • Unless a prior applied command has a next in it, every selector is tested against every line of the input data set.

  • Processing
    • The BEGIN block(s) is(are) run (mawk's -v runs first)
    • Command line variables are assigned
    • For each line in the input data set
      • It is read and NR, NF, $I, etc. are set
      • For each command, its criteria are evaluated
      • If the criteria is true/matches the command is executed
    • After the input data set is exhausted, the END block(s) is(are) run

  • Elementary awk programming
    • Constants
      • Strings are enclosed in quotes (")
      • Numbers are written in the usual decimal way; non-integer values are indicated by including a period (.) in the representation.
      • REs are delimited by /

  • Variables
    • Need not be declared
    • May contain any type of data, their data type may change over the life of the program
    • Are named as any token beginning with a letter and continuing with letters, digits and underscores
    • As in C, case matters; since all the built-in variables are all uppercase, avoid this form.
    • Some of the commonly used built-in variables are:
      • NR -- The current line's sequential number
      • NF -- The number of fields in the current line
      • FS -- The input field separator; defaults to whitespace and is reset by the -F command line parameter
  • Fields
    • Each record is separated into fields named $1, $2, etc
    • $0 is the entire record
    • NF contains the number of fields in the current line
    • FS contains the field separator RE; it defaults to the white space RE, /[]*/
    • Fields may be accessed either by $n or by $var where var contains a value between 0 and NF
  • print/printf
    • print prints each of the values of $1 through $NF separated by OFS then prints a \n onto stdout; the default value of OFS is a blank
    • print value value ... prints the value(s) in order and then puts out a \n onto stdout;
    • printf(format,value,value,...) prints the value(s) using the format supplied onto stdout, just like C. There is no default \n for each printf so multiples can be used to build a line. There must be as many values in the list as there are item descriptors in .
    • Values in print or printf may be constants, variables, or expressions in any order
  • Operators - awk has many of the same as C, excepting the bit operators. It also adds some text processing operators.
  • Built-in functions
    • substr(s,p,l) -- The substring of s starting at p and continuing for l characters
    • index(s1,s2) -- The first location of s2 within s1; 0 if not found
    • length(e) -- The length of e, converted to character string if necessary, in bytes
    • sin, cos, tan -- Standard C trig functions
    • atan2(x,y) -- Standard quadrant oriented arctangent function
    • exp, log -- Standard C exponential functions
    • srand(s), rand() -- Random number seed and access functions
  • Elementary examples and uses
    • length($0)>72 -- print all of the lines whose length exceeds 72 bytes
    • {$2="";print} -- remove the second field from each line
    • {print $2} -- print only the second field of each line
    • /Ucast/{print $1 "=" $NF} -- for each line which contains the string 'Ucast' print the first variable, an equal sign and the last variable (awk code to create awk code; a common trick)
    • BEGIN{FS="/"};NF<4 -- using '/' as a field separator, print only those records with less than four fields; when applied to the output of du, gives a two level summary
    • {n++;t+=$4};END{print n " " t} -- when applied to the output of an ls -l command provides a count and total size of the listed files; I use it as part of an alias for dir. Depending on your flavor of UNIX, the $4 may need to be changed to $5.
    • $1==prv{ct++;next}{printf("%8d %s",ct,prv);ct=1;pr v=$0} -- prints each unique record with a count of the number of occurrences of it; presumes input is sorted

  • Advanced awk programming
    • Program structure (if, for, while, etc.)
      • if(boolean) statement1 else statement2 if the boolean expression evaluates to true execute statement1, otherwise execute statement 2
      • for(v=init;boolean;v change) statement Standard C for loop, assigns v the value of init then while the boolean expression is true executes the statement then the v change
      • for(v in array) statement Assigns to v each of the values of the subscripts of array, not in any particular order, then executes statement
      • while(boolean) statement While the boolean expression is true, execute the statement
      • do statement while(boolean) execute statement, evaluate the boolean expression and if true, repeat
      • statement in any of the above constructs may be either a simple statement or a series of statements enclosed in {}, again like C; a further requirement is that the opening { must be on the line with the beginning keyword (if, for, while, do) either physically or logically via \ .
      • break -- exit from an enclosing for or while loop
      • continue -- restart the enclosing for or while loop from the top
      • next -- stop processing the current record, read the next record and begin processing with the first command
      • exit -- terminate all input processing and, if present, execute the END command

  • Arrays
    • There are two types of arrays in awk - standard and generalized
    • Standard arrays take the usual integer subscripts, starting at 0 and going up; multidimensional arrays are allowed and behave as expected
    • Generalized arrays take any type of variable(s) as subscripts, but the subscript(s) are treated as one long string expression.
    • The use of for(a in x) on a generalized array will return all of the valid subscripts in some order, not necessarily the one you wished.
    • The subscript separator is called SUBSEP and has a default value of comma (,)
    • Elements can be deleted from an array via the delete(array[subscript]) statement

  • Built-in variables
    • FILENAME -- The name of the file currently being processed
    • OFS -- Output Field Separator default ' '
    • RS -- Input Record Separator default \n
    • ORS -- Output Record Separator default \n
    • FNR -- Current line's number with respect to the current file
    • OFMT -- Output format for printed numbers default %.6g
    • RSTART -- The location of the data matched using the match built-in function
    • RLENGTH -- The length of the data matched using the match built-in function

  • Built-in functions
    • gsub(re,sub,str) -- replace, in str, each occurrence of the regular expression re with sub; return the number of substitutions performed
    • int(expr) -- return the value of expr with all fractional parts removed
    • match(str,re) -- return the location in str where the regular expression re occurs and set RSTART and RLENGTH; if re is not found return 0
    • split(str,arrname,sep) -- split str into pieces using sep as the separator and assign the pieces in order to the elements from 1 up of arrname; use FS if sep is not given
    • sprintf(format,value,value,...) -- write the values, as the indicates, into a string and return that string
    • sub(re,sub,str) -- replace, in str, the first occurrence of the regular expression re with sub; return 1 if successful, 0 otherwise
    • system(command) -- pass command to the local operating system to execute and return the exit status code returned by the operating system
    • tolower(str) -- return a string similar to str with all capital letters changed to lower case
  • Other file I/O
    • print and printf may have > (or >>) filename or | command appended and the output will be sent to the named file or command; once a file is opened, it remains open until explicitly closed
    • getline var < filename will read the next line from filename into var. Again, once a file is opened, it remains so until it is explicitly closed
    • close(filename) explicitly closes the file named by the filename expression

  • Writing your own functions
    • A function begins with a function header of the form:
function name(argument(s), localvar(s)) {

    • and ends with the matching }
    • The value of the function is returned via a statement of the form:
return value

    • Functions do not have to return a value and the value returned by a function (either built-in or written locally) may be ignored by just placing the function with its arguments as a whole, separate statement
    • The local variables indicated in the localvars of the heading replace the global variables of the same name until the function completes, at which time the globals are restored
    • Functions may have side effects such as updating global variables, doing I/O or running other functions with side effects; beware the frumious bandersnatch

  • Advanced examples and uses



{  split($1,t,":")
$1 = (t[1]*60+t[2])*60+t[3]
print
}

Replaces an HH:MM:SS time stamp in the first field with a seconds since midnight value which can be more easily plotted, computed with, etc.



     {  for(i = 1; i<=NF; i++) ct[$i] += 1 }
END { for(w in ct) {
printf("%6d %s",ct[w],w)
}
}

This reads a file of text and creates a file containing each unique word along with the number of occurrences of the word in the text.



NR=1  { t0=$1; tp = $1; for(i=1;i<=nv;i++) dp[i] = $(I+1);next}
{ dt=$1-tp;
tp = $1
printf("%d ",$1-t0)
for(i=1;i<=nv;i++) {
printf("%d ",($(I+1)-dp[i])/dt)
dp[i] = $(i+1)
}
printf("\n")
}

Take a set of time stamped data and convert the data from absolute time and counts to relative time and average counts. The data is presumed to be all amenable to treatment as integers. If not, formats better the %d must be used.



BEGIN{  printf("set term postscript\n") > "plots"
printf("set output '|lpr -Php'\n") > "plots" }
{ if(system("test -s " $1 ".r") {
print "process1 " $1 ".r " $2
printf("plot '%s.data' using 2:5 title '%s'",\
$1,$3) >> "plots"
}
}
END { print "gnuplot < plots" }

Write a pair of set lines to a file called plots. For each input line, if a file whose name is the first field on the line with a .r appended exists, write a command to the stdout file containing the file name and the second field from the line; also write a plot statement to a file called plots using the third field from the input line. After the file has been processed, add a gnuplot command to the stdout file. If all of the output is passed to sh or csh through a pipe, the commands will be executed.



BEGIN  { l[1]=25; l[2]=20; l[3]=50 }
/^[ABC]/ {
I = index("ABC", substr($0,1,1))
a=$0 " "
print substr(a,1,l[i])
}
{ print }

Make lines whose first characters are 'A', 'B', or 'C' have lengths of 25, 20, and 50 bytes respectively, changing no other lines.



/^\+/ { hold = hold "\r" substr($0,2); next}
{ if( unfirst ) print hold
hold =""
}
/^1/ { hold = "\f" }
/^0/ { hold = "\n" }
/^-/ { hold = "\n\n" }
{ unfirst = 1
hold = hold + substr($0,2)
}
END { if(unfirst) print hold }

This routine will take FORTRAN-type output with leading ANSI vertical motion indicators and convert it to a stream with ASCII printer control sequences in it.



BEGIN  { b=""; if(ll==0) ll=72 }
NF==0 { print b; b=""; print ""; next }
{ if(substr(b,length(b),1)=="-") {
b=substr(b,1,length(b)-1) $0 }
else b=b " " $0
while(length(b)>ll) {
i = ll
while(substr(b,i,1)=" ") I--
print substr(b,1,i-1)
b = substr(b,i+1)
}
}
END { print b; print "" }

This will take an arbitrary stream of text (where paragraphs are indicated by consecutive \n) and make all the lines approximately the same length. The default output line length is 72, but it may be set via a parameter on the awk command line. Both long and short lines are taken care of but extra spaces/tabs within the text are not correctly handled.



BEGIN {	FS = "\t"   # make tab the field separator
printf("%10s %6s %5s %s\n\n",
"COUNTRY", "AREA", "POP", "CONTINENT")
}
{ printf("%10s %6d %5d %s\n", $1, $2, $3, $4)
area = area +$2
pop = pop + $3
}
END { printf("\n%10s %6d %5d\n", "TOTAL", area, pop) }

This will take a variable width table of data with four tab separated fields and print it as a fixed length table with headings and totals.



  • Important things which will bite you
    • $1 inside the awk script is not $1 of the shell script; use variable assignment on the command line to move data from the shell to the awk script,
    • Actions are within {}, not selections
    • Every selection is applied to each input line after the previously selected actions have occurred; this means that a previous action can cause unexpected selections or selection misses.

Operators

" "              The blank is the concatenation operator
+ - * / % All of the usual C arithmetic
operators, add, subtract, multiply,
divide and mod.
== != < <= > >= All of the usual C relational
operators, equal, not equal, less
than, less than or equal and greater
than, greater than or equal
&& || The C boolean operators and and or
= += -= *= /= %= The C assignment operators
~ !~ Matches and doesn't match
?: C conditional value operator
^ Exponentiation
++ -- Variable increment/decrement
Note the absence of the C bit operators &, |, << and >>

[s]printf format items

Format strings in the printf statement and sprintf function consist of three different type of items: literal characters, escaped literal characters and format items. Literal characters are just that: characters which will print as themselves. Escaped literal characters begin with a backslash (\) and are used to represent control characters; the common ones are: \n for new line, \t for tab and \r for return. Format items are used to describe how program variables are to be printed.

All format items begin with a percent sign (%). The next part is an optional length and precision field. The length is an integer indicating the minimum field width of the item, negative if the data is to be white spacethe left of the field. If the length field begins with a zero (0), then instead of padding the value with leading blanks, the item will be padded with leading 0s. The precision is a decimal followed by the number of decimal digits to be displayed for various floating point representations. Next is an optional source field size modifier, usually 'l' (ell). The last item is the actual source data type, commonly one of the list below:

     d     Integer
f Floating point in fixed point format
e Floating point invaluel format
g Floating point in "best fit" format; integer, fixed
point, or exponential; depending on exact value
s Character string
c Integer to be interpreted as a character
x Integer to be printed as hexadecimal

Examples:

   %-20s   Print a string in the left portion of a 20 character
field
%d Print an integer in however many spaces it takes
%6d Print an integer in at least 6 spaces; used to format
pretty output
%9ld Print a long integer in at least 9 spaces
%09ld Print a long integer in at least 9 spaces with leading
0s, not blanks
%.6f Print a float with 6 digits after the decimal and as
many before it as needed
%10.6f Print a float in a 10 space field with 6 digits after
the decimal
 
阅读(1868) | 评论(2) | 转发(0) |
给主人留下些什么吧!~~