awk programming: August 2008

Friday, 22 August 2008

Gnu Awk - gawk - Manual

This is a very useful resource - although it's specifically talking about gawk most of it will be applicable to other variations of awk:

http://www.gnu.org/software/gawk/manual/

Thursday, 21 August 2008

awk: terms and descriptions

I found this article by IBM that very clearly describes the different components of an awk program:

http://www.ibm.com/developerworks/library/l-awk1.html

EXAMPLES # is the comment character for awk. 'field' means 'column'

# Print first two fields in opposite order:
awk '{ print $2, $1 }' file

# Print lines longer than 72 characters:
awk 'length > 72' file

# Print length of string in 2nd column
awk '{print length($2)}' file

# Add up first column, print sum and average:
{ s += $1 }
END { print "sum is", s, " average is", s/NR }

# Print fields in reverse order:
awk '{ for (i = NF; i > 0; --i) print $i }' file

# Print the last line
{line = $0}
END {print line}

# Print the total number of lines that contain the word Pat
/Pat/ {nlines = nlines + 1}
END {print nlines}

# Print all lines between start/stop pairs:
awk '/start/, /stop/' file

# Print all lines whose first field is different from previous one:
awk '$1 != prev { print; prev = $1 }' file

# Print column 3 if column 1 > column 2:
awk '$1 > $2 {print $3}' file

# Print line if column 3 > column 2:
awk '$3 > $2' file

# Count number of lines where col 3 > col 1
awk '$3 > $1 {print i + "1"; i++}' file

# Print sequence number and then column 1 of file:
awk '{print NR, $1}' file

# Print every line after erasing the 2nd field
awk '{$2 = ""; print}' file

# Print hi 28 times
yes | head -28 | awk '{ print "hi" }'

# Print hi.0010 to hi.0099 (NOTE IRAF USERS!)
yes | head -90 | awk '{printf("hi00%2.0f \n", NR+9)}'

# Print out 4 random numbers between 0 and 1
yes | head -4 | awk '{print rand()}'

# Print out 40 random integers modulo 5
yes | head -40 | awk '{print int(100*rand()) % 5}'

# Replace every field by its absolute value
{ for (i = 1; i <= NF; i=i+1) if ($i < i =" -$i" 2="="" i="875;i">833;i--){
printf "lprm -Plw %d\n", i
} exit
}

Formatted printouts are of the form printf( "format\n", value1, value2, ... valueN)
e.g. printf("howdy %-8s What it is bro. %.2f\n", $1, $2*$3)
%s = string
%-8s = 8 character string left justified
%.2f = number with 2 places after .
%6.2f = field 6 chars with 2 chars after .
\n is newline
\t is a tab

# Print frequency histogram of column of numbers
$2 <= 0.1 {na=na+1} ($2 > 0.1) && ($2 <= 0.2) {nb = nb+1} ($2 > 0.2) && ($2 <= 0.3) {nc = nc+1} ($2 > 0.3) && ($2 <= 0.4) {nd = nd+1} ($2 > 0.4) && ($2 <= 0.5) {ne = ne+1} ($2 > 0.5) && ($2 <= 0.6) {nf = nf+1} ($2 > 0.6) && ($2 <= 0.7) {ng = ng+1} ($2 > 0.7) && ($2 <= 0.8) {nh = nh+1} ($2 > 0.8) && ($2 <= 0.9) {ni = ni+1} ($2 > 0.9) {nj = nj+1}
END {print na, nb, nc, nd, ne, nf, ng, nh, ni, nj, NR}

# Find maximum and minimum values present in column 1
NR == 1 {m=$1 ; p=$1}
$1 >= m {m = $1}
$1 <= p {p = $1} END { print "Max = " m, " Min = " p } # Example of defining variables, multiple commands on one line NR == 1 {prev=$4; preva = $1; prevb = $2; n=0; sum=0} $4 != prev {print preva, prevb, prev, sum/n; n=0; sum=0; prev = $4; preva = $1; prevb = $2} $4 == prev {n++; sum=sum+$5/$6} END {print preva, prevb, prev, sum/n} # Example of defining and using a function, inserting values into an array # and doing integer arithmetic mod(n). This script finds the number of days # elapsed since Jan 1, 1901. (from http://www.netlib.org/research/awkbookcode/ch3) function daynum(y, m, d, days, i, n) { # 1 == Jan 1, 1901 split("31 28 31 30 31 30 31 31 30 31 30 31", days) # 365 days a year, plus one for each leap year n = (y-1901) * 365 + int((y-1901)/4) if (y % 4 == 0) # leap year from 1901 to 2099 days[2]++ for (i = 1; i < m; i++) n += days[i] return n + d } { print daynum($1, $2, $3) } # Example of using substrings # substr($2,9,7) picks out characters 9 thru 15 of column 2 {print "imarith", substr($2,1,7) " - " $3, "out."substr($2,5,3)} {print "imarith", substr($2,9,7) " - " $3, "out."substr($2,13,3)} {print "imarith", substr($2,17,7) " - " $3, "out."substr($2,21,3)} {print "imarith", substr($2,25,7) " - " $3, "out."substr($2,29,3)}

awk: more simple examples

First, suppose you have a file called 'file1' that has 2 columns of numbers, and you want to make a new file called 'file2' that has columns 1 and 2 as before, but also adds a third column which is the ratio of the numbers in columns 1 and 2. Suppose you want the new 3-column file (file2) to contain only those lines with column 1 smaller than column 2. Either of the following two commands does what you want:

awk '$1 < $2 {print $0, $1/$2}' file1 > file2

-- or --

cat file1 | awk '$1 < $2 {print $0, $1/$2}' > file2

Let's look at the second one. You all know that 'cat file1' prints the contents of file1 to your screen. The | (called a pipe) directs the output of 'cat file1', which normally goes to your screen, to the command awk. Awk considers the input from 'cat file1' one line at a time, and tries to match the 'pattern'. The pattern is whatever is between the first ' and the {, in this case the pattern is $1 < $2. If the pattern is false, awk goes on to the next line. If the pattern is true, awk does whatever is in the {}. In this case we have asked awk to check if the first column is less than the second. If there is no pattern, awk assumes the pattern is true, and goes onto the action contained in the {}.

What is the action? Almost always it is a print statement of some sort. In this case we want awk to print the entire line, i.e. $0, and then print the ratio of columns 1 and 2, i.e. $1/$2. We close the action with a }, and close the awk command with a '. Finally, to store the final 3-column output into file2 (otherwise it prints to the screen), we add a '> file2'.

As a second example, suppose you have several thousand files you want to move into a new directory and rename by appending a .dat to the filenames. You could do this one by one (several hours), or use vi to make a decent command file to do it (several minutes), or use awk (several seconds). Suppose the files are named junk* (* is wildcard for any sequence of characters), and need to be moved to ../iraf and have a '.dat' appended to the name. To do this type

ls junk* | awk '{print "mv "$0" ../iraf/"$0".dat"}' | csh

ls junk* lists the filenames, and this output is piped into awk instead of going to your screen. There is no pattern (nothing between the ' and the {), so awk proceeds to print something for each line. For example, if the first two lines from 'ls junk*' produced junk1 and junk2, respectively, then awk would print:

mv junk1 ../iraf/junk1.dat
mv junk2 ../iraf/junk2.dat

At this point the mv commands are simply printed to the screen. To execute the command we take the output of awk and pipe it back into the operating system (the C-shell). Hence, to finish the statement we add a ' | csh'.

More complex awk scripts need to be run from a file. The syntax for such cases is:

cat file1 | awk -f a.awk > file2

where file1 is the input file, file2 is the output file, and a.awk is a file containing awk commands. Examples below that contain more than one line of awk need to be run from files.

Some useful awk variables defined for you are NF (number of columns), NR (the current line that awk is working on), END (true if awk reaches the EOF), BEGIN (true before awk reads anything), and length (number of characters in a line or a string). There is also looping capability, a search (/) command, a substring command (extremely useful), and formatted printing available. There are logical variables || (or) and && (and) that can be used in 'pattern'. You can define and manipulate your own user defined variables. Examples are outlined below. The only bug I know of is that Sun's version of awk won't do trig functions, though it does do logs. There is something called gawk (a Gnu product), which does a few more things than Sun's awk, but they are basically the same. Note the use of the 'yes' command below. Coupled with 'head' and 'awk' you save an hour of typing if you have a lot of files to analyze or rename.

awk: appending fields from one row to the preceding row

How to append fields from one row on to the end of the preceding row - the awk one isn't completely tested yet - this gives the output the other way round...

Try...

paste -d" " - - <> outfile

Or...

awk -v RS="" '{$1=$1};1' infile > outfile

If anyone know's how to improve this please let me know! ...Also I'm not sure why the '-v' variable declaration is required?

Tuesday, 19 August 2008

awk: numerical addition and grouping

The following script performs a count of the numerical data in a column ($3) and then groups that data by $1.

awk -F'#' 'BEGIN {}
{
sum[$1] += $3
} END {
for ( i in sum ) print i" : "sum[i]
}' $1

Notice how I'm making the value of the array sum[$1] the addition of the numerical values found at field three ($3):

sum[$1] += $3

Essentially evertime awk comes across a $1 value it adds the value of $3 to the array index of $1.

awk: printing from a specified field onwards

OK, this is way to complicated when a simple substr() on $0 will suffice but I like it!

awk '{for(i=1;i<4;i++)$i="";sub(/^ */,"");print}' error.txt | sort -u

Notice that it's piped into a sort because awk doesn't have it's own sort command.

All this command is doing is printing everything after field four.

Below is a modification of the above command that prints only the unique lines (based on all fields after field 4):

awk '{for(i=1;i<4;i++)$i=""; sub(/^ */,""); err[$0] = $0 } END { for ( u in err ) print err[u] }'

Notice that $0 is the modified $0 without fields one thru 4.

awk: working out thresholds

The following code was written to work out threshold breaches - the input is standard sar data.
The interesting part is the alert.awk script at the bottom.

The shell script wrapper defines the following variables that are passed to the awk code:

${i} - is the system name
amb - is the amber alert threshold value
count - is the amount of times this threshold occurs

count=`grep ${i} alert_cpu.conf|awk -F, '{ print $4 }'`

amb=`grep ${i} alert_cpu.conf|awk -F, '{ print $2 }'`

awk -v val=${amb} -v count=${count} -f alert.awk cpu2.dat >>detailed_alert_report_${DAY}.txt

alert.awk script:

$(NF)>val {c++; o=o $0 ORS; next}

c>count {printf "%s",o; C++}

{c=0; o=""}

END {if (c>count) {printf "%s",o; C++} print C+0}

awk: putting blank lines between non-alike lines

I had some output that was really difficult to read and so I wanted to split the output into blocks that were similar - where they had the same $1 value. This fab little awk command puts a blank line between each block (multiple lines) where $1 changes.

(don't ask me how it works tho because I got it off a really nice chap at www.unix.com)

awk 'x[$1]++||$0=NR==1?$0:RS $0' test2

If anyone can explain it to me please feel free! :)

awk: handy one-liners

This is a great resource for simple command line awk:

http://student.northpark.edu/pemente/awk/awk1line.txt

awk: getting data based on previous lines

Here's a little awk script that I wrote to collect data from a field where it's heading is in the line above (both are at $4 - space separated). The input file looks like this:

ARTMGA01_usage_07Mar07-1500:# CPU SUMMARY
ARTMGA01_usage_07Mar07-1500-# USER SYS IDLE WAIT INTR SYSC CS RUNQ AVG5 AVG30 AVG60 FORK VFORK
ARTMGA01_usage_07Mar07-1500- 4 6 90 0 1002 7214 9093 0 0.21 0.18 0.17 2.73 0.59
ARTMGA01_usage_07Mar07-1500:# CPU SUMMARY
ARTMGA01_usage_07Mar07-1500-# USER SYS IDLE WAIT INTR SYSC CS RUNQ AVG5 AVG30 AVG60 FORK VFORK
ARTMGA01_usage_07Mar07-1500- 4 9 87 0 905 7552 8530 1 0.32 0.21 0.18 2.20 0.40
ARTMGA01_usage_07Mar07-2100:# CPU SUMMARY
ARTMGA01_usage_07Mar07-2100-# USER SYS IDLE WAIT INTR SYSC CS RUNQ AVG5 AVG30 AVG60 FORK VFORK
ARTMGA01_usage_07Mar07-2100- 4 5 90 0 1052 7777 9492 0 0.19 0.18 0.18 2.65 0.59

Here's the actual code:

BEGIN {}
{
if ($4 ~ /IDLE/) {
getline
idleval = idleval + $4
count++
}
}
END {
print "Total: " idleval
print "Count: " count
print "Average: " idleval / count
}

There's nothing new in this code that we haven't already covered but it's a good 'simple' example of how awk's getline function can be used. We essentially do a search for 'IDLE' then skip to the next line with getline and collect the value at $4 (field four). The 'count' value is incremented for every line that matches 'IDLE' which is how we can then work out the average:

print "Average: " idleval / count

('idleval' divided by 'count').

awk: first, last and count of a given string

This script is a modification of the code used in the previous post. Below is the code for an awk shell wrapper script (awkscript) that finds the first, last and count of occurrences for a given string. The script would be run on the command line like this:

$ awkscript input.file string

Here's the script (actually a shell script remember and not a pure awk script):

awk -F, '

/'"$2"'/ {
if ( min == "" ) {
min = $0
line = NR
}
lines[last] = $0
total++
lastline = NR
}
END {
print " "
print "First Occurrence: " line
print " "
print min
print " "
print "Last Occurrence: " lastline
print " "
print lines[last]
print " "
print "Total Matches: "total
print " "
}' $1

The interesting thing to notice with this script is how we pass the shell variable $2 (not to be confused with the awk variable $2) to the awk command. The standard awk string comparison test '//' has to have the protection of single then double quote protection around the shell variable.

awk: finding the last occurence of a string

OK, these awk commands use the same theory as when I introduced arrays. Here are some examples for finding the last occurrence of a string:

Finding the last occurance of a string:

awk -F, '/SMPP Enquire Link timeout/ { lines[last] = $0 } END { print lines[last] }' 070403.OLG

awk -F, '/No linkset found for rk_pc/ { lines[last] = $0 } END { print lines[last] }' messages

awk -F, '/Verification Tag mismatch... packet dropped/ { lines[last] = $0 } END { print lines[last] }' messages

The above was based on this which gets the last distinct line based on field $2 (which we're using as our array index - the value of the array component is set to $0):

{
lines[$2] = $0
}
END {
for (i in lines)
print lines[i]
}

This awk script gets distinct lines from a file based on field $2.

awk: pipes and file stuff

Here I'm using the output of a standard *nix 'ls' command and piping it into various awk commands to give me the count of specific fields - file size etc.. Notice also that I'm using the printf command to help better display the output.

How to get the total size of all the files in your current directory and directories within
your current directory (recursive):

ls -lrtR | awk '{sum += $5;} END {print sum;}'

As an extension to the above so that you can format the output looks like this (notice that we're actually defining the BEGIN statement here - you don't actually have to do this but it makes the code easier to read):

ls -lrtR alg | awk 'BEGIN { printf "Directory\t : Size\n" } {sum += $5;} END {printf "alg\t\t : " sum"\n";}'

Which provides output that looks like this:

Directory : Size
alg : 120467

And as another extension to this; here's how you script it so that it can take more than one
directory:

printf "Directory\t : Size\n"

for arg
do

$ ls -lrtR $arg | awk '{sum += $5;} END {printf "'"$arg"'\t\t : " sum"\n";}'

done

$ awkdir alg alg1 fred
Directory : Size
alg : 120467
alg1 : 123209
fred : 123209

awk: finding the count of a field

OK, so I had this file that had different types of events in it and I wanted to find the count of how many occurences there were for each event type. The file was comma separated and the events were at $1 (the first field) and looked like this:

TRAF:5
TRAF:8
TRAF:3

Here's the awk command that got the result I was after:

awk -F, '{ te[$1]++ } END { for ( i in te ) print i" : " te[i] }' traf.test

Here's what it's doing:

The file delimiter flag '-F,' we're already familiar with. This tells awk that the file is comma separated.

Now, in the next bit we're introducing awk arrays for the first time: "te[$1]++"

We're creating an array called 'te', you can call this anything you like. We're creating an index in our array based on the contents of $1 (the first field) which is our traffic event type. The double-plus signs are saying that the value of te[$1] is to be incremented. So, what happens is that an array index is created for every unique value found in $1. That means that we've captured all the different possibilities of $1 with out doing much work at all. When awk finds another example of that same index value ($1) it increments the value of that array component.

Once we've gone through the whole file we get to the END section of the code. Here we're seeing a for loop for the first time:

for (i in te) print i" : " te[i]

This loop iterates through the array and prints out the index of the array (i) and then the value of that component. We end up with a print out of each unique value found at $1 and then a count of the occurences of that value.

Monday, 18 August 2008

awk: find a pattern then return the previous line

Here's an awk script (it's a bit long to call it a command) that looks for a pattern and then returns the previous line. I attempted this when someone told me it could'nt be done with awk - I just did it to prove a point and spent way too much time figuring it out!

Command line:

awk -v N=PATTERN -f getprevline1 testinput.dat

Contents of getprevline1:

BEGIN {
(getline line1 <= ARGV[1])
}
{
if ((getline line < ARGV[1]) && $0 ~ /N/ ) {
print line
}
}

You can see that we're using the -v flag again to define a variable called N with the value 'PATTERN'. We're also using the -f (lowercase) flag for the first time. This flag calls the awk command file getprevline1.

We're also bombarded with the getline awk command which is something that should'nt be dabbled with lightly. In one of the awk books I have it essentially says don't use getline until you've mastered everything else!

awk: find a field and replace it

Sometimes you'll need to search for a specific field and replace it with something else here's some awk code written out as it would appear within a script:

# find a field and replace it - then print the first space delimited field.
/447782000913/ {
gsub(/447782000913/, $2)
}
{ print $1 }

Here's the same command written out in one line as we're used to:

awk '/447782000913/ {gsub(/447782000913/, $2) }{ print $1 }' file.txt

We're introduced to a new built-in function of awk that appears really useful but in reality I've actually only used it a couple of times - gsub().

What's happening in the command is that we're searching for a pattern match. Anything between two forward-slashes '/' is pattern matched. When this matched is made the gsub() command is carried out. This gsub() command takes the pattern between IT's two forward slashes and replaces it with what it finds in field two ($2).

awk: finding a missing field

You may come across a situation where not every field in a file has been populated as you'd expect. Now if this file is big you'll want a quick way to check to see if any records have missing fields. Here's a really simple example:

Input file:

44777123123,447786647774,447788772233
44777798982,,448879878873
4477334499882,44998878788,447887818733

awk -F',' '!$2 { print $0 }' test.txt

44777798982,,448879878873

As you can see we have a comma separated input file and we tell awk about it with the -F',' flag. Then we perform a test on every line of the file to determine if there is no 2nd field (!$2). If awk matches this pattern then it outputs the whole record ($0).

awk variables

The awk programming language starts to get really useful when you start building some logic into it. Although the examples given here are simple one-line commands they contain the some of the building blocks with which you can really start to build complex awk programs.

Take a look at this code:

awk -v x=0 'NF != 6 { ++x } END { print x, NR }' file.txt

Firstly we're presented with a new flag '-v'. This tells awk that the next parameter on the command line is going to be a variable that we want to pass into the awk command. In this case we're defining x to be zero.

The next part of the awk command says that whenever we find a line that doesn't have six fields increment x by one (++x).

Then we see END which we've not come across before. This formatting of the command separates the actions awk takes down into two separate parts. Everything before the END is performed on every line in the input file. Everything after the END is done on the results of the previous part. So, the END statement in this command says print the value of x after every line has been checked against our test (NF != 6) and then print NR. NR is also new to us; it meerly means Number-of-Records - or more specifically the number of the last record or last line in the file.

So, if you're in need of an awk command that will give you a count of how many lines there are in a file that don't have a specific number of fields and you wish to know how many lines (records) there are in that file then this is the command for you! ;)

simple awk

You've got to start somewhere so here's the most simple awk command:

$ awk '{ print $0 }' file.txt

So, what does it do? Well it prints out every line of the input file 'file.txt'. Not very impressive I know but if you're presented with a file that has many columns such as the one below and you're only interested in one field then the above command can be easily modified to show just that one field.

Example file data (file.txt):

one two three four five six
1 2 3 4 5 6
one two three four five six
1 2 3 4 5 6
one two three four five six

New awk command:

awk '{ print $3 }' file.txt

three
3
three
3
three

What this new command has done (by changing the $0 to a $3) has just selected the third column for printing.

So, $0 (that's a zero) refers to the whole record or line (we can actually change the definition of record so that it means something else but more on that later) and $3 refers specifically to the third field.

If you wish to extract more than one field or change the order of the fields you can do that. In the following example I extract two fields and change their order, I also change the way that awk presents the resultant output.

awk '{ print $5"-"$3 }' file.txt

five-three
5-3
five-three
5-3
five-three

Our input file has columns that are separated by spaces. Spaces are the default column delimeter so when dealing with files like this we don't need to tell awk anything about the delimeter. But if there was a different delimeter then we'd have to tell awk about it in order to get the same results.

So, our new file is comma separated:

one,two,three,four,five,six
1,2,3,4,5,6
one,two,three,four,five,six
1,2,3,4,5,6
one,two,three,four,five,six
1,2,3,4,5,6

Our new awk command would have to look like this:

awk -F, '{ print $5"-"$3 }' file.txt
or
awk -F',' '{ print $5"-"$3 }' file.txt

So, from this we learn that the -F flag is for defining our delimeter. This flag can be easily confused with the lowercase -f flag which is used to call an awk file but more on that later.

One of the most useful tools in awk is the substring function - substr()

Here's an example of the substr() function in use:

From this file (dial.txt):

33410031 40377873 2c9ff4bafe6732d800fe78920f5b13ea 07861232635
33397056 40330917 2c9ff4bafcedd10900fe66e3342f5da7 07869110438

To this:

447861232635
447869110438

Using this command:

awk '{print "44"substr($4,2,12)}' dial.txt

Firstly you can see that no delimeter has been defined so spaces are assumed.

Secondly a "44" has been appended to the parsed data. The data actually being manipulated and printed is $4, or the fourth field. What we're telling awk to output is a substring (or component of) field four. In plain english the substr($4,2,12) component is telling awk to print $4 starting from the 2nd digit until the 12th digit.

As the 12 digit is actually the last digit you can omit that part of the command and write it like this: substr($4,2)

A simple test:

awk supports testing like any other programming language. Here's a very simple example:

awk 'NF != 8 {print $0}' file.txt

All the above example is doing is printing lines from file.txt that don't have 8 fields. NF is an internal awk variable that contains the number of fields on each line.

an introduction to awk

awk is a great *nix command line tool for extracting information from files - it's also so much more than that; it's a programming language in it's own right and if you know it well it's a great weapon to have in your armory.

Here I'll be explaining the very basics of the language so that you can build up a good solid understanding of how to use it. I'm no expert so what's displayed herein may not be the best method to reach each desired result and because of that I invite anyone to correct me.

I hope that all the code I show is equally useful to those that use nawk, gawk etc.