awk programming

Monday, 10 November 2008

awk: outputting the last ten minutes of a log file

OK, I've got this log file that has the date/time stamp in field one ($1) and I just wanted to extract the last ten minutes worth of data. Using the gawk functions mktime() and strftime() I wrote the following solution:

Getting the last ten minutes (600 seconds) of a log file:

awk -F, '{ tenago = strftime("%H:%M:%S", mktime(strftime("%Y %m %d %H %M %S", systime() - 600))); if ( substr($1,14) > tenago ) print $0 }' logfile.txt

Getting Hours:Minutes:Seconds timestamp for ten minutes ago was accomplished using:

tenago = strftime("%H:%M:%S", mktime(strftime("%Y %m %d %H %M %S", systime() - 600)))

It makes sense if you read it from the inside out. The systime() function, because it hasn't been supplied any parameters returns the current datetime minus (-) 600 seconds. It's formatted as a string using: "%Y %m %d %H %M %S" and passed to the mktime() function where it's converted to a string again... I think you may be able to simplify this - let me know if you can! :)

EDIT: Here we go, I knew there was a way to simplify the above command! Here's my latest version:

awk -F, '{ tenago = strftime("%H:%M:%S", systime() - 600); if ( substr($1,14) > tenago ) print $0 }'

Wednesday, 15 October 2008

awk script to find the maximum value of many returned rows

Below is a complete script that pulls out the maximum value for two specific criteria (disk name and time) - more than one row is returned for each so for the value in $3 we had to find the maximum.

for FILE in $( ls /var/test/perf/archive/measure/input/*$1*disk2.dat.gz )
do

DATFILE=$( echo ${FILE} | awk -F'/' '{ print substr($8,1,length($8)-3) }' )
echo ${DATFILE}

cp ${FILE} .
gunzip *.gz

DATE=$( echo ${DATFILE} | awk -F'-' '{ print $4"/"$3"/"$2 }' )

FILENAME=$( echo ${DATFILE} | awk -F'-' '{ print $1 }' )

awk -v D=${DATE} -F',' '{
if ( $2 ~ /D0201/ && $3 ~ /20:00:00/) {
max201[$3] = ( max201[$3] > $6 ? max201[$3] : $6 )
}
if ( $2 ~ /D0211/ && $3 ~ /20:00:00/) {
max211[$3] = ( max211[$3] > $6 ? max211[$3] : $6 )
}
} END {
for (i in max201) print D","i",D0201,"max201[i];
for (i in max211) print D","i",D0211,"max211[i];
}' ${DATFILE} >> ${FILENAME}-disk2.csv

rm ${DATFILE}

done

---

The clever bit (for me anyway) is this:

max211[$3] = ( max211[$3] > $6 ? max211[$3] : $6 )

The bit between the brackets is saying if the value in $6 is greater than what's already stored in max211[$3] (the value of $6 indexed by column $3) then pass back the new value of $6 - which then updates max211[$3] to the new value.

To put it another way; the bit above between the brackets could be written like this:

if ( max211[$3] > $6 )
max211[$3] = max211[$3]
else
max211[$3] = $6

Or as my book tells me:

Awk provides a conditional operator that is found in the C programming language.
Its form is:

expr ? action1 : action2

Wednesday, 3 September 2008

awk: using the length() function

OK, take the following code snippit as an example:

$ echo this.is.test-for-awk | awk -F'.' '{ printf substr($3,6,length($3)-9)"\nLength of $3: "length($3)"\n" }'

Output:

for
Length of $3: 12

All the above awk command is doing is accepting the input of "this.is.test-for-awk" from an echo command and splitting it into it's component parts as described within the body of the awk command.

As you can see the input has two different types of field separators - which can be quite common. So I thought I'd start with the '.' separator and I've defined this by using the -F'.' flag.

I then wanted to split down what has now been defined as $3 (as the separator is '.'). This I have done using the substring and length functions that are built into awk. Strictly I didn't need to use the length function as I could have just put in the number '3' so this component of the command would've looked like this:

substr($3,6,3)

The above substr() command translates to: strip $3 (field 3) from digit number 6 to digit number 3. But I wanted to show how the length function could be used so this component of the command looked like this:

substr($3,6,length($3)-9)

Now, what the above is doing is splitting down $3 based starting from the 6th digit and then getting the length of the field and taking away 9. As you can see the length of $3 is 12 digits - including the separators. So by taking away 9 we're left with 3. The result is that we have the output 'for'.

The last part of the command proves that the length function is reading the input text correctly by showing us the length of field 3.

"\nLength of $3: "length($3)"\n"

Because I've used the printf function I can format the output to make it more readable by including \n to print new-lines.

Friday, 22 August 2008

Gnu Awk - gawk - Manual

This is a very useful resource - although it's specifically talking about gawk most of it will be applicable to other variations of awk:

http://www.gnu.org/software/gawk/manual/

Thursday, 21 August 2008

awk: terms and descriptions

I found this article by IBM that very clearly describes the different components of an awk program:

http://www.ibm.com/developerworks/library/l-awk1.html

awk: more complex examples

EXAMPLES # is the comment character for awk. 'field' means 'column'

# Print first two fields in opposite order:
awk '{ print $2, $1 }' file

# Print lines longer than 72 characters:
awk 'length > 72' file

# Print length of string in 2nd column
awk '{print length($2)}' file

# Add up first column, print sum and average:
{ s += $1 }
END { print "sum is", s, " average is", s/NR }

# Print fields in reverse order:
awk '{ for (i = NF; i > 0; --i) print $i }' file

# Print the last line
{line = $0}
END {print line}

# Print the total number of lines that contain the word Pat
/Pat/ {nlines = nlines + 1}
END {print nlines}

# Print all lines between start/stop pairs:
awk '/start/, /stop/' file

# Print all lines whose first field is different from previous one:
awk '$1 != prev { print; prev = $1 }' file

# Print column 3 if column 1 > column 2:
awk '$1 > $2 {print $3}' file

# Print line if column 3 > column 2:
awk '$3 > $2' file

# Count number of lines where col 3 > col 1
awk '$3 > $1 {print i + "1"; i++}' file

# Print sequence number and then column 1 of file:
awk '{print NR, $1}' file

# Print every line after erasing the 2nd field
awk '{$2 = ""; print}' file

# Print hi 28 times
yes | head -28 | awk '{ print "hi" }'

# Print hi.0010 to hi.0099 (NOTE IRAF USERS!)
yes | head -90 | awk '{printf("hi00%2.0f \n", NR+9)}'

# Print out 4 random numbers between 0 and 1
yes | head -4 | awk '{print rand()}'

# Print out 40 random integers modulo 5
yes | head -40 | awk '{print int(100*rand()) % 5}'

# Replace every field by its absolute value
{ for (i = 1; i <= NF; i=i+1) if ($i < i =" -$i" 2="="" i="875;i">833;i--){
printf "lprm -Plw %d\n", i
} exit
}

Formatted printouts are of the form printf( "format\n", value1, value2, ... valueN)
e.g. printf("howdy %-8s What it is bro. %.2f\n", $1, $2*$3)
%s = string
%-8s = 8 character string left justified
%.2f = number with 2 places after .
%6.2f = field 6 chars with 2 chars after .
\n is newline
\t is a tab

# Print frequency histogram of column of numbers
$2 <= 0.1 {na=na+1} ($2 > 0.1) && ($2 <= 0.2) {nb = nb+1} ($2 > 0.2) && ($2 <= 0.3) {nc = nc+1} ($2 > 0.3) && ($2 <= 0.4) {nd = nd+1} ($2 > 0.4) && ($2 <= 0.5) {ne = ne+1} ($2 > 0.5) && ($2 <= 0.6) {nf = nf+1} ($2 > 0.6) && ($2 <= 0.7) {ng = ng+1} ($2 > 0.7) && ($2 <= 0.8) {nh = nh+1} ($2 > 0.8) && ($2 <= 0.9) {ni = ni+1} ($2 > 0.9) {nj = nj+1}
END {print na, nb, nc, nd, ne, nf, ng, nh, ni, nj, NR}

# Find maximum and minimum values present in column 1
NR == 1 {m=$1 ; p=$1}
$1 >= m {m = $1}
$1 <= p {p = $1} END { print "Max = " m, " Min = " p } # Example of defining variables, multiple commands on one line NR == 1 {prev=$4; preva = $1; prevb = $2; n=0; sum=0} $4 != prev {print preva, prevb, prev, sum/n; n=0; sum=0; prev = $4; preva = $1; prevb = $2} $4 == prev {n++; sum=sum+$5/$6} END {print preva, prevb, prev, sum/n} # Example of defining and using a function, inserting values into an array # and doing integer arithmetic mod(n). This script finds the number of days # elapsed since Jan 1, 1901. (from http://www.netlib.org/research/awkbookcode/ch3) function daynum(y, m, d, days, i, n) { # 1 == Jan 1, 1901 split("31 28 31 30 31 30 31 31 30 31 30 31", days) # 365 days a year, plus one for each leap year n = (y-1901) * 365 + int((y-1901)/4) if (y % 4 == 0) # leap year from 1901 to 2099 days[2]++ for (i = 1; i < m; i++) n += days[i] return n + d } { print daynum($1, $2, $3) } # Example of using substrings # substr($2,9,7) picks out characters 9 thru 15 of column 2 {print "imarith", substr($2,1,7) " - " $3, "out."substr($2,5,3)} {print "imarith", substr($2,9,7) " - " $3, "out."substr($2,13,3)} {print "imarith", substr($2,17,7) " - " $3, "out."substr($2,21,3)} {print "imarith", substr($2,25,7) " - " $3, "out."substr($2,29,3)}

awk: more simple examples

First, suppose you have a file called 'file1' that has 2 columns of numbers, and you want to make a new file called 'file2' that has columns 1 and 2 as before, but also adds a third column which is the ratio of the numbers in columns 1 and 2. Suppose you want the new 3-column file (file2) to contain only those lines with column 1 smaller than column 2. Either of the following two commands does what you want:

awk '$1 < $2 {print $0, $1/$2}' file1 > file2

-- or --

cat file1 | awk '$1 < $2 {print $0, $1/$2}' > file2

Let's look at the second one. You all know that 'cat file1' prints the contents of file1 to your screen. The | (called a pipe) directs the output of 'cat file1', which normally goes to your screen, to the command awk. Awk considers the input from 'cat file1' one line at a time, and tries to match the 'pattern'. The pattern is whatever is between the first ' and the {, in this case the pattern is $1 < $2. If the pattern is false, awk goes on to the next line. If the pattern is true, awk does whatever is in the {}. In this case we have asked awk to check if the first column is less than the second. If there is no pattern, awk assumes the pattern is true, and goes onto the action contained in the {}.

What is the action? Almost always it is a print statement of some sort. In this case we want awk to print the entire line, i.e. $0, and then print the ratio of columns 1 and 2, i.e. $1/$2. We close the action with a }, and close the awk command with a '. Finally, to store the final 3-column output into file2 (otherwise it prints to the screen), we add a '> file2'.

As a second example, suppose you have several thousand files you want to move into a new directory and rename by appending a .dat to the filenames. You could do this one by one (several hours), or use vi to make a decent command file to do it (several minutes), or use awk (several seconds). Suppose the files are named junk* (* is wildcard for any sequence of characters), and need to be moved to ../iraf and have a '.dat' appended to the name. To do this type

ls junk* | awk '{print "mv "$0" ../iraf/"$0".dat"}' | csh

ls junk* lists the filenames, and this output is piped into awk instead of going to your screen. There is no pattern (nothing between the ' and the {), so awk proceeds to print something for each line. For example, if the first two lines from 'ls junk*' produced junk1 and junk2, respectively, then awk would print:

mv junk1 ../iraf/junk1.dat
mv junk2 ../iraf/junk2.dat

At this point the mv commands are simply printed to the screen. To execute the command we take the output of awk and pipe it back into the operating system (the C-shell). Hence, to finish the statement we add a ' | csh'.

More complex awk scripts need to be run from a file. The syntax for such cases is:

cat file1 | awk -f a.awk > file2

where file1 is the input file, file2 is the output file, and a.awk is a file containing awk commands. Examples below that contain more than one line of awk need to be run from files.

Some useful awk variables defined for you are NF (number of columns), NR (the current line that awk is working on), END (true if awk reaches the EOF), BEGIN (true before awk reads anything), and length (number of characters in a line or a string). There is also looping capability, a search (/) command, a substring command (extremely useful), and formatted printing available. There are logical variables || (or) and && (and) that can be used in 'pattern'. You can define and manipulate your own user defined variables. Examples are outlined below. The only bug I know of is that Sun's version of awk won't do trig functions, though it does do logs. There is something called gawk (a Gnu product), which does a few more things than Sun's awk, but they are basically the same. Note the use of the 'yes' command below. Coupled with 'head' and 'awk' you save an hour of typing if you have a lot of files to analyze or rename.