Thursday 21 August 2008

awk: more simple examples

First, suppose you have a file called 'file1' that has 2 columns of numbers, and you want to make a new file called 'file2' that has columns 1 and 2 as before, but also adds a third column which is the ratio of the numbers in columns 1 and 2. Suppose you want the new 3-column file (file2) to contain only those lines with column 1 smaller than column 2. Either of the following two commands does what you want:

awk '$1 < $2 {print $0, $1/$2}' file1 > file2

-- or --

cat file1 | awk '$1 < $2 {print $0, $1/$2}' > file2

Let's look at the second one. You all know that 'cat file1' prints the contents of file1 to your screen. The | (called a pipe) directs the output of 'cat file1', which normally goes to your screen, to the command awk. Awk considers the input from 'cat file1' one line at a time, and tries to match the 'pattern'. The pattern is whatever is between the first ' and the {, in this case the pattern is $1 < $2. If the pattern is false, awk goes on to the next line. If the pattern is true, awk does whatever is in the {}. In this case we have asked awk to check if the first column is less than the second. If there is no pattern, awk assumes the pattern is true, and goes onto the action contained in the {}.

What is the action? Almost always it is a print statement of some sort. In this case we want awk to print the entire line, i.e. $0, and then print the ratio of columns 1 and 2, i.e. $1/$2. We close the action with a }, and close the awk command with a '. Finally, to store the final 3-column output into file2 (otherwise it prints to the screen), we add a '> file2'.

As a second example, suppose you have several thousand files you want to move into a new directory and rename by appending a .dat to the filenames. You could do this one by one (several hours), or use vi to make a decent command file to do it (several minutes), or use awk (several seconds). Suppose the files are named junk* (* is wildcard for any sequence of characters), and need to be moved to ../iraf and have a '.dat' appended to the name. To do this type

ls junk* | awk '{print "mv "$0" ../iraf/"$0".dat"}' | csh

ls junk* lists the filenames, and this output is piped into awk instead of going to your screen. There is no pattern (nothing between the ' and the {), so awk proceeds to print something for each line. For example, if the first two lines from 'ls junk*' produced junk1 and junk2, respectively, then awk would print:

mv junk1 ../iraf/junk1.dat
mv junk2 ../iraf/junk2.dat

At this point the mv commands are simply printed to the screen. To execute the command we take the output of awk and pipe it back into the operating system (the C-shell). Hence, to finish the statement we add a ' | csh'.

More complex awk scripts need to be run from a file. The syntax for such cases is:

cat file1 | awk -f a.awk > file2

where file1 is the input file, file2 is the output file, and a.awk is a file containing awk commands. Examples below that contain more than one line of awk need to be run from files.

Some useful awk variables defined for you are NF (number of columns), NR (the current line that awk is working on), END (true if awk reaches the EOF), BEGIN (true before awk reads anything), and length (number of characters in a line or a string). There is also looping capability, a search (/) command, a substring command (extremely useful), and formatted printing available. There are logical variables || (or) and && (and) that can be used in 'pattern'. You can define and manipulate your own user defined variables. Examples are outlined below. The only bug I know of is that Sun's version of awk won't do trig functions, though it does do logs. There is something called gawk (a Gnu product), which does a few more things than Sun's awk, but they are basically the same. Note the use of the 'yes' command below. Coupled with 'head' and 'awk' you save an hour of typing if you have a lot of files to analyze or rename.

No comments: