Monday, 18 August 2008

simple awk

You've got to start somewhere so here's the most simple awk command:

$ awk '{ print $0 }' file.txt

So, what does it do? Well it prints out every line of the input file 'file.txt'. Not very impressive I know but if you're presented with a file that has many columns such as the one below and you're only interested in one field then the above command can be easily modified to show just that one field.

Example file data (file.txt):

one two three four five six
1 2 3 4 5 6
one two three four five six
1 2 3 4 5 6
one two three four five six

New awk command:

awk '{ print $3 }' file.txt

three
3
three
3
three

What this new command has done (by changing the $0 to a $3) has just selected the third column for printing.

So, $0 (that's a zero) refers to the whole record or line (we can actually change the definition of record so that it means something else but more on that later) and $3 refers specifically to the third field.

If you wish to extract more than one field or change the order of the fields you can do that. In the following example I extract two fields and change their order, I also change the way that awk presents the resultant output.

awk '{ print $5"-"$3 }' file.txt

five-three
5-3
five-three
5-3
five-three

Our input file has columns that are separated by spaces. Spaces are the default column delimeter so when dealing with files like this we don't need to tell awk anything about the delimeter. But if there was a different delimeter then we'd have to tell awk about it in order to get the same results.

So, our new file is comma separated:

one,two,three,four,five,six
1,2,3,4,5,6
one,two,three,four,five,six
1,2,3,4,5,6
one,two,three,four,five,six
1,2,3,4,5,6

Our new awk command would have to look like this:

awk -F, '{ print $5"-"$3 }' file.txt
or
awk -F',' '{ print $5"-"$3 }' file.txt

So, from this we learn that the -F flag is for defining our delimeter. This flag can be easily confused with the lowercase -f flag which is used to call an awk file but more on that later.

One of the most useful tools in awk is the substring function - substr()

Here's an example of the substr() function in use:

From this file (dial.txt):

33410031 40377873 2c9ff4bafe6732d800fe78920f5b13ea 07861232635
33397056 40330917 2c9ff4bafcedd10900fe66e3342f5da7 07869110438

To this:

447861232635
447869110438

Using this command:

awk '{print "44"substr($4,2,12)}' dial.txt

Firstly you can see that no delimeter has been defined so spaces are assumed.

Secondly a "44" has been appended to the parsed data. The data actually being manipulated and printed is $4, or the fourth field. What we're telling awk to output is a substring (or component of) field four. In plain english the substr($4,2,12) component is telling awk to print $4 starting from the 2nd digit until the 12th digit.

As the 12 digit is actually the last digit you can omit that part of the command and write it like this: substr($4,2)

A simple test:

awk supports testing like any other programming language. Here's a very simple example:

awk 'NF != 8 {print $0}' file.txt

All the above example is doing is printing lines from file.txt that don't have 8 fields. NF is an internal awk variable that contains the number of fields on each line.

No comments: