originally published on January 16, 2006 at
linux.com
When it comes to slicing and dicing text, few tools are as powerful, or
as underutilized, as awk. The name AWK was coined from the initials of
it's authors, Aho, Weinberger, and Kernighan. Yes, the same Kernighan
of the famous K&R C Programming Language book. In the Linux world,
every distribution includes the GNU version, gawk, (/bin/awk is usually a
symbolic link to /bin/gawk). The GNU version has a few more features
than the original. In this article, when I reference awk, I am really
using gawk. The focus of the article is on the core features common
among POSIX-compliant awks.
Basic command line usage
The awk utility is a small program that executes awk language
scripts, often one-liners, but just as adeptly larger programs saved in
a text file. For example, to execute an awk script saved in the file
prg1.awk, and have it process the file, data1, you could use this command:
awk -f prg1.awk data1
The result is written to standard out, so it is usually piped to a result
file.
Another command line parameter commonly used is -F to
change the default field separator of a blank space. The field separator
can also be changed within an awk program. To tell awk how to split
data into fields from a comma separated variable (CSV) file, you would
use:
awk -F"," -f prg1.awk data1
You may also include more than one data file to process and awk will keep
running until it runs out of data:
awk -F"," -f prg1.awk data1 data2 data3 data4 data5
If you want to assign a value to a variable before execution of the program,
use the -v option:
awk -v AMOUNT=100 prg1.awk data1
Behold the power
The power of awk comes from how much it does automatically for you when
crunching text files, and from the simple elegance of the language.
When you feed awk a text file, it does the following things for you:
- Opens and reads all input files listed on the command line
- Handles memory management for all variables
- Parses each line and splits it into fields using the field separator (the
default is a blank space, but can be changed)
- Presents each line of text to your program as variable $0
- Presents each field from each line in predefined variables, starting with
$1, $2, ... $N
- Maintains many internal variables for your use such as (but not limited to):
- RS = record separator
- FS = field separator
- NF = number of fields in the current record
- NR = number of records processed so far
- Automatically handles conversion between internal data types
(string, floating point, array)
- Executes the BEGIN block before processing any records (a good place to
initialize variables)
- Executes the END block after processing all records (a good place to
calculate report totals)
- Closes all input files listed on the command line
The awk language uses only three internal data types: strings, floating
point numbers, and arrays. Variables do not have to be defined before
they are used. Awk handles converting data from one type to another as
necessary. If you add two strings together using the addition
operator (+) and they contain numeric values, you get a numeric result.
If a string is used in an arithmetic operation but can't be converted to
a number, it is converted to zero. Usually, awk does what you want when
handling data conversion.
It can open, read, and write to more files than those listed on the
command line by using the getline function or redirecting output from
within a program. It has access to a set of internal functions that
include math, string manipulation, formatted printing (similar to the C
language printf), and miscellaneous functions like pseudo-random
numbers. You can also create your own functions or function libraries
that can be used in several programs. All of this is packed into an
executable usually about 500k in size. Programmers can typically become
proficient in awk within a day. Complete references are available in a
single book. You don't need a "bookshelf" of dead trees and CDs to
master awk.
Implementations or ports of awk are available on nearly every platform,
making your scripts reasonably portable.
Awk in the real world
Here is a short example of a recent awk application I created to import
a list of email addresses and names from Novell Groupwise to PHPList (a
mailing list manager). The list was exported from Groupwise in vCard
File format (VCF), a text based format. Here is an example entry from
the VCF file:
BEGIN:VCARD
VERSION:2.1
X-GWTYPE:USER
FN:Bar, Foo
ORG:;GREEN
EMAIL;WORK;PREF:
This e-mail address is being protected from spam bots, you need JavaScript enabled to view it
N:Bar;Foo
X-GWUSERID:foobar
X-GWADDRFMT:0
X-GWIDOMAIN:yahoo.com
X-GWTARGET:TO
END:VCARD
The target format was a CSV file that PHPList could import into an
existing mailing list. I needed to extract the name (from the record
that starts with "FN" and the email address from the record that starts
with "EMAIL". These records are easy to identify and a small awk script
does the job nicely.
I started construction of the script by setting up a custom record
separator and a block of code to handle each record type. I saved the
script in a text file called extract-emails.awk. Note that the
.awk file extension is just convention, the file containing awk
commands can be named anything. This was the beginning of the script:
BEGIN { FS = ":" }
/^FN/ {
# handle name here
}
/^EMAIL/ {
# handle email address here
}
The BEGIN block is run once before any records are read. It sets
the field separator to a colon so awk will split the fields of the file
when a colon is encountered.
The regular expressions /^FN/ and /^EMAIL/ tell awk to look for the
characters "FN" or "EMAIL" at the start of a record, and if a match is
found, run the associated block of code between the curly braces. This
kind of regular expression match is common in awk but not required. A
block of code with no match expression is run for every record processed
by awk. I added a couple of comments (lines starting with "#") to
document what each part of the script does.
Looking at the VCF data, I noticed that the "FN" record always precedes
the "EMAIL" record, so I ordered the code blocks to process the records
that way. Awk reads and executes a script in the order it appears.
Many times, the order of the code will not matter, but in this case it
does. The name is related to the email and I need to retain that
relationship as the file is read, so I saved the name in an internal
variable, then wrote both the email address and name to standard out
while processing the email record.
Getting back to the task, let's complete the name section. The goal is
to reformat the name from "lastname, firstname" into "firstname lastname",
removing the comma. Here was my solution for the name:
/^FN/ {
# handle name here
fullname = tolower($2)
split(fullname, names, ",")
name = names[2] names[1]
}
Knowing that awk has split up the incoming records into fields using a
colon as the field separator, the field variables for the example "FN"
record contain the following:
$1 = "FN"
$2 = "Bar, Foo"
Working with the $2 variable, I used a built in awk function,
towlower(), to convert the names to lowercase and stored the result in a
variable called "fullname". Next, I used the split function to break
the name into first and last name parts, with the result stored in an
array called "names". Finally, I glued the name back together in the
desired order, without the comma, and stored that result in a variable
called "name".
There is very little to do inside the email code block. Awk provides
the email address to us in the $2 variable (note that $2 in the "EMAIL"
record is different than $2 in the "FN" record). For consistency, I
converted it to lowercase, then used the print function to write both
the email address and name to standard out, with a comma separating the
values. Here is the complete script:
BEGIN { FS = ":" }
/^FN/ {
# handle name here
fullname = tolower($2)
split(fullname, names, ",")
name = names[2] names[1]
}
/^EMAIL/ {
# handle email address here
mail = tolower($2)
print mail "," name
}
A sprinkle of shell glue
To pull it all together, we need a little shell glue. A small shell
script allows us to call awk with the command line parameters we want
and to easily redirect the output to a file. It is also handy to run a shell
script when you are testing.
#!/bin/sh
# Extract e-mail addresses from VCF file for PHPList.
awk -f extract-emails.awk groupwise.vcf > phplist-emails.txt
Awk can be used as an intermediate step in a larger shell script where
the output is fed into another utility such as sort, grep, or another
awk script.
Finally, here is a sample of the output:
This e-mail address is being protected from spam bots, you need JavaScript enabled to view it
, foo bar
This e-mail address is being protected from spam bots, you need JavaScript enabled to view it
, bar baz
Where awk falls short
There are certain tasks that are beyond the capabilities of awk. For
instance, if you need to do anything that communicates using network
sockets, awk is not your best bet. Similarly, if you need to process
binary files, awk falls short. The latest version of GNU awk does have
some rudimentary network capabilities, but Perl, PHP, and Ruby are much
better equipped for those tasks.
A million household uses
Awk is an expert tool for text processing, and the roots of Perl are
clear in it's design. It is powerful enough to handle almost any kind
of text crunching or reporting, while being very easy to learn and use.
There is a lot of competition and many choices when it comes to
scripting languages, but I still find awk the best choice for many
problems. Although awk is employed most often for smaller problems, it
can be used for large applications. I worked on a 12,000 line awk
application used to adjudicate dental claims. This application was the
core system for a successful million dollar business. Despite being a
common utility on every Linux system, awk remains relatively obscure.
If you take the time to learn it, the rewards will last a lifetime.

This work is licensed under a
Creative Commons Attribution-NonCommercial 2.5 License.