originally published on February 23, 2006 at
linux.com
In a previous article, I
covered the basics of awk and presented a small application to reformat
address book data. Now, I would like to present a few ways to improve
the performance of awk programs.
Basic profiling
If your program is not running at an acceptable speed, the first thing
to do is look for a logic error. You may have created a loop that is
executing more often than it should or may be processing a set of records
more than once. To help you find these kinds of programming bugs, the
GNU version of awk includes a profiler. It can usually be found in
/bin/pgawk.
To use the profiler, call pgawk instead of gawk when running your
program, or use the --profile option with gawk. By default,
the profiler will create a file in the current directory called
"awkprof.out" that contains a copy of the program with execution counts
for each statement in the left margin. For example, here is the
profiling output for the program from my previous article
(extract-emails.awk):
# gawk profile, created Tue Jan 17 13:38:25 2006
# BEGIN block(s)
BEGIN {
1 FS = ":"
}
# Rule(s)
1120 /^FN/ { # 2
84 fullname = tolower($2)
84 split(fullname, names, ",")
84 name = (names[2] names[1])
}
1120 /^EMAIL/ { # 2
82 mail = tolower($2)
82 print ((mail ",") name)
}
The numbers in the left margin indicate how may times each statement
was executed. The statement in the BEGIN block was executed once as
expected. Each regular expression test was executed 1120 times, once for
each record in the input file. Each block was executed only when a
record matched the corresponding regular expression test.
Notice the /^FN/ code block was executed two more times
than the /^EMAIL/. To understand this, I had to examine
the data. I found two records in the input data that contained a
full name record, but no corresponding email address:
BEGIN:VCARD
VERSION:2.1
X-GWTYPE:GROUP
FN:GREEN
N:GREEN
X-DL:GREEN(82 items)
END:VCARD
Since records are only printed inside the email block, it did not cause
any problem with the results. It did bring to light something interesting
about the data, the existence of these two "group" records.
Complete cross reference
For larger programs, where you may pull in several awk script libraries,
deeper bugs may be hunted by creating a cross reference of all functions
and variables. Phil Bewig created XREF, a public domain awk program, to
take valid awk programs as input and write a cross reference to standard
out. It is executed as follows:
awk -f xref.awk [ file ... ]
For ordinary variables and array variables, a line of the form
count var(func) lines ...
is printed, where "count" is the number of times the variable is used,
"var" is the name of the variable, "func" is the function name to which
the variable is local (a null "func" indicates that the variable is
global), and "lines" is the number of each line where the variable
appears. A similar line is printed for each function.
Here is the output of XREF when run against the sample program:
1 FS() 1
2 fullname() 8 9
2 mail() 15 16
2 name() 10 16
3 names() 9 10
2 tolower(0) 8 15
It shows the number of times each variable or function was used and the
line numbers where they appear.
XREF is invaluable as a tool to see if you have
any variable scoping issues or a function name defined in two places.
Bolting on a turbo
Most awk applications run quickly with no performance tuning. But, if
you are crunching enormous text files or doing lengthy calculations,
things may slow down. To squeeze the maximum amount of performance out
of awk, you can use Awka to
translate your program into ANSI C and compile it.
Awka is a GPL'ed program distributed as source code, so you'll need to
install it with the ./configure, make, make install routine.
Compiling awka creates the awka binary and the libawka.so shared library
that is used with each generated program. Note: You may need to add
/usr/local/lib to /etc/ld.so.conf and run ldconfig before the system is
able to find libawka.so.
Typical performance gains are 50% or more over native gawk, though it
depends on the mix of operations and data. Arrays, nested loops, and
user defined functions show great speed increases. A performance comparison
chart of various operations is available on the Awka web site.
I used awka to create a binary version of the sample program above
with this command:
awka -X -o extract-emails -f extract-emails.awk
Then, I ran my own comparison test. The compiled awka version ran about
three times as fast as native gawk, though with such a small program and
small data set, the wall clock time difference was negligible. To see
if the results scaled, I increased the data set first by a factor of
100, then by 1000. With 100 times the data, awka was still three times
as fast and with 1000 times the data, awka was four times as fast (3.811
seconds average vs. 0.932 seconds average). Your mileage will vary.
The Awka project has not been updated in several years, but still
works on modern distributions. You may never need it, but it is a good
tool to have available if you want to use the simple awk language and need
the raw speed of a compiled program.

This work is licensed under a
Creative Commons Attribution-NonCommercial 2.5 License.