Linux Box Admin
Trusted Remote Administration
logo

Tilde
What's new
Articles
Micro HowTos
About
Contact



Fresh Micros




Turbocharged awk
(0 votes)
Thursday, 23 February 2006
  Turbocharged awk
originally published on February 23, 2006 at linux.com

In a previous article, I covered the basics of awk and presented a small application to reformat address book data. Now, I would like to present a few ways to improve the performance of awk programs.

Basic profiling

If your program is not running at an acceptable speed, the first thing to do is look for a logic error. You may have created a loop that is executing more often than it should or may be processing a set of records more than once. To help you find these kinds of programming bugs, the GNU version of awk includes a profiler. It can usually be found in /bin/pgawk.

To use the profiler, call pgawk instead of gawk when running your program, or use the --profile option with gawk. By default, the profiler will create a file in the current directory called "awkprof.out" that contains a copy of the program with execution counts for each statement in the left margin. For example, here is the profiling output for the program from my previous article (extract-emails.awk):

      # gawk profile, created Tue Jan 17 13:38:25 2006

      # BEGIN block(s)

      BEGIN {
   1          FS = ":"
      }

      # Rule(s)

1120  /^FN/   { # 2
  84          fullname = tolower($2)
  84          split(fullname, names, ",")
  84          name = (names[2] names[1])
      }

1120  /^EMAIL/        { # 2
  82          mail = tolower($2)
  82          print ((mail ",") name)
      }

The numbers in the left margin indicate how may times each statement was executed. The statement in the BEGIN block was executed once as expected. Each regular expression test was executed 1120 times, once for each record in the input file. Each block was executed only when a record matched the corresponding regular expression test.

Notice the /^FN/ code block was executed two more times than the /^EMAIL/. To understand this, I had to examine the data. I found two records in the input data that contained a full name record, but no corresponding email address:

BEGIN:VCARD
VERSION:2.1
X-GWTYPE:GROUP
FN:GREEN
N:GREEN
X-DL:GREEN(82 items)
END:VCARD
Since records are only printed inside the email block, it did not cause any problem with the results. It did bring to light something interesting about the data, the existence of these two "group" records.

Complete cross reference

For larger programs, where you may pull in several awk script libraries, deeper bugs may be hunted by creating a cross reference of all functions and variables. Phil Bewig created XREF, a public domain awk program, to take valid awk programs as input and write a cross reference to standard out. It is executed as follows:

awk -f xref.awk [ file ... ]

For ordinary variables and array variables, a line of the form
count var(func) lines ...
is printed, where "count" is the number of times the variable is used, "var" is the name of the variable, "func" is the function name to which the variable is local (a null "func" indicates that the variable is global), and "lines" is the number of each line where the variable appears. A similar line is printed for each function.

Here is the output of XREF when run against the sample program:

    1 FS() 1
    2 fullname() 8 9
    2 mail() 15 16
    2 name() 10 16
    3 names() 9 10
    2 tolower(0) 8 15
It shows the number of times each variable or function was used and the line numbers where they appear.

XREF is invaluable as a tool to see if you have any variable scoping issues or a function name defined in two places.

Bolting on a turbo

Most awk applications run quickly with no performance tuning. But, if you are crunching enormous text files or doing lengthy calculations, things may slow down. To squeeze the maximum amount of performance out of awk, you can use Awka to translate your program into ANSI C and compile it.

Awka is a GPL'ed program distributed as source code, so you'll need to install it with the ./configure, make, make install routine. Compiling awka creates the awka binary and the libawka.so shared library that is used with each generated program. Note: You may need to add /usr/local/lib to /etc/ld.so.conf and run ldconfig before the system is able to find libawka.so.

Typical performance gains are 50% or more over native gawk, though it depends on the mix of operations and data. Arrays, nested loops, and user defined functions show great speed increases. A performance comparison chart of various operations is available on the Awka web site.

I used awka to create a binary version of the sample program above with this command:

awka -X -o extract-emails -f extract-emails.awk

Then, I ran my own comparison test. The compiled awka version ran about three times as fast as native gawk, though with such a small program and small data set, the wall clock time difference was negligible. To see if the results scaled, I increased the data set first by a factor of 100, then by 1000. With 100 times the data, awka was still three times as fast and with 1000 times the data, awka was four times as fast (3.811 seconds average vs. 0.932 seconds average). Your mileage will vary.

The Awka project has not been updated in several years, but still works on modern distributions. You may never need it, but it is a good tool to have available if you want to use the simple awk language and need the raw speed of a compiled program.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License.
 

Copyright © 2006,2007 Linux Box Admin.

 
My NHL fan blog