exclusive content published on May 16, 2007
In my organization, squid runs on the firewall as a transparent proxy and all
web traffic goes through it, while squidGuard filters content. Earlier, I spent some time talking about
how to use squid and squidGuard in these roles. The next step was to set up a
reporting tool to create useful reports from the log files.
Choosing a squid log file analysis tool
On the official squid web site, there
are twenty or so reporting and analysis tools listed that have been developed
by third parties. Selecting the right reporting tool requires understanding
your needs. Some tools focus on the technical performance of squid, while
others focus on how the Internet is being used and where each user is going.
For measuring the performance aspect, I decided to rely on the SAR system
performance package and squid's own cache reporting program, squidclient. If
you want to get deeper into squid performance, there are many solid third party
applications to help you fine tune it.
For looking at user activity, I chose Squidalyser because of
its flexible interface for creating ad hoc reports. Instead of producing
canned reports, it allows you to create reports on one or a group of users,
showing all activity or limited to certain criteria. In our organization, the
main purpose of looking at user activity is in response to a department manager
request. When that happens, we are usually looking for a pattern of abuse and
don't need reports on other users. Squidalyser excels at this kind of
research.
As a set of Perl scripts, Squidalyser is not hard to install. Like most
non-trivial Perl applications, I had to download a few Perl modules. Usually,
it is a routine exercise, but there were a couple of modules that failed during
installation. The modules I had trouble with were the GD graphic library
modules. According to the installation notes, they are optionally used
to create graphs, and Squidalyser will work without them.
To get the GD modules installed, I had to install the gd-devel and the
libpng-devel packages. I discovered this by pouring through the Perl
compilation errors and seeing which header files were missing. When Perl is
compiling the GD module, it asks questions about how libgd was compiled.
Sorting through the details may be more effort than some people want to endure.
As a final note, while I found Squidalyser to be the best reporting tool in my
environment, the code has not been updated in several years.
I am not sure it is still being developed and the code may grow less
compatible with future releases of Perl.
Building a bridge to squidGuard
Squidalyser includes a script that parses the squid log files and stores data in
a MySQL database for later reporting. The report front end is a web based CGI
script that reads the database and displays the results.
SquidGuard uses a different log file format than squid, and Squidalyser was not
designed to handle it. My solution was to make the least invasive changes to
Squidalyser and reuse as much of it as possible.
Entries in the squidGuard log file are created when a URL is blocked for some
reason. To keep the data isolated, I decided to store the squidGuard log
entries in a separate table. I created the table using the same schema as the
one that stores regular squid entries so I could reuse the Squidalyser
reporting and filtering procedures. Then, I copied the parsing script, called
squidparse.pl to guardparse.pl and made the necessary changes. The final step
was to modify the report CGI script to show a new option and read data from the
new table.

new squidGuard report option added
|
If you are interested in my modified version, let me know through the contact
form and I'll send you the source.
Volume speaks volumes
Squidalyser uses the squidparse.pl script to read new log records and load the
database. Similarly, the guardparse.pl script mentioned above is used to read
the squidGuard log and load the database. Both are scheduleid to run via cron
early each morning. This is the only regular maintenance required.
Both parsing scripts read the "expire" entry in the squidalyser.conf
configuration file to determine how long to keep records in the database. Our
policy is six months (24 weeks), leading to some new concerns.
Because our web volume is significant, the database grows about 13 million
records a month. At the time of this writing, the database contained 28 million records.
When it is fully populated, it should contain about 80 million. So far, MySQL
has not had any trouble managing this number of records, but the way
Squidalyser starts up causes the report CGI to load slowly. When the report
interface is loaded, it looks for unique IDs in the database to display them in a
drop down box. This means that each week, it takes longer for the initial web
page to open.
To reduce the start up time, I pulled the SQL statement that finds unique
names out of Squidalyser and ran it as a separate process after the parsing
scripts each night. The results are stored in a plain text file and I
modified Squidalyser to read that file
during start up. That change reduced the start up time from a
couple of minutes to a couple of seconds.
Meta squid
Ultimately, reporting tools like Squidalyser close the business loop to make
squid/squidGuard a complete product. In addition to the proxy services and
content filtering, it offers an easy to use ad hoc reporting tool to research
usage and pull information out of the prodigious log files.

This work is licensed under a
Creative Commons Attribution-NonCommercial 2.5 License.