Content filtering for the web can be a messy proposition. There are several open source and commercial options available to fit a variety of scenarios. A business may only need to block the most objectionable web sites, while schools may be required by law to follow a more thorough process. This article examines one solution built with only open source pieces: squid, squidGuard, and blacklists.
Squid proxy and cache
The squid server acts as an intermediary between a web browser and web server. As a proxy, it receives a URL request from the browser, connects to the server on behalf of the browser, downloads the content, then provides it to the browser. In addition, it saves the content to disk so it can provide it more quickly to another browser if the same URL is requested in the near future. Generally, this leads to more efficient utilization of an Internet connection and faster response times for web browsers.
A typical hardware setup uses physical two network cards, one connected to the internal network where squid listens for incoming HTTP requests (on the default port 3128), and one connected to the Internet where it downloads content.
Squid is a complex piece of software, but is available for most Linux distributions as a standard package. My production system uses Red Hat Linux and I was able to get squid running with sane defaults by simply installing the RPM and setting a few options in the /etc/squid/squid.conf configuration file:
visible_hostname your-server-name
acl our_networks src 192.168.0.0/16
http_access allow our_networks
http_access deny all
The visible_hostname tells squid the name of the server. The acl is an access control list used in the http_access rule to allow internal clients to connect to squid. For security reasons, it is important to ensure that users outside your network can't use squid. This is achieved by adding a deny rule near the bottom of your configuration.
Tell the browsers
Most web browsers are proxy aware and behave a little differently when they know they are talking to a proxy server. In Firefox 2.0, you enter proxy settings under Tools / Options (Firefox / Preferences on Mac) / Advanced section / Network tab, then click the Settings button under Connection.
Once the browser is configured, it should make requests and get responses from squid. Another way to use squid is in transparent proxy mode. Transparent proxies are often used to force web traffic through the proxy regardless of how each browser is configured. Doing so requires some network trickery to hijack outgoing HTTP requests and also requires additional tweaks to squid. Configuring a transparent proxy is beyond the scope of this article, but useful guides are available.
Redirectors
With no additional configuration, squid faithfully fetches and returns each URL requested of it. To filter the content, squid has a feature called a redirector. A redirector is a separate program called by squid that examines the URL and tells squid to either proceed as usual or rewrites the URL so squid returns something else instead. Most often, redirectors rewrite banned URLs, returning the URL of a custom error page that explains why the requested URL was not honored.
Several third party redirectors have been written, including squirm and squidGuard. Both squirm and squidGuard are C language programs that need to be compiled from source. Squirm operates using regular expression rules, while squidGuard uses a a database of domains/URLs to make decisions. I have not done any performance testing on redirectors, but squidGuard has a reputation for scaling quite well as the size of the blacklist increases. In my experience, squidGuard has performed extremely well on networks up to a thousand users.
Installing squidGuard 1.2.0
The squidGuard redirector is installed using the familiar "configure, make, make install" routine. One requirement that may not be installed on your system is the Berkeley DB library (now owned by Oracle). Berkeley DB is known for its speed and squidGuard uses it to store blacklist domains and URLs.
After running "make install" using the squidGuard source, I discovered that some directories were not created. I manually created the following directories:
/usr/local/squidGuard/ - for configuration files
/usr/local/squidGuard/log/ - for log files
/usr/local/squidGuard/db/ - for blacklist files
Next, I copied the sample configuration file to /usr/local/squidGuard/squidGuard.conf. We'll come back to the squidGuard configuration shortly.
To make squid aware of squidGuard, add these options to /etc/squid.conf:
redirect_program /usr/local/bin/squidGuard -c \
/usr/local/squidGuard/squidGuard.conf
redirect_children 8
redirector_bypass on
The redirect_program option points to the redirector binary and configuration file. The redirect_children option controls how many redirector processes to start. The redirector_bypass option tells squid to ignore the redirector if it becomes unavailable for some reason. If you do not set this option and squidGuard crashes or gets overloaded, squid will quit with a fatal error, perhaps ending all web access.
Using a blacklist
To be effective as a filter, squidGuard needs a list of domains and URLs that should be blocked. Building and maintaining your own blacklist would be require a huge investment in time. Fortunately, you can download a quality list and refresh it as it gets updated. One of the largest and most popular blacklists is maintained by Shalla Security Services.
The Shalla list contains over one million entries, categorized by subject such as pornography, gambling, warez, etc. You can use all or any part of the list. The list is free for non commercial use. For commercial use, a one page agreement needs to be signed and returned to Shalla, but there is no cost to use the list (unless it is embedded and resold in another product). Additional free and non free blacklists are available, but the Shalla list is a good place to start.
To use it, download and unpack it in temporary directory. It will create a directory called BL with subject subdirectories below. Copy the directory tree below BL to the /usr/local/squidGuard/db/ directory. When you are done, the db directory should contain the subject subdirectories.
The blacklist itself is a set of plain text files named domains and urls. To allow squidGuard to use them, the text files must be loaded into Berkeley DB format. Before running the conversion process, return to the squidGuard.conf file and define which files you want to use.
Following is a basic squidGuard.conf configuration:
#
# CONFIG FILE FOR SQUIDGUARD
#
dbhome /usr/local/squidGuard/db
logdir /usr/local/squidGuard/log
# DESTINATIONS
dest spy {
domainlist spyware/domains
urllist spyware/urls
log /usr/local/squidGuard/log/blocked.log
}
# ACCESS CONTROL LISTS
acl {
default {
pass !spy !in-addr all
redirect http://webserver.com/blocked.html
}
}
The dest block defines lists of domains and urls, used later in the access control section. The example defines a "spy" destination using the spyware blacklist files defined with relative paths to the files in the db directory. It also uses the log option to write records to the blocked.log file when a match is found. The name and location of the log file can be changed.
The acl block defines what squidGuard does with requests passed to it from squid. The example instructs squidGuard to allow all requests that do not match the "spy" destination and are not IP addresses. The redirect option defines what URL to return if a request does not pass. So, if a request matches our blacklist, it gets redirected to the blocked.html page. It is also possible to set up a CGI script that can collect and report additional information such as the user, source IP, and URL of the request.
The squidGuard configuration can be arbitrarily complex. I strongly recommend starting out with a simple configuration and slowly adding to it and testing it until it meets your requirements.
Returning to the blacklist, it is time to run the Berkeley DB load process, using squidGuard to create the database files. This command starts the conversion process:
/usr/local/bin/squidGuard -C all
With this command, squidGuard looks at its configuration file and only converts the files defined. In the example, it would only convert the spyware lists, creating the files spyware/domains.db and spyware/urls.db. The loading process can take a while, especially on older hardware.
I ran into an issue with file permissions on the blacklist databases. If the files did not have permissions of 777, squidGuard was not able to use them. Even though the squidGuard processes ran as user squid and the files were owned by user squid with permissions of 755, squidGuard did not work as expected. In my setup, this was not a big problem because squidGuard was running on a stand alone firewall. However, on a multi-user system, it would be a serious concern.
The following shows how all the pieces fit together:
Using a whitelist
There are a couple of approaches to setting up a whitelist. One option is to create a whitelist directory under the squidGuard db directory and manage the whitelist using squidGuard ACLs. Another option is to create a file, such as /etc/squid/whitelist, and manage the exceptions with squid. Both options are effective, but I decided to manage the exceptions in squid for two reasons. First, it would eliminate a call to squidGuard, and second, it would be faster to modify. If the whitelist was maintained by squidGuard, squid would have to be restarted to make the changes active. With the whitelist maintained by squid, a much faster squid reload (re-reading the configuration file) is all that is required.
To configure the whitelist in squid, two extra options are needed in /etc/squid.conf:
acl white /etc/squid/whitelist
redirector_access white deny
The first option defines an access control list using the whitelist file. The whitelist file contains domain names (i.e., .youtube.com), one per line. The second option tells squid to skip the call to squidGuard if the URL is in the whitelist. Note, the options must be defined in the order shown. The ACL must be defined before it is used.
Debugging and Tuning
Both squid and squidGuard create useful log files. The primary squid log file is /var/log/squid/cache.log file. Squid is very clear when certain problems arise with the redirector. For example, these messages appeared in the squid log during the first full day of production using squidGuard:
WARNING: All redirector processes are busy.
WARNING: 5 pending requests queued
Consider increasing the number of redirector processes in your
config file.
The setting in squid.conf for the number of redirectors is redirect_children, so correcting this was straighforward. Other issues may be more subtle. Squid provides excellent internal diagnostic reports through squidclient, a program included with the squid pacakge. Use the following command on the machine where squid is installed to get general stastistics:
squidclient mgr:info
Use this command to see a report on the performance of the redirectors:
squidclient mgr:redirector
When squidGuard has a problem, it may not be as precise. A common error you may see in the squidGuard log is: going into emergency mode. There may be additional helpful messages in the log file, but emergency mode usually means that squidGuard has stopped working. Often, there is a syntax error in the configuration file, but it could be a permissions issue or something else. You can test a squidGuard configuration from the command line before committing changes. Simply feed a list of URLs to squidGuard on the command line, using your test configuration file, and see if it returns the expected result. A blank line means squidGuard did not change the URL, while any other result means the URL was rewritten.
DansGuardian
While squidGuard is a fast way to handle blacklists, it may not be robust enough for some applications. Another popular content filter is DansGuardian, capable of phrase matching and PICS filtering in addition to blacklists and whitelists.
DansGuardian does not function as a squid redirector. It works more like a proxy, receiving the intiial URL request from the browser, then forwarding it to squid and doing further analysis of the result. Because of its design, it may be more difficult to get it working as a transparent proxy.
DansGuardian is licesned under the GPL and free for non-commercial use, but is not free for commercial use. Also, it is not free if the person installing it charges for their services. These are additional considerations when evaluating DansGuardian, but few products, open source or commerical, can match its feature set.
The long arm of the squid
Squid and squidGuard offer a reliable, fast platform for web content filtering. If squidGuard doesn't meet your needs, additional redirectors are available or you can roll your own. In addition to blacklisting, the redirector interface can be used to remove advertising, replace images, and other creative things. Content filtering with squid can be as course or as fine grained as your needs.

This work is licensed under a
Creative Commons Attribution-NonCommercial 2.5 License.