Update - 8-Apr-2004

I have updated patches for newer versions of Squid available for download.
Squid 2.5-STABLE5 patches.
Squid 2.5-STABLE3 patches.
For the record, I don't know why anybody would want the STABLE3 patches when they could be running STABLE5, but I am making them available anyway.

Transparent HTTP Caching with Squid and BSD/OS 4.2

In 2000, on the bsdi-users mailing list, a question was asked about how to do transparent HTTP caching/proxying using BSD/OS. At the time, I hinted at using the SO_BINDANY socket option in BSD/OS, along with the Squid cache (http://www.squid-cache.org) to implement this function. Recently, I was asked the specifics of how to do this, so I spent a little time making this technology go. What follows is a short writeup of the work I did to make transparent Web caching work.

Transparent HTTP Caching/Proxying

Transparent HTTP caching/proxying is the attempt to make a user's HTTP application (typically a Web browser) use a http cache or proxy without having to modify the configuration of the application. In some cases, the application cannot be directly configured to use a proxy. Transparent caching is a useful tactic to take when all HTTP traffic must be run through a single machine, either for routing or security purposes. It is also useful when you are attempting to reduce network bandwidth requirements by aggressively caching Web content close to the data consumers.

How To Make It Work

At the Squid website, in one of the FAQs, there is a list of the four basic actions that need to be accomplished to enable transparent caching/proxying:
  1. Compile/run a version of Squid which accepts connections for other IP addresses.
  2. Configure Squid to accept and process the connections.
  3. Get your cache server to accept the packets.
  4. Get the packets to your cache server.
For a more complete version of these actions and exactly what they entail, visit the Squid FAQ pages at http://www.squid-cache.org/Doc/FAQ/FAQ.html (at the time of this writing, FAQ 17 was the relevant section). This document is useful in that it spells out the specific tasks that must be accomplished to complete this task on BSD/OS.

BSD/OS 4.2 To The Rescue

BSD/OS 4.2 ships with a fairly recent, stable version of Squid installed in /usr/contrib/bin/squid. The source code for that version of squid is on the "Contributed Sources" CD-ROM that ships with both the binary and source releases of BSD/OS. I didn't try to make transparent caching work with any version of Squid other than the version that is shipped with BSD/OS 4.2, since that version of Squid is relatively recent. I didn't know of any good reason to upgrade to a newer release of Squid.

Step One:

The first thing that needs to be done is a small set of changes to Squid to make it accept HTTP connections for any destination IP address. This is fairly easy to implement under BSD/OS, using their proprietary SO_BINDANY socket option. This socket option will allow an application to bind to and accept connections for any IP address that get routed to localhost for processing. To that end, I created a small set of patches for Squid that turns on the SO_BINDANY option inside of Squid. In order to apply these patches to Squid, you will need to find your "Contributed Sources" CD-ROM and put it in your CD-ROM drive to retrieve the Squid source code.
# mount /cdrom
# cd /var/tmp
# gzcat < /cdrom/contrib/squid.tar.gz | tar xvf -
# umount /cdrom

Apply the patches in the squid.patch file.
# patch -p0 < squid.patch

Build the (now patched) Squid distribution.
# gmake all

First, save the original binary, in case you need to revert back. Then copy the squid binary to the installation directory.
# mv /usr/contrib/bin/squid /usr/contrib/bin/squid.FCS
# cp src/squid /usr/contrib/bin/squid

You have now completed the first step of the process -- your squid binary now will set the socketoption SO_BINDANY on the sockets it creates for accepting connections.

Step Two:

You need to configure Squid to accept the connections from the hosts on your network. Configuring Squid to accept connections from a given set of hosts or networks is fairly well documented in the Squid FAQs, but it does requires careful reading. I've included the diff of the BSD/OS installed squid.conf.default and the configuration file that is being using on my proxy server.

Next you need to create a configuration file and apply the patches in the squid.conf.patch file.

# cd /var/www/squid/conf
# cp squid.conf.default squid.conf
# patch -p0 < squid.conf.patch
# vi squid.conf

Note: You WILL need to add an ACL for your network/hosts that allows them to connect to the proxy. I've called my ACL "dummy" and it allows anybody on 192.168.1.0/24 to connect to the proxy. You MUST edit this line of the configuration file for it work properly on your network. There are lots of other settings that you might want or need to tweak to have Squid do what you want or need to adjust.

Now, start the Squid process:

# /var/www/squid/bin/start-squid 

Step Three:

Getting the cache server to actually accept the packets is really easy under BSD/OS, assuming you have left the IPFW option on in the kernel. BSD/OS 4.2 ships with this facility turned on, so unless you turned it off explicitly in the kernel you are running, you should have it in your current kernel. You can verify that this facility is turned on in your kernel by examining the kernel configuration file that was used to build your kernel and making sure that a line like the following is in the configuration file:
options         IPFW                    # IP Filtering

You need to install a small filter at the pre-input location in the kernel. Here's a filter similar to what is installed on my router host:

tcp && srcaddr(192.168.1.25) && dstport(service(http/tcp)) {
        forcelocal;
        accept;
}

This filter only turns on the transparent proxying for a single host (192.168.1.25) -- which is all that I needed for demonstrating that the proxying was working. In a normal situation, you will need to change the filter to allow your entire netblock(s) to connect to the service. This could be done by modifying the srcaddr(...) part in the above example to srcaddr(192.168.1.0/24). If you have multiple netblocks that you want to allow, you can list them like this: srcaddr(192.168.1.0/24, 192.168.2.0/24)

To install the filter, you will need to run the following commands.

# ipfwcmp -o /var/run/ipfw.pre-input /path/to/pre-input.filter
# ipfw pre-input -replace /var/run/ipfw.pre-input

Don't forget to put these commands in your startup files so this filter will get installed each time your machine is rebooted! I would suggest putting these commands at the end of the /etc/netstart file.

The first command compiles the ASCII representation of the filter into the binary format that the ipfw command uses. The second command will actually download the filter into the running kernel, replacing any existing pre-input filter. If you need to make changes to the filter, you can execute these commands again (after editing the filter file) and implement the changes to the filtering rules in the running kernel without having to reboot.

Step Four:

There is no step four, at least not for my network setup. In my configuration, there is a BSD/OS 4.2 machine that acts as the gateway to the Internet. This host terminates the Frame Relay connection directly on a serial interface and has multiple ethernet interfaces. It does the filtering for the network and now it runs the Squid cache too. If you are planning to run the Squid cache on a machine that doesn't have all the outbound network traffic already flowing through it, you will need to investigate the FAQs at the Squid homepage and see how to do that. They have notes about configuring some popular brands of routers to do just this.

Performance Tuning and Further Patching

After I wrote up all the above information, it was pointed out that there were several bugs in the version of Squid (2.3STABLE4) that have had patches posted. I looked at all the patches, and while some were gratuitous (in that they fixed code that wasn't enabled in the BSD/OS configuration of Squid) they all were very easy to apply. None of the posted patches at the Squid website conflict with the patches that I wrote. For your reference, the bugs, their descriptions and patches for them are located at http://www.squid-cache.org/Versions/v2/2.3/bugs/. It is probably worth the effort to get and apply these patches for the reported bugs.

Performance Tuning

After getting the transparent http caching working with BSD/OS and Squid, I used the system on and off for the next day and a half. My perception was that browsing the Internet without the Squid cache was definately faster than browsing with the cache enabled. This was obviously not the best possible solution -- there is little point in running a cache if the perceived speed of the network connection goes down.

I examined the log files as they were written by the Squid proxy. Whenever a new Web site was visited, there was a long pause before the first log entry for a new website would be written into the access file. All the subsequent log entries would be written quite rapidly. This problem resembled a DNS lookup problem. The the Squid cache was making the end user wait while it resolved the name of the new Web host. This is not acceptable!

Reading more of the Web pages at the Squid Web server, I stumbled across an extremely important piece of information. With the 2.3 release of Squid, the default nameserver lookup routines used were the internal proxy routines. In other words, all the nameservice lookups were being done internally by the program using the internal Squid resolver code.

While there should be nothing wrong with the Squid resolver, it does has one significant mis-feature, that may or may not affect your installation. I have been given a patch from the the Squid development team to resolve this problem. The misfeature is this -- the Squid internal resolver routines open and parse the /etc/resolv.conf file to retrieve a list of nameservers to query during normal operation. This isn't a bad method to use to get a list of nameservers, except that the parsing code doesn't know that it needs to handle the following nameserver entry specially:

nameserver 0.0.0.0

Since the dawn of time (well, OK, at least since the dawn of the BIND resolver code, in the late 1980s) this has been a legal mechanism to signal the resolver to query the local running named process.

Squid dutifully sends DNS requests to this address, which get delivered to the local DNS server. The DNS server then sends back the response, from one of the IP addresses it has bound to on the machine, but never from the address 0.0.0.0. Because the response comes from an IP address that isn't on the list of nameservers it will believe, Squid tosses out the answer and then queries the next nameserver on the list. What Squid should when the special nameserver 0.0.0.0 is listed in the /etc/resolv.conf file is to query the machine for all local IP addresses bound to its interfaces and accept DNS answers from any of those IP addresses.

Because of this misfeature in Squid, it appears as if every nameservice request was failing. So while the proxy waited for a nameserver to fail on every request, it wasn't handling relaying other http traffic and was causing the end user to have to wait while the dns information resolved from a distant nameserver.

If you apply the above patch that works around this problem, you should not need the following workaround. To implement the workaround, you will need to rebuild the squid executable with the --disable-internal-dns option. This flag forces Squid to use the external dnsserver program, which uses the system resolver routines and will happily accept answers from the local nameserver's IP addresses. After restarting the Web proxy with this workaround in place, browsing through the proxy did not seem noticeably slower than without the proxy.

The very small patch for the Makefile to specify this flag is available. You probably don't need this patch if you don't have 0.0.0.0 listed in your /etc/resolv.conf file as a nameserver!

If you choose to implement this workaround, you will want to rebuild from scratch and reinstall the resulting binary:

# gmake clean
# gmake all
# cp src/squid /usr/contrib/bin/squid

Don't forget to kill and restart the squid daemon for the change to take effect!

Instrumenting dnsserver

If you have determined that you need to run the external dnsserver program (and you probably do not need to do this), the following section will describe what you need to do collect some more information about how many instances of the dnsserver program to run.

The notes on the Squid homepage that describe using the external dnsserver program say that you should always try to run at least as many copies of the dnssserver program than the cache will have nameservice requests outstanding. And then run two more copies of the program for good measure. However, it doesn't appear that Squid keeps track of how many requests each of the dnsserver instances has handled, so figuring out when there are enough dnsserver processes running is not as simple as it could be.

Hacking a little code into the dnsserver program to do this counting seemed like the right solution. So, after a little work on the code to put in a call to setproctitle(), you can now look at the dnsserver processes with ps and see how many requests each of the dnsserver processes has handled. The patches for the dnsserver program are available.

# ps -auxw -U www | grep dnsserver
www        829  0.0  1.7  1396  492  ??  Is   10:30PM    0:00.06 (16 requests) (dnsserver)
www        830  0.0  1.7  1396  492  ??  Is   10:30PM    0:00.04 (1 requests) (dnsserver)
www        831  0.0  0.5  1060  144  ??  Is   10:30PM    0:00.02 (0 requests) (dnsserver)
www        832  0.0  0.5  1060  144  ??  Is   10:30PM    0:00.02 (0 requests) (dnsserver)
www        833  0.0  0.5  1060  144  ??  Is   10:30PM    0:00.02 (0 requests) (dnsserver)

This is much more useful than the default listing in the process table for dnsserver, at least in my opinion. If you patch dnsserver, you will need to recompile everything and reinstall at least the dnsserver binary. You should probably save the original copy of the program, in case you need to revert back to it for some reason.

# gmake clean
# gmake all
# mv /var/www/squid/bin/dnsserver /var/www/squid/bin/dnsserver.FCS
# install -c -o bin -g bin src/dnsserver /var/www/squid/bin/dnsserver

It's not completely obvious, but you will need to kill and restart the squid daemon for the new version of dnsserver to be started. This is necessary because squid starts up all the copies of the dnsserver program when it first starts and uses them until the squid daemon is stopped.

Thanks

Many thanks go to Paul Borman (of the BSD/OS engineering team) for explaining to me what I didn't understand about the way that the SO_BINDANY socket option works in BSD/OS. Also, my thanks go to Adrian Chadd for pointing out that the Squid resolver routines are asynchronous and really ought to be used, now that the resolver misfeature in Squid is fixed.
Copyright 2001,2003,2004 Kurt Lidl. (lidl@pix.net) All Rights Reserved.
See http://www.pix.net/software/squid/ for any updates to this file.
Last Update: $Date: 2004/04/08 15:21:14 $