Document Home

Squid Proxy Cache


If you are installing across a network and have not previously configured a proxy cache, install one on the first machine you configure. The Squid package, which is part of all three extant debian distributions, can be configured in a straightforward manner and I suggest its use. At a minimum, it will save substantially on the bandwidth you use as you install on the machines that will be part of the cluster. Admittedly, the base installation of a cluster workstation may be no more than 25-50 Mb., but if you are setting up five workstations in the manner suggested you will have reduced the bandwidth load by somewhere between 150 and 300 Mb., because virtually all of the packages installed for the first machine after cache configuration will be locally available for the remaining installations. If your organization is like most others, you are hardly saturated with bandwidth, and the effect on a shared medium of removing even that relatively modest volume can be significant. Further, that local availability can significantly speed your installation time. Packages pulled from the cache on our local segment typically come three to five times faster than they come over our T1 pipe at 6:30 a.m. In times of network congestion, that differential can grow by an order of magnitude.


Beyond the current context, squid can substantially curtail on-going bandwidth utilization, even if the only user at your end of the pipe is you. As the number of users using the cache at your site grows, the utility of the cache grows, often in non-linear fashion. It is my personal belief that if you are responsible for almost any number of users who connect to the net over a shared medium a proxy cache is of such utility that if you are not currently using one you should put the mosix project to the side for the moment and set one up. It does not have to be a powerful machine, and indeed if you have several hundred users across several subnets you would best be served by building a hierarchy of servers connecting to a root cache server, which would be the most muscular of the machines in the cache hierarchy. The most important factor affecting cache performance is not processor speed, but RAM and disk subsystem performance. A hierarchy of three, four, or five pentium-class caches with 128-256 Mb. of RAM and two or three 2 Gb. scsi drives configured as RAID0 feeding into a root cache serviced by a 300 Mhz pentium II with 256 Mb. RAM and three or four similar drives configured as RAID0 would server 700-1000 users very well. You could probably buy the hardware for the root cache for around $200-250 USD. A T1 connection, last I checked, costs around $1000 USD per month. Do the math ... sound cost-effective?


Regardless, my discussion here will be limited to a quick squid setup to hold installation packages. If you pursue setting up a more serious squid configuration, the section on raid configuration should be of interest to you. Interestingly, mosix clusters themselves have been discussed as squid proxy cache servers, but I think these discussions have been primarily concerned with use by ISP's and other high-volume network focus points. The operating requirements for a squid server capable of responsively handling a high volume of requests exceeds the capabilities of the architecture of the Ralphzilla cluster substantially, but I may briefly discuss such a setup in alternative configurations.


In any event, to configure a box as a cache server, do a quick debian installation on a machine with the room required to house a small cache. While 200-250 Mb. would probably suffice to hold the cache, if you have a scsi adapter and drive available, put them in this machine. The cache is not going to be any more responsive than the drive on which it is located, and there is no reason to make yourself wait on a slow drive if you don't have to.


Once you have installed debian and squid (which can be installed either during the initial installation or afterward by typing "apt-get install squid"), you will need to edit the file /etc/squid.conf to tell squid where the cache should be located, its maximum size, the range of addresses allowed to access the cache, and the maximum size for objects retained in the cache. As you go through the squid.conf file, the first of these items you come to is maximum object size, around line 400. (In the copy of the file installed on my machine, this setting is on line 404.) The default value for this setting is 4 Mb., and if your cache is configured for general-purpose use that may be the setting you wish to retain on a day-to-day level. However, a number of packages in debian are larger than that, so it is worthwhile to increase the value, if only for the period during which you are configuring machines. I set the value to 24 Mb. for my cache, which is probably larger than necessary but will store the largest package in our configuration.

maximum_object_size 242300 KB


The next item you will be able to configure is the cache directory: the actual location of the cache, the maximum size of the cache, and the number of first and second level directories in the cache. This setting is on or about line 457 of the squid.conf file. The default location of the cache is in the directory /var/cache, but you are intending to extend the use of your cache to general purpose use you should, at a minimum, dedicate a drive to the cache, to minimize the amount the disk is being accessed by other processes. Ideally, of course, you should dedicate a couple of higher-speed drives to the cache and configure them as a RAID0 array. If you have such a dedicated space, specify the mount point of that space here. If you have installed a vanilla debian machine to hold the cache if installation packages and are using the default value of /var/cache, make sure that you have at least 100 Mb. available on the drive which holds the /var mount point. That should be in addition to whatever space would be occupied by the growth of logs and whatever other processes are going to use up space on that drive. A more reasonable choice for maximum cache size if 150 Mb. That should more than suffice to hold a wide variety of packages if you have machines within the cluster that will play specialized roles like database server. In contexts in which a drive or array is dedicated to the cache I have seen it suggested that no more than 75-80% of the available drive space be devoted to the cache, to allow squid space to perform its housekeeping operations. Although I didn't know that at the time I configured my cache, I allocated 3 Gb. of the total available 4 Gb. on the striped two volume disk set to squid, which is in line with this recommendation.

cache_dir /squid/cache 3072 60 312


Finally, you must specify the set of ip addresses which will be allowed to access the cache you are configuring. The configuration of my server is a good example of the basic configuration, because it services a well-defined range of ip addresses on one subnet. First, you must define that range as an access control list (acl). My subnet is assigned the class C network address 10.1.93.0 on the private network of state agencies. That translates in a straightforward manner to an acl definition, occupying line 1197 of my /etc/squid.conf file, as follows:

acl labnet src 10.1.93.0/255.255.255.0
This simply defines the acl labnet as being all of the machines on that class C network range.


Once the acl is defined, you can specify the access allowed for that range in a manner similar to this:

http_access allow labnet

The default action for any given squid invocation is defined in the last line in the "access" lines. The squid configuration documentation suggests that as a safety measure such a line be included to define the default context for the server. As the cache I set up had as its primary goal the reduction of the amount of web-related traffic traversing the T1 line that connects this location to the wider state network, the default condition for this server is to allow access, this is not overly pertinent. Everyone at our location is in the range defined in labnet. Whatever benefit someone outside our local network could gain by accessing our cache would be cancelled out by the latency involved in hitting it across the T1 line. Regardless, there is no point in letting their traffic bog down our connection. Therefore, that final line for this section is:
http_access deny all


Once you've made the appropriate changes to the /etc/squid.conf you will need to create the cache directories at the appropriate location and (re)start the squid process. First, stop squid, if it is running, by typing "squid -k shutdown". If you created a squid id under which the process runs, everything will be easier if you assign the ownership of the squid directory to the squid user: "chown squid /squid". (You may note that I assigned that ownership to the entire mount point, rather than just the /squid/cache directory. I did that just to head off any problems associated with squid housekeeping. There probably aren't any, but since I had created a seperate mount point for the squid directory I had the freedom to do that without messing with something else.) Now su to the squid account ("su squid") and create the cache structure by typing "squid -z". If you are running squid as root, as I have, al you need to do is create the cache structure. Running as root might be considered a security problem on a widely-shared machine, but on a dedicated server with only one or two user accounts beyond the root account I do not see it as a serious problem. Anyway, once the cache directory has been created you can restart the server by issuing the command "/etc/init.d/squid".


Your configuration of squid as a repository for packages need be no more difficult than this ... when you actually step through it it shouldn't take you more than 5 or 10 minutes to configure. Squid can, however, be used in a much wider range of circumstances. You could, for example, use squid to restrict access from an acl to certain times of the day or to force redirection of requests to another cache that only accessed certain sites. It is important to remember, however, that for that to be effective the operating system on which the client browser runs must be able to restrict users from changing browser settings. In the Windows world, therefore, the client operating system must be Windows NT or one of its descendants, Windows 2000 and Windows XP, appropriately configured.