Document Home

Steps in Adding a Machine - Quick Reference

This short segment is intended to be a guide to handling the steps involved in adding a machine to the cluster in a reasonably efficient manner. It assumes that you only recently acquired the machine, and that it came to you from the equivalent of an organizational surplus bin. As elsewhere, there are elements of this discussion that represent my own personal opinion.



-- Make sure the machine will boot, both to a diskette and its own hard drive.

-- Enter the system bios and check for installed peripherals, especially check to see if the machine has an installed network interface. If the system has an installed interface that is not reasonably generic (i.e., NE2000, Intel, 3com, SMC) disable it, unless you are short on network cards. Even then, given the low cost of common network cards, it may be cost-effective to purchase a relatively generic card, given that this will substantially simplify kernel administration. If you are going to use an integrated interface, make a note of the type of interface and its settings (IO address, IRQ #).

-- Open the box, clean out accumulated dust with vacuum or compressed air, make sure connections are snug.

-- If the machine has one small IDE drive of say, 500-850 Mb., it has sufficient space to be used as a cluster machine. However, if you have another drive like that, put it in the box and split your file system between the two drives. The performance of that machine, and that of the cluster as a whole, will benefit as a result.

-- Similarly, if the machine has room for RAM and you haved available RAM in the appropriate form factor, go ahead and add it. However, if you have a limited supply of RAM keep in mind that there are other machines in the cluster. For example, if you have a RAM-constrained machine serving as a database server and data access is an important part of the cluster's functionality, the marginal utility of adding a chunk of RAM to that server is much higher than that of adding to a generic cluster machine. Similarly, adding 16 Mb. of RAM to a cluster machine that has 32 Mb. of RAM will slmost always have greater marginal utility than adding the same 16 Mb. to a cluster machine with 64 Mb. It you keep in mind that aggregate pool of resources available to the cluster as you configure each machine, the summation of all the marginal gains you make by allocating those resources where they can return the largest benefit can have a significant impact on the performance of the cluster as a whole.

-- Boot to the rescue disk and begin the installation. Repartition your hard drive(s) and create the file system keeping in mind the considerations described earlier.

--Install the stable debian distribution. If you are installing across the network and you installed a squid proxy cache as described earlier, make sure that you specify its address when so prompted. Virtually all, if not all, of the packages you are loading should be stored in that cache. Don't worry about installing a large set of packages, but your potential for trouble will be minimized if you add the C development packages. Samba should also be installed, as it will allow configuration files to be edited from windows machines. Although less relevant than on the cluster controller, if you have the space you may also benefit from installing the X Windows environment, at least at the minimum level in which all of the support libraries are in place.

-- Once you have a stable potato environment, update to the mosix-enabled kernel. If you followed the instructions for building the kernel, this should be no more difficult than getting a copy of the appropriate deb from the machine on which you compiled the kernel and instaling it (e.g., dpkg -i kernel-image-2.4.13_mosix.1.0_i386.deb). When asked, don't worry about creating a boot disk, the kernel is likely too large. While the script should make a copy of your old kernel, you may want to make a backup of your own before you do the installation (e.g., "cp /vmlinuz /vmlinuz.1").

-- Reboot the machine. If it loads linux, support for the hardware installed in the machine, and everything comes up ok, fine. If not, boot to a boot or rescue diskette, rename the old kernel to vmlinuz, run lilo, reboot the system under the old kernel and try to figure out what the heck went wrong. One of the best ways to do that is to examine the record for that boot in /var/log/syslog. Look for hardware components that would not load for some reason, in my experience such errors are the primary culprits for system failure after kernel upgrade. If you are using multiple versions of the upgraded kernel, make sure you got the right one when you copied it from the machine used to build the kernel by making sure that it was attempting to load the drivers associated with the hardware in the machine you are configuring. If the kernel is built appropriately, and if a driver loading as a module is the source of the problem, you should probably take a shot at building a version of the kernel that builds support for the device into the kernel. As long as driver resources are specified appropriately, in my experience that almost always works. If it doesn't, try replacing the offending hardware with something different. It is not at all impossible to reach a congruence of kernel, system hardware, and peripheral that will be problematic, but this should be very rare.

-- If the machine boots with the new kernel, the next step is to upgrade to the unstable distribution. You can move the machine into the cluster now if you wish, as long as you have taken the steps detailed in the next step before you move the machine, or you can move the machine after the distribution upgrade. For me, it depends on whether I need the space the machine occupies and whether the upgrade is the highest priority item on my agenda at the time. If so, it can be handy to have the machine available for local login. If you have other things going on, however, you can just as well move the box into the cluster, telnet to the cluster controller and from there into the subject machine, and start the upgrade running in a window on your desktop.

In any event, to affect the upgrade to the unstable distribution, simply edit the /etc/apt/sources.list file, replacing the word "stable" with "unstable". (If you've moved the machine into the cluster, you'll need to change the file further to reflect the relevant directory on the cluster controller as the source, along with the method you are using to access it.) Then run "apt-get update" to refresh the packages list and apt-get dist-upgrade to install the packages from the unstable distribution. That should pretty much do it, but it is not at all unusual for the installation to bomb out somewhere along the line, primarily because a package is being installed well in advance of other packages on which it is dependent. In general, these difficulties can be remedied by running "apt-get -f install" to fix missing dependencies.Generally, you will have to explicitly finish the upgrade by issuing "apt-get dist-upgrade" after fixing the dependency problem.

-- You can now move the machine into the cluster. Before you do, however, you need to change three key files that configure your network environment: /etc/hosts, /etc/networks, and /etc/network/interfaces. The changes to these files should be self-evident, they just need to be changed to correspond to the machine's address in the cluster network segment. Assuming that the cluster controller is already in place, you can copy the files from that machine, remove the lines that have to do with have to do with the network interface that connects to the network cloud, change the address in /etc/network/interfaces to reflect the address of the machine you are adding, making sure that the interface is specified as "eth0", and check that the machine is listed accurately in /etc/hosts. The machine should then be functional on the network that houses the cluster, if not yet part of the cluster. That's the next step.

-- Before the machines in the cluster can swap processes with each other, you have to enable that level of communication, as I discussed earlier. If you recall, that is accomplished through the use of secure shell clients and servers. There are three options available to you for installation of ssh. (If at the time you read this there is an ssh package available in the debian unstable distribution that of course represents a fourth option , but I will proceed according to the current state of affairs. First, if the ssh package has been installed on another machine in the cluster, you can simply copy the binaries as discussed in that earlier section. Second, you can download the source and build the package on the target machine. This should probably be considered when setting up the first machine in the cluster. The third option, intermediate between the first two in difficulty, is to download the compiled redhat binary from openssh.org, run it through the alien utility to convert it to a deb package, and then install using dpkg. Using alien to convert packages is straightforward, "man alien" will display the man page for alien if you are not familiar with it.

Once the ssh client and the sshd daemon are installed on the machine, make the changes to the /etc/ssh/sshd.conf file that enable rhosts authentication. Next make sure that the .rhosts file in the /root directory of the subject machine contains all of the machines in the cluster and that the .rhosts files on the other machines in the cluster reflect the subject machine. Finally, stop and start the sshd daemon to activate the daemon configuration changes.


-- The final step is to install and configure Mosix. At stated in the Mosix Configuration section, the command "apt-get -i mosixview" should bring down mosixview and all of the mosix packages. Unfortunately, in the week since I wrote that the mosixview package seems no longer to be available as part of the debian unstable distribution. (In any event, should the package's availability resume, you would likely not want to do that on a space-constrained system, because it would pull down all of the base X-windows packages.) Simply getting the mosix package (apt-get -i mosix) should be sufficient to install Mosix as required for a basic cluster machine. The /etc/default/mosix and /etc/inittab files from other machines in the cluster should work fine, as long as the machine from which the files originate plays a specialized role in the cluster, such as the one played by Ralphzilla-raider, the cluster database server / file storage machine. The mosix.map file in the /etc directory should be common across the entire cluster, so any mosix.map file already created for the cluster will work on the newly-created machine. Details on creating these files can be found in the configuration section.

-- Once these files are in place, issue the command "/etc/init.d/mosix stop" to make sure any daemons that might somehow have previously been started are stopped, and then "/etc/init.d/mosix start" to restart. The machine should now be functioning within the cluster. If you use the command "mosctl whois machine name" the mosctl utility should tell you the mosix number of the machine. You should now try to mount the mfs file system ("mount -t mfs mount point"). After verifying that this works (and why wouldn't it?), you should add a line to the /etc/fstab file to automatically mount the filesystem at system boot.