This is a description of how to create a Single System Image (SSI) cluster of virtual User-Mode Linux (UML) machines. After explaining how to use the pre-built SSI/UML binaries, this document demonstrates what an SSI cluster can do. Then it shows more advanced users how to build their own SSI/UML kernels, ramdisks and root images. Following that, it provides an overview of how to move to a hardware-based SSI cluster. It concludes with a set of links and an invitation to contribute to the SSI Clustering project.
An SSI cluster is a collection of computers that work together as if they are a single highly-available supercomputer. There are at least three reasons to create an SSI cluster of virtual UML machines.
The raison d'Ítre of the SSI Clustering project is to provide a full, highly available SSI environment for Linux. Goals for this project include availability, scalability and manageability, using standard servers. Technology pieces include: membership, single root and single init, single process space and process migration, load leveling, single IPC, device and networking space, and single management space.
The SSI project was seeded with HP's NonStop Clusters for UnixWare (NSC) technology. It also leverages other open source technologies, such as Cluster Infrastructure (CI), Global File System (GFS), keepalive/spawndaemon, Linux Virtual Server (LVS), and the Mosix load-leveler, to create the best general-purpose clustering environment on Linux.
The CI project is developing a common infrastructure for Linux clustering by extending the Cluster Membership Subsystem (CLMS) and Internode Communication Subsystem (ICS) from HP's NonStop Clusters for Unixware (NSC) code base.
GFS is a parallel physical file system for Linux. It allows multiple computers to simultaneously share a single drive. The SSI Clustering project uses GFS for its single, shared root. GFS was originally developed and open-sourced by Sistina Software. Later they decided to close the GFS source, which prompted the creation of the OpenGFS project to maintain a version of GFS that is still under the GPL.
keepalive is a process monitoring and restart daemon that was ported from HP's Non-Stop Clusters for UnixWare (NSC). It offers significantly more flexibility than the respawn feature of init.
spawndaemon provides a command-line interface for keepalive. It's used to control which processes keepalive monitors, along with various other parameters related to monitoring and restart.
Keepalive/spawndaemon is currently incompatible with the GFS shared root. keepalive makes use of shared writable memory mapped files, which OpenGFS does not yet support. It's only mentioned for the sake of completeness.
LVS allows you to build highly scalable and highly available network services over a set of cluster nodes. LVS offers various ways to load-balance connections (e.g., round-robin, least connection, etc.) across the cluster. The whole cluster is known to the outside world by a single IP address.
The SSI project will become more tightly integrated with LVS in the future. An advantage will be greatly reduced administrative overhead, because SSI kernels have the information necessary to automate most LVS configuration. Another advantage will be that the SSI environment allows much tighter coordination among server nodes.
LVS support is turned off in the current binary release of SSI/UML. To experiment with it you must build your own kernel as described in Section 4.
The Mosix load-leveler provides automatic load-balancing within a cluster. Using the Mosix algorithms, the load of each node is calculated and compared to the loads of the other nodes in the cluster. If it's determined that a node is overloaded, the load-leveler chooses a process to migrate to the best underloaded node.
Only the load-leveling algorithms have been taken from Mosix. The SSI Clustering project is using its own process migration model, membership mechanism and information sharing scheme.
The Mosix load-leveler is turned off in the current binary release of SSI/UML. To experiment with it you must build your own kernel as described in Section 4.
User-Mode Linux (UML) allows you to run one or more virtual Linux machines on a host Linux system. It includes virtual block, network, and serial devices to provide an environment that is almost as full-featured as a hardware-based machine.
The following are various cluster types found in use today. If you use or intend to use one of these cluster types, you may want to consider SSI clustering as an alternative or addition.
For more information about how SSI clustering compares to the cluster types above, read Bruce Walker's Introduction to Single System Image Clustering.
To create an SSI cluster of virtual UML machines, you need an Intel x86-based computer running any Linux distribution with a 2.2.15 or later kernel. About two gigabytes of available hard drive space are needed for each node's swap space, the original disk image, and its working copy.
A reasonably fast processor and sufficient memory are necessary to ensure good performance while running several virtual machines. The systems I've used so far have not performed well.
One was a 400 MHz PII with 192 MB of memory running Sawfish as its window manager. Bringing up a three node cluster was quite slow and sometimes failed, maybe due to problems with memory pressure in either UML or the UML port of SSI.
Another was a two-way 200 MHz Pentium Pro with 192 MB of memory that used a second machine as its X server. A three node cluster booted quicker and failed less often, but performance was still less than satisfactory.
More testing is needed to know what the appropriate system requirements are. User feedback would be most useful, and can be sent to <firstname.lastname@example.org>.
The latest version of this HOWTO will always be made available on the SSI project website, in a variety of formats:
Feedback is most certainly welcome for this document. Please send your additions, comments and criticisms to the following email address: <email@example.com>.
This document is copyrighted © 2002 Hewlett-Packard Company and is distributed under the terms of the Linux Documentation Project (LDP) license, stated below.
Unless otherwise stated, Linux HOWTO documents are copyrighted by their respective authors. Linux HOWTO documents may be reproduced and distributed in whole or in part, in any medium physical or electronic, as long as this copyright notice is retained on all copies. Commercial redistribution is allowed and encouraged; however, the author would like to be notified of any such distributions.
All translations, derivative works, or aggregate works incorporating any Linux HOWTO documents must be covered under this copyright notice. That is, you may not produce a derivative work from a HOWTO and impose additional restrictions on its distribution. Exceptions to these rules may be granted under certain conditions; please contact the Linux HOWTO coordinator at the address given below.
In short, we wish to promote dissemination of this information through as many channels as possible. However, we do wish to retain copyright on the HOWTO documents, and would like to be notified of any plans to redistribute the HOWTOs.
If you have any questions, please contact <firstname.lastname@example.org>
No liability for the contents of this documents can be accepted. Use the concepts, examples and other content at your own risk. As this is a new edition of this document, there may be errors and inaccuracies, that may of course be damaging to your system. Proceed with caution, and although this is highly unlikely, the author(s) do not take any responsibility for that.
All copyrights are held by their by their respective owners, unless specifically noted otherwise. Use of a term in this document should not be regarded as affecting the validity of any trademark or service mark.
Naming of particular products or brands should not be seen as endorsements.
You are strongly recommended to make a backup of your system before major installations, and back up at regular intervals.
This section is a quick start guide for installing and running an SSI cluster of virtual UML machines. The most time-consuming part of this procedure is downloading the root image.
First you need to download a SSI-ready root image. The compressed image weighs in at over 150MB, which will take more than six hours to download over a 56K modem, or about 45 minutes over a 500K broadband connection.
The image is based on Red Hat 7.2. This means the virtual SSI cluster will be running Red Hat, but it does not matter which distribution you run on the host system. A more advanced user can make a new root image based on another distribution. This is described in Section 5.
After downloading the root image, extract and install it.
Download the UML utilities. Extract, build, and install them.
Download the SSI/UML utilities. Extract, build, and install them.
Assuming X Windows is running or the DISPLAY variable is set to an available X server, start a two node cluster with
This command boots nodes 1 and 2. It displays each console in a new xterm. The nodes run through their early kernel initialization, then seek each other out and form an SSI cluster before booting the rest of the way. If you're anxious to see what an SSI cluster can do, skip ahead to Section 3.
You'll probably notice that two other consoles are started. One is the lock server node, which is an artefact of how the GFS shared root is implemented at this time. The console is not a node in the cluster, and it won't give you a login prompt. For more information about the lock server, see Section 7.3. The other console is for the UML virtual networking switch daemon. It won't give you a prompt, either.
Note that only one SSI/UML cluster can be running at a time, although it can be run as a non-root user.
The argument to ssi-start is the number of nodes that should be in the cluster. It must be a number between 1 and 15. If this argument is omitted, it defaults to 3. The fifteen node limit is arbitrary, and can be easily increased in future releases.
To substitute your own SSI/UML files for the ones in /usr/local/lib and /usr/local/bin, provide your pathnames in ~/.ssiuml/ssiuml.conf. Values to override are KERNEL, ROOT, CIDEV, INITRD, and INITRD_MEMEXP. This feature is only needed by an advanced user.
Add nodes 3 and 5 to the cluster with
The arguments taken by ssi-add are an arbitrary list of node numbers. The node numbers must be between 1 and 15. At least one node number must be provided. For any node that is already up, ssi-add ignores it and moves on to the next argument in the list.
Simulate a crash of node 3 with
Note that this command does not inform the other nodes about the crash. They must discover it through the cluster's node monitoring mechanism.
The arguments taken by ssi-rm are an arbitrary list of node numbers. At least one node number must be provided.
You can take down the entire cluster at once with
If ssi-stop hangs, interrupt it and shoot all the linux-ssi processes before trying again.
Eventually, it should be possible to take down the cluster by running shutdown as root on any one of its consoles. This does not work just yet.
Bring up a three node cluster with ssi-start. Log in to all three consoles as root. The initial password is root, but you'll be forced to change it the first time you log in.
The following demos should familiarize you with what an SSI cluster can do.
Start dbdemo on node 1.
The dbdemo program "processes" records from the file given as an argument. In this case, it's alphabet, which contains the ICAO alphabet used by aviators. For each record, dbdemo writes the data to its terminal device and spins in a busy loop for a second to simulate an intensive calculation.
The dbdemo program is also listening on its terminal device for certain command keys.
Table 1. Command Keys for dbdemo
Move dbdemo to different nodes. Note that it continues to send output to the console where it was started, and that it continues to respond to keypresses from that console. This demonstrates that although the process is running on another node, it can remotely read and write the device it had open.
Also note that when a process moves, it preserves its file offsets. After moving, dbdemo continues processing records from alphabet as if nothing had happened.
To confirm that the process moved to a new node, get its PID and use where_pid. You can do this on any node.
If you like, you can download the source for dbdemo. It's also available as a tarball in the /root/dbdemo directory.
3.2. Clusterwide PIDs, Distributed Process Relationships and Access, Clusterwide Job Control and Single Root
From node 1's console, start up vi on node 2. The onnode command uses the SSI kernel's rexec system call to remotely execute vi.
Confirm that it's on node 2 with where_pid. You need to get its PID first.
Type some text and save your work. On node 3, cat the file to see the contents. This demonstrates the single root file system.
From node 3, kill the vi session running on node 2. You should see control of node 1's console given back to the shell.
Make a FIFO on the shared root.
echo something into the FIFO on node 1.
cat the FIFO on node 2.
This demostrates that FIFOs are clusterwide and remotely accessible.
On node 3, write "Hello World" to the console of node 1.
This shows that devices can be remotely accessed from anywhere in the cluster. Eventually, the node-specific subdirectories of /devfs will be merged together into a single device tree that can be mounted on /dev without confusing non-cluster aware applications.
Building your own kernel and ramdisk is necessary if you want to
Otherwise, feel free to skip this section.
SSI source code is available as official release tarballs and through CVS. The CVS repository contains the latest, bleeding-edge code. It can be less stable than the official release, but it has features and bugfixes that the release does not have.
The latest SSI release can be found at the top of this release list. At the time of this writing, the latest release is 0.6.5.
Download the latest release. Extract it.
Determine the corresponding kernel version number from the release name. It appears before the SSI version number. For the 0.6.5 release, the corresponding kernel version is 2.4.16.
Follow these instructions to do a CVS checkout of the latest SSI code. The modulename is ssic-linux.
You also need to check out the latest CI code. Follow these instructions to do that. The modulename is ci-linux.
Determine the corresponding kernel version with
In this case, the corresponding kernel version is 2.4.16. If you're paranoid, you might want to make sure the corresponding kernel version for CI is the same.
They will only differ when I'm merging them up to a new kernel version. There is a window between checking in the new CI code and the new SSI code. I'll do my best to minimize that window. If you happen to see it, wait a few hours, then update your sandboxes.
Extract the source. This will take a little time.
Follow the appropriate instructions, based on whether you downloaded an official SSI release or did a CVS checkout.
Apply the patch in the SSI source tree.
Apply the UML patch from either the CI or SSI sandbox. It will fail on patching Makefile. Don't worry about this.
Copy CI and SSI code into place.
Apply the GFS patch from the SSI sandbox.
Apply any other patch from ssic-linux/3rd-party at your discretion. They haven't been tested much or at all in the UML environment. The KDB patch is rather useless in this environment.
Configure the kernel with the provided configuration file. The following commands assume you are still in the kernel source directory.
Build the kernel image and modules.
To install the kernel you must be able to loopback mount the GFS root image. You need to do a few things to the host system to make that possible.
Apply the appropriate kernel patches from the kernel_patches directory to your kernel source tree. Make sure you enable the /dev filesystem, but do not have it automatically mount at boot. (When you configure the kernel select 'File systems -> /dev filesystem support' and unselect 'File systems -> /dev filesystem support -> Automatically mount at boot'.) Build the kernel as usual, install it, rewrite your boot block and reboot.
Configure, build and install the GFS modules and utilities.
Configure two aliases for one of the host's network devices. The first alias should be 192.168.50.1, and the other should be 192.168.50.101. Both should have a netmask of 255.255.255.0.
cat the contents of /proc/partitions. Select two device names that you're not using for anything else, and make two loopback devices with their names. For example:
Finally, load the necessary GFS modules and start the lock server daemon.
Your host system now has GFS support.
Loopback mount the shared root.
Install the modules into the root image.
You have to repeat some of the steps you did in Section 4.5. Extract another copy of the OpenGFS source. Call it opengfs-uml. Add the following line to make/modules.mk.in.
Configure, build and install the GFS modules and utilities for UML.
Change root into the loopback mounted root image, and use the --uml argument to cluster_mkinitrd to build a ramdisk.
Move the new ramdisk out of the root image, and assign ownership to the appropriate user. Wrap things up.
Pass the new kernel and ramdisk images into ssi-start with the appropriate pathnames for KERNEL and INITRD in ~/.ssiuml/ssiuml.conf. An example for KERNEL would be ~/linux/linux. An example for INITRD would be ~/initrd-ssi.img.
Stop the currently running cluster and start again.
You should see a three-node cluster booting with your new kernel. Feel free to take it through the exercises in Section 3 to make sure it's working correctly.
Building your own root image is necessary if you want to use a distribution other than Red Hat 7.2. Otherwise, feel free to skip this section.
These instructions describe how to build a Red Hat 7.2 image. At the end of this section is a brief discussion of how other distributions might differ. Building a root image for another distribution is left as an exercise for the reader.
Extract the image.
Loopback mount the image.
Make a blank GFS root image. You also need to create an accompanying lock table image. Be sure you've added support for GFS to your host system by following the instructions in Section 4.5.
Enter the following pool information into a file named pool0cidev.cf.
Enter the following pool information into a file named pool0.cf.
Write the pool information to the loopback devices.
Create the pool devices.
Enter the following lock table into a file named gfscf.cf.
Write the lock table to the cidev pool device.
Format the root disk image.
Mount the root image.
Copy the ext2 root to the GFS image.
Cluster Tools source code is available as official release tarballs and through CVS. The CVS repository contains the latest, bleeding-edge code. It can be less stable than the official release, but it has features and bugfixes that the release does not have.
The latest release can be found at the top of the Cluster-Tools section of this release list. At the time of this writing, the latest release is 0.6.5.
Download the latest release. Extract it.
Otherwise, mount the old root image and copy the modules directory from /mnt/lib/modules. Then remount the new root image and copy the modules into it.
Remake the ubd devices. At some point, the UML team switched the device numbering scheme from 98,1 for dev/ubd/1, 98,2 for dev/ubd/2, etc. Now they use 98,16 for dev/ubd/1, 98,32 for dev/ubd/2, etc.
Comment and uncomment the appropriate lines in /mnt/etc/inittab.ssi. Search for the phrase 'For UML' to see which lines to change. Basically, you should disable the DHCP daemon, and change the getty to use tty0 rather than tty1.
You may want to strip down the operating system so that it boots quicker. For the prepackaged root image, I removed the following files.
You might also want to copy dbdemo and its associated alphabet file into /root/dbdemo. This lets you run the demo described in Section 3.1.
Cluster Tools has make rules for Caldera and Debian, in addition to Red Hat. Respectively, the rules are install_ssi_caldera and install_ssi_debian.
The main difference between the distributions is the /etc/inittab.ssi installed. It is the inittab used by the clusterized init.ssi program. It is based on the distribution's /etc/inittab, but has some cluster-specific enhancements that are recognized by init.ssi.
There is also some logic in the /etc/rc.d/rc.nodeup script to detect which distribution it's on. This script is run whenever a node joins the cluster, and it needs to do different things for different distributions.
Finally, there are some modifications to the networking scripts to prevent them from tromping on the cluster interconnect configuration. They're a short-term hack, and they've only been implemented for Red Hat so far. The modified files are /etc/sysconfig/network-scripts/ifcfg-eth0 and /etc/sysconfig/network-scripts/network-functions.
If you plan to use SSI clustering in a production system, you probably want to move to a hardware-based cluster. That way you can take advantage of the high-availability and scalability that a hardware-based SSI cluster can offer.
Hardware-based SSI clusters have significantly higher availability. If a UML host kernel panics, or the host machine has a hardware failure, its UML-based SSI cluster goes down. On the other hand, if one of the SSI kernels panic, or one of the hardware-based nodes has a failure, the cluster continues to run. Centralized kernel services can failover to a new node, and critical user-mode programs can be restarted by the application monitoring and restart daemon.
Hardware-based SSI clusters also have significantly higher scalability. Each node has one or more CPUs that truly work in parallel, whereas a UML-based cluster merely simulates having multiple nodes by time-sharing on the host machine's CPUs. Adding nodes to a hardware-based cluster increases the volume of work it can handle, but adding nodes to a UML-based cluster bogs it down with more processes to run on the same number of CPUs.
You can build hardware-based SSI clusters with x86 or Alpha machines. More architectures, such as IA64, may be added in the future. Note that an SSI cluster must be homogeneous. You cannot mix architectures in the same cluster.
The cluster interconnect must support TCP/IP networking. 100 Mbps ethernet is acceptable. For security reasons, it should be a private network. Each node should have a second network interface for external traffic.
Right now, the most expensive requirement of an SSI cluster is the shared drive, required for the shared GFS root. This will no longer be a requirement when CFS, which is described below, is available. The typical configuration for the shared drive is a hardware RAID disk cabinet attached to all nodes with a Fibre Channel SAN. For a two-node cluster, it is also possible to use shared SCSI, but it is not directly supported by the current cluster management tools.
The GFS shared root also requires one Linux machine outside of the cluster to be the lock server. It need not be the same architecture as the nodes in the cluster. It just has to run memexpd, a user-mode daemon. Eventually, GFS will work with a Distributed Lock Manager (DLM). This would eliminate the need for the external lock server, which is a single point of failure. It could also free up the machine to be another node in your cluster.
In the near future, the Cluster File System (CFS) will be an option for the shared root. It is a stateful NFS that uses a token mechanism to provide tight coherency guarantees. With CFS, the shared root can be stored on the internal disk of one of the nodes. The on-disk format can be any journalling file system, such as ext3 or ReiserFS.
The initial version of CFS will not provide high availability. Future versions of CFS will allow the root to be mirrored across the internal disks of two nodes. A technology such as the Distributed Replicated Block Device (DRBD) would be used for this. This is a low-cost solution for the shared root, although it has a performance penalty.
Future versions will also allow the root to be stored on a disk shared by two or more nodes, but not necessarily shared by all nodes. If the CFS server node crashes, its responsibilities would failover to another node attached to the shared disk.
Start with the installation instructions for SSI.
If you'd like to install SSI from CVS code, follow these instructions to checkout modulename ssic-linux, and these instructions to checkout modulenames ci-linux and cluster-tools. Read the INSTALL and INSTALL.cvs files in both the ci-linux and ssic-linux sandboxes. Also look at the README file in the cluster-tools sandbox.
For more information, read Section 7.
Here are some links to information on SSI clusters, CI clusters, GFS, UML, and other clustering projects.
If you are working from a CVS sandbox, you may also want to sign up for the ssic-linux-checkins mailing list to receive checkin notices. You can do that through this web form.
If you are working from a CVS sandbox, you may also want to sign up for the ci-linux-checkins mailing list to receive checkin notices. You can do that through this web form.
SSI clustering currently depends on the Global File System (GFS) to provide a single root. The open-source version of GFS is maintained by the OpenGFS project. They also have a SourceForge project summary page.
Right now, GFS requires either a DMEP-equipped shared drive or a lock server outside the cluster. The lock server is the only software solution for coordinating disk access, and it is not truly HA. There are plans to make OpenGFS support IBM's Distributed Lock Manager (DLM), which would distribute the lock server's responsibilities across all the nodes in the cluster. If any node fails, the locks it managed would failover to other nodes. This would be a true HA software solution for coordinating disk access.
If you'd like to contribute to the SSI project, you can do so by testing it, writing documentation, fixing bugs, or working on new features.
While using the SSI clustering software, you may run into bugs or features that don't work as well as they should. If so, browse the SSI and CI bug databases to see if someone has seen the same problem. If not, either post a bug yourself or post a message to <email@example.com> to discuss the issue further.
It is important to be as specific as you can in your bug report or posting. Simply saying that the SSI kernel doesn't boot or that it panics is not enough information to diagnose your problem.
There is already some documentation for SSI and CI, but more would certainly be welcome. If you'd like to write instructions for users or internals documentation for developers, post a message to <firstname.lastname@example.org> to express your interest.
Debugging is a great way to get your feet wet as a developer. Browse the SSI and CI bug databases to see what problems need to be fixed. If a bug looks interesting, but is assigned to a developer, contact them to see if they are actually working on it.
After fixing the problem, send your patch to <email@example.com> or <firstname.lastname@example.org>. If it looks good, a developer will check it into the repository. After submitting a few patches, you'll probably be invited to become a developer yourself. Then you'll be able to checkin your own work.
After fixing a bug or two, you may be inclined to work on enhancing or adding an SSI feature. You can look over the SSI and CI project lists for ideas, or you can suggest something of your own. Before you start working on a feature, discuss it first on <email@example.com> or <firstname.lastname@example.org>.