Network Computing Laboratory, Columbia University Home 
People 
Publications  
Research Projects
Resources
Lab Tour

CRAK: Linux Checkpoint/Restart As a Kernel Module

Hua Zhong and Jason Nieh
Department of Computer Science
Columbia University
Technical Report CUCS-014-01
November 2001

Abstract

Process checkpoint/restart is a very useful technology for process migration, load balancing, crash recovery, rollback transaction, job controlling and many other purposes. Although process migration has not yet been widely used and is not widely available commercial systems, the growing shift of computing facilities from supercomputers to networked workstations and distributed systems is increasing the importance and demand for migration technologies.

In this paper, we describe the design and implementation of CRAK, an innovative transparent checkpoint/restart package for Linux. CRAK provides transparent migration of Linux networked applications and computing environments without modifying, recompiling, or relinking applications or the operating system. CRAK is the first system for Unix/Linux that provides transparent checkpoint/restart with the following properties: (1) it does not require any modifications of existing operating system or application code and (2) it supports migrating network sockets. Prototype implementations are available for Linux 2.2 and Linux 2.4 kernels.

1. Introduction

A process is an entity that is actually running in the operating system. A process has an identifier (PID), a register set, an address space and/or other certain resources such as opened files. In modern operating systems a process runs in its own address space and is separate from other processes. A process interacts with the operating system, or the kernel, by system calls; it can also communicate with other processes by inter-process communication mechanisms provided by the operating system.

Process checkpoint/restart has two phases. The first phase is to save the running state of a process. This usually includes register set, address space, allocated resources, and other related process private data. The second phase is to re-construct the original running process from the saved image and resume the execution from exactly the interrupted point.

There are several problems with existing checkpoint/restart systems. First, all of them require modification of existing code, either the kernel or user applications. Second, except some written-from-scratch process migration operating systems (such as Sprite), they can not preserve opened network connections. Third, general-purpose operating systems such as Unix was not designed to support process migration, so checkpoint/restart systems built on top of existing OSes usually only support a limited set of applications.

In this project we designed and implemented a Linux checkpoint/restart system as a kernel module. The primary contributions are:

This paper is organized as follows. Section 2 briefly describes the motivation and application of checkpoint/restart. Section 3 describes related work. Section 4 describes the design of CRAK. Section 5 describes our CRAK kernel module implementation in Linux. Section 6 discusses network socket support in CRAK. Section 7 describes some experimental results. Section 8 presents conclusions and directions for future work.

2. Motivation and Application

Checkpoint/restart has various applications:

Although checkpoint/restart is a useful technology, it is still mainly a research subject and has not come to production use. The reasons are:

3. Related Work

In this section we describe some well-known process migration and checkpoint/restart systems with different design and complexity. For more information, Dejan S. Milojicic and his colleagues have written an excellent survey on process migration [1]. Current migration systems can be grouped into one of the following categories:

Clustering Operating Systems with Single System Image

There are a number of kernel level systems which require substantial kernel modifications or are written from scratch. Examples are MOSIX, Sprite, V Kernel, Amoeba and Solaris MC. They provide automatic load balancing and allow workstations and servers to run cooperatively as if they were a single multiprocessor computer by means of transparent process migration. Usually each process still has a "home node" where it originated. A process appears to be always running at its home node even after it is migrated to another node. These systems cannot suspend processes and save them offline.

Sprite was developed at U. C. Berkeley starting in 1984 [2]. It provided a transparent process migration environment for a cluster of workstations and a Unix-like programming interface. Each process had a home machine and appeared to run on that machine even if it had been migrated. Kernels talked to each other via inter-kernel RPC. IPC and TCP were implemented by a shared network file system so that processes at different locations can access the same resources. The project terminated in 1994 and never became widely used.

MOSIX is a popular distributed operating system [3]. Many versions of the system have been developed on various Unix systems, and the latest version 7 runs on Linux. MOSIX was designed as a scalable system. Nodes are independent of each other and join and leave the cluster dynamically. In MOSIX each process is tied to a Unique Home Node (UHN) where the user logins in. All the processes in user's session have the same UHN. Processes migrated to other nodes use local resources whenever possible, but interact with the user's environment through the UHN. If the UHN dies, all processes tied to it die. MOSIX uses TCP/IP to implement global interprocess communication.

Virtual Machine Systems

Virtual machine systems provide a software-emulated hardware environment. Different virtual machines are separated from one another. The image of an entire virtual machine can be saved to disk and resumed later. Examples are Disco, IBM VM/370 and Vmware.

These systems can only do checkpointing on a per-VM basis, which is too coarse-grain. The entire virtual machine image has to be saved (typically several hundreds of megabytes) even if only a small part of the memory is actually being used. This results in very expensive migration operations.

Kernel Support for Checkpointing

Examples of kernel support for checkpointing include Compute Capsules, Epckpt, Fluke, KeyKOS and Discount Checking. Compute Capsules and Epckpt support checkpointing and migrating a single process or a group of processes. Fluke provides very limited support for checkpointing and does not have much practical use. KeyKOS and Discount Checking can record the entire memory image to stable storage, very similar to the hibernation functionality commonly available on PC operating systems.

Compute capsule is a kernel-level process migration system developed at Stanford University [4]. A compute capsule contains a group of active processes and their associated environments. Compute capsules virtualize the application interface to the operating system and repartition state ownership so that all machine-dependent information is moved into capsules from the kernel. Each capsule has its own private view of the system. A set of new system calls are implemented to support capsule management, checkpoint and restart, which adds some overhead. The kernel is also heavily patched to support this new level of abstraction. A prototype system has been implemented in Solaris 7.

Epckpt is a Linux kernel patch developed by Eduardo Pinheiro at Federal University of Rio de Janeiro UFRJ [5]. It installs a default signal handler in the kernel. Checkpointing a process is as easy as sending a signal. The signal handler then will be activated and checkpoint the current process.

It has the following features:

It also has a few drawbacks. One big problem is that it adds excessive information logging to the kernel. For instance, in order to know which files are opened it patches open and close system calls. It also patches mmap, fork, exit, etc and introduces a lot of overhead and makes the implementation rather complex. To minimize the overhead, it even adds one more system call, collect_data. By default no information is logged. If you want to checkpoint one process, however, you have to call this system call before it starts to initiate the logging. Without this logging the process can't be checkpointed. That is to say, you have to know in advance whether you will checkpoint a process before you launch it. This is very inconvenient and inefficient. Epckpt is also not very flexible in supporting parallel processes.

User-level Checkpointing and Process Migration

Adding kernel-level process migration support to industry standard operating systems is difficult as they were not designed for this. When changing the kernel is not possible, the only choice is to do it in user-space. User-level checkpointing is not transparent. It needs to modify, recompile, or relink user code. It doesn't support legacy applications, and usually requires that applications only use a limited set of system calls. Examples are Condor [6], CoCheck [7], and Libckpt [8].

The typical approach is to Install a signal handler to do checkpointing. It's similar to what Epckpt does: send a signal to the process and the signal handler will checkpoint itself, except that the signal handler is in user space. Condor uses this approach. Libckpt uses a timer handler (which is also a kind of signal handler) to periodically and incrementally checkpoint itself, but this approach is only suitable for rollback recovery, not for process migration.

There are also user-level systems that provide process migration on a programming language or middleware level. Examples are Legion, Emerald, Rover and Abacus. They support checkpointing as a basic concept of object-oriented languages. Of course they require that the application be written in these languages.

In summary, there are two basic categories of checkpointing or migration systems: kernel level and user level. MOSIX, Compute Capsule and Epckpt are all kernel level, while Condor, CoCheck and Libckpt are user level. They have different levels of transparency, complexity, portability and performance. Transparency means whether user applications need to be modified, recompiled or relinked. Generally speaking, adding support to kernel leads to better transparency, but more implementation complexity and less portability. CRAK is a kernel-level system.

4. Design

In this section we describe the design of CRAK. Our design is not limited to Linux, but we will take Linux as an example to explain CRAK design concepts.

4.1 Goals

In this project, our goal is to design and implement a light-weight general-purpose transparent checkpoint/restart package for existing operating systems.

Our design goals include:

Since we wanted to build a kernel-level system, we considered current kernel-level systems, especially MOSIX, Compute Capsules and Epckpt. We adopted many of Epckpt's ideas because it doesn't change the kernel as much as the first two, and thus was more suitable for transforming into a kernel module.

4.2 Assumptions

The basic assumptions of CRAK are:

4.3 Basic Approach

CRAK does process migration by checkpointing and restarting. The basic procedure is:

  1. Install and load kernel modules on all machines. Modules run in kernel space but does not require kernel modification.
  2. Processes communicate via virtual network address translation.
  3. Request a process or a group of processes to be checkpointed. The request can be local or remote, and can be issued by user explicitly or by an automatic migration system.
  4. Checkpoint command is sent to the kernel module on the local machine.
  5. Kernel module stops processes to be checkpointed.
  6. Kernel module saves the process state to file. The original process is then killed.
  7. Kernel module takes care to save appropriate socket and file state.
  8. A new process is created on a new machine where the process is to be migrated.
  9. The new process is filled with checkpointed process state.
  10. The network addresses are translated for new machine by the kernel module.
  11. Other processes that communicated with the migrated process are notified and updated.

The CRAK user interface is:

  1. Checkpoint: the user should be able to specify which process(s) to checkpoint and where to send the checkpointed image (save to disk or send over network). The user should also be able to control the checkpointing behavior, such as whether to kill the process after the checkpoint, whether to checkpoint the child processes, and whether to turn on some optimization switches, etc.
  2. Restart: the user should be able to pass a checkpointed image and restart the previous process(s) from it.

An automatic process migration system can be built on top of the checkpoint/restart capability and do things like load-balancing without human intervention.

4.4 Checkpoint Mechanism

4.4.1 Save Entire State

All the process private data need to be saved and recovered during a migration. This includes: address space, register set, opened files/pipes/sockets, System V IPC structures, current working directory, signal handlers, timers, terminal settings, user identities (uid, gid, etc), process identities (pid, pgrp, sid, etc), rlimit, and any other data that need to be saved.

Address space and register set are the two mandatory components we have to save and recover to be able to migrate a process, while others can be optional depending on the type of application.

Some migration systems use a paging technique that only transfers a checkpoint image page to the new node only when the page is needed. We don't use this technique and simply save/transfer everything at one time for the following reasons:

4.4.2 Optimizations of Saving Address Space

An address space is made up of a few sections, each being a continuous memory block with a start address, an end address and an access attribute (read/write/execute and private/shared). For example, a section with access 'rwp' means that it's readable, writable and private.

Saving the address space is straightforward. We simply walk through all the sections, and for each one we save its position, access attribute and content to the checkpoint image file.

The question is the overhead. Even a very simple process may have a total image size of several megabytes, and large applications like Netscape could have 50M. Can we reduce the image size? The answer is yes. For example, code sections are read-only, which means it's always the same as the original executable file on the disk. We can then find the content directly from there when we do restart instead of duplicating it in each checkpoint image file. Of course, this assumes the same binary exists on all the machines. This is true if it's on a global network file system or all the machines have the same file installed locally.

So which sections can we choose not to save? We can skip a section if we know it's identical as in the original disk file:

Typically those sections include code sections of the program and shared libraries.

4.4.3 Local Parallel Processes Are Migrated Together

If a group of processes communicate with each other by means of local IPC, they have to be migrated together. If they use global IPC such as network sockets or named piles on a global file system, they don't have to be migrated together.

Some migrating systems reimplement local IPCs with global IPCs. For example, Sprite implements IPC by a shared file system. This beyond the scope of our paper.

4.4.4 Atomicity and Consistency

To ensure that we save consistent state of the running processes, we stop the target processes before we do checkpointing. This ensures that they are not running and their states are not changed during checkpointing.

Stopping a process doesn't necessarily mean that its state will not change. For example, if a signal is sent to a stopped process, its signal pending state will be changed. There are several reasons we don't need to worry about this problem:

4.4.5 Kernel Module

All the other migration systems have one common drawback: they require modification of existing code (either kernel or user applications), thus are difficult to deploy. For example, MOSIX and Compute Capsule patch kernel heavily. Moreover, it's not possible to patch the kernel if the operating system doesn't have source code available.

Fortunately, most operating systems support the concept of dynamically loadable kernel modules. Kernel modules are typically loaded at boot time or on demand when certain functionality is needed. Once the module is loaded, it becomes part of the kernel address space and runs in privileged mode.

One of CRAK's key design concepts is to do checkpoint/restart as a kernel module, which is the most challenging problem in building CRAK. Although a kernel module runs in the same privileged mode, it can only cooperate with the kernel instead of changing the kernel behavior at will. For example, we can not change the scheduling or swapping behavior in a module if the kernel doesn't provide this kind of hookup.

Many packages, like Condor and Epckpt, use Unix signal handling mechanism to do checkpoint. When a process needs to be checkpointed, a special signal is sent to it. The real work is done by the signal handler. In another word, the process checkpoints itself. In this approach, the checkpoint can be done in the the target process' context. It's then very easy to access the address space and register set.

However, we can not add a special signal and a default signal handler with a module. As a result, we can not checkpoint a process in its own context; instead, we have to checkpoint it directly from another process (the 'checkpoint' program). The basic problem is then how to access an arbitrary process' private data from a kernel module, especially the address space and register set. This will be discussed in 5.2 and 5.3. On the other hand, most checkpointing operations such as saving opened files have no much difference between a kernel patch or a module.

Restart is very similar to the system call execve: the kernel loads the checkpoint image into current process' (the 'restart' program) address space, and the current process is completely replaced. It's basically the same to a kernel patch or a module.

Another problem is the kernel usually only provides a limited set of APIs forkernel modules. In a kernel patch we can call any global kernel functions, but in a module we can only call the "exported" functions. If we strictly comply with the APIs, there are only two ways to overcome this constraint:

We will discuss how we solve this problem on Linux in Section 5.7.

4.4.6 Process Cleanup

The old process needs to be cleaned up and recourses be reclaimed when we do a migration. Fortunately, this can be done by the kernel automatically. We simply kill the old process, and the kernel will take care of the rest, such as reclaiming allocated memory, closing all the opened files and flushing dirty pages, etc.

4.5 Restart Mechanism

Restart recovers processes from saved checkpoint images. It's relatively simple and straightforward as it's very similar to the system call "execve". What it does is:

Unfortunately, not all the data can be guaranteed to recover. For example, if the PID number is already used, it can't be recovered. If the file to open is already opened and locked by another process, we can not open it again.

4.6 Special Issues for Networked Processes

Special steps need to be taken for maintaining open network connections. After the process is migrated, its network addresses need to be translated correctly. In particular, the migrated socket is mapped to the new physical address. The other end of the connection should also know this new mapping so it continues to talk to the new node instead of the old one.

4.7 User Space or Kernel Space Solutions

Although CRAK is a kernel level system, some operations can also be done in user space. An operation can be done in user space if there is a system call to do the same job. User space implementation is usually preferred because it's easier to do (right) and more portable. The basic philosophy is that an operating system may have very different internal data structures and algorithms across different versions, but it always has a (reasonably) stable application programming interface.

For example, to restore a file descriptor we could directly mock with the file table in the kernel, but this may not work if in a later version the OS changes the way it uses file tables. On the contrary, if we do it in user space as described in 5.4, as long as the OS still provides "open" and "dup2" system calls (which is highly possible since they are standard APIs), our code will continue to work.

However, sometimes we are still in favor of kernel space because it's usually faster. For example, to restore a file seek pointer, we could use the "lseek" system call in user space, but it requires a context switch. Its kernel implementation is actually very simple - just a one-liner. In this case we just do it in kernel.

5. Implementation

In this section we describe our implementation of CRAK as a kernel module on Linux 2.2.x/i386. The basic picture of checkpoint/restart is, needless to say, composed of two phases: checkpoint and restart. Checkpoint means that we save the running state of a process to a image file. Restart then restarts the process based on the image file. So the basic questions are 1) what states we need to save; 2) how to save the states; and 3) how to recover the states.

We first give an overview of the implementation, and then describe the key components we need to save and recover.

5.1 Overview

There are basically two ways of using a kernel module from a user application. The first is using a device file (usually in /dev). The kernel module registers itself with a device file when it starts up, and user applications can interact with the module by standard file operations (open, close, read, write, ioctl, etc). The other way is through the /proc interface. The module can register entries in the /proc pseudo file system, and interacts with user applications by reads and writes. We chose to use the first approach.

A device file interface, /dev/ckpt, is implemented. It provides IOCTL interface for checkpoint and restart. We encapsulate this device file in a helper library (ckptlib.c) to provide the following application programming interfaces.

int checkpoint (int fd, int pid, int flags);

Three parameters are passed: descriptor of the file the checkpoint image should be written to; pid of the target process; the flags.

Flags can be or-ed by the following:

These flags are taken from Epckpt. Their meanings were discussed in Section 4.5. We also learned another thing from Epckpt that we pass a file descriptor instead of a path name to the checkpoint routine. The kernel then writes checkpoint image to this descriptor. In this way the image is not necessarily saved to local disk. Instead, it can be sent directly to a remote machine via a socket. This significantly reduces the checkpoint overhead for process migration.

int restart (const char * filename, int pid, int flags);

Three parameters are also passed: file name of the checkpoint image; a pid; flags. This system call loads the image, then replaces current process by the checkpointed process.

The last two parameters are mainly for group checkpoint/restart:

As mentioned before, we pass a file descriptor to the checkpoint routine, and can directly send checkpoint image to remote machine. Can we do the same thing for restart: receive checkpoint images from remote machine and directly restart it without saving to the disk? Currently we can't: mmap requires a concrete file anyway.

We also developed a suite of user utilities:

5.2 Address Space

On Linux all the processes have an address space from 0 to 4GB, but only a small part of it is actually used. Each process has a "task_struct" stucture which holds all the private process data. From the structure we can walk through all the memory sections (called Virtual Memory Area, or VMA).

During the checkpoint we need to access the target process' address space from the kernel. Common C functions like strcpy wouldn't work because pointers in kernel and user space have completely different meanings. Instead, the Linux provides a set of helper functions (copy_from_user(), copy_to_user(), etc) that allow data transfer between kernel and the current process. But now we want to access a process's address space that is not the current one, so we can't use these functions either.

Each process has its own page directory and page tables that are initialized in a fork and switched to in a context switch. All the processes have an address space from 0 to 4GB, but mapped to different physical memory regions due to their different page structures.

The kernel also has its own page directory and page tables, but it is special because it just maps the physical address to itself. For instance, a pointer of 0x00004000 in a process may actually points to a physical address of 0x01234000, but in the kernel a virtual address is exactly its physical address.

Now suppose we want to access the address addr in a process p. The physical address of addr is a function of both p and addr. Since this physical address is also the virtual address in the kernel, we can then directly access it (using strcpy or direct pointer dereference). A function, get_kernel_address(), is implemented. It takes p and addr as parameters and returns its physical address by using p's page directory and page tables.

Another problem is that this addr may have been swapped out. In this case, it has an invalid page table entry, and we need to manually swap it in so that it can have a physical address.

5.3 Register Set

To be able to restart a process, we have to make sure that all registers have the same content as the checkpointed point. In a context switch, the currently running process' register set is saved and the kernel switches to another process selected to run, and we need to save it during a checkpoint. As a result, the process we want to checkpoint must not be running, otherwise the saved register set is not in sync with the run-time register set.

To recover we simply replaces the register set in the current stack with our checkpointed values. When we return from the restart call, they will be loaded into the registers.

The only problem is where it is. On Linux, a process' task structure is within the same 8KB frame with its kernel stack, and the register set is saved at the top of the stack, so there is a simple connection between the locations of process' task structure and its register set location. Assuming struct task_struct *p is a pointer to the task structure, then the corresponding register set location is:

struct pt_regs *regs = ((struct pt_regs *)(2*PAGE_SIZE + (unsigned long)p)) - 1;

5.4 Opened Files

There are basically three types of files: named files, pipes and sockets. They have different forms of identity/. Named files are identified by path names. Pipes are anonymous and directly identified by inode numbers. The identity of sockets is protocol dependent but typically is a tuple of network addresses and ports. We need to save the identity of a file based on its type. For all of them we also need to save the values of file handle so we make sure that when we restart the process all the file descriptors refer to the original files.

In the case of named files, if we have a global file system (like NFS), we can just save the path name. If the file is not on a local file system (e.g., /tmp), or it has different path names on different machines, we have to save the entire content of this file instead of just the path name.

Then how to restore them? Since after the restart the process continues to refer to these files by their file descriptors, we have to ensure that these files are opened with exactly the same file descriptors as before. The answer is using dup2(). dup2() duplicates a file descriptor and both old and new file descriptors refer to exactly the same file. Because the code is fairly simple, we attach it here:

int open_force (int fd, char * filename, int flags, int mode) {
int ret = open(filename, flags, mode); // open the file
if (ret < 0 || ret == fd) return ret; // if error or already the desired descriptor just return
if (dup2(ret, fd) < 0) return -1; // dup the descriptor, note this should not fail
close(ret); // close the original descriptor
return fd;
}

This function forces the file to be opened with the specified file descriptor. We just call this function in user space for each descriptor we saved. Since descriptors are unique, there are no conflicts.

For pipes it's more complex because it involves a group of processes, but the idea is the same. We'll discuss parallel processes in Section 6.6.

For network socket we'll discuss in Section 7.

Currently our current implementation only supports saving the pathnames of opened files, not their contents. It works if we have a global file system as discussed in Section 4.5.

5.5 Other States

We also checkpoint and restart the following states:

5.6 Parallel Processes

Very few packages support checkpoint/restart for parallel processes.

Epckpt supports it by passing a flag to the system call to tell it to checkpoint not only the specified process but also all its children. In another word, it is implemented in the kernel. There are two major drawbacks:

We made significant effort to improve it. Instead of semaphores we use signals to do synchronization and almost all the support is done in user space:

5.7 Kernel Symbols

As we discussed in Section 4.2, sometimes we need to access kernel functions or variables that are not exported by the kernel to our module.

There is a dirty trick: /boot/System.map. This file is generated each time a new kernel is compiled. It contains absolute addresses for all kernel symbols (even those not exported to modules), and we can directly use them.

For example, if there is an unexported kernel function tcp_v4_rehash() we want to use, we can grep in the System.map file and find the following line:

c0173e00 t __tcp_v4_rehash

then we can use this function by directly calling the address c0173e00.

6. Network Socket

Network socket migration is the most challenging part in process migration. It's difficult because it involves two machines. In this project we implemented a prototype of TCP/IPv4 socket migration, and have been successfully migrated certain network applications.

6.1 Approaches to Socket Migration

When a process is migrated, we have to find a way to keep the previous network connections so that it can continue to talk to its peers, which are typically at a remote site.

One approach is that we keep a "stub" process at the original node, whose responsibility is to keep the original connection and forward the traffic to and from the new node. The other end doesn't need to know about it. The goodness is that all the machines that need to support migration are under our control.

However, there is one big problem in this approach: each process is thus tied to the first node it runs on (which is usually called a home node). If the home node dies, the process dies. This weakens the usefulness of process migration.

In this project we adopted a different, more direct approach. We do not use stub processes. Instead, when we migrate a process, we save the network stack and restart a new connection on the new node. We also notify the remote peer of the migration so it can talk to the new address.

In this way when the process is migrated it no longer depends on the original node. There is a drawback, however, that now we have to be able to access the remote peer and change its socket information.

6.2 The Sock Structure

In the Linux kernel, all sorts of sockets (inet, unix) have a struct "socket". For inet family, socket has a pointer to the struct "sock", which has all information of a TCP/IP socket.

A "sock" is identified by a unique tuple (protocol, saddr, sport, daddr, dport), where "s" means "source", or "local", and "d" means "destination", or "remote". The following shows how a sock gets its identity and how it relates to the BSD socket API:

socket() -> create the socket structure in the kernel, and set protocol.

bind() -> get sport (if saddr is not zero it also gets saddr)

connect(), accept() -> get full identity (protocol, saddr, sport, daddr, dport)

During the socket migration, we need to change the sock information, especially addresses and ports. For example, for the migrated process we need to change its saddr and possibly sport, and for the remote process we need to change its daddr and possibly dport. It's not just as simple as to change the values, because the sock structure has interactions with various kernel data structures.

Socks are put in various hash tables to make lookup more efficient. The most important ones are the bind hash (tcp_bhash), listening hash (tcp_listening_hash) and established hash (tcp_ehash). For example, when an incoming packet comes in, and if it's for an established connection, the kernel will lookup in the tcp_ehash to find the corresponding sock structure based on the hash key (the full address tuple in the packet).

The following is a summary of how the sock is linked in the hash tables.

Description Hash key Sock field Relevant functions
Protocol* all socks bound to a local port   sklist_next, sklist_prev add_to_prot_sklist, del_from_prot_sklist
Bind all socks bound to a port sport bind_next, bind_pprev tcp_v4_get_port, tcp_put_port
Listening all socks in listening state sport next, pprev tcp_v4_hash, tcp_v4_unhash, tcp_v4_rehash
Connected all socks in connected state saddr, sport, daddr, dport

When we change its address/port, we need to make sure the sock is still in the correct hash tables.

Another thing is the routing table. Each sock structure has a 'rtable' that caches the information as where to send the packet, thus avoids looking up in the global routing table (in Linux it's called Forwarding Information Base, or FIB) each time it sends out a packet. If we change the remote address, we have to rebuild the cache information.

[*] in kernel 2.4 the single protocol linked list no longer exists.

6.3 Three Steps to Migrate a Socket

To be simple, we call the process we checkpoint A, the other end B, and the new node C.

The approach to migrate the socket is very straightforward: firstly, we checkpoint A and leave B still open; secondly, we restart A on C and recover the socket to "established" state which points to B again; thirdly, we change B to point to C, thus recover the whole connection.

6.3.1 Half-close the Connection

When A is checkpointed, all the open files will be closed. For a TCP socket, a FIN packet is sent to B and the connection is then closed.

To be able to migrate the connection without disturbing B, when we checkpoint A we manually set the sock state to TCP_TIME_WAIT, then it won't send FIN packet to B anymore when it's closed.

We also need to save the TCP stack information. Most importantly the sequence numbers so later we can resume the connection. For simplicity, we save the whole TCP stack and restore everything except certain pointer fields that are apparently non-migratible.

6.3.2 Half-recover the Connection

When A is restarted, we rebuild all the sockets it previously had. The idea is to establish a connection at our end without calling connect() or sending any packet out to the wire.

This is done both in user and kernel space.

In user space, we call socket() and bind() to create the socket. In kernel space, we fill in remote address, remote port and other TCP stack information we saved. Since we change its remote address and port, we need to rehash it. Then we fresh the rtable cache. At last we manually set the state to TCP_ESTABLISHED.

By the way, listening ports can be simply recovered by calling listen() in the user space. Since it doesn't involve the other end, it's trivial.

6.3.3 Fully Recover the Connection

The last thing left is that the other end, B, still knows nothing about the new location C. We need to change the daddr and dport pf B's socket to C.

When A is restarted, restart will invoke chsock on B. Currently the remote execution is done by rsh or ssh, but it can be replaced by our own protocol.

7. Results

We report some experimental measurements using CRAK. The measurements were done using Gateway PCs with Intel Celeron 433MHz CPUs and 128 MB RAM running Redhat 6.1 with Linux kernel 2.2.14. All client machines were on a 100MB Fast Ethernet network. The tests were done over NFS v3. The NFS server was a dual-processor Sun4u Sparc running Solaris 7 with 256 MB RAM. The file system involved in the test was on a seagate 4.2GB SCSI disk (there were several other disks hosting other file systems on the server). The Linux client ran with NFS v3 UDP support.

7.1 General Test

We tested our CRAK prototype on some common Unix applications and it worked fine for most of them, such as cat, find, grep, ls, more, emacs, vi and for parallel processes like grep|cat. However, it still fails to restart or produced unexpected results for a few applications. For example, top always segfaults after the restart.

For network applications, we have been able to migrate telnet, ssh, and some X applications. Theoretically we could also restart daemons such as telnetd, but because telnetd forks login and the latter forks a shell, and our current implementation of the user utility "restart" can't deal with 3-layer hierarchy yet, we can't do this now. but it's just lack of user level support. All the kernel work has been done.

The basic procedure of testing networking applications is as follows:

All operations are done as root. We run the application (e.g., telnet) from machine A to connect to B. We run "ck pid image" on A to checkpoint A. Then we do "restart image" on machine C, then C can continue to talk to B over the recovered connection. To be able to run rsh on B, we need to have A and C in B's ~root/.rhosts.

7.2 Overhead

To be able to support checkpoint/restart, Epckpt does a lot of information logging. Compute Capsule also has the extra capsule management overhead. For example, a capsule creation operation costs 74.6us.

On the contrary, CRAK has no overhead unless you actually issue the checkpoint/restart commands. The primary overhead is saving the checkpoint image to disk.

Below is a simple test on three cases: find, grep, and parallel processes of grep|cat. In each case we checkpointed with and without the CKPT_NO_BINARY_FILE and CKPT_NO_SHARED_LIBRARIES optimization (the first and second rows respectively). Time was measured by inserting gettime functions in the beginning and end of kernel routines. We can see that:

Application Checkpoint Time (ms) Restart Time (ms) Image Size (KB)
find 218.3 4.1 166
2182.0 4.4 1246
grep 504.7 2.1 406
2365.9 4.5 1522
grep | cat 558.7 9.5 523
4347.8 10.0 2719

[*] these tests were run several times and averages were calculated.

8. Conclusion and Future Work

We have demonstrated that it is possible to implement a general-purpose transparent checkpoint/restart package for a popular operating system like Linux. We also demonstrated that we could make it without patching either kernel or user applications. This level of transparency has never been achieved by any other packages.

References

[1] Dejan S. Milojicic, Fred Douglas, Yves Paindaveine, Richard Wheeler and Songnian Zhou, Process Migration, HP Labs, AT&T Labs-Research, TOG Research Institute, EMC, and University of Toronto and Platform Computing.

[2] Ousterhout, John K., Cherenson, Andrew R., Douglis, Frederick, Nelson, Michael N., and Welch, Brent B., The Sprite Network Operating System, IEEE Computer, 21(2), February, 1988.

[3] Ammon Barak and Oren La'sdan, The MOSIX Multicomputer Operating System for High performance Cluster Computing. Institute of Computer Science, The Hebrew University of Jerusalem.

[4] Brian K. Schmidt, Supporting Ubiquitous Computing with Stateless Consoles and Computation Caches (PhD Thesis), Stanford University (http://suif.stanford.edu/~bks/).

[5] Eduardo Pinheiro, Truly-Transparent Checkpointing of Parallel Applications (Working Draft), Federal University of Rio de Janeiro UFRJ

[6] M. Litzkow, T. Tannenbaum, J. Basney, M. Livny. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System. University of Wisconsin Madison.

[7] J. Pruyne, M. Livny. Managing Checkpoints for Parallel Programs. University of Wisconsin Madison.

[8] J. Plank, M. Beck, G. Kingsley. Libckpt: Transparent Checkpointing under Unix. Princeton University.