System and method for hybrid kernel- and user-space incremental and full checkpointing
11573868 · 2023-02-07
Assignee
Inventors
Cpc classification
G06F11/1448
PHYSICS
G06F11/1482
PHYSICS
G06F2201/84
PHYSICS
International classification
Abstract
A system includes a multi-process application that runs. A multi-process application runs on primary hosts and is checkpointed by a checkpointer comprised of at least one of a kernel-mode checkpointer module and one or more user-space interceptors providing at least one of barrier synchronization, checkpointing thread, resource flushing, and an application virtualization space. Checkpoints may be written to storage and the application restored from said stored checkpoint at a later time. Checkpointing may be incremental using Page Table Entry (PTE) pages and Virtual Memory Areas (VMA) information. Checkpointing is transparent to the application and requires no modification to the application, operating system, networking stack or libraries. In an alternate embodiment the kernel-mode checkpointer is built into the kernel.
Claims
1. A system, comprising: one or more Central Processing Units (CPUs) operatively connected to computer system memory and configured to execute one or more multi-process applications on a host with a host operating system; a kernel-space checkpointer configured to provide incremental kernel-space checkpointing of the multi-process applications; and one or more user-space interceptors preloaded into an address space of each application process, wherein said user-space interceptors comprise a barrier; wherein said incremental checkpointing is comprised of tracking dirty pages for page table entry (PTE) pages within the address space of the one or more multi-process applications; wherein a dirty page bit is repurposed to track dirty pages for incremental checkpointing.
2. The system according to claim 1 wherein said host operating system is one of Windows, Linux, Solaris, Android, MacOS, iOS, or UNIX.
3. The system according to claim 1 wherein said kernel-space checkpointer is implemented as one of a kernel module, a loadable kernel module, or kernel loadable module.
4. The system according to claim 1 wherein said kernel-space checkpointer is compiled into the kernel.
5. The system according to claim 1 wherein said kernel-space checkpointer is implemented as one of a character device or block device.
6. The system according to claim 1, comprising at least one page table configured to map between process virtual addresses and physical memory addresses.
7. The system according to claim 1, wherein said user-space interceptors comprise creation of a per-process checkpointing thread and an application virtualization space providing a private resource name space.
8. The system according to claim 1, wherein a page table entry with page dirty set is included in an incremental checkpoint, and a page table entry with page dirty bit cleared is excluded from the incremental checkpoint.
9. The system according to claim 8, wherein the page table entry dirty bit is cleared after taking an incremental checkpoint.
10. The system according to claim 1, wherein virtual memory areas (VMAs) of the one or more multi-process applications are used to identify memory pages for multi-process applications.
11. The system according to claim 10, wherein for each virtual memory page identified by said virtual memory areas, a page table is traversed and a corresponding page table entry is identified.
12. The system according to claim 10, wherein checkpoint information for each page contains virtual memory area page information and corresponding page table entry information.
13. The system according to claim 10, wherein said memory pages are identified by traversing virtual memory areas (VMAs) for each application process, and for each virtual memory area identify pages comprising each virtual memory area.
14. The system according to claim 1, further comprising a checkpointer configured to: first taking one full checkpoint comprising memory pages in multi-process applications, and clearing all page dirty bits for said memory pages; taking one or more incremental checkpoints, and merging said incremental checkpoints into said full checkpoint; wherein pages included in said incremental checkpoints replace said pages in the full checkpoint, and other pages in said full checkpoint remain unmodified.
15. The system according to claim 14, wherein said merged checkpoints are written to one of memory, local storage, network storage, or remote storage.
16. The system according to claim 15, further comprising compressing-said checkpoints before writing to one of memory, local storage, network storage, or remote storage.
17. The system according to claim 14, wherein: a full checkpoint is comprised of global state for application processes, process hierarchy, shared state, and individual process checkpoints; and said individual process checkpoints are comprised of one or more page information blocks comprised of virtual memory areas page information and page table entry page information.
18. The system according to claim 14, wherein a page in said full checkpoint is left unmodified if said incremental checkpoints does not contain said page.
19. The system according to claim 1, wherein said incremental checkpoints are transferred from kernel-space to user-spacing using one of copy_to_user( ) or memory mapped pages.
20. A method, comprising: executing one or more multi-process applications on a host with a host operating system; wherein one or more user-space interceptors are located in an address space of each application process, wherein the one or more user-space interceptors comprise a barrier; incrementally checkpointing, by the kernel-space checkpointer, the processes of the multi-process applications; wherein said incremental checkpointing is comprised of tracking dirty pages for page table entry pages within the address space of said multi-process applications; wherein a dirty page bit is repurposed to track dirty pages for incremental checkpointing.
Description
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S)
(1) The invention will be more fully understood by reference to the following drawings which are for illustrative purposes only:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
DETAILED DESCRIPTION OF THE INVENTION
(17) Referring more specifically to the drawings, for illustrative purposes the present invention will be disclosed in relation to
0. INTRODUCTION
(18) The context in which this invention is disclosed is an application running on a primary server. Without affecting the general case of multiple primary applications, the following disclosures often depict and describe just one primary application. Multiple primary applications are handled in a similar manner.
(19) Likewise, the disclosures generally describe applications with one or two processes; any number of processes is handled in a similar manner. Finally, the disclosures generally describe one or two threads per process; any number of threads is handled in a similar manner
1. OVERVIEW
(20)
(21) System resources, such as CPUs 46, I/O devices 44, Network interfaces 42 and storage 40 are accessed using the operating system 38 and the loadable kernel modules 30. Devices accessing remote resources use some form of transport network 48. By way of example, system networking 42 may use TCP/IP over Ethernet transport, Storage 40 may use Fibre Channel or Ethernet transport, and I/O 44 may use USB.
(22) The architecture for the backup is identical to the primary,
(23) In the preferred embodiment storage 40 is external and accessible by both primary and backups over a network 48.
(24)
(25) In an alternate embodiment the functionality of the kernel module 30 is built into, i.e. compiled into, the kernel. This eliminates the need to load the kernel module, at the expense of being a custom kernel. The preferred embodiment disclosed herein provides checkpointing services as a kernel module, but it is obvious to anyone with ordinary skills in the art the kernel module functionality could be compiled into the kernel as disclosed for the alternate embodiment.
(26)
(27) Using a similar architecture, the backup server 82 contains the backup application 84 comprised of process A 86 and process B 88 each with two threads and interception layer. System libraries 90 are interposed between the application 84 and the operating system kernel 96. Loadable kernel modules 92 as loaded by the kernel 96 as needed.
(28) Primary and backup communicate over one or more network connections 98.
2. KERNEL STRUCTURE
(29) Linux and Windows are two dominant operating systems with different kernel architectures. Linux is build around a monolithic kernel with loadable kernel modules, while Windows is built around a micro kernel with a core executive. As with Linux, Windows supports loadable kernel modules and kernel-mode drivers. Both Windows and Linux thus enable drivers and modules to run in kernel space.
(30) 2.1 Memory
(31) On the Linux operating system applications running in user space communicate with the kernel through defined kernel interfaces. The kernel interfaces are generally concealed by the system libraries as far as the applications in user space are concerned. On Windows, the Windows system libraries exemplified by the Win32 and Posix subsystems likewise manage the interfaces to kernel functionality. Generally, a user-space application doesn't need to concern itself with kernel behavior; it simply requests kernel services through the system libraries.
(32) Both the Linux and Windows kernels and system libraries are well documented and only the elements relevant for the present invention will be further described.
(33) On a 32-bit operating system an application can generally address the full 4 GB of memory corresponding to the 32 bits of address space, even if the underlying hardware provides less than 4 GB of memory. As a practical matter Linux reserves the upper 1 GB of memory for the operating system, while Windows generally reserves the upper 1 or 2 GB for the operating system. The available physical memory is first mapped into the address space of the kernel, and the kernel is then proving a flat virtual 4 GB address space for the application. For 64-bit operating systems and hardware similar considerations apply, other than the address space is 64 bits instead of 32 bit. The kernel maintains said mapping between application virtual addresses and physical memory addresses. The virtual address space of is broken into equal sized portions called pages by the kernel. Physical memory is likewise broken into similar sized pages, called page frames. On IA32 systems the page size is 4 kb (4096 bytes). Other architectures may have different page sizes. In the following PAGE_SIZE is used to designate the page size on all platforms. The present invention relies on this mapping to locate and extract application memory image as part of checkpointing while running within the context of the kernel. Generally, the address space when viewed from an application process's perspective is called process virtual memory or process address space, while the underlying physical memory of the host system is called physical memory.
(34)
(35) 2.2 Resources and Processes
(36) Each process has resources attached. By way of example, on the Linux operating system, information related to a running process is captured and stored in the task_struct structure or can be retrieved via said task_struct. Some of the data kept in the task_struct is process execution status, process hierarchy, allocated memory, process credentials, memberships, signals, resources in use and thread information. As the task_struct is a kernel data structure the information stored within is generally not accessible by applications running in user space. Applications can access some of the elements using system calls abstracted out within the system libraries. Data within the task_struct, however, is accessible when running in kernel mode, a fact the present invention builds on.
(37) The kernel also maintains a list of all running processes called the kernel task list. In newer versions of the Linux kernel the kernel task list is private and can therefore not be directly accessed. The present invention does therefore not rely on the global kernel task list; rather it relies only on the local task_struct for the process currently being accessed and uses a user-space library to coordinate access to different processes.
(38) 2.3 Device Drivers
(39) Device drivers provide implementation of hardware specific interfaces. Both Linux and Windows provide generic device support and impose certain architectural constraints on drivers in order to provide universal support for a large number of drivers. By way of example, operating systems provide device drivers for access to hard disks, CD ROM drives, USB, internal and external peripherals. By way of example, when an application opens a file, a device driver is ultimately responsible for retrieving or writing data to said file.
(40) Device drivers thus enable access to very specific resources using well-defined mechanisms offered by the kernel and abstracted out by the system libraries. Device drivers run as part of the kernel either built-in or dynamically loaded as a loadable kernel module. As such, device drivers have access to the kernel internals and can inspect and manipulate kernel structures. The present invention builds on these facts to offer checkpointing services, where key elements of said checkpointing services are provided as a checkpointing device driver.
(41)
(42) As illustrated in the example embodiment 120 on
3 CHECKPOINTING DEVICE DRIVER INTRODUCTION
(43) In the following disclosures we follow the Linux naming convention, where device drivers are installed in the “/dev” directory. In the following disclosure the checkpointer device driver is named “ckpt” and the full pathname is thus “/dev/ckpt”. The actual device name is not important as long as it's different from other device names supported by the kernel and available in the /dev or other directory.
(44) By way of example, /dev/ckpt is implemented as a character device. The disclosed functionality of/dev/ckpt could also be implemented as a block device, which is obvious to anyone with ordinary skills in the art. Therefore, the preferred embodiment uses a character device while an alternate implementation uses a block device. Character devices and block devices are both known in the art and the following disclosures will thus focus on aspects relevant to the present invention.
(45) A character device is identified by its major device number. By way of example, the following teachings use a major device number of CKPT_MAJOR_NUM. For testing purposes the locally reserved numbers in the range 240 to 254 can be used. By way of example on the Linux operating system, the device node is create as
(46) # mknod/dev/ckpt c CKPT_MAJOR_NUM 0
(47) 3.1 File Operations
(48) The internal structure of a device driver must declare the file_operations supported by the device. By way of example, ckpt file_operations is defined as follows:
(49) struct file_operation ckpt_fops={
(50) .owner=THIS_MODULE,
(51) .open=ckpt_open,
(52) .release=ckpt_release,
(53) .read=ckpt_read,
(54) .write=ckpt_write,
(55) .ioctl=ckpt_ioctl,
(56) .seek=ckpt_seek,
(57) };
(58) Here each of the device driver functions ckpt_open, ckpt_release, ckpt_read, ckpt_write, ckpt_ioctl and ckpt_seek are assigned to their respective file_operations.
(59) ckpt_open is used to initialize the device driver, ckpt_release to release all resources associated with the device before taking the device down. ckpt_read and ckpt_write are used to read and write bytes of data respectively, ckpt_ioctl is used to query the device driver for special status information, and ckpt_seek is used to reset the current position in the device. Finally THIS_MODULE identifies the module.
(60) In addition to open/read/write/release calls, the /dev/ckpt also contains kernel module initialization. The module initialization and termination is often called module_init( ) and module_exit( ) in reference to the commonly found declarations in kernel modules. By way of example the following teachings use the terminology ckpt_module_init( ) and ckpt_module_exit( ) for said functions of within the context of the present invention.
(61) In the preferred embodiment ckpt_module_init( ) call registers the device with the operating system.
(62) static int_init ckpt_module_init(void)
(63) {
(64) int ret; if ((ret=register_chrdev(CKPT_MAJOR_NUM, “ckpt”, &ckpt_fops))<0) printk(KERN_ERR “ckpt_module_init: % d\n”,ret); return ret;
}
(65) The register_chrdev call provides the registration information that allows the operating system to map the /dev/ckpt to the actual device driver.
(66) Similarly ckpt_module_exit( ) unregisters the module with a preferred embodiment as
(67) static void_exit ckpt_module_exit(void)
(68) { unregister_chrdev(CKPT_MAJOR_NUM, “ckpt”);
(69) }
(70) 3.2 Conceptual Use of/dev/ckpt
(71) By way of example, the /dev/ckpt checkpointer module may be loaded using the standard Linux command insmod and the /dev/ckpt checkpointer may be unloaded using the Linux command rmmod. Both command are well known in the art and will not be described further herein.
(72) The /dev/ckpt checkpointer is accessed from user space like any other file or device. In the preferred embodiment the checkpointer is opened, ignoring error handling, using code similar to
(73) pCkpt=open(“/devickpt”,O_RDWR);
(74) and the checkpointer is closed using code like
(75) close(pCkpt);
(76) In the preferred embodiment, a checkpoint is created by reading from /dev/ckpt using code like
(77) readCount=read(pCkpt,buffer,size);
(78) and a restore from checkpoint is accomplished by writing to /dev/ckpt using code like
(79) writtenCount=write(pCkpt,buffer,size);
(80)
(81) The finer details of the checkpointing process and its interaction with the user-space interceptors are disclosed in the next sections.
4.0 USER-SPACE SHARED LIBRARY
(82) On the Linux operating systems new processes are created by the combined use of fork( ) and exec( ). On Windows, processes may be created by using the CreateProcess( ) call. As described above, the disclosures of the present invention use the terminology fork( ) and exec( ) to designate the process creating interfaces on all operating system.
(83) The user-space functionality is pre-loaded as a library as part of initially loading the initial application process and later every time a new process is created. The user-space library provides the following functionality
(84) a. Creation of a dedicated checkpointing thread for each process
(85) b. Preparation of Barrier for process and threads within process
(86) c. Collection of checkpoint data by calling/dev/ckpt
(87) d. Flushing of system resources
(88) e. Resource virtualization, Application Virtualization Space (AVS).
(89) As part of checkpointing, the present invention deterministically halts execution of the application being checkpointed. Halting of a multi-process multi-threaded application is achieved using barrier synchronization (as defined in “Brief summary of Invention”). The use of Barrier synchronization for halting a multi-process multi-threaded application is taught in U.S. Pat. No. 7,293,200 Neary et al and included in its entirety by reference. The use of a barrier within the context of the present invention is disclosed below.
(90) Where Neary et al. provides the actual checkpointing facility in a “check point library” running in user-space, the present invention provides the core checkpointing facility as a kernel module. The present invention utilizes the barrier from the Neary, and eliminates the need for the checkpointer to run in user space. Moving the checkpointer from user-space to kernel-space has several advantages: 1) it essentially removes the dependency on the system libraries found in Neary, 2) it eliminates the need to closely track application and library interactions, as all checkpointing is done in the kernel sitting under the system libraries, and 3) it automatically includes all memory allocated in the address space of the application without the need to special case stacks, code, text and shared memory.
(91) The present invention creates a checkpointing thread per process as taught in Neary. However, contrary to Neary, the checkpointer thread calls the /dev/ckpt to collect the checkpoint and does not itself create the checkpoint for its parent process.
(92)
(93) In contrast to Neary and Backensto, the user-space library of the present invention contains only the barrier, checkpointer thread, flushing, AVS and collection of checkpoint data. There is no need to keep track of memory allocations, including those hidden within the system libraries, as all checkpointing data is collected in the kernel. This is a dramatic simplification over the user-space checkpointer and does not require customized system libraries.
(94) 4.1 Flushing of TCP/IP
(95) If, by way of example, an application uses TCP/IP with the socket-programming model, select TCP and socket state and data are implicitly included in the application state, even though the data may reside in the TCP stack or associated device driver. As previously disclosed, the present invention does not attempt to access kernel structures for non-application processes and thus approaches the problem of the TCP and socket state from the application layer. While application and socket buffers generally are included in the applications memory image that is not the case for kernel and device buffers related to the lower levels of the TCP/IP stack and the network device itself. In the preferred embodiment the approach taken is to flush all TCP and socket communication in order to move data through the buffers thereby ensuring that the buffers outside the application are in a consistent state with respect to the application. Flushing of TCP sockets may be achieved in a number of ways including but not limited to closing the socket, which forces a flush and then reopening the socket, setting the TCP_NODELAY socket option or through application specific flush operations. In an alternate embodiment the TCP and device kernel buffers are extracted in kernel module without requiring any flushing operations.
(96) Resources accessible from the task_struct can be flushed directly by the kernel module checkpointer and does therefore not need flushing from user space. Only resources with kernel functionality outside the task_struct, such as TCP/IP, require flushing from user space.
(97) Generally, resources are flushed as part of the checkpointing process in order to move data out of low-level buffers located outside the address space of application processes, through the operating system and onto or into the devices.
(98) 4.2 Application Virtualization Space—AVS
(99) Instantiated resources generally are only meaningful in the context of the operating system where the application is running. By way of example the operating system assigns each process a unique process ID (PID). At any point in time each process on a particular node has a unique process ID. If the process is stopped and re-started it may not get the same PID. Likewise, if the process is started on a different node, the PID may be different. Some applications query the operating for their own PID and use the PID for a variety of purposes. For a checkpoint image to be meaningful, it's thus important that the PID is presented to applications with the value it had at the time of taking the checkpoint.
(100) A structure called the Application Virtualization Space (AVS) is used, to continue to example, to store the PID at the time the process is created and when the PID is accessed at later time through the defined APIs, to retrieve it from the AVS. At the time the process, and thus the PID, is first created the PID is recorded within the AVS for the particular process and kept associated with process as long as it's running. When the process calls getpid( ) to get the associated PID, the values stored in the AVS is returned, as opposed to the PID provided by the underlying operating system. This ensures that dependencies on PIDs are virtualized and preserved across checkpoints. Similar teachings apply to thread IDs (TIDs), and other system constructs such as IPC, signal, and named semaphores.
(101) Files are handled in a similar manner. By way of example, the file name, path, and attributes are stored in the AVS. As part of restoring an application from a checkpoint, the contents within the AVS is used to recreate the data structures used by the operating system to represent the file, i.e. the file's name, path, and attributes. Current file state, such as file position, is contained within the checkpoint and does not need to be included in the AVS; the AVS only needs the minimal information required to re-create the resource and the checkpoint then contains the current state.
(102) Similarly, network state such as state of sockets, may be included in the AVS. As with files, only the data relevant to re-creation of the resource is required, while current state is contained within the checkpoint.
(103) The AVS entries are created when resources are created or opened, and removed from the AVS when the resource is closed. By way of example, this means that the AVS entry for a file is created when the file is opened, and removed when the file is closed. The AVS is generally not involved with individual file operations, such as reading or writing data, only in creating file pointers that are meaningful in the context of the operating system and the associated checkpoint.
(104) The AVS thus contains an application-specific private name space for resources, and maintains mappings between resources within the initial instances of application resources and their possibly different values when running in a different environment at another time. In the example embodiment, the AVS mappings are implemented using a hash table. In other embodiments the AVS is implemented using a database or custom coding.
5. CHECKPOINTING IN KERNEL MODULE /dev/ckpt
(105) As disclosed in section 3 the read( ) method of /dev/ckpt is called to retrieve checkpoint data from the kernel. As disclosed in section 4, read( ) is called within the barrier, so the underlying process is not executing application code.
(106) As taught in section 2 all state information for a particular process is stored in or can be found from the process's task_struct structure. By way of example on the Linux operating system, the task_struct for the current process can be found by the current macro. In continuation of section 3 an example embodiment, without error handling, of the read function for/dev/ckpt is
(107) static ssize_t ckpt_read(struct file *filp, char *ubuff, size_t count, loff_t *f_pos)
(108) { // pMem is a pointer to memory seen from // the kernel Set pMem to the current address in filp // copy to user space eCount=copy_to_user(ubuff,pMem, count) nBytes=(eCount==0)? count: (count+eCount); *f_pos+=nBytes; return count
(109) }
(110) copy_to_user( ) returns zero if count bytes have been copied or -eCount if only ecount bytes were transmitted. nBytes represents the number of bytes actually transferred to user space. In an alternate embodiment a page is memory mapped into user-space and the checkpoint data copied directly into said memory mapped page.
(111) Generally, attempting to access a memory location not allocated to the process causes a page fault. However, copy_to_user( ) specifically deals with this issue by validating addresses before the copy operation. If the page is invalid, i.e. the linux function access_ok( ) returns zero, copy_to_user( ) does not attempt to access the page, and the function returns without attempting to copy data. It is thus not necessary to check every page before attempting to copy.
(112) In an alternate preferred embodiment, the page data is transferred to user-space using memory mapped pages.
(113) Each of these steps are now disclosed in further detail.
(114) 5.1 Collecting Memory State
(115) The process's task_struct contains a data structure mm_struct which contains the relevant information regarding said process's virtual memory. The mm_struct contains the kernel data structures necessary to determine which memory pages are used by the process.
(116) By way of example, the start and end of code may be found using the start_code and end_code variables, the start and end of data with the start_data and end_data, the start and end of the heap with start_brk and brk. Similar variables may exist for stack, environment and arguments.
(117) As memory is organized in pages within the kernel, the most efficient operation of /dev/ckpt is in units of memory pages, as opposed to units of bytes. However, since the core device read( ) operations operates in units of bytes, the calling process preferably aligns calls to read( ) on page boundaries and requests blocks of bytes with the size of a page. In the preferred embodiment, a call to read( ) may thus look like read(filp,ubuff,PAGE_SIZE,p_pos). The read( ) function return the number of bytes read as a means of communicating success or error. If e.g. zero (0) is returned, it means that zero bytes were actually read into ubuff.
(118) By way of example embodiment, and in continuation of the example in section 3.2, the calling process may call /dev/ckpt with pseudo code like
(119) unsigned byte ubuff[PAGE_SIZE];
(120) // open device
(121) // set/reset filepos to zero.
(122) seek(fp,0);
(123) for(int i=0; i<maxMem; i+=PAGE_SIZE)
(124) { count=read(fp,ubuff,PAGE_SIZE) if (count >0) save checkpoint of page stored in ubuff
(125) }
(126) The /dev/ckpt driver may optimize the checkpointing process by using the previously disclosed information about start and end of code, text, stack etc. The checkpointer may thus check the address of the current read request to verify that it falls outside the pages used by the process and if that is the case, immediately return a value of zero to indicate that no application process data was available for the requested read( ) operation. A related optimization is that on some operating systems such as Linux a certain number of pages in low memory are left unused to catch null-pointers. Similarly, the operating system kernel may, by way of example, reside in the upper or lower 1 GB of the memory address space and may thus be skipped by the checkpointer. Each operating system is slightly different, and may offer opportunity for such simple optimizations of the checkpointing process.
(127) In the example, the user-space buffer ubuff is fixed and pre-allocated, and thus automatically included in the checkpoint as well. However, since ubuff is used in a transient manner with data flowing through, its content does not need to be preserved as part of a checkpoint.
(128) 5.2 Flushing of System Resources
(129) In the preferred embodiment disclosed above, TCP/IP connections are flushed from the user-space library. Other system resources, such as IPC is generally contained within the task_struct and can thus be captured by /dev/ckpt by accessing resource within the address space of the current process. Examples include, but are not limited to the System V semaphores sysvmem, open files identified by *files and state of signal handlers. In these cases /dev/ckpt can access the resource internals and flush buffers as supported by the device.
(130) 5.3 Integration with User-Space Library
(131) As disclosed above the kernel module checkpointer /dev/ckpt is called by the checkpointing thread for a process to collects a specified number of memory bytes, typically a page, for said process. The /dev/ckpt checkpointer does thus not need to access the address space of any other processes, nor does it need to know about other processes.
(132) Conversely, the user-space library is responsible for coordination across all processes comprising an application. The barrier contained within the user-space library brings every application process and thread into a stable locked state where the application processes and threads do not execute. With the application essentially halted, the checkpointer thread for each process can call /dev/ckpt and assemble the memory pages comprising the application image.
(133) 5.4 Deterministic Halting at the Barrier
(134) The inner working of the barrier as it relates to the checkpointing process is now disclosed.
(135) Initially the application the process is running A 210 with the checkpointer thread 204 and the two process threads 206, 208 operating. At some point a checkpoint is triggered 224. The checkpoint trigger activates the barrier and transitions into waiting B 212 for T0 and T1 to stop at the barrier. When both T0 and T1 are stopped at the barrier the checkpointing thread 204 starts assembling the checkpoint C 214. As disclosed previously, the checkpoint assembly comprises collecting memory and kernel state for the process 202. Kernel buffers are flushed and relevant state information saved along with the memory state. Said checkpoint state is then written to disk D 216. In the preferred embodiment disclosed previously states C 214 and D 216 are interleaved, in that one or more memory pages repeatedly are collected then written to disk. That loop is indicated with the callout 226 next to state C and D. If a kernel device had to be halted as part of flushing, the device is activated again E 218, and the checkpointer transitions to state F 220 where individual threads TO and T1 are released in reverse order from the barrier. Upon being released, the application process 202 runs again G 222.
(136) 5.5 Triggering of Checkpoints
(137) Checkpoints may be triggered in a variety of ways. By way of example, a checkpoint may be triggered after a certain elapsed time, if the CPU temperature exceeds a certain level, if a memory threshold is met, if a storage threshold is met, if a network traffic threshold is meet, if a new server as added, after changes in configuration, in response to an external script or program, or manually by the administrator. The teachings of the present invention are independent of how the checkpoints are triggered and triggering will therefore not be further described.
6. MANAGEMENT OF MEMORY USING PAGE TABLES AND PTES
(138) As taught above, memory is divided in the pages and presented to an individual process as a virtual address space.
(139) Page tables are used to establish a mapping between a user-space process's virtual address space and the physical memory of the system. The use of page tables is well known in the art (http://en.wikipedia.org/wiki/Page_table) and will only be described to the extent that is required to fully disclose the present invention. Page tables are often multi-level; by way of example Linux generally uses four levels to fully break down an address across page table levels. Modern CPUs, such as x32/x64/AMD and SPARC, have memory management units (MMUs) that traverse the page tables to make memory addressing efficient and fast.
(140) The levels of the page table hierarchy are often given names.
By way of example, the Linux operating system provides a family of functions to access and work with said PTE page table. Some of the functions relevant for the present teachings are
(141) pte_present( ): is the page present in the system
(142) pte_read( ): can page be read
(143) pte_write( ): can page be written
(144) pte_dirty( ): is page dirty
(145) The entire family of pte calls is documented in the Linux kernel sources at www.kernel.org.
(146) As a process executes, the CPU continuously updates the _PAGE_ACCESSED and _PAGE_DIRTY bits. Most modern CPUs provide _PAGE_ACCESSED and _PAGE_DIRTY flags, or they are implemented in software by the operating system. In the following _PAGE_ACCESSED and _PAGE_DIRTY is used to designate said functionality on all CPUs independently of whether the CPU provides native support or the support is provided by the underlying operating system. These teachings are combined with section 9 to further optimize the checkpointing process.
(147) To find the page corresponding to a virtual memory address, the page table needs to be traversed. An example embodiment of Traversing the page table without error handling, is given in the following pseudo code:
(148) static struct page find_page(struct mm_struct mm, unsigned long address);
(149) { pgd=pgd_offset(mm, address) pmd=pmd_offset(pgd,address); pte=pte_offset(pmd,address); return pte_page(pte);
(150) }
(151) Where pgd, pmd, and pte designate the Page Gobal Directory, Page Middle Directory, and Page Table Entry introduced above. The incremental checkpointer collects and uses state from the individual pages. The previously disclosed algorithm is used in a preferred embodiment to find the page associated with a particular virtual memory address.
7. STRUCTURE OF CHECKPOINT
(152) The checkpoint of a multi process application is comprised of state global to the operating system, state shared across all processes within the application, and state for each process including the state of all threads within said process.
(153)
(154)
(155) By way of example, the page info for the first page in the checkpoint 262 is followed by the data contained in said first page 264. This is followed by the page info for the second page 266 in the checkpoint and the data contained in said second page 268. The same layout is followed until the last page info 270 is reached and the data contained in said last page 272.
(156) The layout of the individual process checkpoints in the just disclosed example embodiment allows for saving of only those pages used by the application, and thus supports the checkpointer optimizations previously disclosed in section 5.1.
(157) Methods for storing the hash tables as used by AVS are well known in the art and will therefore not be discussed in further detail. Likewise, methods for storing trees, as used by the process hierarchy, are well known in the art and will therefore not be discussed further
8. CHECKPOINT OPTIMIZATIONS USING VMAS
(158) Within the virtual address space of a process Linux uses the technique of Virtual
(159) Memory Areas (VMAs), also called Regions, to manage memory and further refine the access rights to individual pages or groups of pages. In the disclosures the terms Region and VMA are used interchangeably to represents an interval of address space that is a multiple of the underlying page size. VMAs do not overlap and all pages within a VMA have the same attributes.
(160) By way of example, on Linux the mm_struct data structure contains a list of vm_area_structs (called *mmap) with the elements in the list representing a VMAs and their underlying component VMAs. This hierarchical layout makes it easy to assemble a checkpoint comprised of all process pages.
(161) By way of example, the relevant element of the mm_struct data structure is
(162) struct mm_struct { struct vm_area_struct *mmap; . . .
(163) };
(164) While the relevant elements of the vm_area_struct are:
(165) struct vm_area_struct { struct mm_struct *vm_mm; unsigned long vm_start; // starting address unsigned long vm_end; // end address+1 struct vm_area_struct *vm_next; // next element . . .
(166) };
(167)
(168) Based on the disclosures above, checkpointing may thus be optimized by traversing the list of vm_area_structs found from the mm_struct (*mm) located within the task_struct. Contrary to the teachings in section 5, by traversing the VMAs each set of pages are known to be in use by the process and will thus not cause page faults that need masking. This, however, means that the user-space checkpointing disclosed in section 5.1 no longer is adequate as the calling read( ) function no longer can pre-determine the address of the page being read( ) i.e. checkpointed. The following disclosures describe a preferred embodiment with of /dev/ckpt and associated user-space calling method to support dynamic and interactive checkpointing optimized by using the VMAs.
(169) By way of example, we examine the example embodiment 340 on
(170) An example embodiment in pseudo code is:
(171) TABLE-US-00001 static ssize_t ckpt_read(struct file *filp, char *ubuff, size_t count, loff_t *f_pos) { if (f_pos == 0) { // new checkpoint // find start page for first vma pMem = address of first page in first vma } else { pMem = address by f_pos } If (f_pos indicates end of address space) return 0; // we're done // read page and copy to user space eCount = copy_to_user(ubuff,pMem, count) nBytes = (eCount==0)? count : (count+eCount); *f_pos += nBytes; // move to start of next page or end address return count }
(172)
(173) This just disclosed read( ) function requires a more sophisticated user-space caller to accommodate the different interpretation of the return values
(174) TABLE-US-00002 unsigned byte ubuff[PAGE_SIZE]; int fd; // set/seek to first page. currPageAddr = lseek(fd,0,SEEK_SET); // set file pos to zero boolean done = false; while (!done) { count = read(fd,ubuff,PAGE_SIZE) if (count > 0) { Save the page data (ubuff) for currPagrAddr; // save addr of next page to be read currPageAddr = lseek(fd,0,SEEK_CUR); } else { done = true; // at the end of a checkpoint; } } In continuation of section 3 an example embodiment, without error handling, of the seek( ) for /dev/ckpt is static loff _t ckpt_seek(struct file *filp, loff_t offset, int type) { loff_t newPos = *f_pos; switch (type) { case SEEK_SET: newPos= offset; break; case SEEK_CUR: newPos=filp->f_pos + offset; break; case SEEK_END: newPos = SIZE − offset; break; Default: newPos = −EINVAL; } filp->f_pos = newPos; return newPos; }
(175) 8.1 Further Optimizations Using PTEs
(176) The checkpointing process may be further optimized by keeping track of which pages have been modified between successive checkpoints. Referring to
(177) By way of example, the Linux operating system keeps track of page access using the _PAGE_ACCESSED and_PAGE_DIRTY bits as previously disclosed in section 6. The bits are generally automatically set by the CPU hardware as the program is executing. The present invention repurposes the _PAGE_DIRTY bit in the following way:
(178) After the completion of a checkpoint, the current _PAGE_DIRTY is copied into one of the unused bits in the page table entry and thereby saved for future reference. In the following disclosures _PAGE_DIRTY_COPY is used as a short hand for said copy of the _PAGE_DIRTY bit at the time of the most recent checkpoint. As the _PAGE_DIRTY bit is used by several low-level functions of the kernel, said low-level functions are updated to use _PAGE_DIRTY_COPY instead of the original _PAGE_DIRTY bit. The benefit is thus that the _PAGE_DIRTY bit, which is automatically set by the CPU hardware, becomes the tracking mechanism for which pages need to be included in the next incremental checkpoint, and that no further data structures are necessary in order to maintain complete tracking of changed pages. Upon completing the checkpoint, and after copying _PAGE_DIRTY to _PAGE_DIRTY_COPY, the _PAGE_DIRTY bit is cleared.
(179)
(180) First check 404 is to determine if this is the first checkpoint. If it is the first checkpoint, there is no prior checkpoint and thus no _PAGE_DIRTY reference available. The checkpoint checkpoints the current page 406, copies 408 the _PAGE_DIRTY bit into _PAGE_DIRTY_COPY and clears 410 the _PAGE_DIRTY bit. Finally, the nBytes is set 412 to the number of bytes transferred in copy_to_user( ) as previously disclosed. If this is the second or later checkpoint 404 the page table entry flag is checked to see if the _PAGE_DIRTY bit is set. If _PAGE_DIRTY is set, the page has been modified and needs to be checkpointed, which is accomplished with the just described set of steps 406, 408, 410 and 412. If the _PAGE_DIRTY is not set, the data on the page is unmodified and the page does not need to be checkpointed. The number of bytes, nBytes, is thus set 416 to zero. Following setting the number of bytes transferred 412, 416 the current file pointer is updated 417 to point to the start of the next page or the end of the last page, as previously disclosed. Finally, the checkpointer exits 418 as the current page has been processed.
(181) The just-disclosed incremental checkpointing offers several improvements over Neary and Backensto: Where Neary and Backensto marked pages “read-only” and rely on a segment fault to capture changed pages; the present invention relies exclusively on the PAGE_DIRTY bit, which is automatically updated by the CPU. In Neary, by primarily addressing the incremental checkpointing from user space, each segment fault causes a transition from kernel-space to use-space to handle the error conditions. Said transition is generally expensive and slows application execution. The present invention exclusively deals with the access issue in kernel space and requires no page fault handling.
(182) 8.2 Merging Full and Incremental Checkpoints
(183) When using incremental checkpointing, the incremental checkpoints are merged into a full checkpoint. The checkpointing process starts with one full checkpoint, i.e. a checkpoint that includes all process memory pages, and is followed by checkpoints where only the changed pages are included. A new incremental checkpoint contains only those pages that have changed since the last checkpoint, and the update to the full checkpoint proceeds as follows: every page in the incremental checkpoint is written to the full checkpoint, thereby updating the checkpoint with the most recent checkpoint data. The process of merging an incremental checkpoint into a full checkpoint is called merging throughout the disclosures of the present invention.
(184) By way of example, if an incremental checkpoint contains page 300, the page 300 in the full checkpoint is updated. If the full checkpoint contains a page 301, and the incremental checkpoint does not contain page 301, said page 301 in the full checkpoint is left unchanged.
(185) Upon the completion of a checkpoint merge, the full checkpoint is up-to-date with the latest incremental changes and may be used as a checkpoint for a full process restore. Incremental checkpoints on their own are incomplete, as they only contain changed pages, and therefore cannot directly be used for a process restore.
(186)
(187) 8.3 Storing Checkpoints and Compression
(188) Each time an incremental checkpoint is merged into the previous checkpoint a new full checkpoint is formed. It differs from the previous checkpoint by the pages in the incremental checkpoint. Said full checkpoints are written to memory, local storage, networked store or other remote storage for potential use later. The application may be fully restored from a full checkpoint, as disclosed in the following section.
(189) In a preferred embodiment, the checkpoints are written to memory or storage uncompressed; in an alternate preferred embodiment the checkpoints are compressed prior to being written to storage. Conversely, upon reading a compressed checkpoint during restore, the checkpoint is decompressed.
(190) 8.4 Deletion of Pages
(191) As an application executes it allocates and de-allocates memory. As the application allocates memory it is automatically reflected in the VMAs as additional pages and is thus automatically picked up by the disclosed incremental checkpointer.
(192) When an application de-allocates memory the operating systems and libraries remove the allocation from the application process that had allocated the memory. Depending on the size of the de-allocate memory and the memory allocator, the corresponding memory page may be de-allocated and removed from the virtual memory area. When and if memory is returned to the operating system is not predictable, and the present invention addresses the issue of de-allocating memory as follows:
(193) As previously disclosed, the checkpointer starts by taking a full checkpoint, which contains a complete memory image and then takes one or more incremental checkpoints. After a set number of incremental checkpoints, a full checkpoint is taken again, and the cycle restarts. The full checkpoints fully account for allocated and de-allocated memory and every full checkpoint thus ensures that de-allocated memory has been properly accounted for in the checkpoint.
9. RESTORING FROM CHECKPOINT
(194)
(195) The restoration of individual process checkpoints are now disclosed in further detail.
(196) As disclosed above, the kernel uses Virtual Memory Areas (VMAs) to keep track of memory. Part of restoring an application's memory image from a checkpoint therefore requires re-generation of the VMA data structures. Specifically, the vm_area_struct embedded within the mm_struct must be rebuilt to match the memory pages included within the checkpoint.
(197) To restore from a checkpoint requires said checkpoint to be a full checkpoint, as opposed to an incremental checkpoint. Since a full checkpoint contains all memory pages for the process, rebuilding the VMA infrastructure is comprised of first deleting all VMA entries in the vm_area_struct followed by adding each page one at a time.
(198) 9.1 Restoring VMAs and PTEs.
(199) The virtual memory areas are process private and can thus be rebuilt to match the checkpoint exactly. The page table and PTEs are part of the underlying operating system and must therefore be treated differently than process-private structures.
(200) In the preferred embodiment, restoring a page from a checkpoint is performed as follows: First a vm_area_struct is allocated to represent the page, the page is allocated with do_mmap( ), the vm_area_struct is updated to match the just allocated page, the vm_area_struct attributes are updated with the attributes stored in the page info block in the checkpoint, and the vm_area_struct is merged into the VMA infrastructure for the process using vma_merge( ). The just-disclosed embodiment thus restores a page from the checkpoint to the same virtual address it occupied within the address space of the original process where the checkpoint originated. Associated with the page in the process' address space is the physical PTE and its associated protection and state. Using the traversal algorithm disclosed previously, the PTE is identified for each page, and said PTE is updated with the PTE attributes stored in the page info block in the checkpoint of said page.
(201) The entire process is restored by first clearing all vm_area_structs from the mm_struct, and then restoring each page as just disclosed. The repeated restoration of every page in the checkpoint rebuilds the process' memory image to match the image at the time of the checkpoint, and by also updating the associated physical PTE, the operating system's view and the process' internal state is reconstructed to match the original process.
(202) The kernel functions to manipulate VMAs and PTEs are well documented and will therefore not be further described herein.
(203) 9.2 Remapping AVS Entries.
(204) Finally, after all processes have been recreated and restored, the resource information in the AVS is remapped against the new image. Resources, such as open files, have components both in user-space and in kernel-space, and must be rebuilt to be functional after a restore. By way of example, the file is re-opened from user space, which triggers a recreation of its kernel state. This last step updates the internals of the resource and makes is a functional resource on the new host operating system.
(205) After remapping of AVS entries the application is ready to run.
10. DEPLOYMENT SCENARIOS
(206)
(207) In one embodiment, the invention is configured with a central file server 302, primary server 304 and backup server 306. The primary server 304 runs the primary application and the backup serves as backup. The primary 304 and backup 306 are connected to each other and the storage device 302 via a network 308. The network is connected to the internet 316 for external access. In another embodiment the primary server 304 has two backup servers; backup 306 and backup-2 305. In yet another embodiment the primary 304 runs in the data center, while the backup 317 runs off site, accessed over the internet
(208) In one embodiment a PC client 312 on the local network 308 is connected to the primary application while the backup application is prepared to take over in the event of a fault. In another embodiment a PC 314 is configured to access the primary application server 304 over the public internet 316. In a third embodiment a cell phone or PDA 310 is accessing the primary application 304 over wireless internet 316, 318. The present invention is configured to server all clients simultaneously independently of how they connect into the application server; and in all cases the backup server is prepared to take over in the event of a fault
(209) Finally, as the interceptors and kernel module are components implemented outside the application, the operating system and system libraries, the present invention provides checkpointing without requiring any modifications to the application, operating system and system libraries.
(210) The just illustrated example embodiments should not be construed as limiting the scope of the invention but as merely providing illustrations of some of the exemplary embodiments of this invention
11. CONCLUSION
(211) In the embodiments described herein, an example programming environment, systems and configurations were disclosed for which one or more embodiments according to the invention were taught. It should be appreciated that the present invention can be implemented by one of ordinary skill in the art using different program organizations and structures, different data structures, different configurations, different systems, and of course any desired naming conventions without departing from the teachings herein. In addition, the invention can be ported, or otherwise configured for, use across a wide-range of operating system environments.
(212) Although the description above contains many details, these should not be construed as limiting the scope of the invention but as merely providing illustrations of some of the exemplary embodiments of this invention. Therefore, it will be appreciated that the scope of the present invention fully encompasses other embodiments which may become obvious to those skilled in the art, and that the scope of the present invention is accordingly to be limited by nothing other than the appended claims, in which reference to an element in the singular is not intended to mean “one and only one” unless explicitly so stated, but rather “one or more.” All structural and functional equivalents to the elements of the above-described preferred embodiment that are known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the present claims. Moreover, it is not necessary for a device or method to address each and every problem sought to be solved by the present invention, for it to be encompassed by the present claims. Furthermore, no element, component, or method step in the present disclosure is intended to be dedicated to the public regardless of whether the element, component, or method step is explicitly recited in the claims. No claim element herein is to be construed under the provisions of 35 U.S.C. 112, sixth paragraph, unless the element is expressly recited using the phrase “means for.”