Patent classifications
G06F11/1407
METHOD AND APPARATUS FOR INSTRUCTION CHECKPOINTING IN A DATA PROCESSING DEVICE POWERED BY AN UNPREDICTABLE POWER SOURCE
A computer-implemented method comprises generating computer executable code as one or more code portions; detecting a number of processing operations required to reach one or more predetermined stages in execution of each code portion; and associating with each code portion one or more progress indicators, each representing a respective execution stage of the one or more predetermined stages within execution of that code portion. The code portions are executed by a processor powered by an unpredictable power source. When the processor detects an energy condition indicating that no more than a reserve quantity of electrical energy is available, the progress indicators are used to determine whether or not to perform a checkpoint.
Cross cluster replication
Methods and systems for cross cluster replication are provided. Exemplary methods include: periodically requesting by a follower cluster history from a leader cluster, the history including at least one operation and sequence number pair, the operation having changed data in a primary shard of the leader cluster; receiving history and a first global checkpoint from the leader cluster; when a difference between the first global checkpoint and a second global checkpoint exceeds a user-defined value, concurrently making multiple additional requests for history from the leader cluster; and when a difference between the first global checkpoint and the second global checkpoint is less than a user-defined value, executing the at least one operation, the at least one operation changing data in a primary shard of the follower cluster, such that an index of the follower cluster replicates an index of the leader cluster.
System and method for hybrid kernel- and user-space incremental and full checkpointing
A system includes a multi-process application that runs. A multi-process application runs on primary hosts and is checkpointed by a checkpointer comprised of at least one of a kernel-mode checkpointer module and one or more user-space interceptors providing at least one of barrier synchronization, checkpointing thread, resource flushing, and an application virtualization space. Checkpoints may be written to storage and the application restored from said stored checkpoint at a later time. Checkpointing may be incremental using Page Table Entry (PTE) pages and Virtual Memory Areas (VMA) information. Checkpointing is transparent to the application and requires no modification to the application, operating system, networking stack or libraries. In an alternate embodiment the kernel-mode checkpointer is built into the kernel.
CONTROL STATE PRESERVATION DURING TRANSACTIONAL EXECUTION
A method includes saving a control state for a processor in response to commencing a transactional processing sequence, wherein saving the control state produces a saved control state. The method also includes permitting updates to the control state for the processor while executing the transactional processing sequence. Examples of updates to the control state include key mask changes, primary region table origin changes, primary segment table origin changes, CPU tracing mode changes, and interrupt mode changes. The method also includes restoring the control state for the processor to the saved control state in response to encountering a transactional error during the transactional processing sequence. In some embodiments, saving the control state comprises saving the current control state to memory corresponding to internal registers for an unused thread or another level of virtualization. A corresponding computer system and computer program product are also disclosed herein.
ELASTICALLY MANAGING WORKERS OF MULTI-WORKER WORKLOADS ON ACCELERATOR DEVICES
The disclosure herein describes elastically managing the execution of workers of multi-worker workloads on accelerator devices. A first worker of a workload is executed on an accelerator device during a first time interval. A first context switch point is identified when the first worker is in a first worker state. At the identified context switch point, a first memory state of the first worker is stored in a host memory and the accelerator device is configured to a second memory state of the second worker. The second worker is executed during a second time interval and a second context switch point is identified at the end of the second time interval when the second worker is in a state that is equivalent to the first worker state. During the intervals, collective communication operations between the workers are accumulated and, at the second context switch point, the accumulated operations are performed.
SECURELY EXECUTING SOFTWARE BASED ON CRYPTOGRAPHICALLY VERIFIED INSTRUCTIONS
Securely executing instructions of software on a computerized device by accessing a software of a computerized device, wherein the software includes a plurality of instructions and respective reference message authentication codes (MACs), generating a cryptographic key based at least in part on a key derivation function, wherein arguments of the key derivation function are based at least in part on a unique identifier of the computerized device and a value extended from a measurement of a content of the software of an extension mechanism of a platform configuration register of the computerized device, verifying an instruction of the plurality of instructions of the software based at least in part on the cryptographic key and a reference MAC of the respective reference MACs, and in response to verifying the instruction of the plurality of instructions of the software, executing the instruction.
Proactive cluster compute node migration at next checkpoint of cluster upon predicted node failure
While scheduled checkpoints are being taken of a cluster of active compute nodes distributively executing an application in parallel, a likelihood of failure of the active compute nodes is periodically and independently predicted. Responsive to the likelihood of failure of a given active compute node exceeding a threshold, the given active compute node is proactively migrated to a spare compute node of the cluster at a next scheduled checkpoint. Another spare compute node of the cluster can perform prediction and migration. Prediction can be based on both hardware events and software events regarding the active compute nodes.
System and method for hybrid kernel and user-space checkpointing using a character device
A system, method, and computer readable medium for hybrid kernel-mode and user-mode checkpointing of multi-process applications using a character device. The computer readable medium includes computer-executable instructions for execution by a processing system. A multi-process application runs on primary hosts and is checkpointed by a checkpointer comprised of a kernel-mode checkpointer module and one or more user-space interceptors providing barrier synchronization, checkpointing thread, resource flushing, and an application virtualization space. Checkpoints may be written to storage and the application restored from said stored checkpoint at a later time. Checkpointing is transparent to the application and requires no modification to the application, operating system, networking stack or libraries. In an alternate embodiment the kernel-mode checkpointer is built into the kernel.
ERROR CONTAINMENT FOR ENABLING LOCAL CHECKPOINT AND RECOVERY
Various embodiments include a parallel processing computer system that detects memory errors as a memory client loads data from memory and disables the memory client from storing data to memory, thereby reducing the likelihood that the memory error propagates to other memory clients. The memory client initiates a stall sequence, while other memory clients continue to execute instructions and the memory continues to service memory load and store operations. When a memory error is detected, a specific bit pattern is stored in conjunction with the data associated with the memory error. When the data is copied from one memory to another memory, the specific bit pattern is also copied, in order to identify the data as having a memory error.
Data recovery using bitmap data structure
Examples of the present disclosure describe implementing bitmap-based data replication when a primary form of data replication between a source device and a target device cannot be used. According to one example, a temporal identifier may be received from the target device. If the source device determines that the primary replication method is unable to be used to replicate data associated with the temporal identifier, a secondary replication method may be initiated. The secondary replication method may utilize a recovery bitmap identifying data blocks that have changed on the source device since a previous event.