G06F11/2043

Handling migration in a virtualization environment
10713132 · 2020-07-14 · ·

In one embodiment, a system for migrating virtual machines in a virtualization environment includes a plurality of host machines implementing the virtualization environment and a migration controller. Each of the host machines includes a hypervisor, one or more user virtual machines (UVMs) and a virtual machine controller. The system further implements a virtual disk comprising a plurality of storage devices, the virtual disk being accessible by the virtual machine controllers, which conduct I/O transactions with the virtual disk. The migration controller determines a segment size and, for each host machine, determines a number of required segments for the UVMs running on the host machine. The controller computes a number of reserved segments based on a total number of host machines and a largest one of the numbers of required segments. The reserved segments are then assigned among the plurality of host machines.

Processor Repair
20200201704 · 2020-06-25 · ·

A processor comprises a plurality of processing units, wherein there is a fixed transmission time for transmitting a message from a sending processing unit to a receiving processing unit, based on the physical positions of the sending and receiving processing units in the processor. The processing units are arranged in a column, and the fixed transmission time depends on the position of a processing circuit in the column. An exchange fabric is provided for exchanging messages between sending and receiving processing units, the columns being arranged with respect to the exchange fabric such that the fixed transmission time depends on the distances of the processing circuits with respect to the exchange fabric. The processor comprises at least one delay stage for each processing circuit and switching circuitry for selectively switching the delay stage into or out of a communication path involved in message exchange. For processing circuits up to a defective processing circuit in the column, the delay stage is switched into the communication path, and for processing circuits above the defective processing circuit in the column, including a repairing processing circuit which repairs the defective processing circuit the delay stage is switched out of the communication path whereby the fixed transmission time of processing circuits is preserved in the event of a repair of the column.

Capturing snapshots of offload applications on many-core coprocessors

Methods are provided. A method includes capturing a snapshot of an offload process being executed by one or more many-core processors. The offload process is in signal communication with a host process being executed by a host processor. At least the offload is in signal communication with a monitoring process. The method further includes terminating the offload process on the one or more many-core processors, by the monitor process responsive to a communication between the monitor process and the offload processing being disrupted. The snapshot includes a respective predetermined minimum set of information required to restore a same state of the process as when the snapshot was taken.

METHOD AND APPARATUS FOR BACKUP COMMUNICATION

Embodiments of the present disclosure relate to a method and an apparatus for backup communication. The method comprises: detecting a failure of a management interface between a processor and a baseboard management controller; in response to detecting the failure of the management interface, performing backup communication between the processor and the baseboard management controller using a control interface, wherein the baseboard management controller can obtain a physical parameter of the processor via the control interface; and transmitting a packet between the processor and the baseboard management controller via the control interface.

VALIDATION OF DATA WRITTEN VIA TWO DIFFERENT BUS INTERFACES TO A DUAL SERVER BASED STORAGE CONTROLLER

A first server of a storage controller is configured to communicate with a host via a first bus interface, and a second server of the storage controller is configured to communicate with the host via a second bus interface. Data is written from the host via the first bus interface to a cache of the first server and via the second bus interface to a non-volatile storage of the second server. The data stored in the cache of the first server is periodically compared to the data stored in the non-volatile storage of the second server.

Resource coordination method, apparatus, and system for database cluster
10642822 · 2020-05-05 · ·

A resource coordination method, an apparatus, and a system for a database cluster, which include an active coordinator node obtains status information corresponding to each processing node in multiple processing nodes, where the status information is used to indicate an operating load status of the processing node, determines, according to the status information corresponding to each processing node in multiple processing nodes, whether the active coordinator node has an idle resource whose capacity is a preset threshold X, and if the active coordinator node has the idle resource whose capacity is the preset threshold X, instructs each processing node to upload subsequently generated clean page data to the active coordinator node.

PROGRAMMING MODEL AND FRAMEWORK FOR PROVIDING RESILIENT PARALLEL TASKS
20200110676 · 2020-04-09 ·

Exemplary embodiments herein describe programming models and frameworks for providing parallel and resilient tasks. Tasks are created in accordance with predetermined structures. Defined tasks are stored as data objects in a shared pool of memory that is made up of disaggregated memory communicatively coupled via a high performance interconnect that supports atomic operations as descried herein. Heterogeneous compute nodes are configured to execute tasks stored in the shared memory. When compute nodes fail, they do not impact the shared memory, the tasks or other data stored in the shared memory, or the other non-failing compute nodes. The non-failing compute nodes can take on the responsibility of executing tasks owned by other compute nodes, including tasks of a compute node that fails, without needing a centralized manager or schedule to re-assign those tasks. Task processing can therefore be performed in parallel and without impact from node failures.

Failure indication in shared memory

In some examples, a node of a computing system may include a failure identification engine and a failure response engine. The failure identification engine may identify a failure condition for a system function of the node and the failure response engine may store a failure indication in a shared memory to trigger takeover of the system function by a different node of the computing system.

Restoring distributed shared memory data consistency within a recovery process from a cluster node failure

A DSM component is organized as a matrix of page. The data structure of a set of data structures occupies a column in the matrix of pages. A recovery file is maintained in a persistent storage. The recovery file consists of entries and each one of the entries corresponds to a column in the matrix of pages by a location of each one of the entries. The set of data structures is stored in the DSM component and in the persistent storage. Incorporated into each one of the plurality of entries in the recovery file is an indication if an associated column in the matrix of pages is assigned with the data structure of the set of data structures; and additionally incorporated into each one of the plurality of entries in the recovery file are identifying key properties of the data structure of the set of data structures.

Microcontroller and electronic control unit
10592356 · 2020-03-17 · ·

A microcontroller includes two processing blocks that respectively have a Central Processing Unit (CPU) and a peripheral circuit, where an access to the peripheral circuit in each of the processing blocks, that is, to a Read-Only Memory (ROM) or a Pulse Width Modulator (PWM) signal generator, is limited only from the CPU disposed in the same processing block. Thereby a fail-safe functionality of the microcontroller is improved.