Patent classifications
G06F11/2043
Node device, recovery operation control method, and non-transitory computer readable medium storing recovery operation control program
When a node device (10-1) has detected a system failure in a cluster system (1), it determines whether the node device (10-1) is an avoidance-override device. Then, when the node device (10-1) determines that the own node device is an avoidance priority device, the node device (10-1) transmits a request signal to a node device (10-2) other than the node device (10-1). The request signal is a signal for requesting a report about a normal state and an abnormal state of the node device (10-2). Then, the node device (10-1) determines whether to execute a recovery operation of the own node device or to avoid executing a recovery operation of the own node device based on the report from the node device (10-2).
MEMORY SCANNING OPERATION IN RESPONSE TO COMMON MODE FAULT SIGNAL
An apparatus comprises a plurality of redundant processing units (4) to perform data processing redundantly in lockstep; common mode fault detection circuitry *6, 22) to detect an event indicative of a potential common mode fault affecting each of the plurality of redundant processing units; a memory (10) shared between the plurality of redundant processing units; and memory checking circuitry (30) to perform a memory scanning operation to scan at least part of the memory for errors; in which the memory checking circuitry (30) performs the memory scanning operation in response to a common mode fault signal generated by the common mode fault detection circuitry (6, 22) indicating that the event indicative of a potential common mode fault has been detected.
Processor repair
A processor comprises a plurality of processing units, wherein there is a fixed transmission time for transmitting a message from a sending processing unit to a receiving processing unit, based on the physical positions of the sending and receiving processing units in the processor. The processing units are arranged in a column, and the fixed transmission time depends on the position of a processing circuit in the column. An exchange fabric is provided for exchanging messages between sending and receiving processing units, the columns being arranged with respect to the exchange fabric such that the fixed transmission time depends on the distances of the processing circuits with respect to the exchange fabric. The processor comprises at least one delay stage for each processing circuit and switching circuitry for selectively switching the delay stage into or out of a communication path involved in message exchange. For processing circuits up to a defective processing circuit in the column, the delay stage is switched into the communication path, and for processing circuits above the defective processing circuit in the column, including a repairing processing circuit which repairs the defective processing circuit the delay stage is switched out of the communication path whereby the fixed transmission time of processing circuits is preserved in the event of a repair of the column.
Method and apparatus for backup communication
Embodiments of the present disclosure relate to a method and an apparatus for backup communication. The method comprises: detecting a failure of a management interface between a processor and a baseboard management controller; in response to detecting the failure of the management interface, performing backup communication between the processor and the baseboard management controller using a control interface, wherein the baseboard management controller can obtain a physical parameter of the processor via the control interface; and transmitting a packet between the processor and the baseboard management controller via the control interface.
REMOTE COPY SYSTEM AND REMOTE COPY MANAGEMENT METHOD
A first storage system that provides a primary site and a second storage system that provides a secondary site are provided to quickly and easily switch between the storage systems. A storage controller of the storage system performs remote copy from a first data volume of the first storage system to a second data volume of the second storage system, after a failover is performed from the primary site to the secondary site, accumulates data and operation that are processed at the secondary site in a journal volume of the second storage system as a secondary site journal, and restores the first data volume using the secondary site journal when the primary site is recovered.
TECHNIQUES FOR MEMORY ERROR ISOLATION
Apparatuses, systems, and techniques to detect memory errors and isolate or migrate partitions on a parallel processing unit using an application programming interface to facilitate parallel computing, such as CUDA. In at least one embodiment, interrupts are intercepted and processed on a graphics processing unit indicating a memory error for one or more partitions, and a policy is applied to isolate that memory error from other partitions.
DISTRIBUTED STORAGE SYSTEM AND STORAGE CONTROL METHOD
Provided are: one or plural storage units including a plurality of physical storage devices (PDEVs); and a plurality of computers connected to the one or plural storage units via a communication network. Two or more computers execute storage control programs (hereinafter, control programs), respectively. Two or more control programs share a plurality of storage areas provided by the plurality of PDEVs and metadata regarding the plurality of storage areas. When the control program fails, another control program sharing the metadata accesses data stored in a storage area. When a PDEV fails, the control program restores data of the failed PDEV using redundant data stored in another PDEV that has not failed.
METHOD OF USING A SECURE PRIVATE NETWORK TO ACTIVELY CONFIGURE THE HARDWARE OF A COMPUTER OR MICROCHIP
A method for a computer or microchip with one or more inner hardware-based access barriers or firewalls that establish one or more private units disconnected from a public unit or units having connection to the public Internet and one or more of the private units have a connection to one or more non-Internet-connected private networks for private network control of the configuration of the computer or microchip using active hardware configuration, including field programmable gate arrays (FPGA). The hardware-based access barriers include a single out-only bus and/or another in-only bus with a single on/off switch.
Log management for a multi-node data processing system
A computer-readable medium comprises instructions which, upon execution by a node in a multi-node data processing system, enable the node to serve as a first leader node by receiving system log data from multiple compute nodes in a first cluster of the multi-node data processing system, and by saving the system log data in shared storage that is also used by second and third leader nodes to save system log data for compute nodes in second and third clusters of the multi-node data processing system. The instructions further enable the node to respond to failure of either of the second and third leader nodes by automatically assuming system logging duties for the compute nodes in the cluster that was associated with the failed leader node. The instructions may also enable the node to serve as a console bridge and to save console log data in the shared storage.
METHOD AND APPARATUS FOR PERFORMING NODE INFORMATION EXCHANGE MANAGEMENT OF ALL FLASH ARRAY SERVER
A method and apparatus for performing node information exchange management of an all flash array (AFA) server are provided. The method may include: utilizing a hardware manager module among multiple program modules running on any node of multiple nodes of the AFA server to control multiple hardware components in a hardware layer of the any node, for establishing a Board Management Controller (BMC) path between the any node and a remote node among the multiple nodes; utilizing at least two communications paths to exchange respective node information of the any node and the remote node, to control a high availability (HA) architecture of the AFA server according to the respective node information of the any node and the remote node, for continuously providing a service to a user of the AFA server; and in response to malfunction of any communications path, utilizing remaining communications path(s) to exchange the node information.