Patent classifications
G06F11/2046
Computer cluster with adaptive quorum rules
The fail-over computer cluster enables multiple computing devices to operate using adaptive quorum rules to dictate which nodes are in the fail-over cluster at any given time. The adaptive quorum rules provide requirements for communications between nodes and connections with voting file systems. The adaptive quorum rules include particular recovery rules for unplanned changes in node configuration, such as due to a disruptive event. Such recovery quorum rules enable the fail-over cluster to continuing to operate with various changed configurations of its node members as a result of the disruptive event. In the changed configuration, access to voting file systems may not be required for a majority-group subset of nodes. If no majority-group subset remains, nodes may need direct or indirect access to voting file systems.
SYSTEMS, DEVICES, AND METHODS FOR CONTROLLER DEVICES HANDLING FAULT EVENTS
A controller chip includes a first cluster including one or more first controller units, a first power supply grid, a first clock tree structure to supply one or more clock signals, and at least a first power supply input. A second cluster includes one or more second controller units, a second power supply grid, a second clock tree structure to supply one or more clock signals, and at least a second power supply input. A monitoring cluster includes a monitoring circuit configured to: monitor the power supply and the clock signal supply of each of the first cluster and second cluster, and in the event of determining at least one of a power supply failure or a clock signal supply failure in one cluster of the first cluster or the second cluster, indicate the failure to the other cluster to take one or more actions.
COMPUTER-IMPLEMENTED RUNTIME SYSTEM, HEALTHCARE NETWORK, METHOD AND COMPUTER PROGRAM
A computer-implemented runtime system is operable of providing a continuous product execution runtime environment for an application via a healthcare network. The system includes a focus machine and an action plan repository, to provide an autonomous runtime environment by at least: monitoring a running use case of at least one application on at least one device; taking over responsibility of a running use case of the at least one application, upon an error state being detected for the monitored running use case; analyzing the error state of the running use case detected; obtaining at least one suitable substitution action out of a plurality of actions deposited in the action plan repository, based on the error state of the running use case analyzed; and terminating and completing at least a part of the running use case, by employing the at least one substitution actions obtained, on the at least one application.
Distributed computing in a process control environment
High availability and data migration in a distributed process control computing environment. Allocation algorithms distribute data and applications among available compute nodes, such as controllers in a process control system. In the process control system, an input/output device, such as a fieldbus module, can be used by any controller. Databases store critical execution information for immediate takeover by a backup compute element. The compute nodes are configured to execute algorithms for mitigating dead time in the distributed computing environment.
One-sided reliable remote direct memory operations
Techniques are provided to allow more sophisticated operations to be performed remotely by machines that are not fully functional. Operations that can be performed reliably by a machine that has experienced a hardware and/or software error are referred to herein as Remote Direct Memory Operations or “RDMOs”. Unlike RDMAs, which typically involve trivially simple operations such as the retrieval of a single value from the memory of a remote machine, RDMOs may be arbitrarily complex. The techniques described herein can help applications run without interruption when there are software faults or glitches on a remote system with which they interact.
DISTRIBUTED WORKLOAD REASSIGNMENT FOLLOWING COMMUNICATION FAILURE
A generation identifier is employed with various systems and methods in order to identify situations where a workload has been reassigned to a new node and where a workload is still being processed by an old node during a failure between nodes. A master node may assign a workload to a worker node. The worker node sends a request to access target data. The request may be associated with a generation identifier and workload identifier that identifies the node and workload. At some point, a failure occurs between the master node and worker node. The master node reassigns the workload to another worker node. The new worker node accesses the target data with a different generation identifier, indicating to the storage system that the workload has been reassigned. The old worker node receives an indication from the storage system that the workload has been reassigned and stops processing the workload.
High availability for persistent memory
Techniques for implementing high availability for persistent memory are provided. In one embodiment, a first computer system can detect an alternating current (AC) power loss/cycle event and, in response to the event, can save data in a persistent memory of the first computer system to a memory or storage device that is remote from the first computer system and is accessible by a second computer system. The first computer system can then generate a signal for the second computer system subsequently to initiating or completing the save process, thereby allowing the second computer system to restore the saved data from the memory or storage device into its own persistent memory.
VIRTUALIZED FILE SERVER
In one embodiment, a system for managing communication connections in a virtualization environment includes a plurality of host machines implementing a virtualization environment, wherein each of the host machines includes a hypervisor, at least one user virtual machine (user VM), and a distributed file server that includes file server virtual machines (FSVMs) and associated local storage devices. Each FSVM and associated local storage device are local to a corresponding one of the host machines, and the FSVMs conduct I/O transactions with their associated local storage devices based on I/O requests received from the user VMs. Each of the user VMs on each host machine sends each of its representative I/O requests to an FSVM that is selected by one or more of the FSVMs for each I/O request based on a lookup table that maps a storage item referenced by the I/O request to the selected one of the FSVMs.
SYSTEMS AND METHODS FOR HIERARCHICAL FAILOVER GROUPS
A logical grouping of subgroups of server clusters forms a failover super-cluster. A logical grouping of groups of servers provides high availability by, upon failure of an entire group (site), failing over an entire subgroup to a different subgroup. Yet within each subgroup local failovers continue to maintain application high availability during instances in which the site remains operational. t,?
CPU hot-swapping
There is disclosed in one example a multi-core computing system configured to provide a hot-swappable CPU0, including: a first CPU in a first CPU socket and a second CPU in a second CPU socket; a switch including a first media interface to the first CPU socket and a second media interface to the second CPU socket; and one or more mediums including non-transitory instructions to detect a hot swap event of the first CPU, designate the second CPU as CPU0, determine that a new CPU has replaced the first CPU, operate the switch to communicatively couple the new CPU to a backup initialization code store via the first media interface, initialize the new CPU, and designate the new CPU as CPUN, wherein N≠0.