G06F11/2028

METHOD AND APPARATUS FOR RECONFIGURING AN AUTONOMOUS VEHICLE IN THE EVENT OF A FAULT

Technologies and techniques for reconfiguring an autonomous vehicle in the event of a fault, wherein application entities are executed in a distributed manner across a plurality of computing nodes in accordance with a predefined configuration. A fault in an application entity and/or in an operating system and/or in a piece of hardware is detected via the at least one monitor device, wherein the detected fault is isolated via a switching device by switching to application entities which are redundant with respect to the affected application entities, and wherein predefined redundancy conditions and/or segregation conditions are restored for the application entities by the reconfiguration, by an application placement device of the configuration, wherein the reconfiguration is carried out such that the number of times application entities are switched to other computing nodes to establish the predefined redundancy conditions and/or segregation conditions will be or is minimized.

METHOD FOR USING BMC AS PROXY NVMEOF DISCOVERY CONTROLLER TO PROVIDE NVM SUBSYSTEMS TO HOST
20220365683 · 2022-11-17 ·

A management device that may communicate with at least one devices is disclosed. The management device may include a communication logic to communicate with the devices over a communication channels about data associated with the devices. The management device may also include reception logic that may receive a query from a host. The query may request information from the management device about the devices. The management device may also include a transmission logic to send the data about the devices to the host. The host may be configured to send a message to the devices.

Micro-level network node failover system
11588927 · 2023-02-21 · ·

An improved core network that can monitor micro-level issues, identify specific services of specific nodes that may be causing an outage, and perform targeted node failovers in a manner that does not cause unnecessary disruptions in service is described herein. For example, the improved core network can include a failover and isolation server (FIS) system. The FIS system can obtain service-specific KPIs from the various nodes in the core network. The FIS can then compare the obtained KPI values of the respective service with corresponding threshold values. If any KPI value exceeds a corresponding threshold value, the FIS may preliminarily determine that the service of the node associated with the KPI value is responsible for a service outage. The FIS can initiate a failover operation, which causes the node to re-route any received requests corresponding to the service potentially responsible for the service outage to a redundant node.

High reliability fault tolerant computer architecture

A fault tolerant computer system and method are disclosed. The system may include a plurality of CPU nodes, each including: a processor and a memory; at least two IO domains, wherein at least one of the IO domains is designated an active IO domain performing communication functions for the active CPU nodes; and a switching fabric connecting each CPU node to each IO domain. One CPU node is designated a standby CPU node and the remainder are designated as active CPU nodes. If a failure, a beginning of a failure, or a predicted failure occurs in an active node, the state and memory of the active CPU node are transferred to the standby CPU node which becomes the new active CPU node. If a failure occurs in an active IO domain, the communication functions performed by the failing active IO domain are transferred to the other IO domain.

Pod migration across nodes of a cluster

Example techniques for pod migration across nodes of a cluster are described. In an example, in response to receiving a request to migrate a pod from a first region of a cloud computing platform to a second region of the cloud computing platform, the pod may be migrated from a first node in the first region to a second node in the second region. The first node and the second node may each be a part of a cluster of nodes.

Methods, devices and systems for writer pre-selection in distributed data systems

A computer-implemented method may comprise receiving proposals to mutate a data stored in a distributed and replicated file system coupled to a network, the distributed and replicated data system comprising a plurality of nodes, each comprising a server. A metadata service maintains and updates a replica of a namespace of the distributed and replicated file system and coordinates updates to the data by generating an ordered set of agreements corresponding to the received proposals, the ordered set of agreements specifying an order in which the nodes are to mutate data stored in data nodes and cause corresponding changes to the state of the namespace. For each agreement in the generated ordered set of agreements, a corresponding writers list may be provided that comprises an ordered list of nodes to execute the agreement and make corresponding changes to the namespace. The ordered set of agreements may then be sent to the plurality of nodes along with, for each agreement in the ordered set of agreements, the corresponding writers list or a pre-generated index thereto and each of the plurality of nodes may be configured to only execute agreements for which it is a first-listed node on the received writers list.

Data ingestion replication and disaster recovery

Described herein are techniques for improving disaster recovery, in particular disaster recovery pertaining to data transfer requests. The data transfer request can be received by each of multiple deployments; however, only a primary deployment can process the request. The data transferred by the primary deployment may be replicated in the secondary deployments. In response to a failover event, one of the secondary deployments can be designated as the new primary development and continue the data transfer based on the data transfer request and the replication information received from the old primary deployment prior to the failover.

Live recovery of virtual machines in a public cloud computing environment

Live recovery generates a new “recovery VM” that operates as an ongoing “live” production platform. A previously created non-cloud-native backup copy is the data source for the recovery VM. Live recovery restores data blocks from the backup copy on backup media directly to cloud-based virtual disk(s) assigned to the recovery VM. As a result, the cloud-based recovery VM can become fully operational in the cloud computing environment on a going-forward basis. The advantage of live recovery over a traditional restore is that live recovery provides a cloud-based VM that begins operating well before the backup copy is fully restored. This is accomplished by temporarily mounting a “temp-mounted VM” in the cloud while the backup copy is methodically restored in the background. VM reads and writes begin issuing from the temp-mounted VM and writes are retained on completion. Downtime is minimized when switching from the temp-mounted VM to the recovery VM.

Interactive Graphical User Interface for Monitoring Computer Models
20220357821 · 2022-11-10 ·

A computing system establishes a hierarchy for monitoring model(s). The hierarchy comprises an association between each of multiple measures of a measure level of the hierarchy and intermediate level(s) of the hierarchy. An intermediate level comprises one or more of a measurement category or analysis type. The hierarchy comprises an association between the intermediate level(s) and at least one model. The system monitors the model(s) by generating health measurements. Each of the health measurements corresponds to one of the multiple measures. Each of the health measurements indicates a performance of a monitored model according to a measurement category or analysis type associated in the hierarchy with the respective measure of the multiple measures. The system generates a visualization in a graphical user interface. The visualization comprises a graphical representation of an indication of a health measurement for each of measure(s), and associations in the hierarchy.

PARALLEL PROCESSING SYSTEM RUNTIME STATE RELOAD
20230102197 · 2023-03-30 ·

A parallel processing system includes at least three parallel processors, state monitoring circuitry, and state reload circuitry. The state monitoring circuitry couples to the at least three parallel processors and is configured to monitor runtime states of the at least three parallel processors and identify a first processor of the at least three parallel processors having at least one runtime state error. The state reload circuitry couples to the at least three parallel processors and is configured to select a second processor of the at least three parallel processors for state reload, access a runtime state of the second processor, and load the runtime state of the second processor into the first processor. Monitoring and reload may be performed only on sub-systems of the at least three parallel processors. During reload, clocks and supply voltages of the processors may be altered. The state reload may relate to sub-systems.