G06F11/2023

Failover of virtual devices in a scalable input/output (I/O) virtualization (S-IOV) architecture

Examples include a method of performing failover of in an I/O architecture by allocating a first set of resources, associated with a first port of a physical device, to a virtual device, allocating a second set of resources, associated with a second port of the physical device, to the virtual device, assigning the virtual device to a virtual machine, activating the first set of resources, and transferring data between the virtual machine and the first port using the virtual device and the first set of resources. The method further includes detecting an error in the first set of resources, deactivating the first set of resources and activating the second set of resources, and transferring data between the virtual machine and the second port using the virtual device and the second set of resources.

Failure shield

An example graphics system can include a first portion including a graphics driver and graphics hardware and a second portion communicatively coupled to the first portion. The second portion can include a display system communicatively coupled to a GUI application and a shim layer to shield the second portion from failure responsive to failure of the first portion.

System recovery using a failover processor

Techniques for system recovery using a failover processor are disclosed. A first processor, with a first instruction set, is configured to execute operations of a first type; and a second processor, with a second instruction set different from the first instruction set, is configured to execute operations of a second type. A determination is made that the second processor has failed to execute at least one operation of the second type within a particular period of time. Responsive to determining that the second processor has failed to execute at least one operation of the second type within the particular period of time, the first processor is configured to execute both the operations of the first type and the operations of the second type.

Self-descriptive orchestratable modules in software-defined industrial systems

Various systems and methods are provided for implementing a software defined industrial system. In an example, self-descriptive control applications and software modules are provided in the context of orchestratable distributed systems. The self-descriptive control applications may be executed by an orchestrator or like control device, configured to: identify available software modules adapted to perform functional operations in a control system environment; identify operational characteristics that identify characteristics of execution of the available software modules that are available to implement a control system application; select a software module for execution based on the operational configuration and the operational characteristics identified in the manifest; and cause the execution of the selected software module in the control system environment based on an application specification for the control system application.

Method of using a single controller (ECU) for a fault-tolerant/fail-operational self-driving system

In a self-driving autonomous vehicle, a controller architecture includes multiple processors within the same box. Each processor monitors the others and takes appropriate safe action when needed. Some processors may run dormant or low priority redundant functions that become active when another processor is detected to have failed. The processors are independently powered and independently execute redundant algorithms from sensor data processing to actuation commands using different hardware capabilities (GPUs, processing cores, different input signals, etc.). Intentional hardware and software diversity improves fault tolerance. The resulting fault-tolerant/fail-operational system meets ISO26262 ASIL-D specifications based on a single electronic controller unit platform that can be used for self-driving vehicles.

Cluster recovery manager to remediate failovers

Example implementations relate to management of clusters. A cluster recovery manager may comprise a processing resource; and a memory resource storing machine-readable instructions to cause the processing resource to: adjust, based on a monitored degree of performance of a controller of a controller cluster, a state of the controller to one of a first state and a second state; and reassign a corresponding portion of a plurality of APs managed by the controller periodically to a different controller until the state of the controller is determined to be adjustable to the first state. The reassignment can be triggered responsive to a state adjustment of the controller from the first state to the second state.

Distributed application orchestration management in a heterogeneous distributed computing environment

Distributed application orchestration management is provided. A first passive member of a set of passive members sends a notification message to other members indicating that the first passive member is initiating start of a distributed application in response to the first passive member validating that a self-restart by a leader member failed. The first passive member compares timestamps associated with an attempt to start the distributed application by other passive members in the set of passive members. The first passive member stops a particular attempt to start the distributed application in response to the first passive member determining that a timestamp associated with the particular attempt to start the distributed application by the first passive member is newer than another timestamp of another passive member. The first passive member designates the other passive member having an older timestamp as a new leader member to continue starting the distributed application.

Event-driven system failover and failback
11636013 · 2023-04-25 · ·

A system determines that a primary event processor, included in a primary data center, is associated with a failure. The primary event processor is included in the primary data center and configured to process first events stored in a main event store of the primary data center. The system identifies a secondary event processor, in a secondary data center, that is to process one or more first events based on the failure. The primary event processor and the secondary event processor are configured to process a same type of event. The system causes, based on a configuration associated with the primary or secondary event processor, the one or more first events to be retrieved from one of the main event store or a replica event store. The replica event store is included in the secondary data center and mirrors the main event store of the primary data center.

Utilizing machine learning to streamline telemetry processing of storage media
11474986 · 2022-10-18 · ·

Data associated with storage media utilized by one or more storage systems is received. The data is provided as an input to a machine learning model executed by a processing device. The machine learning model identifies one or more deterministic characteristics from the data. The one or more deterministic characteristics associated with the storage media are received from the machine learning model. A data structure comprising the one or more deterministic characteristics is generated for use in a telemetry process to qualify types of storage media.

SYSTEMS AND METHODS FOR PERFORMING A TECHNICAL RECOVERY IN A CLOUD ENVIRONMENT

A computer-implemented method for testing failover may include: determining one or more cross-regional dependencies and traffic flow of an application in a first region of a cloud environment, wherein the one or more cross-regional dependencies include a dependency of the application in the first region of the cloud environment to one or more applications in at least one other region of the cloud environment; determining a risk score associated with performing failover of the application to a second region of the cloud environment at least based on the determined one or more cross-regional dependencies and traffic flow of the application; comparing the determined risk score with a predetermined risk score; in response to determining that the determined risk score is lower than the predetermined risk score, performing failover of the application to the second region of the cloud environment; isolating the second region of the cloud environment from the first region of the cloud environment for a predetermined period of time; and monitoring operation of the application in the second region of the cloud environment during the predetermined period of time.