G06F11/1438

Methods and systems for restarting one or more components of a network device based on conditions

The present invention discloses a method carried out at a network device for restarting one or more components of the network device, comprising the steps of monitoring whether at least one condition has been satisfied at any component of the network device. If at least one condition satisfied, one or more components are restarted of the network device, based, at least in part, on the at least one condition. When the network device has been restarted a management server is informed that one or more components of the network device have been restarted. According to the present invention, whether a configuration has been received for the network device is determined from a user or administrator at the management server. When a configuration has been received the configuration is retrieved from the management server. The network device is then configured with the configuration retrieved.

OBJECT DATA BACKUP AND RECOVERY IN CLUSTERS MANAGING CONTAINERIZED APPLICATIONS

A object data backup and restore method and system include receiving a request to restore a target object to a first point-in-time version; identifying a first snapshot of the cluster corresponding to the first point-in-time version; generating a second snapshot of the cluster upon receiving the request to restore the target object; determining data changes associated with the target object based on the first snapshot and second snapshot; scanning all objects associated with the cluster to determine one or more additional objects that are affected by restoring the target object due to object dependencies defined by a cluster configuration; generating a relationship graph for the one or more additional objects and the target object; and restoring the target object based on the data changes, the first snapshot and the relationship graph.

Procedure for managing a failure in a network of nodes based on a local strategy

Disclosed is a failure management method in a network of nodes, including, for each considered node: first, a step of locally saving the state of this considered node, to a storage medium for this node in question. Then, if the considered node has failed, retrieving the local backup of the state of this considered node, by redirecting the link between the considered node and its storage medium to connect this storage medium to an operational node other than the considered node, this operational node already in the process of carrying out this calculation, the local backups of these considered nodes, used for the retrieving steps being coherent with each other so as to correspond to the same state of calculation. If a considered node failed, returning this local backup for this considered node to a new additional node added to the network at the time of the failure.

Failure shield

An example graphics system can include a first portion including a graphics driver and graphics hardware and a second portion communicatively coupled to the first portion. The second portion can include a display system communicatively coupled to a GUI application and a shim layer to shield the second portion from failure responsive to failure of the first portion.

Cluster recovery manager to remediate failovers

Example implementations relate to management of clusters. A cluster recovery manager may comprise a processing resource; and a memory resource storing machine-readable instructions to cause the processing resource to: adjust, based on a monitored degree of performance of a controller of a controller cluster, a state of the controller to one of a first state and a second state; and reassign a corresponding portion of a plurality of APs managed by the controller periodically to a different controller until the state of the controller is determined to be adjustable to the first state. The reassignment can be triggered responsive to a state adjustment of the controller from the first state to the second state.

Distributed application orchestration management in a heterogeneous distributed computing environment

Distributed application orchestration management is provided. A first passive member of a set of passive members sends a notification message to other members indicating that the first passive member is initiating start of a distributed application in response to the first passive member validating that a self-restart by a leader member failed. The first passive member compares timestamps associated with an attempt to start the distributed application by other passive members in the set of passive members. The first passive member stops a particular attempt to start the distributed application in response to the first passive member determining that a timestamp associated with the particular attempt to start the distributed application by the first passive member is newer than another timestamp of another passive member. The first passive member designates the other passive member having an older timestamp as a new leader member to continue starting the distributed application.

Apparatus and method for detecting and correcting problems, failures, and anomalies in managed computer systems such as kiosks and informational displays

In a kiosk or informational display, an apparatus for detecting and remediating problems, failures, and anomalies includes a data collection agent configured to collect original data over time associated with components, operation, and configuration of the managed computer system, a monitoring and learning module configured to process the original data and generate a historic record that includes time-based data, such as one or more time-based lists, an alert detection system that includes a sensor having associated therewith one of the time-based lists. The sensor is activated when sensor condition(s) are met, which includes evaluating the sensor condition(s) using at least the time-based list and a current-time value of the components, operation, and configuration of the managed computer system. The apparatus includes a remediation action module configured to effect at least one of a plurality of predetermined actions when the sensor is activated.

Re-initiation of microservices utilizing context information provided via service calls

An apparatus comprises a processing device configured to identify, at a first microservice, a service call that is to be transmitted to a second microservice, and to modify the service call to include context information, the context information characterizing a current state of execution of one or more tasks by one of the first microservice and the second microservice. The processing device is further configured to provide, from the first microservice to the second microservice, the modified service call including the context information. The context information enables re-initiation of said one of the first microservice and the second microservice to continue execution of the one or more tasks from the current state.

RECOVERY OF A SOFTWARE-DEFINED DATA CENTER

Examples described herein include systems and methods for backing up and recovering a software-defined data center (“SDDC”). In one example, entities of the SDDC, such as virtual machines, hosts, and clusters, can coexist with corresponding entity stores. The entity stores can store current state information for each SDDC entity. For example, an identifier or name of a virtual machine can be stored in that virtual machine's corresponding entity store. When recovery of a controller is needed, the controller can rebuild state information that has changed after the controller was backed up, by retrieving state information from entity stores of the various SDDC entities.

Virtual Machines Recoverable From Uncorrectable Memory Errors
20230121338 · 2023-04-20 ·

The disclosed technology provides techniques, systems, and apparatus for containing and recovering from uncorrectable memory errors in distributed computing environment. An aspect of the disclosed technology includes a hypervisor or virtual machine manager that receives signaling of an uncorrectable memory error detected by a host machine. The virtual machine manager then uses information received via the signaling to identify virtual memory addresses or memory pages associated with the corrupted memory element so as to allow for containment and recovery from the error.