G06F11/0793

TAG CHECKING APPARATUS AND METHOD

An apparatus has tag checking circuitry responsive to a target address to: identify a guard tag stored in a memory system in association with a block of one or more memory locations, the block containing a target memory location identified by the target address, perform a tag check based on the guard tag and an address tag associated with the target address, and in response to detecting a mismatch in the tag check, perform an error response action. The apparatus also has tag mapping storage circuitry to store mapping information indicative of a mapping between guard tag values and corresponding address tag values. The tag checking circuitry remaps at least one of the guard tag and the address tag based on the mapping information stored by the tag mapping storage circuitry to generate a remapped tag for use in the tag check.

NODE HEALTH PREDICTION BASED ON FAILURE ISSUES EXPERIENCED PRIOR TO DEPLOYMENT IN A CLOUD COMPUTING SYSTEM

To improve the reliability of nodes that are utilized by a cloud computing provider, information about the entire lifecycle of nodes can be collected and used to predict when nodes are likely to experience failures based at least in part on early lifecycle errors. In one aspect, a plurality of failure issues experienced by a plurality of production nodes in a cloud computing system during a pre-production phase can be identified. A subset of the plurality of failure issues can be selected based at least in part on correlation with service outages for the plurality of production nodes during a production phase. A comparison can be performed between the subset of the plurality of failure issues and a set of failure issues experienced by a pre-production node during the pre-production phase. A risk score for the pre-production node can be calculated based at least in part on the comparison.

Addressing Storage Device Performance

Improving storage device performance including initiating, on a storage device, execution of a rehabilitative action from a set of rehabilitative actions that can be performed on the storage device; determining that the storage device is operating outside of a defined range of expected operating parameters after the rehabilitative action has been executed; and initiating execution of a higher level rehabilitative action responsive to determining that the higher level rehabilitative action exists.

Failure Prediction Using Informational Logs and Golden Signals

Embodiments relate to a computer platform to support processing of informational logs and corresponding performance data to detect and mitigate occurrence of anomalous behavior. Metrics are extracted from the informational logs and correlated with performance data, and in an exemplary embodiment golden signal metrics. A window or block of the logs is classified as potential candidates or indicators of anomalous behavior, which in an embodiment is indicative of potential failure or service outage. A control signal is dynamically issued to an operatively coupled device associated with the window or block of logs. The control signal is configured to selectively control a state of a physical device or process controlled by software, with the control directed at mitigating or eliminating the effect(s) of the anomalous behavior.

MECHANISM FOR INTEGRATING I/O HYPERVISOR WITH A COMBINED DPU AND SERVER SOLUTION

A combined data processing unit (DPU) and server solution with DPU operating system (OS) integration is described. A DPU OS is executed on a DPU or other computing device, where the DPU OS exercises secure calls provided by a DPU's trusted firmware component, that may be invoked by DPU OS components to abstract DPU vendor-specific and server vendor-specific integration details. An invocation of one of the secure calls made on the DPU to communicate with its associated server computing device is identified. In an instance in which the one of the secure calls is invoked, the secure call invoked is translated into a call or request specific to an architecture of the server computing device and the call is performed, which may include sending a signal to the server computing device in a format interpretable by the server computing device.

METHOD AND SYSTEM FOR IDENTIFYING ROOT CAUSE OF A HARDWARE COMPONENT FAILURE

In general, embodiments relate to a method for identifying hardware component failures, comprising: obtaining system logs that show a transition of device states for a device; using a normalization and filtering module to process and extract relevant data from the system logs and important keywords for the device; creating a device state path for the device from a healthy device state to an unhealthy device state using the extracted relevant data; obtaining the device state path for the device from a storage and a current device state of the device; predicting a next device state of the device based on the current device state using an analysis module; generating a device state chain using the device state path, current device state, and next device state; and identifying root cause of a hardware component failure using the device state chain.

TECHNIQUES FOR IMPLEMENTING ROLLBACK OF INFRASTRUCTURE CHANGES IN A CLOUD INFRASTRUCTURE ORCHESTRATION SERVICE

Techniques for implementing rollback of infrastructure changes in an infrastructure orchestration service are described. In certain examples, an infrastructure orchestration service is disclosed that manages both provisioning and deploying of infrastructure assets within a cloud environment. The service receives a plan comprising a set of instructions associated with a set of infrastructure assets of an execution target and identifies a first state of the set of infrastructure assets. The service executes the set of instructions in the plan to achieve a second state for the set of infrastructure assets. Based in part on the executing, the service receives a trigger for rolling back the plan to restore the set of infrastructure assets in the plan to the first state and executes a rollback plan for the plan. The service then transmits a result associated with the execution of the rollback plan.

SYSTEMS AND METHODS FOR AUTOMATICALLY APPLYING CONFIGURATION CHANGES TO COMPUTING CLUSTERS
20230236911 · 2023-07-27 ·

A system includes a memory and a processor. The processor is configured to access one or more configuration logs generated by a computing cluster. The processor is further configured to determine, by analyzing the one or more configuration logs, a particular service running on the computing cluster that has generated a plurality of errors within the plurality of log messages. The processor is further configured to determine whether the particular error has previously occurred. The processor is further configured to, in response to determining that the particular error has previously occurred, generate and send one or more commands to the computing cluster. The one or more commands are operable to change a current configuration value for the particular service running on the computing cluster to a new configuration value. The new configuration value is based on a historical value stored in the database of historical configuration errors.

Method for identifying and evaluating common cause failures of system components

Provided is a method and system for identifying and evaluating common cause failures of system components, wherein at least one analytical artifact and machine readable system related to at least one of spatial, topological data and machine readable system related lifecycle data are processed to analyze automatically a susceptibility of system components to common cause failure based on common cause failure influencing factors.

Network system fault resolution via a machine learning model

Disclosed are embodiments for automatically resolving faults in a complex network system. Some embodiments monitor one or more of system operational parameter values and message exchanges between network components. A machine learning model detects a fault in the complex network system, and an action is selected based on a cause of the fault. After the action is applied to the complex network system, additional monitoring is performed to either determine the fault has been resolved or additional actions are to be applied to further resolve the fault.