G06F11/004

Datacenter IoT-triggered preemptive measures using machine learning

One example method includes performing a machine learning process that involves performing an assessment of a state of a computing system, and the assessment includes analyzing information generated by an IoT edge sensor in response to a sensed physical condition in the computing system, and identifying an entity in the computing system potentially impacted by an event associated with the physical condition. The example method further includes identifying a preemptive recovery action and associating the preemptive recovery action with an entity, and the preemptive recovery action, when performed, reduces or eliminates an impact of the event on the entity, determining a cost associated with implementation of the preemptive recovery action, evaluating the cost associated with the preemptive recovery actions and identifying the preemptive recovery action with the lowest associated cost, implementing the preemptive recovery action with the lowest associated cost, and repeating part of the machine learning process.

Integrated circuit chip with cores asymmetrically oriented with respect to each other

An integrated circuit (IC) chip can include a given core at a position in the IC chip that defines a given orientation, wherein the given core is designed to perform a particular function. The IC chip can include another core designed to perform the particular function. The other core can be flipped and rotated by 180 degrees relative to the given core such that the other core is asymmetrically oriented with respect to the given core. The IC chip can also include a compare unit configured to compare outputs of the given core and the other core to detect a fault in the IC chip.

Resiliency Schemes for Distributed Storage Systems
20230214956 · 2023-07-06 ·

A plurality of computing devices are communicatively coupled to each other via a network, and each of the plurality of computing devices is operably coupled to one or more of a plurality of storage devices. A plurality of failure resilient stripes is distributed across the plurality of storage devices such that each of the plurality of failure resilient stripes spans a plurality of the storage devices. A graphics processing unit is operable to access data files from the failure resilient stripes, while bypassing a kernel page cache. Furthermore, these data files may be accessed in parallel by the graphics processing unit.

Machine learning to predict container failure for data transactions in distributed computing environment

Inflight transactions having predictable pod failure in distributed computing environments are managed by integrating a transaction manager into pods having containers running applications in a distributed computing environment, wherein the transaction manager records a transaction log having data indicative of historical pod failure. A pod health check that is also integrated into the pods determines predictive pod failure scenarios from the data of historical pod failure in the transaction log. Pod health can be tracked using the pod health checker by matching the predictive pod failure scenarios to transaction calls. Calls may be sent to a load balancer for recovery of pod failure for transaction calling match the predictive pod failure scenarios. Pods can be configured recover for the predictive pod failure.

Reconfiguration rate-control

A state management server applies configuration information to a set of virtual computer system instances in accordance with one or more limitations specified by an administrator. In an embodiment, the limitations include a velocity parameter that limits the number of virtual computer system instances to which the configuration may be applied concurrently. In an embodiment, the limitations include an error threshold that stops the application of the configuration if the number of configuration failures meets or exceeds the error threshold. In an embodiment, the set of virtual computer systems is identified by providing a list of the individual virtual computer system instances, or by specifying one or more tags that are associated with the virtual computer systems in the set. In an embodiment, the administrator is able to specify that an association be applied according to a predetermined schedule.

Dynamic object policy reconfiguration mechanism for object storage system

A method for dynamic storage object configuration in a datacenter is provided. Embodiments include determining a number of fault domains in a storage cluster that have sufficient storage capacity for creating a storage object. Embodiments include applying a dynamic fault tolerance policy to the number of fault domains that have sufficient capacity for creating the storage object in order to determine a number of host failures to tolerate for the storage object, the dynamic fault tolerance policy specifying a manner of determining, for any respective storage object, a respective number of host failures to tolerate for storing the respective storage object in a respective storage cluster based on at least a respective number of fault domains of the respective storage cluster. Embodiments include implementing the storage object on the storage cluster based on the number of host failures to tolerate for the storage object.

System management method, non-transitory computer-readable storage medium for storing system management program, and system management device

A method includes: acquiring, based on status information, a failure risk of each of a plurality of devices including physical devices and virtual machines, each of the virtual machines being operated on any of the physical devices, the status information indicating the statuses of the plurality of devices; acquiring an influence range based on route information indicating a link in a range affected by a failure; acquiring a first influence risk based on a failure risk acquired for a first device, the first physical device being any of the physical devices; acquiring a second influence risk based on a failure risk of a second device, the second influence risk indicating a possibility of a target device being affected by a failure in another device; and determining the second physical devices as a destination candidate of the target device when the second influence risk is lower than the first influence risk.

Aging protection techniques for power switches

The present disclosure provides techniques for predicting failure of power switches and taking action based on the predictions. In an example, a method can include controlling the at least two parallel-connected power switches via a first driver and a second driver, the first a second driver responsive to a single command signal, measuring a failure characteristic of a first power switch, and disabling a first driver of the first power switch when the first failure characteristic exceeds a failure precursor threshold.

Resolving erred 10 flows
11544139 · 2023-01-03 · ·

A method for resolving an erred input/output (IO) flow, the method may include (i) sending over a path a remote direct write request associated with a certain address range; wherein the path is formed between a compute node of a storage system to a storage drive of the storage system; (ii) receiving by the compute node an error message related to the remote direct write request; wherein the error message does not indicate whether an execution of the remote direct write request failed or is only temporarily delayed; (iii) responding by the compute node to the error message by (a) preventing from sending one or more IO requests through the path, (b) preventing from sending at least one IO requests aimed to the certain address range; and (c) requesting, using a management communication link, to force an execution of pending IO requests that are related to the path; and (iv) reuse the path, by the compute node, following an indication that there are no pending IO requests that are related to the path.

Data recovery validation test

Techniques are described for a data recovery validation test. In examples, a processor receives a command to be included in the validation test that is configured to validate performance of an activity by a server prior to a failure to perform the activity by the server. The processor stores the validation test including the command on a memory device, and prior to the failure of the activity by the server, executes the validation test including the command responsive to an input. The processor receives results of the validation test corresponding to the command and indicating whether the server performed the activity in accordance with a standard for the activity during the validation test. The processor provides the results of the validation test in a user interface.