G06F11/0751

MULTIPLE BLOCK ERROR CORRECTION IN AN INFORMATION HANDLING SYSTEM

An information handling system includes a first memory and a baseboard management controller. The first memory stores a first firmware partition and a second firmware partition. The baseboard management controller includes a second memory. The baseboard management controller begins execution of a DM-Verity daemon, and performs periodic patrol reads of the first firmware partition. The baseboard management controller detects one or more block failures in the first firmware partition, and stores information associated with the one or more block failures in a message box of the second memory. In response to the entire first firmware partition being scanned, the baseboard management controller switches a boot partition from the first firmware partition to the second firmware partition, and initiates a reboot of the information handling system.

Virtualized file server

In one embodiment, a system for managing communication connections in a virtualization environment includes a plurality of host machines implementing a virtualization environment, wherein each of the host machines includes a hypervisor, at least one user virtual machine (user VM), and a distributed file server that includes file server virtual machines (FSVMs) and associated local storage devices. Each FSVM and associated local storage device are local to a corresponding one of the host machines, and the FSVMs conduct I/O transactions with their associated local storage devices based on I/O requests received from the user VMs. Each of the user VMs on each host machine sends each of its respective I/O requests to an FSVM that is selected by one or more of the FSVMs for each I/O request based on a lookup table that maps a storage item referenced by the I/O request to the selected one of the FSVMs.

Role-based failure response training for distributed systems

Methods, systems, and computer-readable media for role-based failure response training for distributed systems are disclosed. A failure response training system determines a failure mode associated with an architecture for a distributed system comprising a plurality of components. The training system generates a scenario based at least in part on the failure mode. The scenario comprises an initial state of the distributed system which is associated with one or more metrics indicative of a failure. The training system provides, to a plurality of users, data describing the initial state. The training system solicits user input representing modification of a configuration of the components. The training system determines a modified state of the distributed system based at least in part on the input. The performance of the distributed system in the modified state is indicated by one or more modified metrics differing from the one or more initial metrics.

Saving page retire information persistently across operating system reboots

Examples described herein include systems and methods for retaining information about bad memory pages across an operating system reboot. An example method includes detecting, by a first instance of an operating system, an error in a memory page of a non-transitory storage medium of a computing device executing the operating system. The operating system can tag the memory page as a bad memory page, indicating that the memory page should not be used by the operating system. The operating system can also store tag information indicating memory pages of the storage medium that are tagged as bad memory pages. The example method can also include receiving an instruction to reboot the operating system, booting a second instance of the operating system, and providing the tag information to the second instance of the operating system. The operating system can use the tag information to avoid using the bad memory pages.

Anti-pattern detection in extraction and deployment of a microservice

Disclosed are various embodiments for anti-pattern detection in extraction and deployment of a microservice. A software modernization service is executed to analyze a computing application to identify various applications. When one or more of the application components are specified to be extracted as an independently deployable subunit, anti-patterns associated with deployment of the independently deployable subunit are determined prior to extraction. Anti-patterns may include increases in execution time, bandwidth, network latency, central processing unit (CPU) usage, and memory usage among other anti-patterns. The independently deployable subunit is selectively deployed separate from the computing application based on the identified anti-patterns.

Apparatuses, methods, and computer program products for triggering component workflows within a multi-component system

Methods, apparatuses, or computer program products provide for triggering component workflows within a multi-component system. An update to one or more component metadata records of a component metadata vector associated with a first component identifier may be received. The component metadata vector may include a plurality of records. Each record of the plurality of records may include a unique component metadata record identifier and a component metadata value. The component metadata vector associated with the first component identifier may be traversed after updating the one or more component metadata records. Based at least in part on detecting a component metadata condition associated with a component workflow trigger associated with the first component identifier, a first component workflow action of a first component workflow action series comprising a plurality of component workflow actions may be executed. Furthermore, a component workflow trigger notification may be transmitted to a first computing device.

Prevention and mitigation of corrupt database data

Embodiments of the present disclosure may provide a data protection system that performs identification of errors from queries on a database. The data protection system can further identify corrupted data from additional errors, are difficult to detect, and occur between layers of data in the database system. The data protection system can perform corrections of the error data by rebuilding database data or removing the corrupted data.

Cloud-based providing of one or more corrective measures for a storage system

An illustrative method includes detecting, by a cloud based storage system services provider based on a problem signature, that a storage system has experienced a problem that is associated with the problem signature; and deploying, without user intervention, one or more corrective measures that modify the storage system to resolve the problem.

System and method for diagnosing resistive shorts in an information handling system
11592891 · 2023-02-28 · ·

An information handling system includes resistive short detection circuitry that measures a first amount of power provided by a power supply system, and measures a second amount of power drawn by components. The resistive short detection circuitry compares the first amount of power with the second amount of power. In response to first amount of power being greater than the second amount of power, the resistive short detection circuitry determines that a short exists within the information handling system.

System and method for autonomous data center operation and healing
11704189 · 2023-07-18 · ·

Methods and systems for autonomous computing comprising processing historical data to analyze a past performance, collecting data from a plurality of connected devices over a network, synchronizing the collected data from the plurality of connected devices with the processed historical data. Based on the synchronized data, methods and systems disclosed include detecting an alert (error/fault) condition in one or more of the plurality of connected devices, based on the detected alert condition, triggering the delivery of the detected alert condition to an automated network operations center (NOC), and matching the determined alert condition to a historical alert condition by the network operations center. Based on the matching, methods and systems include determining a corrective action, and based on the determined corrective action, assigning a virtual self-healing module from a plurality of virtual self-healing modules. Finally, a trigger to performance of the determined corrective action by the assigned virtual self-healing module is initiated.