G06F11/0793

Remote debug for scaled computing environments

Techniques and apparatus for remotely accessing debugging resources of a target system are described. A target system including physical compute resources, such as, processors and a chipset can be coupled to a controller remotely accessible over a network. The controller can be arranged to facilitate remote access to debug resources of the physical compute resources. The controller can be coupled to debug pin, such as, those of a debug port and arranged to assert control signals on the pins to access debug resources. The controller can also be arranged to exchange information elements with a remote debug host to include indication of debug operations and/or debug results.

Detecting shingled overwrite errors

Systems and methods are disclosed for detecting shingled overwrite errors. When a read error is encountered when reading from shingled recording tracks, a processor may determine whether the read error is an error caused by shingled overwriting. The processor may determine whether the read error is caused by shingled overwriting by determining read signal quality of one or more sectors preceding the read error, such as based on a bit error count or bit error ratio (BER), and comparing the read signal quality to a threshold value. The processor may determine that the read error is caused by shingled overwriting when the read signal quality value is lower than the threshold.

Managing machine failure

A method, computer program product, and computer system are provided. A message storage area of an adjunct processor (AP) crypto adapter is filled with a plurality of command request messages sufficient to maximize utilization and performance of the AP crypto adapter. In response to detecting an error during execution of one of the plurality of command request messages, generating an AP crypto adapter command reply message. The AP crypto adapter command reply message includes the error. In response to the error being a non-recoverable failure, determining a state of the command request message, wherein the state of the command request message comprises an in-process state or a request-pending state. The AP crypto adapter command reply message is formatted, wherein the formatted AP crypto adapter command reply message is stored in a message queue in the AP crypto adapter pending completion of machine failure recovery. The AP crypto adapter is recovered.

Apparatuses, methods, and computer program products for triggering component workflows within a multi-component system

Methods, apparatuses, or computer program products provide for triggering component workflows within a multi-component system. An update to one or more component metadata records of a component metadata vector associated with a first component identifier may be received. The component metadata vector may include a plurality of records. Each record of the plurality of records may include a unique component metadata record identifier and a component metadata value. The component metadata vector associated with the first component identifier may be traversed after updating the one or more component metadata records. Based at least in part on detecting a component metadata condition associated with a component workflow trigger associated with the first component identifier, a first component workflow action of a first component workflow action series comprising a plurality of component workflow actions may be executed. Furthermore, a component workflow trigger notification may be transmitted to a first computing device.

Prevention and mitigation of corrupt database data

Embodiments of the present disclosure may provide a data protection system that performs identification of errors from queries on a database. The data protection system can further identify corrupted data from additional errors, are difficult to detect, and occur between layers of data in the database system. The data protection system can perform corrections of the error data by rebuilding database data or removing the corrupted data.

Out-of-bounds recovery circuit

Out-of-bounds recovery circuits configured to detect an out-of-bounds violation in an electronic device, and cause the electronic device to transition to a predetermined safe state when an out-of-bounds violation is detected. The out-of-bounds recovery circuits include detection logic configured to detect that an out-of-bounds violation has occurred when a processing element of the electronic device has fetched an instruction from an unallowable memory address range for the current operating state of the electronic device; and transition logic configured to cause the electronic device to transition to a predetermined safe state when an out-of-bounds violation has been detected by the detection logic.

Systems and methods for self-healing and/or failure analysis of information handling system storage
11593191 · 2023-02-28 · ·

Systems and methods are provided that may be implemented to perform failure analysis and/or self-healing of information handling system storage. In one example, an information handling system may perform self-recovery actions to self-heal system storage issues when there is a OS boot failure due to a failure to detect a system storage drive by determining one or more possible recovery actions based on a current system storage drive status retrieved by an embedded controller (EC) or other programmable integrated circuit of the information handling system. In another example, manufacturing quality control analysis may be performed on boot failure information that is collected at a remote server from multiple failed information handling systems.

Cloud-based providing of one or more corrective measures for a storage system

An illustrative method includes detecting, by a cloud based storage system services provider based on a problem signature, that a storage system has experienced a problem that is associated with the problem signature; and deploying, without user intervention, one or more corrective measures that modify the storage system to resolve the problem.

System and method for autonomous data center operation and healing
11704189 · 2023-07-18 · ·

Methods and systems for autonomous computing comprising processing historical data to analyze a past performance, collecting data from a plurality of connected devices over a network, synchronizing the collected data from the plurality of connected devices with the processed historical data. Based on the synchronized data, methods and systems disclosed include detecting an alert (error/fault) condition in one or more of the plurality of connected devices, based on the detected alert condition, triggering the delivery of the detected alert condition to an automated network operations center (NOC), and matching the determined alert condition to a historical alert condition by the network operations center. Based on the matching, methods and systems include determining a corrective action, and based on the determined corrective action, assigning a virtual self-healing module from a plurality of virtual self-healing modules. Finally, a trigger to performance of the determined corrective action by the assigned virtual self-healing module is initiated.

IMPLEMENTING COHERENT ACCELERATOR FUNCTION ISOLATION FOR VIRTUALIZATION

A method, system and computer program product are provided for implementing coherent accelerator function isolation for virtualization in an input/output (IO) adapter in a computer system. A coherent accelerator provides accelerator function units (AFUs), each AFU is adapted to operate independently of the other AFUs to perform a computing task that can be implemented within application software on a processor. The AFU has access to system memory bound to the application software and is adapted to make copies of that memory within AFU memory-cache in the AFU. As part of this memory coherency domain, each of the AFU memory-cache and processor memory-cache is adapted to be aware of changes to data commonly in either cache as well as data changed in memory of which the respective cache contains a copy.