G06F11/2242

Method and sequencer for detecting a malfunction occurring in a high performance computer

A method for monitoring the operation of an IT infrastructure including a plurality of calculation nodes, includes selecting calculation nodes for performing a calculation, performing the calculation via the selected calculation nodes, attributing, via the sequencer, a score to each one of the calculation nodes having participated in the calculation performed, with each score reflecting a difference between a measured operating parameter of the calculation node for which the score is attributed and a reference operating parameter of the calculation node for which the score is attributed, verifying the operation of the calculation nodes having participated in the calculation performed, the verification being carried out using scores attributed to the calculation nodes having participated in the calculation.

Automated system-level failure and recovery

Systems and methods for automated system-level failure and recovery are described. In some embodiments, an Information Handling System (IHS) includes a processor and a memory, the memory having program instructions stored thereon that, upon execution by the processor, cause the IHS to execute a selected process configured to participate in an inter-process communication (IPC) with at least one other process, invoke an error handling process by simulating a fault in the IPC, and determine if the error handling process successfully handles the fault.

MULTI-CORE PROCESSOR DEBUGGING SYSTEMS AND METHODS
20240338288 · 2024-10-10 ·

The present invention facilitates efficient and effective information storage device operations. In one embodiment, a system comprises: a plurality of processing cores configured to process information and a debug system coupled to the plurality of cores. The plurality of processing cores are configured to perform respective test operations on the respective processing cores. The debug system is configured to gather results of the test operations on a flexible compaction basis, wherein a compacted indication of a passing test result is available at a debug cluster basis and compacted indications of a failed test result available at the debug cluster basis are further resolved to identify a failing processing core within the cluster. The processing cores are organized in clusters, wherein a set comprising more than one of the plurality of processing cores and less than all of the processing cores is considered a cluster. The processing cores can be organized in levels, wherein the number of processing cores in a cluster differs at different level. The different levels correspond to a debug hierarchy.

SYSTEMS AND METHODS FOR USAGE OF SPARE CORES IN CONNECTION WITH IN-FIELD TESTS OF OPERATIONAL CORES
20240296102 · 2024-09-05 ·

The technology generally relates to systems and methods for performing in-field testing of processing cores within a system-on-chip (SoC), so as to identify faults, including those associated with silent data corruption. For example, an SoC may contain operational cores and spare cores. An operational core may be selected for testing while a spare core is used to replace the tested core. In addition, a spare core may be used to replace an operational core that has been determined to be corrupted.

Vehicle Control Device

In the present invention, computational efficiency degradation is suppressed when diagnosing a shared storage area in a vehicle control device in which a plurality of computing units are employed. This vehicle control device suppresses computational efficiency degradation by changing an access destination in a storage device while diagnosing a shared storage area that the storage device has.

IN-FIELD SELF-TEST CONTROLLER FOR SAFETY CRITICAL AUTOMOTIVE USE CASES

A self-test controller includes a memory configured to store a test patterns, configuration registers, and a memory data component. The test patterns are encoded in the memory using various techniques in order to save storage space. By using the configuration parameters, the memory data component is configured to decode the test patterns and perform multiple built-in self-test on a multitude of test cores. The described techniques allow for built-in self-test to be performed dynamically while utilizing less space in the memory.

SYSTEM-ON-CHIP INCLUDING CPU OPERATING AS DEBUG HOST AND METHOD OF OPERATING THE SAME
20180224504 · 2018-08-09 ·

Provided is a method of operating a system-on-chip (SoC) including a plurality of CPUs. The method includes: receiving a debug request by a first CPU of the CPUs; outputting a first signal to the CPUs by the first CPU in response to the debug request; selecting a second CPU from the CPUs to control the debugging based on the first signal; and performing a debug operation by selecting a debug target block by the second CPU.

DEBUGGING METHOD, MULTI-CORE PROCESSOR, AND DEBUGGING DEVICE

Embodiments of the present invention relate to the field of computer technologies. The embodiments of the present invention provide a debugging method, including: stopping running, by a core A of the multi-core processor, and sending a running stop signal to other cores in a process of stopping running; after receiving a first stop termination instruction and resuming running, executing a debugging information collection function and stopping running after completing the execution of the debugging information collection function; after receiving a second stop termination instruction and resuming running, sending a running resumption instruction to the other cores; and knocking the pending breakpoint in a process of running an operation object of the preset event, so as to enter a debugging state. According to the technical solutions provided in the embodiments of the present invention, kernel mode code and user mode code can be debugged on a same debugging platform.

DEBUGGING METHOD, MULTI-CORE PROCESSOR AND DEBUGGING DEVICE

Embodiments of the present invention relate to the field of computer technologies. The embodiments of the present invention provide a debugging method, including: starting, by a core A of a multi-core processor after completing execution of a preset event processing routine, to stop running, and sending a running stop signal to other cores in a process of stopping running; after receiving a first stop termination instruction and resuming running, executing a debugging information collection function to collect debugging information of the preset event, and stopping running after completing the execution of the debugging information collection function; and after receiving a second stop termination instruction and resuming running, sending a running resumption instruction to the other cores. By means of the technical solutions provided in the embodiments of the present invention, kernel mode code and user mode code can be masked on a same debugging platform.

PERIODIC NON-INTRUSIVE DIAGNOSIS OF LOCKSTEP SYSTEMS
20180203778 · 2018-07-19 ·

Aspects disclosed herein relate to periodic non-intrusive diagnosis of lockstep systems. An exemplary method includes comparing execution of a program on a first processing system of the plurality of processing systems and execution of the program on a second processing system of the plurality of processing systems using a first comparator circuit, comparing the execution of the program on the first processing system and the execution of the program on the second processing system using a second comparator circuit, and running a diagnosis program on the second comparator circuit while the comparing using the first comparator circuit is ongoing.