G06F11/0724

SOFTWARE VISIBLE AND CONTROLLABLE LOCK-STEPPING WITH CONFIGURABLE LOGICAL PROCESSOR GRANULARITIES

A processor is described. The processor includes model specific register space that is visible to software above a BIOS level. The model specific register space is to specify a granularity of a processing entity of a lock-step group. The processor also includes logic circuitry to support dynamic entry/exit of the lock-step group's processing entities to/from lock-step mode including: i) termination of lock-step execution by the processing entities before the program code to be executed in lock-step is fully executed; and, ii) as part of the exit from the lock-step mode, restoration of a state of a shadow processing entity of the processing entities as the state existed before the shadow processing entity entered the lock-step mode and began lock-step execution of the program code.

SYSTEM AND METHOD FOR N-MODULAR REDUNDANT COMMUNICATION

A fault tolerant consensus generation and communication system and method is described. Each processing node in the system receives a plurality of measurements from a sensor, calculates a consolidated value for the received plurality of measurements, transmits the consolidated value to other processing nodes, receives consolidated values from the other processing nodes, calculates a consensus value based on the calculated consolidated value and the received one or more consolidated values, transmits the calculated consensus value to the other processing nodes, receives consensus values from the other processing nodes, generates a consensus message based on the calculated consensus value, the received one or more consensus values, and a predefined criterion, and, in a case where the consensus message is not present in a consensus queue, adds the consensus message to the consensus queue.

Apparatus and method for fault handling of an offload transaction
11372711 · 2022-06-28 · ·

Apparatus and Method for Fault Handling of an Offload Transaction. For example, one embodiment of a processor comprises: a plurality of cores; an interconnect coupling the plurality of cores; and offload circuitry to transfer work from a first core of the plurality of cores to a second core of the plurality of cores without operating system (OS) intervention, the work comprising a plurality of instructions; the second core comprising first fault management logic to determine an action to take responsive to a fault condition, wherein responsive to detecting a first type of fault condition, the first fault management logic is to cause the first core to be notified of the fault condition, the first core comprising second fault management logic to attempt to resolve the fault condition.

Methods and systems for reducing downtime from system management mode in a computer system

A system and method for shortening the system management mode when a fault occurs in hardware component in a computer system is disclosed. The computer system has hardware components that may have faults. Notification of an error in one of the hardware components is received through RAS silicon on a processing unit. The error is detected from the hardware component by a system management interrupt handler executed by a bootstrap processor core. The error data is logged into a system error log via a system control interrupt handler executed by the processing unit. The system management mode is avoided during the logging of the error data. This prevents other processor cores being suspended from the system management mode.

Method and apparatus for self-diagnosis of ram error detection logic of powertrain controller
11347582 · 2022-05-31 · ·

A method for the self-diagnosis of RAM error detection logic of a powertrain controller includes: idling, by a first core, an operation of a second core; testing an error correction code (ECC) module corresponding to a RAM operating by the second core; idling, by the second core, an operation of a core of a plurality of un tested cores; and testing an ECC module corresponding to a RAM operating by the core of the plurality of untested cores.

METHOD FOR ENCODED DIAGNOSTICS IN A FUNCTIONAL SAFETY SYSTEM
20230273851 · 2023-08-31 ·

A method includes, storing a set of valid codewords including: a first valid functional codeword representing a functional state of a controller subsystem; a first valid fault codeword representing a fault state of the controller subsystem and characterized by a minimum hamming distance from the first valid functional codeword; a second valid functional codeword representing a functional state of a controller; and a second valid fault codeword representing a fault state of the controller; in response to detecting functional operation of the controller subsystem, storing the first valid functional codeword in a first memory; in response to detecting a match between contents of the first memory and the first valid functional codeword, outputting the second valid functional codeword; in response to detecting a mismatch between contents of the first memory and every codeword in the first set of valid codewords, outputting the second valid fault codeword.

Method and system for fault collection and reaction in system-on-chip

A fault collection and reaction system on a system-on-chip (SoC) includes a plurality of reaction cores assigned to a plurality of applications being executed by a plurality of processor cores on the SoC, at least one look-up table (LUT), and a controller. The at least one LUT stores therein a first mapping between the plurality of reaction cores and corresponding plurality of domain identifiers, and a second mapping between a plurality of faults and a set of reaction combinations. The controller receives a fault indication and a first domain identifier in response to occurrence of a first fault and selects from the plurality of reaction cores, a first reaction core mapped to the first domain identifier, and from the set of reaction combinations, a first reaction combination mapped to the first fault. The first reaction core responds to the fault indication with a reaction based on the selected reaction combination.

Method and system for detecting GPU-related factors of multi-mode distributed cluster

A method for detecting comprehensive GPU-related factors of a distributed cluster, the method including: (1): checking whether there is a configuration file content of an operating node; (2): reading a mode parameter in an environment variable of the operating node, and correspondingly switching an operating mode according to the mode parameter; (3): reading a timer frequency value from the environment variable of the operating node so as to set a time period for reading a GPU information parameter according to the timer frequency value; (4): calculating the maximum value of the GPU information parameter of the operating node, and storing the maximum value into the GPU information list cache; and (5): initializing the transmitted information; determining whether there is a GPU in the GPU information list cache of the operating node.

Multicore system for determining processor state abnormality based on a comparison with a separate checker processor
11327853 · 2022-05-10 · ·

A multicore system according to one or more embodiments is disclosed, which may include processors that execute processing different from each other, a selector that selects one of the processors, a checker processor, a comparator that compares an external state of the processor selected by the selector with an external state of the checker processor, or compares an internal state of the processor selected by the selector with an internal state of the checker processor, and a controller that determines that the selected processor or the checker processor is abnormal in response to the external states or the internal states not matching each other based on comparison results obtained by the comparator.

Hardware-based fault scanner to detect faults in homogeneous processing units

Apparatuses, systems, and techniques to detect faults in processing pipelines are described. One accelerator circuit includes a fixed-function circuit that performs an operation corresponding to a layer of a neural network. The fixed-function circuit includes a set of homogeneous processing units and a fault scanner circuit. The fault scanner circuit includes an additional homogeneous processing unit to scan each processing unit of the set for functional faults in a sequence.