G06F11/2242

Fault tolerant computer systems and methods establishing consensus for which processing system should be the prime string

Systems and methods for fault tolerant computing in accordance with various embodiments of the invention are disclosed. Fault tolerant computer systems in accordance with a number of embodiments of the invention include multiple processing systems supervised by a Fault Management Unit (FMU). The FMU can build a representation of the state of all of the multiple processing systems and then determines which of the processing systems to utilize to perform a particular function based upon this state representation.

Critical path failure analysis using hardware instruction injection

Critical path failure analysis using hardware instruction injection may include providing, by an instruction microcontroller, to a plurality of processor cores, one or more test instruction sequences, wherein the instruction microcontroller is coupled to, for each of the plurality of processor cores: a first multiplexor providing an input to an instruction queue, and a second multiplexer receiving an input from the instruction queue and providing an output to an execution pathway; performing, by the instruction microcontroller, based on one or more test instruction sequences, one or more of a scan-in last pass (SLP) analysis or a scan-in cycle offset (SCO) analysis; and determining, based on one or more of the SLP analysis or the SCO analysis, one or more of a critical instruction sequence or a critical component path associated with the plurality of processor cores.

Fault Isolation and Recovery of CPU Cores for Failed Secondary Asymmetric Multiprocessing Instance
20210173753 · 2021-06-10 ·

According to certain embodiments, a system includes one or more processors and one or more computer-readable non-transitory storage media comprising instructions that, when executed by the one or more processors, cause one or more components to perform operations including executing a software process of a secondary instance, the secondary instance running in parallel with a primary instance and associated with a plurality of cores including a bootstrap core, registering a non-maskable interrupt for the bootstrap core in the secondary instance, determining whether the secondary instance is in a fault state, wherein, if the secondary instance is in the fault state, halting the plurality of cores associated with the secondary instance, without impact to the primary instance, and recovering the bootstrap core by switching a context of the bootstrap core from the secondary instance to the primary instance via the non-maskable interrupt.

HARDWARE RELIABILITY DIAGNOSTICS AND FAILURE DETECTION VIA PARALLEL SOFTWARE COMPUTATION AND COMPARE
20210165730 · 2021-06-03 ·

Methods, apparatus, and software for hardware reliability diagnostics and failure detection via parallel software computation and compare. Parallel testing is performed on hardware resources such as processor cores, accelerators, and Other Processing Units (XPUs) using test algorithms such as encryption/decryption. The results of the testing (the algorithm outputs) are compared to detect errant hardware. Comparison may be across cores (via execution of software-based algorithms), across accelerators/XPUs (via algorithms implement in hardware) or between cores and accelerators/XPUs. Techniques are disclosed to enable all cores to be tested while a platform is performing a workload, such as in a data center environment, wherein unused cores are used for testing, with workloads being migrated between cores between tests.

Validation of multiprocessor hardware component

A method, apparatus and computer program product to be employed by a hardware component under validation, wherein the hardware component having a plurality of processing units each belonging to one of at least two types, such that one of the at least two types of processing units is less error-prone then a remainder of the at least two types. The method comprising: designating one of the processing units of the hardware component under validation that belongs to the less error-prone type as a manager processing unit; initiating execution of a tester program code for testing processing units, by processing units of the hardware component other than the manager processing unit; and, monitoring by the manager processing unit the status of the processing units during execution of the tester program code.

Fault isolation and recovery of CPU cores for failed secondary asymmetric multiprocessing instance

According to certain embodiments, a system includes one or more processors and one or more computer-readable non-transitory storage media comprising instructions that, when executed by the one or more processors, cause one or more components to perform operations including executing a software process of a secondary instance, the secondary instance running in parallel with a primary instance and associated with a plurality of cores including a bootstrap core, registering a non-maskable interrupt for the bootstrap core in the secondary instance, determining whether the secondary instance is in a fault state, wherein, if the secondary instance is in the fault state, halting the plurality of cores associated with the secondary instance, without impact to the primary instance, and recovering the bootstrap core by switching a context of the bootstrap core from the secondary instance to the primary instance via the non-maskable interrupt.

METHODS AND SYSTEMS FOR PROACTIVE MANAGEMENT OF NODE FAILURE IN DISTRIBUTED COMPUTING SYSTEMS

Embodiments for managing distributed computing systems are provided. Information associated with operation of a computing node within a distributed computing system is collected. A reliability score for the computing node is calculated based on the collected information. The calculating of the reliability score is performed utilizing the computing node. A remedial action associated with the operation of the computing node is caused to be performed based on the calculated reliability score.

System, apparatus and method for in-field self testing in a diagnostic sleep state

In one embodiment, a processor includes at least one core and an interface circuit to interface the at least one core to additional circuitry of the processor. In response to an in-field self test instruction, at least one core may save state to a low power memory, enter into a diagnostic sleep state and execute an in-field self test in the diagnostic sleep state in which the at least one core appears to be inactive. Other embodiments are described and claimed.

Systems and methods for validation of test results in network testing
10936396 · 2021-03-02 · ·

A network testing system includes one or more test devices each including a media-specific testing module and a processing device with a network interface, wherein the processing device is configured to test a network with the media-specific testing module; one or more servers configured to receive test results from the test of the network either directly from the one or more test devices or an intermediate data source communicatively coupled to the one or more test devices; and a validator module executed on the one or more servers configured to perform automated post-processing on the test results to compare the test results to a pre-defined Method of Procedure (MOP), to auto-correct one or more errors in the test results, and to provide a report based on the comparison.

DEBUG FOR MULTI-THREADED PROCESSING
20230418718 · 2023-12-28 ·

A system to implement debugging for a multi-threaded processor is provided. The system includes a hardware thread scheduler configured to schedule processing of data, and a plurality of schedulers, each configured to schedule a given pipeline for processing instructions. The system further includes a debug control configured to control at least one of the plurality of schedulers to halt, step, or resume the given pipeline of the at least one of the plurality of schedulers for the data to enable debugging thereof. The system further includes a plurality of hardware accelerators configured to implement a series of tasks in accordance with a schedule provided by a respective scheduler in accordance with a command from the debug control. Each of the plurality of hardware accelerators is coupled to at least one of the plurality of schedulers to execute the instructions for the given pipeline and to a shared memory.