G06F11/24

Method and System for Intelligent Failure Diagnosis Center for Burn-In Devices Under Test

A mechanism is provided for automatically detecting, diagnosing, transporting, and repairing devices having failed during burn-in testing. Embodiments provide a system that monitors devices undergoing burn-in testing and detecting when a device or a component within a device fails the burn-in test. Embodiments can then alert burn-in-rack monitor personnel of the device failure. Embodiments can concurrently determine the nature of the failure applying a machine learning-based prediction model against log files associated with the failed device. The diagnosis along with a recommended repair strategy can be provided to the repair center as an aid in accelerating the repair process. In addition, the diagnosis can be used to order parts for the repair from a parts depot. In this manner, embodiments can reduce the time for detection, diagnosis, and repair of the failed device.

Information handling system and methods to detect power rail failures and test other components of a system motherboard
11194684 · 2021-12-07 · ·

Embodiments of information handling systems (IHSs) and methods are provided herein to automatically detect failure(s) on one or more power rails provided on a system motherboard of an IHS. One embodiment of such a method may include determining if a power rail test should be performed each time an information handling system (IHS) is powered on or rebooted. If a power rail test is performed, the method may perform a current measurement for each of the power rails separately to obtain actual current values for each power rail, compare the actual current values obtained for each power rail to expected current values stored for each power rail, and detect a failure on at least one of the power rails if the actual current value obtained for the at least one power rail differs from the expected current value stored for the at least one power rail by more than a predetermined percentage or amount.

Regulating core and un-core processor frequencies of computing node clusters

A technique includes receiving, by a cluster management controller, data representing results of a plurality of performance tests that are conducted on a plurality of processors that are associated with the plurality of computing nodes of a cluster. The technique includes, based on the data representing the results, selecting a lead processor of the plurality of processors; and communicating, by the cluster management controller, with the plurality of computing nodes to regulate core and un-core operating frequencies of the plurality of processors. Communicating with the plurality of computer nodes includes communicating data representing a core operating frequency of the lead processor and an un-core operating frequency of the lead processor.

Method and apparatus for performing test for CPU, and electronic device

A method and an apparatus for performing a test for a CPU, and an electronic device. A decay command in a SETWP test and a command-executing duration corresponding to each command subsequent to the decay command can be automatically deployed. Thereby, the SETWP test is correctly performed for the CPU to obtain a test result. It is not necessary to rely on manual adjustment on a parameter of a delay corresponding to each command.

HIGH SPEED DEBUG-DELAY COMPENSATION IN EXTERNAL TOOL

A testing tool includes a clock generation circuit generating a test clock and outputting the test clock via a test clock output pad, data processing circuitry clocked by the test clock, and data output circuitry receiving data output from the data processing circuitry and outputting the data via an input/output (IO) pad, the data output circuitry being clocked by the test clock. The testing tool also includes a programmable delay circuit generating a delayed version of the test clock, and data input circuitry receiving data input via the IO pad, the data input circuitry clocked by the delayed version of the test clock. The delayed version of the test clock is delayed to compensate for delay between transmission of a pulse of the test clock via the test clock output pad to an external computer and receipt of the data input from the external computer via the IO pad.

RATING MEMORY DEVICES BASED ON PERFORMANCE METRICS FOR VARIOUS TIMING MARGIN PARAMETER SETTINGS
20220137854 · 2022-05-05 ·

An operation timing condition associated with a memory device to be installed at a memory sub-system is determined. The memory device can include a cross-point array of non-volatile memory cells. The operation timing condition corresponds to a first operation delay timing margin setting for the cross-point array of non-volatile memory cells. A first set of memory access operations is performed at the cross-point array of non-volatile memory cells according to a second operation delay timing margin setting that is lower than the first operation delay timing margin setting. A first number of errors that occurred during performance of the first set of memory access operations is determined. In response to a determination that the first number of errors satisfies an error condition, a first quality rating is assigned for the memory device. In response to a determination that the first number of errors does not satisfy the error criterion, further testing is performed for the cross-point array of non-volatile memory cells based on one or more power level settings.

DETECTING SILENT DATA CORRUPTIONS WITHIN A LARGE SCALE INFRASTRUCTURE

Systems, apparatuses and methods provide technology for conducting silent data corruption (SDC) testing in a network including a fleet of production servers comprising generating a first SDC test selected from a repository of SDC tests, submitting the first SDC test for execution on a plurality of servers selected from the fleet of production servers, wherein for each respective server of the plurality of servers the first SDC test is executed as a test workload in co-location with a production workload executed on the respective server, determining a result of the first SDC test performed on a first server of the plurality of servers, and upon determining that the result of the first SDC test performed on the first server is a test failure, removing the first server from a production status, and entering the first server in a quarantine process to investigate and to mitigate the test failure.

Detecting execution hazards in offloaded operations

Detecting execution hazards in offloaded operations is disclosed. A second offload operation is compared to a first offload operation that precedes the second offload operation. It is determined whether the second offload operation creates an execution hazard on an offload target device based on the comparison of the second offload operation to the first offload operation. If the execution hazard is detected, an error handling operation may be performed. In some examples, the offload operations are processing-in-memory operations.

Detecting execution hazards in offloaded operations

Detecting execution hazards in offloaded operations is disclosed. A second offload operation is compared to a first offload operation that precedes the second offload operation. It is determined whether the second offload operation creates an execution hazard on an offload target device based on the comparison of the second offload operation to the first offload operation. If the execution hazard is detected, an error handling operation may be performed. In some examples, the offload operations are processing-in-memory operations.

Smart overclocking method conducted in basic input/output system (BIOS) of computer device
20210365269 · 2021-11-25 · ·

The present invention provides a smart overclocking method for a computer device with a multi-core CPU and abasic input/output system (BIOS) where an overclocking database is stored therein, which comprises: booting the computer device, logging in the BIOS and performing an overclocking function; acquiring overclocking parameters from the overclocking database; conducting adjustment/settlement of the clock rate and the voltage of the multi-core CPU based on the overclocking parameters; conducting a Heavy Load Testing (HLT) on the multi-core CPU; reading out working results data of the multi-core CPU and determining whether any of them have exceeded limits. Hence, overclocking can be completed within 10 min. or less, without causing shut down of the computer device, and without causing working temperature or working voltage of multi-core CPU to be higher than 90° C. or 1500 mV during Heavy Load Testing (HLT).