IPIQ

G06F11/2257

PREEMPTIVE DEEP DIAGNOSTICS AND HEALTH CHECKING OF RESOURCES IN DISAGGREGATED DATA CENTERS

20200097347 · 2020-03-26 ·

International Business Machines Corporation

Embodiments for preemptive deep diagnostics of resources in a disaggregated computing environment. Responsive to detecting a threshold breach of a recurrent event associated with a first resource of a first resource type executing a workload, an alert is generated; and responsive to receiving the alert, the execution of the workload on the first resource is ceased. Health check diagnostics are identified and invoked on the first resource based on the alert and a server telemetry. Results of the health check diagnostics are mapped to a set of learned failure patterns; and a potential failure of the first resource is predicted based on the mapping.

AUTOMATIC ROOT CAUSE ANALYSIS USING TERNARY FAULT SCENARIO REPRESENTATION

20240028445 · 2024-01-25 ·

David R. Cheriton

A plurality of potential fault scenarios are accessed, wherein a given potential fault scenario of the plurality of potential fault scenarios has at least one corresponding root cause, and a representation of the given potential fault scenario comprises a don't care value. An actual fault scenario from telemetry received from a monitored system is generated. The actual fault scenario is matched against the plurality of potential fault scenarios. One or more matched causes are output as one or more probable root cause failures of the monitored system.

ABNORMALITY DETECTION DEVICE, ABNORMALITY DETECTION METHOD, AND STORAGE MEDIUM

20200057939 · 2020-02-20 ·

An abnormality detection device according to an embodiment includes a detector, a remover, and a learner. The detector detects first abnormal data in detection target data which is an abnormality detection target by inputting the detection target data to a first autoencoder which performed learning based on first learning target data which is a learning target. The remover removes data associated with the first abnormal data from the first learning target data to generate second learning target data by inputting the first learning target data to a second autoencoder which performed learning based on the first abnormal data detected by the detector. The learner causes the first autoencoder to perform learning based on the second learning target data generated by the remover.

Using cognitive technologies to identify and resolve issues in a distributed infrastructure

10565077 · 2020-02-18 ·

International Business Machines Corporation

A mechanism is provided in a data processing system for identifying and resolving issues in a distributed infrastructure. A log error processor monitors error logs of a plurality of data processing nodes within the distributed infrastructure. In response to the log error processor detecting an error in a given node within the distributed infrastructure, the log error processor provides error data for the error to the machine learning model and receiving from the machine learning model a set of potential solutions and associated confidence values. An operation extraction component extracts from each potential solution in the set of potential solutions a set of operations to resolve the error. A classifier component maps each set of operations to a set of executable operations that are executable by the given node. A solution scorer component determines whether to perform automatic resolution using a selected potential solution and its corresponding set of executable operations. In response to the solution scorer component determining to perform automatic resolution, an operation execution engine executes the corresponding set of executable operations on the given node.

Method and system for monitoring and correcting defects of a network device

10484256 · 2019-11-19 ·

Arista Networks, Inc.

A method for determining that a defect applies to a network device that includes receiving, at a monitoring module, network device information from the network device. The network device information includes state information for the network device and does not include hardware and software version information. The method includes storing, in a network device database, the network device information from the network device and receiving, at the monitoring module, defect information about a defect. The defect information includes network device criteria specifying what state information is required for a network device to be affected by the defect. The method includes storing the defect information in a defect database, determining that the defect applies to the network device based on analyzing the network device information and the defect information from their respective databases, and, based on the determination, informing a defect alert recipient that the defect applies to the network device.

INTELLIGENT SCORE BASED OOM TEST BASELINE MECHANISM

20240126667 · 2024-04-18 ·

Dell Products L.P.

Huijuan Fan

Methods, system, and non-transitory processor-readable storage medium for an Out of Memory test baseline system are provided herein. An example method includes executing a plurality of test cases on a system. A test score calculation module calculates a test case score for each of the executed test cases in a subset of the plurality of test cases. An Out of Memory (OOM) test baseline configuration system trains a machine learning system, using the subset test scores, to predict a baseline test score for an unexecuted test case. A test case score prediction module predicts the baseline test score for the unexecuted test case. A test case configuration tuning module tunes the unexecuted test case to determine a baseline configuration for the unexecuted test case, to identify OOM issues when the unexecuted test case is executed on a test system.

PREDICTION METHOD AND APPARATUS FOR FAULTY GPU, ELECTRONIC DEVICE AND STORAGE MEDIUM

20240118984 · 2024-04-11 ·

The present disclosure provides a prediction method and an apparatus for a faulty GPU, an electronic device and a storage medium. The method includes: acquiring parameter information of each GPU in a plurality of GPUs to obtain a parameter information set; inputting the parameter information set into a plurality of pre-trained prediction models to obtain a prediction result corresponding to each prediction model; and determining a faulty GPU from the plurality of GPUs according to the prediction result.

AUTOMATED TERMINAL PROBLEM IDENTIFICATION AND RESOLUTION

20190303258 · 2019-10-03 ·

Surisetty Pradeep Kumar

A transaction terminals reports information regarding operation of terminal to a server-based analyzer. The analyzer labels the information and normalizes the labeled information into a model format. The analyzer reports the model format to a problem identifier/resolver. The problem identifier/resolver identifies a closest likely problem and a resolution for that closest likely problem based on the labeled information in the model format and reports the likely problem and resolution back to the analyzer for resolution on the transaction terminal.

FAULT TREE ANALYSIS FOR TECHNICAL SYSTEMS

20190278647 · 2019-09-12 ·

Peter Bakucz

A method for fault tree analysis of a technical system, which includes a plurality of functional units, the technical system being modeled as a tree-like logical linkage of causative events, which may culminate in an undesirable event, and the causative events including malfunctions of individual functional units, a tree-like logical linkage having a self-similar structure being selected. An associated computer program is described. A surroundings detection system and/or a control system for an at least partially automated driving vehicle, including a plurality of functional units having mutual dependencies, which link the functional units in a tree-like structure in such a way that an undesirable event occurs if a logical linkage of causative events is true, the causative events including malfunctions of individual functional units, the tree-like structure being self-similar.

METHOD OF PERFORMING FAULT MANAGEMENT IN AN ELECTRONIC APPARATUS

20190272475 · 2019-09-05 ·

Siemens Healthcare Gmbh

A method is for performing fault management in an electronic apparatus. In an embodiment, the method includes transferring machine data of the electronic apparatus to a remote support center; analyzing machine data of the electronic apparatus in the remote support center; providing at least one of a diagnostic workflow and a corrective workflow in response to an anomaly detected in the machine data; and operating the electronic apparatus from the remote support center to execute the at least one of a diagnostic workflow and the corrective workflow.

Patent classifications

G06F11/2257