Patent classifications
G06F11/0793
Efficient Fault Prevention and Repair in Complex Systems
A method of supervising a complex system includes acquiring and storing failures data and repair resources information regarding the complex system, identifying failure networks and structures of the complex system. Failure types associated with the failure networks of the complex system are determined. The method includes generating a plurality of failure prevention and repair (FPR) sequences, wherein each FPR is associated with the failure networks and the failure types. The generated FPR sequences are analyzed to select a set of FPR sequences and associated repair resources. The method further comprises applying the selected one of the plurality of failure prevention and repair sequences to the complex system, thereby managing the complex system.
SYSTEM-ON-CHIP FOR SHARING GRAPHICS PROCESSING UNIT THAT SUPPORTS MULTIMASTER, AND METHOD FOR OPERATING GRAPHICS PROCESSING UNIT
A system-on-a-chip sharing a graphics processing unit supporting multi-master is provided. A system on chip (SoC) comprises a plurality of central processing units (CPUs) for executing at least one operating system, a graphics processing unit (GPU) that is connected to each of the plurality of CPUs via a bus interface and communicates with each of the plurality of CPUs, and at least one state monitoring device that is selectively connected to at least one CPU among the plurality of CPUs and transmits execution state information of at least one operating system executed in the connected CPU to the GPU. The GPU is shared by at least one operating system and controls a sharing operation by the at least one operating system based on the execution state information of the at least one operating system.
FAILURE DIAGNOSIS DEVICE, FAILURE DIAGNOSIS SYSTEM, HOUSEHOLD ELECTRICAL APPLIANCE, SENSOR UNIT, AND FAILURE DIAGNOSIS METHOD
A failure diagnosis device includes a communication unit, a data comparison unit, and an eligibility determination unit. The communication unit acquires first physical quantity data and second physical quantity data of a type different from that of the first physical quantity data that are used for performing failure diagnosis of a home appliance acquired by a sensor unit, and first control information related to the second physical quantity data acquired by the home appliance. The data comparison unit compares the second physical quantity data with the first control information. The eligibility determination unit determines whether or not the first physical quantity data is eligible as data used for the failure diagnosis based on a comparison result obtained by the data comparison unit.
FAULT RECOVERY SYSTEM FOR FUNCTIONAL CIRCUITS
A fault recovery system includes various fault management circuits that form a hierarchical structure. One fault management circuit detects a fault in a functional circuit and executes a recovery operation to recover the functional circuit from the fault. When the fault management circuit fails to recover the functional circuit from the fault within a predetermined time duration, a fault management circuit that is in a higher hierarchical level executes another recovery operation to recover the functional circuit from the fault. Such a fault management circuit is required to execute the corresponding recovery operation within another predetermined time duration to successfully recover the functional circuit from the fault. The fault recovery system thus implements the hierarchical structure of fault management circuits to recover the functional circuit from the fault.
METHOD AND SYSTEM FOR PREDICTIVE MAINTENANCE OF HIGH PERFORMANCE SYSTEMS (HPC)
State of the art predictive maintenance systems that generate predictions with respect to maintenance of High Performance Computing (HPC) systems have the disadvantage that they either are reactive, or the predictions are affected due to quality issues associated with the data being collected from the HPC systems. The disclosure herein generally relates to predictive maintenance, and, more particularly, to a method and system for predictive maintenance of High Performance Computing (HPC) systems. The system performs abstraction and cleansing on performance data collected from the HPC systems, and generates a cleansed performance data, on which a Machine Leaning (ML) prediction is applied to generate predictions with respect to maintenance of the HPC systems.
AUTOMATED CROSS-SERVICE DIAGNOSTICS FOR LARGE SCALE INFRASTRUCTURE CLOUD SERVICE PROVIDERS
Example aspects include techniques for employing cross-service diagnostics for cloud service providers. These techniques may include dynamically generating a workflow of one or more diagnostic modules based on relationship information between an origin service experiencing an incident and one or more related services that the origin service depends on, and executing the workflow of one or more diagnostic modules to determine a root cause of the incident, each of the one or more diagnostic modules implemented by an individual service of the one or more related services in accordance with a schema. In addition, the techniques may include determining a diagnostic action based on the root cause, and transmitting, based on the diagnostic action, an engagement notification to a responsible entity.
METHOD AND SYSTEM FOR AUTOMATED HEALING OF HARDWARE RESOURCES IN A COMPOSED INFORMATION HANDLING SYSTEM
In general, the invention relate to providing computer implemented services using information handling systems. One or more embodiments includes after being allocated to a composed information handling system of the composed information handling systems: monitoring health of a hardware resource of the composed information handling system, making a determination, based on the monitoring of the health of the hardware resource, that the hardware resource is in a compromised state, and based on the determination, initiating a hardware replacement operation using replacement option information (ROI) for the hardware resource and replacement conditions for the hardware resource.
AUTOMATED ISSUE DETECTION AND REMEDIATION ACROSS MULTIPLE USER SYSTEMS USING HEALING-AS-A-SERVICE TECHNIQUES
Methods, apparatus, and processor-readable storage media for automated issue detection and remediation across multiple user systems using healing-as-a-service techniques are provided herein. An example computer-implemented method includes obtaining system configuration data from at least a portion of multiple user systems within a network; obtaining an alert pertaining to an issue attributed to a first of the user systems; training a machine learning model related to user system issue detection using at least a portion of the system configuration data and data related to the alert; determining user system configuration adjustments related to remedying at least a portion of the issue, by processing the data related to the alert using the trained machine learning model; automatically performing the user system configuration adjustments in connection with the first user system; and sharing, using at least one healing-as-a-service component, the trained machine learning model with the user systems in the network.
Adaptively Uploading Data Center Asset Data for Analysis
A system, method, and computer-readable medium are disclosed for performing a data center monitoring and management operation. The data center monitoring and management operation includes: identifying data center asset data to monitor; collecting data center asset data; and, performing an adaptive update scheduling operation, the adaptive update scheduling operation adaptively adjusting a prioritization and frequency of data center asset data collection to provide adapted data center asset data.
Systems And Methods For Self-Healing And/Or Failure Analysis Of Information Handling System Storage
Systems and methods are provided that may be implemented to perform failure analysis and/or self-healing of information handling system storage. In one example, an information handling system may perform self-recovery actions to self-heal system storage issues when there is a OS boot failure due to a failure to detect a system storage drive by determining one or more possible recovery actions based on a current system storage drive status retrieved by an embedded controller (EC) or other programmable integrated circuit of the information handling system. In another example, manufacturing quality control analysis may be performed on boot failure information that is collected at a remote server from multiple failed information handling systems.