G06F11/0793

ATTRIBUTING ERRORS TO INPUT/OUTPUT PERIPHERAL DRIVERS
20230236917 · 2023-07-27 ·

A process includes determining, by an operating system agent of a computer system, a first profile that is associated with an input/output (I/O) peripheral of the computer system. The first profile is associated with an error register of the I/O peripheral, and the first profile represents a configuration of the computer system that is associated with the I/O peripheral. The process includes, responsive to a notification of an error being associated with the I/O peripheral, determining, by the operating system agent, a second profile that is associated with the I/O peripheral. The second profile is associated with the error register. Moreover, responsive to the notification of the error, the process includes comparing, by a baseboard management controller of the computer system, the second profile to the first profile. Based on the comparison, the process includes determining, by the baseboard management controller, whether the error is attributable to a driver for the I/O peripheral.

METHOD AND SYSTEM FOR PROVIDING SOLUTIONS TO A HARDWARE COMPONENT FAILURE USING A CONTEXT-AWARE SEARCH

In general, embodiments relate to a method for providing solutions to hardware component failures, comprising: creating a device state chain using a device state path, a current device state, and a next device state for a device; identifying root cause of a hardware component failure in the device using the device state chain; performing a context-aware search in a shared storage using the root cause of the hardware component failure; and obtaining, in response to the context-aware search, a result specifying a proposed solution for the hardware component failure.

ANOMALY DETECTION ON DYNAMIC SENSOR DATA

Methods and systems for anomaly detection include determining whether a system is in a stable state or a dynamic state based on input data from one or more sensors in the system, using reconstruction errors from a respective stable model and dynamic model. It is determined that the input data represents anomalous operation of the system, responsive to a determination that the system is in a stable state, using the reconstruction errors. A corrective operation is performed on the system responsive to a determination that the input data represents anomalous operation of the system.

MONITORING DEVICE AND MONITORING METHOD
20230236924 · 2023-07-27 ·

According to one embodiment, there is provided a monitoring device for a terminal including a connection unit to which at least one device is able to be connected and mounted with a container file storing application software for controlling a business to be performed by using the device connected to the connection unit. The monitoring device includes a detection unit, an acquisition unit, and a processing unit. The detection unit detects an abnormality in the terminal. The acquisition unit acquires the container file mounted on the terminal in which the abnormality is detected. The processing unit installs the acquired container file on an alternative terminal.

MACHINE LEARNING ASSISTED REMEDIATION OF NETWORKED COMPUTING FAILURE PATTERNS

Disclosed are techniques for automatically determining whether a new disruption of service alert corresponds to a pattern of failures and automatically applying remedies based on the determined pattern. Datasets of historical disruption of service alerts on networked computing clusters are used to train a machine learning algorithm to identify patterns between alerts. When a new disruption of service alert is received, historical disruption of service alerts for the originating networked computing cluster are also received and provided as input to the machine learning model. The machine learning model then automatically determines whether the new alert fits a pattern with the historical alerts from the same cluster, and when a fit is found, remedial actions are sourced from the alerts that fit the pattern to be applied automatically to the originating networked computing cluster.

Intelligent network operation platform for network fault mitigation

Embodiments of the present disclosure provide systems, methods, and computer-readable storage media that leverage artificial intelligence and machine learning to identify, diagnose, and mitigate occurrences of network faults or incidents within a network. Historical network incidents may be used to generate a model that may be used to evaluate real-time occurring network incidents, such as to identify a cause of the network incident. Clustering algorithms may be used to identify portions of the model that share similarities with a network incident and then actions taken to resolve similar network incidents in the past may be identified and proposed as candidate actions that may be executed to resolve the cause of the network incident. Execution of the candidate actions may be performed under control of a user or automatically based on execution criteria and the configuration of the fault mitigation system.

Time clock quality determination

In some examples, an electronic device records, in an entry of a time-state data structure that includes a plurality of entries to store respective times, a time in response to invocation of a time-lapse process that lasts a predefined time duration independently of a time clock of the electronic device. The electronic device determines whether times in successive entries of the plurality of entries of the time-state data structure are within a threshold of one another, the threshold based on the predefined time duration. Based on the determining, the electronic device sets a parameter representing a quality of the time clock.

Image processing apparatus and control method for image processing apparatus for error reduction
11570330 · 2023-01-31 · ·

An image processing apparatus includes a processing section and a control section configured to instruct operation of the processing section. The control section executes, before sending, to the processing section, a first instruction corresponding to an instruction that caused an error in the past, an error avoidance operation based on instruction history information and operation state history information acquired from a storing section that stores the instruction history information and the operation state history information, the instruction history information indicating an instruction given to the processing section by the control section, the operation state history information indicating an operation state of the processing section caused by the instruction.

Troubleshooting for a distributed storage system by cluster wide correlation analysis
11714701 · 2023-08-01 · ·

A troubleshooting technique provides faster and more efficient troubleshooting of issues in a distributed system, such as a distributed storage system provided by a virtualized computing environment. The distributed system includes a plurality of hosts arranged in a cluster. The troubleshooting technique uses cluster-wide correlation analysis to identify potential causes of a particular issue in the distributed system, and executes workflows to remedy the particular issue.

Compute cluster preemption within a general-purpose graphics processing unit

Embodiments described herein provide techniques enable a graphics processor to continue processing operations during the reset of a compute unit that has experienced a hardware fault. Threads and associated context state for a faulted compute unit can be migrated to another compute unit of the graphics processor and the faulting compute unit can be reset while processing operations continue.