G06F11/0766

ENHANCED PERFORMANCE DIAGNOSIS IN A NETWORK COMPUTING ENVIRONMENT

Embodiments provide enhanced performance diagnosis in a network computing environment. In response to an occurrence of a performance issue for a node while under operating conditions, common logs for applications on the node are analyzed. The applications are respectively registered in advance for diagnosis services. The applications each register rules in advance for the diagnosis services. At a time of the performance issue, debug programs are automatically issued to generate debug level logs respectively for the applications. Debug level logs are analyzed according to the rules to determine a root cause of the performance issue. A potential solution to the root cause of the performance issue is determined using the rules, without having to recreate the operating conditions occurring during the performance issue. The potential solution to rectify the root cause of the performance issue is executed without having to recreate the operating conditions occurring during the performance issue.

Managing machine failure

A method, computer program product, and computer system are provided. A message storage area of an adjunct processor (AP) crypto adapter is filled with a plurality of command request messages sufficient to maximize utilization and performance of the AP crypto adapter. In response to detecting an error during execution of one of the plurality of command request messages, generating an AP crypto adapter command reply message. The AP crypto adapter command reply message includes the error. In response to the error being a non-recoverable failure, determining a state of the command request message, wherein the state of the command request message comprises an in-process state or a request-pending state. The AP crypto adapter command reply message is formatted, wherein the formatted AP crypto adapter command reply message is stored in a message queue in the AP crypto adapter pending completion of machine failure recovery. The AP crypto adapter is recovered.

TWO-LEVEL SYSTEM MAIN MEMORY
20180004432 · 2018-01-04 ·

Embodiments of the invention describe a system main memory comprising two levels of memory that include cached subsets of system disk level storage. This main memory includes “near memory” comprising memory made of volatile memory, and “far memory” comprising volatile or nonvolatile memory storage that is larger and slower than the near memory.

The far memory is presented as “main memory” to the host OS while the near memory is a cache for the far memory that is transparent to the OS, thus appearing to the OS the same as prior art main memory solutions. The management of the two-level memory may be done by a combination of logic and modules executed via the host CPU. Near memory may be coupled to the host system CPU via high bandwidth, low latency means for efficient processing. Far memory may be coupled to the CPU via low bandwidth, high latency means.

METHOD AND SYSTEM FOR IMPLEMENTING AN OPERATING SYSTEM HOOK IN A LOG ANALYTICS SYSTEM

Disclosed is a system, method, and computer program product for implementing a log analytics method and system that can configure, collect, and analyze log records in an efficient manner. An improved approach is provided for identifying log files that have undergone a change in status that would require retrieve of its log data, by including a module directly into the operating system that allows the log collection component to be reactively notified of any changes to pertinent log files.

DAMAGE SENSORS FOR A MOBILE COMPUTING DEVICE
20180007164 · 2018-01-04 ·

In general, this disclosure is directed to techniques for utilizing sensors within a computing device to detect a hazardous event and notify a central server that the computing device is potentially damaged. One or more sensors of a computing device may detect the hazardous event to the computing device. Responsive to detecting the hazardous event, the sensors may measure a magnitude of a damage measurand associated with the hazardous event to the computing device. The computing device may determine that the magnitude of the damage measurand exceeds a threshold damage value for the computing device. Responsive to determining that the magnitude of the damage measurand exceeds the threshold damage value, the computing device may send, to a server device, a message indicating the computing device is potentially damaged.

Computer system and method for presenting asset insights at a graphical user interface

A computing system is configured to derive insights related to asset operation and present these insights via a GUI. To these ends, the computing system (a) receives data related to the operation of assets, (b) based on this data, derives a plurality of insights related to the operation of at least a subset of the assets, (c) from the insights, defines a given subset of insights to be presented to a user, (d) defines at least one aggregated insight representative of one or more individual insights in the given subset of insights that are related to a common underlying problem, and (e) causes the user's client station to display a visualization of the given subset of insights including (i) an insights pane that provides a high-level overview of the subset of insights and (ii) a details pane that provides additional details regarding a selected one of the subset of insights.

METHODS AND SYSTEMS THAT AUTOMATICALLY PREDICT DISTRIBUTED-COMPUTER-SYSTEM PERFORMANCE DEGRADATION USING AUTOMATICALLY TRAINED MACHINE-LEARNING COMPONENTS

The current document is directed to methods and systems that automatically generate training data for machine-learning-based components used by a metric-data processing-and-analysis component of a distributed computer system, a subsystem within a distributed computer system, or a standalone metric-data processing-and-analysis system. The training data sets are labeled using categorical KPI values. The machine-learning-based components are applied to metric data both for predicting anomalous operational behaviors and problems within the distributed computer system and for determination of potential causes of anomalous operational behaviors and problems within the distributed computer system. Training of machine-learning-based components is carried out concurrently and asynchronously with respect to other metric-data collection, aggregation, processing, storage, and analysis tasks.

Message Cloud

A method for error management is provided. The method comprises receiving a message call request regarding an error event generated by a software application. The message call request comprises a message ID associated with an error type. In response to the call request a message cache is searched for the message ID. If the ID is in the cache, an error message associated with the ID is returned. The error message provides a description of the error and suggested remedial action. If the message ID is not in the cache, the error message is fetched from a message repository that contains error messages corresponding to respective message IDs. The fetched error message is loaded into the cache and returned. Message call request data is stored in a metrics repository. The message call request data comprises frequency metrics that describe how often the message ID is received.

PREDICTIVE BATCH JOB FAILURE DETECTION AND REMEDIATION

Systems, methods, and computer programming products for predicting, preventing and remediating failures of batch jobs being executed and/or queued for processing at future scheduled time. Batch job parameters, messages and system logs are stored in knowledge bases and/or inputted into AI models for analysis. Using predictive analytics and/or machine learning, batch job failures are predicted before the failures occur. Mappings of processes used by each batch job, historical data from previous batch jobs and data identifying the success or failure thereof, builds an archive that can be refined over time through active learning feedback and AI modeling to predictively recommend actions that have historically prevented or remediated failures from occurring. Recommended actions are reported to the system administrator or automatically applied. As job failures occur over time, mappings of the current system log to logs for the unsuccessful batch jobs help the root cause analysis becomes simpler and more automated.

ERROR INFORMATION PROCESSING METHOD AND DEVICE, AND STORAGE MEDIUM
20230009868 · 2023-01-12 ·

An error information processing method includes, in response to a memory error triggering an interrupt, collecting error information of the memory error that includes a first memory area where the memory error occurs, obtaining a second memory area for writing log information, determining whether the second memory area contains the first memory area, and, in response to determining that the second memory area contains the first memory area, skipping a process of writing the log information into the second memory area.