G06F11/2257

Evaluation method and evaluation apparatus

A calculation unit calculates, for each of a plurality of systems in which a countermeasure is taken, a maturity index of the system, indicating the degree of operational stability of the system, based on a value related to a non-functional requirement of the system. An evaluation unit evaluates usefulness of the countermeasure for a particular system based on similarity of configuration between the particular system and the system, timing that the countermeasure is taken, effects of the countermeasure, and the calculated maturity index.

Generating dependency graphs for analyzing program behavior

An analysis management system (AMS) is described that analyzes the in-field behavior of a program resource installed on a collection of computing devices, such as mobile telephone devices or the like. In operation, the AMS can instruct different devices to collect data regarding different observation points associated with the program resource, thus spreading the reporting load among the devices. Based on the data that is collected, the AMS can update a dependency graph that describes dependencies among the observation points associated with the program resource. The AMS can then generate new directives based on the updated dependency graph. The AMS can also use the dependency graph and the collected data to infer information regarding observation points that is not directly supplied by the collected data.

Log analytics for problem diagnosis

In a set of problem log entries from a computing system, a subset of the set of problem log entries are identified, which pertain to a failed request. The subset is compared to a reference model which defines log entries per request type under a healthy state of the computing system, to identify a portion of the subset of problem log entries which deviate from corresponding log entries in the reference model. In the portion of the subset, at least one high-value log entry is identified. The at least one high-value log entry is output.

User-directed diagnostics and auto-correction

A method, system, and computer program product for performing user-initiated logging and auto-correction in hardware/software systems. Embodiments commence upon identifying a set of test points and respective instrumentation components, then determining logging capabilities of the instrumentation components. The nature and extent of the capabilities and configuration of the components aid in generating labels to describe the various logging capabilities. The labels are then used in a user interface so as to obtain user-configurable settings which are also used in determining auto-correction actions. A measurement taken at a testpoint may result in detection of an occurrence of a certain condition, and auto-correction steps can be taken by retrieving a rulebase comprising a set of conditions corresponding to one or more measurements, and corrective actions corresponding to the one or more conditions. Detection of a condition can automatically invoke any number of processes to apply a corrective action and/or emit a recommendation.

Detecting datacenter mass outage with near real-time/offline using ML models

The present embodiments relate to data center outage detection and alert generation. An outage detection service as described herein can process near real-time data from various sources in a datacenter and process the data using a model to determine one or more projected sources of a detected outage. The model as described herein can include one or more machine learning models incorporating a series of rules to process near-real time data and offline data and determine one or more projected sources of an outage. An alert message can be generated to provide the projected sources of the outage and other data relevant to the outage.

Machine learning model monitoring

Technologies for monitoring performance of a machine learning model include receiving, by an unsupervised anomaly detection function, digital time series data for a feature metric; where the feature metric is computed for a feature that is extracted from an online system over a time interval; where the machine learning model is to produce model output that relates to one or more users' use of the online system; using the unsupervised anomaly detection function, detecting anomalies in the digital time series data; labeling a subset of the detected anomalies in response to a deviation of a time-series prediction model from a predicted baseline model exceeding a predicted deviation criterion; creating digital output that identifies the feature as associated with the labeled subset of the detected anomalies; causing, in response to the digital output, a modification of the machine learning model.

Software Application Diagnostic Aid
20210390010 · 2021-12-16 ·

A diagnostics tool aids in the efficient collection of relevant error data for addressing faults in connected software systems. Context information is collected from a software system that is being displayed. Configuration of the software system is collected, and used to identify relevant connected software systems. Error data is collected via respective log interfaces from the error logs of the software system being displayed, and relevant connected systems. The context, configuration, and error data is stored in a database. Based at least upon the configuration information, a query is formulated and posed to the database. A corresponding query result is received and processed to return an error report to a user interface, for inspection (e.g., by a user or a support staff member). Certain embodiments may further generate an appropriate recommendation based upon the query result. The recommendation may be generated with reference to a stored ruleset and/or artificial intelligence.

Automatic root cause analysis using ternary fault scenario representation
11372708 · 2022-06-28 · ·

A plurality of potential fault scenarios are accessed, wherein a given potential fault scenario of the plurality of potential fault scenarios has at least one corresponding root cause, and a representation of the given potential fault scenario comprises a don't care value. An actual fault scenario from telemetry received from a monitored system is generated. The actual fault scenario is matched against the plurality of potential fault scenarios. One or more matched causes are output as one or more probable root cause failures of the monitored system.

Method and system for assisting troubleshooting of a complex system

A system and a method for assisting with troubleshooting a complex system is disclosed in which the troubleshooting procedure can be modeled by a Markov decision process. Combining the fault tree technique with a Markov decision process, in order to determine in an optimal manner the sequence of troubleshooting actions will quickly address the consequences of a failure and ensure maintainability of the complex system.

A DISASTER RECOVERY SYSTEM AND METHOD
20230273868 · 2023-08-31 ·

A disaster recovery (DR) system and method configured to test and evaluate systems readiness and ability to recover while providing various management tools that can assist an administrator operating said DR system and method. Said DR system and method further enables automated fixing and testing procedures while maintaining real time, reliable and up to date backup solutions.