H04L41/0636

SYSTEM AND METHOD FOR ANOMALY DETECTION WITH ROOT CAUSE IDENTIFICATION

A computer device may include a processor configured to obtain key performance indicator (KPI) values for KPI parameters associated with at least one device and compute a set of historical statistical values for the obtained KPI values associated with the network device. The processor may be further configured to provide the KPI values and the computed set of historical statistical values to an anomaly detection model to identify potential anomalies; filter the identified potential anomalies based on a designated desirable behavior for a particular KPI parameter to identify at least one anomaly; and send an alert that includes information identifying the at least one anomaly to a management system or a repair system associated with the device. The computer device may further determine a root cause KPI parameter for the identified at least one anomaly and include information identifying the determined root cause KPI parameter in the alert.

METHOD AND APPARATUS FOR MANAGING DEVICE FAILURE
20170269983 · 2017-09-21 ·

Embodiments of the present disclosure relate to a method and apparatus for managing a failure of a device. The method comprises detecting whether a failure occurs in a device, and generating a failure report for the failure in response to the failure occurring in the device. The method further comprises querying a device object repository with the failure report, and the object device repository stores historical failure information associated with the device and a fix solution corresponding to the historical failure information. The method further comprises obtaining the fix solution from the device object repository based on a comparison between the failure report and the historical failure information. Embodiments of the present disclosure can manage the failure of the device more effectively.

Forming root cause groups of incidents in clustered distributed system through horizontal and vertical aggregation
11252014 · 2022-02-15 · ·

A system and method for the aggregation and grouping of previously identified, causally related abnormal operating condition, that are observed in a monitored environment, is disclosed. Agents are deployed to the monitored environment which capture data describing structural aspects of the monitored environment, as well as data describing activities performed on it, like the execution of distributed transactions. The data describing structural aspects is aggregated into a topology model which describes individual components of the monitored environments, their communication activities and resource dependencies and which also identifies and groups components that serve the same purpose, like e.g. processes executing the same code. Activity related monitoring data is constantly monitored to identify abnormal operating conditions. Data describing abnormal operating condition is analyzed in combination with topology data to identify networks of causally related abnormal operating conditions. Causally related abnormal operating conditions are then grouped using known topological resource and same purpose dependencies. Identified groups are analyzed to determine their root cause relevance.

Internet last-mile outage detection using IP-route clustering

Techniques for internet last-mile outage detection are disclosed herein. The techniques include methods for monitoring, by a network appliance associated with a network, a plurality of network nodes, detecting, by the network appliance, that a network node of the plurality of network nodes in a last mile of the network has disconnected from the network, overlaying, by the network appliance, the network node over a network model for at least a portion of the network including the network node to generate a model overlay, and determining, by the network appliance, a last mile outage source associated with a disconnection of the network node by identifying a lowest common ancestor node of the network node from the model overlay. Systems and computer-readable media are also provided.

Bayesian-based event grouping

Techniques for Bayesian-based event grouping are provided. One technique includes determining a group of alarm events from received alarm events; in response to the group of alarm events matching a group of historical alarm events, determining a first correlation, wherein the group of historical alarm events comprises correlated events associated with a same entity; and determining a root cause of the group of alarm events based on the first correlation.

Network issue tracking and resolution system

In one embodiment, an issue analysis service determines that an issue exists with a device in a network. The service searches a decision tree for a solution to the issue, wherein branch nodes of the decision tree comprise diagnostic checks. The service clusters, based on a determination that a solution to the issue does not exist in the decision tree, telemetry for the device with telemetry for one or more other devices that also experienced the issue. The service uses a neural network to identify a difference between the clustered telemetry and telemetry from one or more devices for which the issue was resolved. The service adds a leaf node to the decision tree with the identified difference as a solution to the issue.

A METHOD AND SYSTEM FOR DETECTING A SERVER FAULT
20210377102 · 2021-12-02 ·

A method for detecting a server fault includes: collecting sample monitoring data of a plurality of servers, the sample monitoring data signifying operating states of the plurality of servers; performing training, based on the sample monitoring data, to obtain a fault detection model for the plurality of servers; and collecting current monitoring data of a target server, and inputting the current monitoring data into the fault detection model to determine an operating fault corresponding to the current monitoring data.

AUTOMATIC CORRELATION OF DYNAMIC SYSTEM EVENTS WITHIN COMPUTING DEVICES
20220206889 · 2022-06-30 ·

Systems and methods are described herein for logging system events within an electronic machine using an event log structured as a collection of tree-like cause and effect graphs. An event to be logged may be received. A new event node may be created within the event log for the received event. One or more existing event nodes within the event log may be identified as having possibly caused the received event. One or more causal links may be created within the event log between the new event node and the one or more identified existing event nodes. The new event node may be stored as an unattached root node in response to not identifying an existing event node that may have caused the received event.

Forming Root Cause Groups Of Incidents In Clustered Distributed System Through Horizontal And Vertical Aggregation
20220210004 · 2022-06-30 · ·

A system and method for the aggregation and grouping of previously identified, causally related abnormal operating condition, that are observed in a monitored environment, is disclosed. Agents are deployed to the monitored environment which capture data describing structural aspects of the monitored environment, as well as data describing activities performed on it, like the execution of distributed transactions. The data describing structural aspects is aggregated into a topology model which describes individual components of the monitored environments, their communication activities and resource dependencies and which also identifies and groups components that serve the same purpose, like e.g. processes executing the same code. Activity related monitoring data is constantly monitored to identify abnormal operating conditions. Data describing abnormal operating condition is analyzed in combination with topology data to identify networks of causally related abnormal operating conditions. Causally related abnormal operating conditions are then grouped using known topological resource and same purpose dependencies. Identified groups are analyzed to determine their root cause relevance.

VIDEO TRANSPORT STREAM STABILITY PREDICTION
20220191087 · 2022-06-16 · ·

A method of measuring video stream visual stability, the method including receiving a first set of network packets carrying data of the video stream, determining network performance metrics for a session associated with the first set of network packets, retrieving priority fault errors from a packet header of at least one network packet of the first set of the network packets, adding the priority fault errors and the network performance metrics to time series data, and applying a machine learning model to the time series data to obtain a visual stability score for the first set of network packets.