ALERT CORRELATING USING SEQUENCE MODEL WITH TOPOLOGY REINFORCEMENT SYSTEMS AND METHODS
20230133541 · 2023-05-04
Assignee
Inventors
- Jiayi Gu HOFFMAN (Sunnyvale, CA, US)
- Mahesh RAMACHANDRAN (San Jose, CA, US)
- Bhanu Pratap SINGH (Fremont, CA, US)
Cpc classification
H04L41/0631
ELECTRICITY
G06F9/542
PHYSICS
International classification
Abstract
Alert correlation helps reduce the number of alerts that IT staff have to act upon. Methods include a computer program product that applies a machine driven deep learning model to effectively correlate alerts caused by a common root cause. The methods of correlation provide the user the context of the root cause for the alerts. Therefore, it helps the user to quickly identify, understand and resolve the problem thereby reducing the mean time to identification and resolution. Alerts caused by the same root cause therefor come together.
Claims
1. A method of correlating a plurality of alerts in a network environment including multiple computing devices coupled through one or more networks comprising: receiving a plurality of alerts from one or more applications operating in the network environment; analyzing the plurality of alerts in a time sequence; correlating the plurality of alerts via an alert correlation module comprising one or more of a sequence model, a topology reinforcement module, and a similarity reinforcement module; clustering the plurality of alerts attributable to a common triggering event; and alert sequence training of a recurrent neural network, the recurrent neural network including a first long short-term memory layer.
2. The method of claim 1, further comprising: converting one or more raw alerts into one or more normalized alerts for analysis.
3. The method of claim 2, wherein the raw alerts are normalized for analysis and provided to a data pipeline.
4. The method of claim 1, wherein the sequence model is trainable using historical alert sequences on a neural network.
5. The method of claim 1, wherein the topology reinforcement is created through a network discovery.
6. The method of claim 1, wherein the similarity reinforcement is based on a natural language process.
7. The method of claim 1, wherein a first alert in an alert sequence is used to invoke a sequence model.
8. The method of claim 1, wherein the recurrent neural network includes a second long short-term memory layer in connection with the first long short-term memory layer.
9. The method of claim 8, wherein the recurrent neural network includes a dropout layer for regularization, the dropout layer in connection with the second long short-term memory layer.
10. The method of claim 1, further comprising: taking information from an input alert at a first timestep; calculating an alert sequence; and predicting a time interval for which a simulation will progress.
11. The method of claim 1, further comprising alert embedding.
12. The method of claim 1, further comprising running a training workload as a scheduled batch job on a training node.
13. The method of claim 1, wherein the plurality of alerts are not analyzed individually or in alert pairs.
14. The method of claim 1, when the correlation module includes the sequence model and the topology reinforcement module, the topology reinforcement module correlates the plurality of alerts based on a correlation result from the sequence model.
15. One or more computer-readable storage media storing computer-executable instructions for causing a computer to perform a method, the method comprising: receiving a plurality of alerts from one or more applications operating in a network environment; analyzing the plurality of alerts in a time sequence; correlating the plurality of alerts via an alert correlation module comprising one or more of a sequence model, a topology reinforcement module, and a similarity reinforcement module; clustering the plurality of alerts attributable to a common triggering event; and alert sequence training of a recurrent neural network, the recurrent neural network including a first long short-term memory layer.
16. The computer-readable storage media of claim 15, the method further comprising: converting one or more raw alerts into one or more normalized alerts for analysis.
17. The computer-readable storage media of claim 16, wherein the raw alerts are normalized for analysis and provided to a data pipeline.
18. The computer-readable storage media of claim 15, wherein the sequence model is trainable using historical alert sequences on a neural network.
19. The computer-readable storage media of claim 15, wherein the topology reinforcement is created through a network discovery.
20. The computer-readable storage media of claim 15, wherein the similarity reinforcement is based on a natural language process.
21. The computer-readable storage media of claim 15, wherein a first alert in an alert sequence is used to invoke a sequence model.
22. The computer-readable storage media of claim 15, wherein the recurrent neural network includes a second long short-term memory layer in connection with the first long short-term memory layer.
23. The computer-readable storage media of claim 22, wherein the recurrent neural network includes a dropout layer for regularization, the dropout layer in connection with the second short-term memory layer.
24. The computer-readable storage media of claim 15, further comprising: taking information from an input alert at a first timestep; calculating an alert sequence; and predicting a time interval for which a simulation will progress.
25. The computer-readable storage media of claim 15, further comprising alert embedding.
26. The computer-readable storage media of claim 15, further comprising running a training workload as a scheduled batch job on a training node.
27. The computer-readable storage media of claim 15, wherein the plurality of alerts are not analyzed individually or in alert pairs.
28. The computer-readable storage media of claim 15, when the correlation module includes the sequence model and the topology reinforcement module, the topology reinforcement module correlates the plurality of alerts based on a correlation result from the sequence model.
29. A system for correlating alerts in a computing environment including multiple computing devices coupled through one or more networks comprising: an alert processing service comprising an alert correlation module having one or more of a sequence model, topology reinforcement module, and similarity reinforcement module; and an alert normalization engine, wherein the sequence model is trainable using historical alert sequences on a recurrent neural network, the recurrent neural network including a first long short-term memory layer.
30. The system of claim 29, wherein the system is configurable to convert one or more raw alerts received by the alert processing service into one or more normalized alerts for analysis.
31. The system of claim 29, wherein the recurrent neural network includes a second long short-term memory layer in connection with the first long short-term memory layer.
32. The system of claim 31, wherein the recurrent neural network includes a dropout layer for regularization, the dropout layer in connection with the second short-term memory layer.
33. The system of claim 29, wherein the topology reinforcement is created through a network discovery.
34. The system of claim 29, wherein the sequence model is configured to provide a correlation result to the topology reinforcement module.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0026] The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings of which:
[0027]
[0028]
[0029]
[0030]
[0031]
DETAILED DESCRIPTION
[0032]
[0033] As shown, the computing environment 100 can have a distributed service 104 (web app 1) which may depend on another software application 103, server 102, and a VPN gateway 101. The distributed service 104 may provide functionalities to another service 105, such as web app 2. As will be appreciated by those skilled in the art, everything in the environment 100 can be interconnectable. Therefore, a failure at one location in the computing environment 100 can affect a plurality of components downstream from the failure location.
[0034] The flows between service 105 and storage service 106, service 105 and document database 107, and service 105 and service 104 illustrate the interdependent nature of the IT environment. Service 105 is a web application. It depends on service 106 and document database 107 for data persistence. Service 105 consumes services provided by service 104.
[0035] For example, as illustrated in
[0036] In this scenario, each of the four alerts 111, 112, 113, 114 are tied to the same root cause, which is a failure or performance event on server 102. Using the disclosed alert processing service 140, the four alerts 111, 112, 113, 114 are communicated to an alert normalization engine 142 for processing. As will be appreciated by those skilled in the art, with a high number of devices and applications operating in an environment 100, and various types of IT resources, operation monitoring is highly complex and noisy. For example, an unrelated alert, alert x 115, i.e., unrelated to the four alerts 111, 112, 113, 114, may be created around the same time as the four alerts 111, 112, 113, 114 for application 108 and communicated to the alert normalization engine 142. Correlating alerts in such an environment using traditional similarity based approaches is difficult because alerts are highly dynamic.
[0037] In a centralized alert processing service 140 according to the disclosure, two or more alerts are collected from various sources, and analyzed together in time sequence. However, alerts are not analyzed individually or in alert pairs. As noted above, when an alert triggering event occurs a single triggering event can trigger a chain reaction which results in a plurality of alerts for the same triggering event from different locations within the environment 100. Alerts in the operation resulting from the same triggering event have sequence patterns. Analyzing alerts using a deep learning sequence model identifies alerts that occur together (e.g., clustered), such as alerts 111, 112, 113, 114. Analyzing and clustering alerts provides the user the context of the root cause for the alerts, and provides clarity to determine the impacted services and resources. The analyzed alerts help the user understand the nature of the triggering event better, allowing faster response and a reduction of the time to resolution of the problem.
[0038]
[0039] The correlation process correlates related alerts by the following method: [0040] The alert normalization engine 142 converts a plurality of raw alerts received from a plurality of sources in the environment 100 into normalized alerts for the subsequent alert analysis; [0041] The normalized alerts are provided to a data pipeline 207; [0042] The alert stream is turned into sequences using time gaps and provided to the alert correlation module 220.
For example, alerts that occur in time over a defined interval, e.g., more than 2 minutes apart, belong to a different sequence.
[0043] The alert correlation module 220 has three parts: [0044] a sequence model 208 that is trained using historical alert sequences on a recurrent neural network; [0045] a topology reinforcement 209 created through network discovery; and [0046] a similarity reinforcement 210 based on natural language processing (“NLP”) technology.
[0047] The first alert 211 in an alert sequence is used to invoke the sequence model 208 to generate alert sequences. The sequence model 208 can map a fixed-length input with a fixed-length output where the length of the input and output may differ. Because the sequence of alerts 211, 212, 213, 214 has been seen multiple times before, the sequence model 208 generates alerts 212, 213, 214. Alert 216 is a random alert, so alert 216 would not be included in the generated sequence. Therefore, alert 216 is ruled out immediately. Alert 215 has the same type of Web App impacted alert as alert 214, so alert 215 is kept as a candidate for clustering at the moment, and is moved to the next phase along with other alerts attributed to the same root cause.
[0048] According to the embodiment, the alert correlation module 220 has: [0049] a sequence model 208, [0050] a topology reinforcement 209, and [0051] a similarity reinforcement 210 module.
Topology reinforcement 209 receives data from a sequence model 208 and performs the next phase of processing the plurality of alerts received. According to alert correlation module, alerts 211, 212, 213, 214 are connected to each other. Therefore, using topology reinforcement 209, alert 215 is ruled out from the correlation of the other alerts. The final correlated alerts 211, 212, 213, 214 are attributed to the same root cause.
[0052]
[0053] This allows a unidirectional RNN to take information from the input alert 301 at a first timestep to predict the time interval for which the simulation will progress, e.g., the value of the future timesteps 309, 310, 311, 312. The alert sequence generation is a supervised learning where the output y is equal to the input x of the previous timestep:
y<i>=x<i+1>.
[0054] For example, first prediction 308 should be close in value to input alert 304.
[0055]
[0056] For example, after learning, the model finds that the CPU utilization alert 401 and CPU stats alert 402 are interchangeable. Therefore, both CPU alerts 401, 402 are used when generating a CPU related alert sequence. Load alert 403 is similar to the CPU alerts, consequently the CPU alerts and load alert are located in close proximity in the vector space. System ping alert 406, and Cassandra server down alert 407 are similar, but different than other alerts. Therefore, the system ping alert 406 and Cassandra server down alert 407 are close to each other in the vector space, but far away from other alerts, such as CPU utilization alert 401 and Cassandra write request 405. When generating Cassandra down alerts, for example, a system ping alert 406 might be generated along with Cassandra server down alert 407 due to the close proximity in the vector space, but not with a Tomcat threads busy alert 404.
[0057]
[0058] Inference nodes 510 run a real-time processing workload. Alert data 520 from various sources are injected into one or more data pipelines 506 to form alert streams 512. The alert stream 512 passes through the alert correlation module 507, and creates correlated alerts at insight 508. The correlated alerts at insight 508 provide the user insight into the problem experienced in the computing environment. Meanwhile, inference nodes 510 send new alerts into the alert repository 501 for the next retraining job to analyze and create a new pattern from newly received alerts. The training workload can work in parallel with the real-time processing workload to provide a continuous learning and continuous insight into operation of the IT system.
[0059] In engaging the systems and methods according to aspects of the disclosed subject matter, a user may engage in one or more use sessions. A use session may include a training session for the user.
[0060] The systems and methods according to aspects of the disclosed subject matter may utilize a variety of computer and computing systems, communications devices, networks and/or digital/logic devices for operation. Each may, in turn, be configurable to utilize a suitable computing device that can be manufactured with, loaded with and/or fetch from some storage device, and then execute, instructions that cause the computing device to perform a method according to aspects of the disclosed subject matter.
[0061] A computing device can include without limitation a mobile user device such as a mobile phone, a smart phone and a cellular phone, a personal digital assistant (“PDA”), such as an iPhone®, a tablet, a laptop and the like. In at least some configurations, a user can execute a browser application over a network, such as the internet, to view and interact with digital content, such as screen displays. A display includes, for example, an interface that allows a visual presentation of data from a computing device. Access could be over or partially over other forms of computing and/or communications networks. A user may access a web browser, e.g., to provide access to applications and data and other content located on a website or a webpage of a website.
[0062] A suitable computing device may include a processor to perform logic and other computing operations, e.g., a stand-alone computer processing unit (“CPU”), or hard wired logic as in a microcontroller, or a combination of both, and may execute instructions according to its operating system and the instructions to perform the steps of the method, or elements of the process. The user's computing device may be part of a network of computing devices and the methods of the disclosed subject matter may be performed by different computing devices associated with the network, perhaps in different physical locations, cooperating or otherwise interacting to perform a disclosed method. For example, a user's portable computing device may run an app alone or in conjunction with a remote computing device, such as a server on the Internet. For purposes of the present application, the term “computing device” includes any and all of the above discussed logic circuitry, communications devices and digital processing capabilities or combinations of these.
[0063] Certain embodiments of the disclosed subject matter may be described for illustrative purposes as steps of a method that may be executed on a computing device executing software, and illustrated, by way of example only, as a block diagram of a process flow. Such may also be considered as a software flow chart. Such block diagrams and like operational illustrations of a method performed or the operation of a computing device and any combination of blocks in a block diagram, can illustrate, as examples, software program code/instructions that can be provided to the computing device or at least abbreviated statements of the functionalities and operations performed by the computing device in executing the instructions. Some possible alternate implementation may involve the function, functionalities and operations noted in the blocks of a block diagram occurring out of the order noted in the block diagram, including occurring simultaneously or nearly so, or in another order or not occurring at all. Aspects of the disclosed subject matter may be implemented in parallel or seriatim in hardware, firmware, software or any combination(s) of these, co-located or remotely located, at least in part, from each other, e.g., in arrays or networks of computing devices, over interconnected networks, including the Internet, and the like.
[0064] The instructions may be stored on a suitable “machine readable medium” within a computing device or in communication with or otherwise accessible to the computing device. As used in the present application a machine readable medium is a tangible storage device and the instructions are stored in a non-transitory way. At the same time, during operation, the instructions may at times be transitory, e.g., in transit from a remote storage device to a computing device over a communication link. However, when the machine readable medium is tangible and non-transitory, the instructions will be stored, for at least some period of time, in a memory storage device, such as a random access memory (RAM), read only memory (ROM), a magnetic or optical disc storage device, or the like, arrays and/or combinations of which may form a local cache memory, e.g., residing on a processor integrated circuit, a local main memory, e.g., housed within an enclosure for a processor of a computing device, a local electronic or disc hard drive, a remote storage location connected to a local server or a remote server access over a network, or the like. When so stored, the software will constitute a “machine readable medium,” that is both tangible and stores the instructions in a non-transitory form. At a minimum, therefore, the machine readable medium storing instructions for execution on an associated computing device will be “tangible” and “non-transitory” at the time of execution of instructions by a processor of a computing device and when the instructions are being stored for subsequent access by a computing device.
[0065] As will be appreciated by those skilled in the art, the systems and methods disclosed are configurable to send a variety of messages when alerts are generated. Messages include, for example, SMS and email.
[0066] While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is intended that the claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.