Automatic root cause diagnosis in networks based on hypothesis testing

Abstract

An embodiment may involve obtaining a set of data records including features characterizing operational aspects of a communication network. Each data record may include a feature vector and performance metrics of the communication network. Each feature vector may include a multiple elements corresponding to feature-value pairs. A first statistical analysis may be applied to the set of data records and their performance metrics to identify major contributors to degraded network performance. A second statistical analysis may be applied to identify elements that negatively influence the major contributors, and to discriminate between additive effects and incompatibilities as the source of negative influence. For each major contributor, a hierarchical dependency tree may be constructed with the major contributor as the root node and influencer elements as other nodes. Redundant dependencies may be removed, mutually dependent influencer elements grouped, and only the longest edges retained, in order to create dependency graph.

Claims

1. A computer-implemented method comprising: obtaining a set of data records including features that characterize operational aspects of a communication network, wherein each given data record comprises a feature vector and one or more performance metrics characterizing operational performance of the communication network, wherein each feature vector of the set comprises a plurality of elements, e.sub.i, i=1, . . . , n, each made up of a feature-value pair, (f.sub.i,v.sub.k), k=1, . . . , m.sub.j, for each j=i, that identifies f.sub.j with a particular one of n operational aspects of the communication network and assigns to v.sub.k one of m.sub.j values of the particular operational aspect, wherein e.sub.i is a set of elements indexed by integer index i, f.sub.j is a set of features indexed by integer index j, v.sub.k is a set of values assigned to f.sub.j and indexed by integer index k, wherein i ranges from 1 to upper limit n, wherein for each e.sub.i, j=i, wherein k ranges from 1 to upper limit m.sub.j, wherein m.sub.j is the jth upper limit, and wherein the operational aspects characterized by features correspond to hardware, software, operational, or functional components related to the network operations; applying a first statistical analysis to the set of data records and the performance metrics of the data records to generate one or more data subsets, each comprising a respective subset of feature vectors that each contains a respective, particular inefficient element, wherein the respective, particular inefficient element is associated with a statistically significant negative contribution to network performance; respectively applying a second statistical analysis to each respective subset of feature vectors to (i) identify for the respective, particular inefficient element a respective set of influencer elements representing elements associated with statistically significant negative influence on causing the negative contribution to network performance associated with the respective, particular inefficient element, and (ii) where the respective set of influencer elements is non-empty, discriminate between those influencer elements associated with additive negative influence and those associated with an incompatibility with the particular operational aspect of the communication network and assigned value identified with the respective, particular inefficient element; for each respective subset of feature vectors, analyzing each possible pair of elements of the influencer elements of the respective influencer set of the respective, particular inefficient element to determine a dependency relationship based on a co-occurrence of inefficiency associated with both pair members, wherein all determined dependency relationships from all pairs represent a respective hierarchical dependency tree in which influencer elements of the respective influencer set correspond to nodes; for each determined dependency relationship in each respective hierarchical dependency tree, applying a metric-based rule to identify redundant dependencies of the respective hierarchical dependency tree, and removing at least one element of each of the redundant dependencies from each respective hierarchical dependency tree; grouping mutually dependent influencer elements of each respective hierarchical dependency tree and retaining only a longest of any multiple paths between remaining nodes to generate a respective dependency graph; and displaying at least one respective dependency graph in a display device of the system.

2. The computer-implemented method of claim 1, wherein applying a first statistical analysis to the set of data records and the performance metrics of the data records to generate the one or more data subsets, each comprising the respective subset of feature vectors that each contains a respective, particular inefficient element, comprises: for each pair (f.sub.j,f.sub.k), j≠k, in the set of data records, applying a χ.sup.2 test to identify redundant pairs for which f.sub.j and f.sub.k provide redundant information with respect to respectively associated metrics, and for each redundant pair, marking one of the pair members as excluded from consideration in further analysis; for each f.sub.j=t not marked as excluded from consideration, subdividing the set of data records into one or more respective feature subsets each having the same respective value v.sub.i, and applying, with respect to respectively associated metrics, a variance test to the respective feature subsets to determine whether f.sub.j=t represents a discriminating feature, and if not, marking f.sub.j=t as excluded from consideration in further analysis, wherein t is an index of f.sub.j in the range defined for index j; and for each particular element of the set of data records having (i) an identical value and corresponding to a feature f.sub.j=s that is not excluded from consideration by either the χ.sup.2 test or the variance test, wherein s is an index of f.sub.j in the range defined for index j, and (ii) a statistically significant negative contribution to network performance with respect to an associated performance metric, retaining as one of the generated one or more data subsets a respective collection of feature vectors each containing the particular element, wherein the particular element is the particular inefficient element.

3. The computer-implemented method of claim 1, wherein applying a first statistical analysis to the set of data records and the performance metrics of the data records to generate the one or more data subsets, each comprising the respective subset of feature vectors that each contains a respective, particular inefficient element, comprises: computing a first mean of an associated performance metric for a first data set that includes the given particular element; computing a second mean of the associated performance metric for a second data set that excludes the given particular element; and based on a comparison of the first and second computed means, determining that the given particular element has a statistically significant negative contribution to network performance with respect to the associated performance metric.

4. The computer-implemented method of claim 1, wherein applying a first statistical analysis to the set of data records and the performance metrics of the data records to generate the one or more data subsets, each comprising the respective subset of feature vectors that each contains a respective, particular inefficient element, comprises: ranking the respective, particular inefficient elements of among the respective subsets into a list according to increasing negative contribution to network performance; and retaining only a threshold number of list elements in ranked order.

5. The computer-implemented method of claim 1, wherein applying the second statistical analysis to each respective subset of feature vectors to identify for the respective, particular inefficient element a respective set of influencer elements comprises: for each respective subset of feature vectors, identifying every given element, excluding the particular inefficient element, having (i) an identical value, and (ii) a statistically significant negative contribution to network performance with respect to an associated performance metric; and including the given element in the respective set of influencer elements.

6. The computer-implemented method of claim 1, wherein applying the second statistical analysis to each respective subset of feature vectors to discriminate between those influencer elements associated with additive negative influence and those associated with an incompatibility with the particular operational aspect of the communication network and assigned value identified with the respective, particular inefficient element, comprises: for each respective subset of feature vectors, determining for each given influencer element of the respective influencer set, a respective influencer subset of feature vectors corresponding to those containing the given influencer element; and for each respective influencer subset: (i) determining a first intersection with the respective subset of feature vectors; (ii) determining a second intersection with a complementary set of respective subset of feature vectors; (iii) applying a T-test to compare the first and second intersections; and (iv) if the T-test comparison yields a statistically significant difference, then marking the given influencer element as an incompatibility element, otherwise marking the given influencer element as an additive element.

7. The computer-implemented method of claim 1, wherein analyzing each possible pair of elements of the influencer elements of the respective influencer set of the respective, particular inefficient element to determine a dependency relationship based on a co-occurrence of inefficiency associated with both pair members comprises: for each possible pairing of the influencer elements of the respective influencer set, determining a joint distribution of the respective associated features (f.sub.r,f.sub.t), wherein r and t are indices of f.sub.j in the range defined for index j; determining a co-occurrence of the influencer elements of the respective influencer set based on the joint distribution; and determining for each co-occurrence whether it is directional or two-way.

8. The computer-implemented method of claim 1, wherein, for each determined dependency relationship in each respective hierarchical dependency tree, applying a metric-based rule to identify redundant dependencies of the respective hierarchical dependency tree, and removing at least one element of each of the redundant dependencies from each respective hierarchical dependency tree comprises: for each hierarchical dependency between a dependent parent element, e.sub.p, and a dependent child element e.sub.c, computing a first metric mean with respect to an associated performance metric of each of e.sub.p and e.sub.c for a first particular set containing e.sub.p and not containing e.sub.c, wherein p and c are indices of e.sub.i in the range defined for index i; computing a second metric mean with respect to the associated performance metric for second particular set containing the respective, particular inefficient element; comparing the first and second metric means; if the first metric mean is smaller than the second metric mean, then removing the parent element from the respective hierarchical dependency tree; and if the second metric mean is smaller than the first metric mean by more than a threshold amount, then removing the child element from the respective hierarchical dependency tree.

9. The computer-implemented method of claim 1, wherein grouping mutually dependent influencer elements of each respective hierarchical dependency tree and retaining only a longest of any multiple paths between remaining nodes to generate a respective dependency graph comprises: for each node of each respective hierarchical dependency tree, grouping mutually dependent elements; for each pair of nodes connected with multiple paths, applying a depth first search (DFS) algorithm to determine the longest of the multiple paths; and removing all of multiple paths that are not the longest paths.

10. The computer-implemented method of claim 1, wherein the set of data records comprises log records of operations in the communication network, and wherein the each log record is one of a voice call, a session detail record for a data session, a performance record, or a status or health check record for at least one of a network device, a network service, network operation system, or network monitoring system.

11. A system comprising: one or more processors; and memory configured for storing instructions that, when executed by the one or more processors, cause the system to carry out operations including: obtaining a set of data records including features that characterize operational aspects of a communication network, wherein each given data record comprises a feature vector and one or more performance metrics characterizing operational performance of the communication network, wherein each feature vector of the set comprises a plurality of elements, e.sub.i, i=1, . . . , n, each made up of a feature-value pair, (f.sub.i,v.sub.k), k=1, . . . , m.sub.j, for each j=i, that identifies f.sub.j with a particular one of n operational aspects of the communication network and assigns to v.sub.k one of m.sub.j values of the particular operational aspect, wherein e.sub.i is a set of elements indexed by integer index i, f.sub.j is a set of features indexed by integer index j, v.sub.k is a set of values assigned to f.sub.j and indexed by integer index k, wherein i ranges from 1 to upper limit n, wherein for each e.sub.i, j=i, wherein k ranges from 1 to upper limit m.sub.j, wherein m.sub.1 is the jth upper limit, and wherein the operational aspects characterized by features correspond to hardware, software, operational, or functional components related to the network operations; applying a first statistical analysis to the set of data records and the performance metrics of the data records to generate one or more data subsets, each comprising a respective subset of feature vectors that each contains a respective, particular inefficient element, wherein the respective, particular inefficient element is associated with a statistically significant negative contribution to network performance; respectively applying a second statistical analysis to each respective subset of feature vectors to (i) identify for the respective, particular inefficient element a respective set of influencer elements representing elements associated with statistically significant negative influence on causing the negative contribution to network performance associated with the respective, particular inefficient element, and (ii) where the respective set of influencer elements is non-empty, discriminate between those influencer elements associated with additive negative influence and those associated with an incompatibility with the particular operational aspect of the communication network and assigned value identified with the respective, particular inefficient element; for each respective subset of feature vectors, analyzing each possible pair of elements of the influencer elements of the respective influencer set of the respective, particular inefficient element to determine a dependency relationship based on a co-occurrence of inefficiency associated with both pair members, wherein all determined dependency relationships from all pairs represent a respective hierarchical dependency tree in which influencer elements of the respective influencer set correspond to nodes; for each determined dependency relationship in each respective hierarchical dependency tree, applying a metric-based rule to identify redundant dependencies of the respective hierarchical dependency tree, and removing at least one element of each of the redundant dependencies from each respective hierarchical dependency tree; grouping mutually dependent influencer elements of each respective hierarchical dependency tree and retaining only a longest of any multiple paths between remaining nodes to generate a respective dependency graph; and displaying at least one respective dependency graph in a display device.

12. The system of claim 11, wherein applying a first statistical analysis to the set of data records and the performance metrics of the data records to generate the one or more data subsets, each comprising the respective subset of feature vectors that each contains a respective, particular inefficient element, comprises: for each pair (f.sub.j,f.sub.k), j≠k, in the set of data records, applying a χ.sup.2 test to identify redundant pairs for which f.sub.j and f.sub.k provide redundant information with respect to respectively associated metrics, and for each redundant pair, marking one of the pair members as excluded from consideration in further analysis; for each f.sub.j=t not marked as excluded from consideration, subdividing the set of data records into one or more respective feature subsets each having the same respective value v.sub.i, and applying, with respect to respectively associated metrics, a variance test to the respective feature subsets to determine whether f.sub.j=t represents a discriminating feature, and if not, marking f.sub.j=t as excluded from consideration in further analysis, wherein t is an index of f.sub.j in the range defined for index j; and for each particular element of the set of data records having (i) an identical value and corresponding to a feature f.sub.j=s that is not excluded from consideration by either the χ.sup.2 test or the variance test, wherein s is an index of f.sub.j in the range defined for index j, and (ii) a statistically significant negative contribution to network performance with respect to an associated performance metric, retaining as one of the generated one or more data subsets a respective collection of feature vectors each containing the particular element, wherein the particular element is the particular inefficient element.

13. The system of claim 11, wherein applying a first statistical analysis to the set of data records and the performance metrics of the data records to generate the one or more data subsets, each comprising the respective subset of feature vectors that each contains a respective, particular inefficient element, comprises: ranking the respective, particular inefficient elements of among the respective subsets into a list according to increasing negative contribution to network performance; and retaining only a threshold number of list elements in ranked order.

14. The system of claim 11, wherein applying the second statistical analysis to each respective subset of feature vectors to identify for the respective, particular inefficient element a respective set of influencer elements comprises: for each respective subset of feature vectors, identifying every given element, excluding the particular inefficient element, having (i) an identical value, and (ii) a statistically significant negative contribution to network performance with respect to an associated performance metric; and including the given element in the respective set of influencer elements.

15. The system of claim 11, wherein applying the second statistical analysis to each respective subset of feature vectors to discriminate between those influencer elements associated with additive negative influence and those associated with an incompatibility with the particular operational aspect of the communication network and assigned value identified with the respective, particular inefficient element, comprises: for each respective subset of feature vectors, determining for each given influencer element of the respective influencer set, a respective influencer subset of feature vectors corresponding to those containing the given influencer element; and for each respective influencer subset: (i) determining a first intersection with the respective subset of feature vectors; (ii) determining a second intersection with a complementary set of respective subset of feature vectors; (iii) applying a T-test to compare the first and second intersections; and (iv) if the T-test comparison yields a statistically significant difference, then marking the given influencer element as an incompatibility element, otherwise marking the given influencer element as an additive element.

16. The system of claim 11, wherein analyzing each possible pair of elements of the influencer elements of the respective influencer set of the respective, particular inefficient element to determine a dependency relationship based on a co-occurrence of inefficiency associated with both pair members comprises: for each possible pairing of the influencer elements of the respective influencer set, determining a joint distribution of the respective associated features (f.sub.r,f.sub.t), wherein r and t are arbitrary indices of f.sub.j in the range defined for index j; determining a co-occurrence of the influencer elements of the respective influencer set based on the joint distribution; and determining for each co-occurrence whether it is directional or two-way.

17. The system of claim 11, wherein, for each determined dependency relationship in each respective hierarchical dependency tree, applying a metric-based rule to identify redundant dependencies of the respective hierarchical dependency tree, and removing at least one element of each of the redundant dependencies from each respective hierarchical dependency tree comprises: for each hierarchical dependency between a dependent parent element, e.sub.p, and a dependent child element e.sub.c, computing a first metric mean with respect to an associated performance metric of each of e.sub.p and e.sub.c for a first particular set containing e.sub.p and not containing e.sub.c, wherein p and c are indices of e.sub.i in the range defined for index i; computing a second metric mean with respect to the associated performance metric for second particular set containing the respective, particular inefficient element; comparing the first and second metric means; if the first metric mean is smaller than the second metric mean, then removing the parent element from the respective hierarchical dependency tree; and if the second metric mean is smaller than the first metric mean by more than a threshold amount, then removing the child element from the respective hierarchical dependency tree.

18. The system of claim 11, wherein grouping mutually dependent influencer elements of each respective hierarchical dependency tree and retaining only a longest of any multiple paths between remaining nodes to generate a respective dependency graph comprises: for each node of each respective hierarchical dependency tree, grouping mutually dependent elements; for each pair of nodes connected with multiple paths, applying a depth first search (DFS) algorithm to determine the longest of the multiple paths; and removing all of multiple paths that are not the longest paths.

19. The system of claim 11, wherein the set of data records comprises log records of operations in the communication network, and wherein the each log record is one of a voice call, a session detail record for a data session, a performance record, or a status or health check record for at least one of a network device, a network service, network operation system, or network monitoring system.

20. An article of manufacture including a non-transitory computer-readable medium, having stored thereon program instructions that, when executed by one more processors of a system, cause the system to carry out operations including: obtaining a set of data records including features that characterize operational aspects of a communication network, wherein each given data record comprises a feature vector and one or more performance metrics characterizing operational performance of the communication network, wherein each feature vector of the set comprises a plurality of elements, e.sub.i, i=1, . . . , n, each made up of a feature-value pair, (f.sub.i,v.sub.k), k=1, . . . , m.sub.j, for each j=i, that identifies f.sub.j with a particular one of n operational aspects of the communication network and assigns to v.sub.k one of m.sub.j values of the particular operational aspect, wherein e.sub.i is a set of elements indexed by integer index i, f.sub.j is a set of features indexed by integer index j, v.sub.k is a set of values assigned to f.sub.j and indexed by integer index k, wherein i ranges from 1 to upper limit n, wherein for each e.sub.i, j=i, wherein k ranges from 1 to upper limit m.sub.j, wherein m.sub.j is the jth upper limit, and wherein the operational aspects characterized by features correspond to hardware, software, operational, or functional components related to the network operations; applying a first statistical analysis to the set of data records and the performance metrics of the data records to generate one or more data subsets, each comprising a respective subset of feature vectors that each contains a respective, particular inefficient element, wherein the respective, particular inefficient element is associated with a statistically significant negative contribution to network performance; respectively applying a second statistical analysis to each respective subset of feature vectors to (i) identify for the respective, particular inefficient element a respective set of influencer elements representing elements associated with statistically significant negative influence on causing the negative contribution to network performance associated with the respective, particular inefficient element, and (ii) where the respective set of influencer elements is non-empty, discriminate between those influencer elements associated with additive negative influence and those associated with an incompatibility with the particular operational aspect of the communication network and assigned value identified with the respective, particular inefficient element; for each respective subset of feature vectors, analyzing each possible pair of elements of the influencer elements of the respective influencer set of the respective, particular inefficient element to determine a dependency relationship based on a co-occurrence of inefficiency associated with both pair members, wherein all determined dependency relationships from all pairs represent a respective hierarchical dependency tree in which influencer elements of the respective influencer set correspond to nodes; for each determined dependency relationship in each respective hierarchical dependency tree, applying a metric-based rule to identify redundant dependencies of the respective hierarchical dependency tree, and removing at least one element of each of the redundant dependencies from each respective hierarchical dependency tree; grouping mutually dependent influencer elements of each respective hierarchical dependency tree and retaining only a longest of any multiple paths between remaining nodes to generate a respective dependency graph; and displaying at least one respective dependency graph in a display device.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a simplified block diagram showing components of a system for automatic root cause analysis, in accordance with example embodiments.

(2) FIG. 2 is a flow chart illustrating an example method for automatic root cause analysis, in accordance with example embodiments.

(3) FIG. 3 is a flow chart illustrating a simplified, example method for determining the major contributors to network inefficiencies, in accordance with example embodiments.

(4) FIG. 4 is a flow chart illustrating an example sub-method of identification of major contributors, including removal of redundant features, in accordance with example embodiments.

(5) FIG. 5 is a flow chart illustrating an example sub-method of identification of major contributors, including selection of relevant features, in accordance with example embodiments.

(6) FIG. 6 is a flow chart illustrating an example sub-method of identification of major contributors, including identification of major inefficient elements, in accordance with example embodiments.

(7) FIG. 7 is a flow chart illustrating a simplified, example method for establishing hierarchical relationships and mutual dependencies between inefficient elements, in accordance with example embodiments.

(8) FIG. 8 is a flow chart illustrating a simplified, example method for determining the influence between influencer elements, in accordance with example embodiments.

(9) FIG. 9 is a flow chart illustrating a simplified, example method for grouping mutually dependent inefficient elements and prune dependencies, in accordance with example embodiments.

(10) FIGS. 10A, 10B, and 10C depict a conceptual display for an example application which provides the hierarchical graph produced by a system for automatic root cause analysis, in accordance with example embodiments.

(11) FIG. 11 shows example source code with implementation considerations of steps of a method for the automatic root cause analysis, in accordance with example embodiments.

(12) FIG. 12 shows a simplified, example environment in which an example system and method for the automatic root cause analysis may be implemented, in accordance with example embodiments.

(13) FIG. 13 is a block diagram of an example system for the automatic root cause analysis, in accordance with example embodiments.

(14) FIG. 14 is a flow chart of an example method, in accordance with example embodiments.

DETAILED DESCRIPTION

(15) Example methods, devices, and systems are described herein. It should be understood that the words “example” and “exemplary” are used herein to mean “serving as an example, instance, or illustration.” Any embodiment or feature described herein as being an “example” or “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments or features unless stated as such. Thus, other embodiments can be utilized and other changes can be made without departing from the scope of the subject matter presented herein.

(16) Accordingly, the example embodiments described herein are not meant to be limiting. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the figures, can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations.

(17) Further, unless context suggests otherwise, the features illustrated in each of the figures may be used in combination with one another. Thus, the figures should be generally viewed as component aspects of one or more overall embodiments, with the understanding that not all illustrated features are necessary for each embodiment.

(18) Additionally, any enumeration of elements, blocks, or steps in this specification or the claims is for purposes of clarity. Thus, such enumeration should not be interpreted to require or imply that these elements, blocks, or steps adhere to a particular arrangement or are carried out in a particular order.

I. EXAMPLE ANALYTICAL FORMULATION AND IMPLEMENTATION

A. Example Data, Notation, and Overview

(19) The automatic root cause analysis for service performance degradation and outages in telecommunications networks, together referred to as network inefficiencies, may be configured to exploit data collected by the monitoring entities (e.g. physical, and virtual probes, logging systems, etc.) within telecommunication networks. The data provided by the monitoring entities form a dataset that may be used for root cause analysis. As described herein, a dataset is a collection of feature vectors, where a feature vector is a list of feature-value pairs. Each feature-value pair is also referred as an element of the feature vector. A feature refers to a measurable property, such as utilization, load, or may refer to a tag or label identifying an operational component, such as a device, service, program, or IP address of the network. A value is either categorical (e.g., a device brand or model) or numerical (e.g., a measured parameter value or Boolean value). In practice, there may be a plurality of features, and a respective plurality of possible values for each feature. In the discussion here, features will be denoted as f.sub.j, j=1, . . . , n, where n specifies the number of features, and a values will be denoted by v.sub.k, k=1, . . . , m.sub.j, where m.sub.j specifies the number of values for feature f.sub.j. A feature-value pair is also referred to as an element, e.sub.i, where e.sub.i=(f.sub.i,v.sub.k), k=1, . . . , m.sub.j, for each j=i.

(20) Table 1 shows a simplified example of a dataset, where the features describe the attributes of the parties involved in mobile communications, such as call or sessions. In this example, there are six features (n=6). The number of possible values for each feature is not necessarily indicated, but it may be seen that there are at least two values for each feature. Each row of the table includes a feature vector followed an associated performance metric, which, for purposes example, is a response time. There could be different and/or additional performance metrics logged for each record. Each row of the table may also correspond to a record of a database or dataset of performance data that may be obtained by one or more monitoring devices in or of a communication network. The vertical ellipses in the last row indicate that there may be more entries in the table (i.e., more records containing feature vectors). In particular, the statistical analyses described are generally applied to the performance metrics. As such, it may generally be assumed that there are sufficient numbers of records to help ensure the validity and/or accuracy of the statistical analyses. In practice, this may typically be the case, as the number of call records, session logs, performance logs, and the like usually stretch into the hundreds, thousands, or more over typical collection time spans. The four records shown in Table 1 thus serve to illustrate concepts of analysis relating to various data selection criteria, with the assumption that number of actual records involved may be much larger.

(21) For convenience in the discussion herein, each row is labeled with a record number (“Rec No.” in the table), although this label may not necessarily be included in an actual implementation of the table. It should be understood that the form and content of Table 1 is an example for illustrative purposes of the discussion herein, and should not be interpreted as limiting with respect to example embodiments herein.

(22) TABLE-US-00001 TABLE 1 Features Metrics Content Service Content Host IP Server IP Response Rec Provider Type Service Category Address Address Time No. (f.sub.1) (f.sub.2) (f.sub.3) (f.sub.4) (f.sub.5) (f.sub.6) (ms) 1 other unknown undetected other 80.12.32.235 80.12.32.235 1.0 2 teamspeak VoIP TeamSpeak VoIP and 31.214.227.112 31.214.227.112 10.0 Messaging 3 teamspeak VoIP TeamSpeak VoIP and 149.202.129.60 149.202.129.60 12.0 Messaging 4 bittorrent P2P BitTorrent P2P 41.251.70.198 41.251.70.198 2.0 . . . . . . . . . . . . . . . . . . . . . . . .

(23) The organization of records containing feature vectors and performance metrics into a table, such as Table 1, may serve to describe certain aspects of the analysis described below. Specifically, it may be seen that each feature corresponds to a column in the table, and that the features of each row correspond to feature vectors. The entries in the feature columns correspond to values, and the column heading—i.e., feature—plus a given value corresponds to an element. For example, the pair (Service Type, VoIP) is an element that is present in both the second and third rows or data records. In later descriptions, when the term “feature” is used, it will usually refer to an entire column. And reference to a set of data containing only a specific element will be used to mean a subset of records each containing only feature vectors having specific feature-value pair corresponding to that element. For example, a subset of the data containing only the element (Service Type, VoIP) would be a subset of only the second and third records. In addition, subsets of data need not necessarily be separate from Table 1. Rather, they may be viewed as Table 1 with ancillary information specifying which rows and/or columns are under consideration.

(24) Also, reference below to “removing” features or elements may be viewed as identifying columns or elements that are ignored or disregarded in subsequent computations. For example, if a feature is determined to be “redundant” (as explained below) and removed, then the column corresponding to that feature may omitted from consideration in a subsequent computation that is applied to features that are not removed. However, Table 1 as a whole may remain intact throughout the computation. Finally, the above description is intended for the convenience of the discussion various analytical steps, operations, and algorithms herein. It should be understood that there may be other specific implementations that achieve automatic root cause analysis in accordance with example embodiments. For example, subsets of data may be algorithmically implemented as separated sub-tables.

(25) In the discussion herein, a set of data records, such as those illustrated in Table 1 is referred to as a “data set E.” A data set may be further distinguished according to what subset of E is under consideration. For example, E(e.sub.i) may be used to refer to a subset of E that includes only feature vectors containing the specific element e.sub.i for a given pair of i and k. Adopting the notation of Set Theory, a complementary data set, designated as E\E(e.sub.i), refers to a subset of E that contains all feature vectors except those that contain e.sub.i. Referring again to Table 1 for the example in which e.sub.i for a given pair of i and k corresponds to (Service Type, VoIP), then E(e.sub.i) would again be a subset of only the second and third records, and E\E(e.sub.i) would be a subset of only the first and fourth records. Note that this example momentarily ignores other possible table entries represented by the vertical ellipses.

(26) In accordance with example embodiments, automatic root cause analysis may be accomplished by determining which feature-value pairs (elements) are most associated to network inefficiencies, where, as noted, “network inefficiency” is a term used herein to describe degradation (including possible failure) of one or more aspects of network performance below some threshold or statistically significant level. A network inefficiency may also be described as a statistically significant negative contribution to one or more aspects of network performance. Thus, a feature-value pair (element) is considered to be inefficient if it causes or is associated with a statistically significant negative contribution to one or more aspects of network performance.

(27) In order to carry out automatic root cause analysis or diagnosis based on analysis of inefficient element, feature-value pairs may first be isolated or identified as “major contributors” to network inefficiencies based on how they contribute to network performance metrics degradation; specifically, how significantly they negatively impact network performance. The isolation step may also include eliminating the effects of possibly redundant information provided by different features that may contribute to the same degradation of network performance. Inefficient elements may be grouped together when associated with the same network inefficiency, and statistically analyzed to determine other elements that may “influence” the inefficient elements. Identifying such “influence elements” may help discover elements and their associated features that may play a role in causing the detected inefficiencies. Dependencies between inefficient elements may then be analyzed and organized into a hierarchy based on derived feature dependencies. This makes it possible to isolate which feature-value pairs comprise a root cause. The basic analytical approach just outlined is described in detail below. In order to put the analytical technique in context, an example automatic root cause analysis system is first described.

B. Example System Architecture

(28) FIG. 1 is a simplified block diagram showing components of a system 100 for automatic root cause analysis, in accordance with example embodiments. As show, system 100 includes a Data Processor 106, a Major Contributors Identifier 108, an Influence Analyzer 110, a Dependency Analyzer 112 and a Graph Generator 114. In operation together, these components may produce the hierarchical graph of potential root causes of network inefficiencies, which could be presented on a User Interface 116, for example. The Networks 102 is supervised by monitoring entities 104. The network monitoring data D may be processed by the Data Processor 106 to produce the dataset E. Then, the major contributors to network inefficiencies E(e.sub.i . . . e.sub.k) may be identified by the Major Contributors Identifier 108. Then, the major contributors to each subset E(e.sub.i) to E(e.sub.k) may be identified by Influence Analyzer 110. Then, the hierarchical relationship and mutual dependencies between inefficient elements may be established by the Dependency Analyzer 112. The resulting directional edges (e.sub.ij,e.sub.il), (e.sub.km,e.sub.kn), . . . may then be pruned by the Graph Generator 114.

(29) In FIG. 1, the automatic root cause analysis system 100 may be or include a computing device or system. By way of example, a computing device may include a processing unit which includes at least functional components 106, 108, 110, 112, 114. Functional components 106, 108, 110, 112, 114 can be software instructions executed on the processing unit for automatic root cause analysis. The processing unit can be a digital device that, in terms of hardware architecture, generally includes a processor, input/output (I/O) interfaces including user interface 116, a network interface and memory. The processing unit can be one or multiple virtual instances running over a virtualization layer abstracting computing resource. It should be appreciated by those of ordinary skill in the art that FIG. 1 depicts the processing unit in an oversimplified manner, and a practical embodiment may include additional components and suitably configured processing logic to support known or conventional operating features that are not described in detail herein. Also, the processing device can be a stand-alone server, group of servers, etc. for executing the automatic root cause analysis.

(30) When the processing unit is a digital device, the components 106, 108, 110, 112, 114 may be communicatively coupled via a local interface. The local interface can be, for example, but not limited to, one or more buses or other wired or wireless connections, as is known in the art. The local interface can have additional elements, which are omitted for simplicity, such as controllers, buffers (caches), drivers, repeaters, and receivers, among many others, to enable communications. Further, the local interface may include address, control, and/or data connections to enable appropriate communications among the components.

(31) The network interface may be used to enable the processing device to communicate on a network, such as the Internet. The network interface may include, for example, an Ethernet card or adapter (e.g., 10BaseT, Fast Ethernet, Gigabit Ethernet, 10 GbE) or a wireless local area network (WLAN) card or adapter (e.g., 802.11a/b/g/n/ac). The network interface may include address, control, and/or data connections to enable appropriate communications on the network.

(32) A processor is used as a hardware device for executing software instructions within processing device 100. The processor can be any custom made or commercially available processor, a central processing unit (CPU), an auxiliary processor among several processors associated with the processing device, a semiconductor-based microprocessor (in the form of a microchip or chip set), or generally any device for executing software instructions. When the processing device is in operation, the processor is configured to execute software stored within the memory, to communicate data to and from the memory, and to generally control operations of the processing device pursuant to the software instructions. In an exemplary embodiment, the processor may include a mobile-optimized processor such as optimized for power consumption and mobile applications.

(33) The I/O interfaces, including user interface 116 can be used to receive user input from and/or for providing system output. User input can be provided via, for example, a keypad, a touch screen, a scroll ball, a scroll bar, buttons, and the like. System output can be provided via a display device such as a liquid crystal display (LCD), touch screen, and the like. System output may also be provided via a display device and a printer. The I/O interfaces can also include, for example, a serial port, a parallel port, a small computer system interface (SCSI), an infrared (IR) interface, a radio frequency (RF) interface, a universal serial bus (USB) interface, and the like. The I/O interfaces can include a graphical user interface (GUI) that enables a user to interact with the processing device 100.

(34) The data store may be used to store data. The data store may include any of volatile memory elements (e.g., random access memory (RAM, such as DRAM, SRAM, SDRAM, and the like)), nonvolatile (non-transitory computer-readable media) memory elements (e.g., ROM, hard drive, tape, CDROM, and the like), and combinations thereof. Moreover, the data store may incorporate electronic, magnetic, optical, and/or other types of storage media.

(35) The memory may include any of volatile memory elements (e.g., random access memory (RAM, such as DRAM, SRAM, SDRAM, etc.)), nonvolatile memory elements (e.g., ROM, hard drive, etc.), and combinations thereof. Moreover, the memory may incorporate electronic, magnetic, optical, and/or other types of storage media. Note that the memory may have a distributed architecture, where various components are situated remotely from one another but can be accessed by the processor.

(36) The software in memory can include one or more software programs, each of which includes an ordered listing of executable instructions for implementing logical functions. In the example of FIG. 1, the software in the memory includes a suitable operating system (O/S) and programs. The operating system may control the execution of other computer programs and provides scheduling, input-output control, file and data management, memory management, and communication control and related services. The programs may include various applications, add-ons, etc. configured to provide end-user functionality with the processing device. In accordance with example embodiments, system 100 may include a non-transitory computer-readable medium storing instructions thereon that, when executed by one or more processors of a system, cause the system to carry out operations described herein.

(37) The processing device can be incorporated in a test equipment or be in communication with a test equipment. The test equipment can include different physical media test modules. The physical media test modules include ports and connectors to interface to networks for monitoring and troubleshooting. In an embodiment, a mobile device can execute an application which communicates with the test equipment. The mobile device can communicate with the test equipment via Bluetooth, Wi-Fi, wired Ethernet, USB, via combinations, or the like. The mobile device is configured to communicate to the Internet via cellular, Wi-Fi, etc.

(38) Still referring to FIG. 1, when the processing unit is running over a virtualization layer, the components 106, 108, 110, 112, 114 may run inside one or multiple virtual machine or container instances. When distributed over multiple instances, the components are exchanging information between themselves typically via a bridge network. The bridge network can be, for example, but not limited to, a virtual switch. Some of the components might also communicating on a network, such as the Internet. In this case, a virtual NIC is associated to each virtual machine while the containers are connected to the bridge network of the host system providing access to the network.

C. Example Analysis Procedures

(39) Example procedures for automatic root cause analysis are shown in FIGS. 2-9, which are arranged for describing increasing levels of detail for certain aspects of the analyses. FIG. 2 is a flow chart illustrating an example method 200 for automatic root cause analysis at a top level, in accordance with example embodiments. Example method 200 includes a number of “main” steps to produce the hierarchical graph of potential root causes of network inefficiencies. First, at step 202, data from monitoring entities may be collected, followed at step 204, followed by cleaning, formatting, and restructuring to form a dataset, such as that illustrated in Table 1. Next, at step 206, major contributors to network inefficiencies may be identified. At step 208, the influence between major contributors may be analyzed. Then, at step 210, hierarchical relationships and mutual dependencies between inefficient elements may be established. Finally, at step 212, the mutually dependent inefficient elements may be grouped and the dependencies may be pruned. FIGS. 3-9 then provide elaboration the main steps and some of the sub-steps.

(40) Data Pre-Processing

(41) The data-preprocessing, performed by the Data Processor 106, may involve various processes of data cleaning, formatting and restructuring. In step 204, metrics having extreme values may be removed to suppress the effect of outliers, which may otherwise affect performance by skewing distributions when left uncorrected. The result of data pre-processing is a data set E of records, such as those of Table 1. As such, the records collectively include features that characterize operational aspects of a communication network. Initially, E may contain all of the records generated by the Data Processor 106. Various subsets of E may be formed along the way in subsequent analysis steps described below.

(42) Identification of Major Contributors

(43) As noted, a major contributor is an element that is determined to have a statistically significant negative contribution to one or more metrics associated with the data record (or feature vector) that contains the element. Identification of major contributors also involves removing redundant information from subsequent analyses and selecting relevant features evaluating the significance of elements that may be associated with inefficiency. Major contributor identification is performed by the Major Contributors Identifier 108 of FIG. 1. A major contributor is an element e.sub.i=(f.sub.i,v.sub.j) from dataset E that has a statistically significant negative impact on a network performance metric. For convenience in the discussion herein, a specific index value for i will generally be used in referring to data sets and elements. It should be understood that the analysis is not limited by or restricted to the specific index values used in the discussion. When the distribution of a metric for the subset of the dataset containing e.sub.1 (i.e., for i=1), E(e.sub.1), is statistically significantly different from the distribution of the metric for the subset of the complementary dataset not containing e.sub.1, E\E(e.sub.1), the element e.sub.1 is considered to contribute to the performance metric degradation if the directionality of the result indicates degradation.

(44) FIG. 3 is a flow chart illustrating a simplified, example method 206 (corresponding to step 206 in FIG. 2) for determining the major contributors to network inefficiencies, in accordance with example embodiments. The identification of major contributors includes three “main” steps: removing redundant features 301, selecting relevant features 303, identifying major inefficient elements 305. These steps are further detailed in FIGS. 4-6.

(45) FIG. 4 is a flow chart illustrating an example sub-method 301 (corresponding to step 301 in FIG. 3) for identification of major contributors, including removal of redundant features. Redundant features are removed from the dataset E to reduce the dimensionality of the data that need to be considered in the analysis. Advantageously, this may help optimize performance of automated root cause analysis, because features with a large amount of co-occurrence may typically represent the same or redundant information. Thus, it may not be necessary to compute the later computations on removed features separately. One feature from each dependency tuple is then removed as redundant. As mentioned above and described below, “removal” of a feature refers to omitting further consideration of that feature in a subsequent analysis step applied to features of the data set E. Features removed at this step are referred to as “redundant features,” where redundancy refers to statistical information about one or more metrics associated with E that may be obtained by analyzing the features in question.

(46) For each pair of features (f.sub.1,f.sub.2), the method 301 starts at step 401 by creating a crosstab, which is a joint distribution of pairs of features with respect to the associated metric. Then, at 403, a Chi-squared test of independence of the distributions of the metric is applied to compare the distributions of all features against all other features. This statistical test determines whether features are highly associated to each other. The Chi-squared test produces two values: the test statistic χ.sup.2 and the P value. The test statistic χ.sup.2 quantifies the dependences between the two features. The P value quantifies the statistical significance of the test statistic and probability it arose due to chance alone. A typical threshold for the P value is 0.05 at 405 (or 0.01 for a more stringent criteria; other threshold values may be used as well). If the P value is higher than the significance level set (e.g., 0.05), the two features are considered to be non-redundant (i.e., independent) and, at step 407, both are retained in the data set.

(47) Redundant features are those that may be considered mutually dependent. To check the degree of dependence of f.sub.1 on f.sub.2 and f.sub.2 on f.sub.1, weights are calculated at step 409 χ.sup.2(f.sub.1,f.sub.2) by χ.sup.2(f.sub.1,f.sub.1) and χ.sup.2(f.sub.2,f.sub.2).
χ.sup.2.sub.1=χ.sup.2(f.sub.1,f.sub.2)/χ.sup.2(f.sub.1,f.sub.1) (1)
χ.sup.2.sub.2=χ.sup.2(f.sub.i,f.sub.2)/χ.sup.2(f.sub.2,f.sub.2) (2)

(48) At step 411, if most of the information included in f.sub.1 can be deduced from f.sub.2 and vice versa (e.g., χ.sup.2.sub.1>0.9 and χ.sup.2.sub.2>0.9), then f.sub.1 and f.sub.2 are still considered redundant and one of the features is removed at 413. As noted, “removal” of features may correspond to marking the features for omission from subsequent analysis steps, and not necessarily from actual removal from E (e.g., from Table 1). Thus “removing” f.sub.2 from E may correspond to omitting the column “service type” from subsequent computations. In practice, omitting a feature (column) or an element may be achieved by “marking” the removed feature as “omitted” or some other tag that may be used to control subsequent computational operations. The result of this step is the data set E with redundant features removed (or marked as such).

(49) FIG. 5 is a flow chart illustrating an example sub-method 303 (corresponding to step 303 in FIG. 3) for identification of major contributors, including selection of relevant features. Isolation of the most relevant features narrows the search space to identify features most associated with negative network performance metrics and essential in determining the root cause of a network inefficiency.

(50) At step 501, for each feature f.sub.i, the dataset E, now with redundant features removed, is split into subsets, each having the same value v.sub.i. Then at step 503, a one-way analysis of variance (ANOVA) test is performed on the distribution of network performance metric for all subsets of a feature f.sub.i. If there are no statistically significant differences between any individual subsets v.sub.i within a particular feature f.sub.i, then the feature is not discriminant for the particular metric measured and, thus, will not contain any feature-value pair candidates for network efficiency. Thus, non-discriminant f.sub.i are not processed in the next steps; this may be achieved by marking the non-discriminant features as such. In the case where f.sub.i is discriminant, f.sub.i is carried forward to the next stage of the method.

(51) A one-way ANOVA test computes an F value and the P value. The F value is the ratio of the variance of the metric means for the different subsets relative to the global variance of the dataset. A high valued of F indicates that associated metric is not identically distributed in all subsets; i.e., at least two groups are different from each other. The P value determines the robustness of the F value.

(52) At step 505, the P-Statistic is compared to the upper threshold p. If the threshold is set top p>0.05 (or 0.01 for a more stringent criteria; other threshold p values may be used as well), the F-Statistic is not considered and the feature f.sub.i is rejected and so removed. Otherwise, the F value is considered statistically.

(53) At step 509, the F-Statistic is computed by calculating F critical, F.sub.c. F.sub.c is the maximum F value when all the subsets have the same distribution (with the probability of error type I α=0.05). At step 511, if F>F.sub.c, then the performance metric is not identically distributed in the different subsets. Thus, further exploration of the feature f.sub.i is considered for root cause analysis as it is discriminant; f.sub.i is kept at step 513 and used in to the next steps of the method. When this criterion is not met, f.sub.i is not used in the next steps of the method.

(54) FIG. 6 is a flow chart illustrating an example sub-method 305 (corresponding to step 305 in FIG. 3) for identification of major contributors, including identification of major inefficient elements. The identification of inefficient elements 305 aims to identify the main contributors to network inefficiency. At step 601, all possible elements of relevant features are considered. Then, in step 603, for each element e.sub.i=(f.sub.i,v.sub.i), the dataset E, with redundant and non-discriminant features removed, is split into two subsets: one containing e.sub.i, E(e.sub.i), and one containing its complementary subset, E\E(e.sub.i). At step 605, the distribution of the network performance metric of these two subsets are compared using the paired T-test. The T-test produces the T statistic and the P value. At step 607, if T>0, indicating that the network performance metric is higher in E(e.sub.i) than in its complementary E\E(e.sub.i), and P<0.05 (or 0.01 for a more stringent criteria; other threshold values could be used, as well), then e.sub.i is considered an inefficient element and kept at 609. Otherwise, it is removed at step 611, where, again, removal refers to omission of the removed element from consideration is subsequent analysis steps. Once all the elements have identified at step 609, the inefficient elements are sorted by the T statistic at step 613. At step 615, the top N elements in order of increasing T statistic are selected (e.g. N=50). This selection narrows the search space for a probably root cause of inefficiency and the elements are put forward as major contributors to network inefficiency based on their association with worse metric scores. Note that “N” in this context should not be confused with “n” introduced above as referring to the number of features.

(55) Influence Analysis

(56) FIG. 7 is a flow chart illustrating a simplified, example method for establishing hierarchical relationships and mutual dependencies between inefficient elements, in accordance with example embodiments. The influence analysis 208 (corresponding to step 208 in FIG. 2) may be performed by the Influence Analyzer 110 of FIG. 1. For each element identified as a major contributor (e.sub.i to e.sub.k) from the dataset E (at step 701), the method determines whether in the subset including the element e.sub.i, E(e.sub.i), there exists other sub-elements e.sub.ij to e.sub.il within the subset E(e.sub.i) that contribute to performance metric degradation. Consequently, this strategy highlights the contribution of sub-elements, e.sub.ij to e.sub.il, on a particular inefficient element's metric results. Sub-elements that meet these criteria are referred to as “influencer elements” or just “influencers.” The subset of influencer elements e.sub.ij to e.sub.il, of a particular inefficient e.sub.i is referred to herein as an “influencer set” of e.sub.i.

(57) In cases where such sub-elements exist—i.e., where an influencer set exists—their influence may be caused by an additive effect of independent faulty elements or incompatibilities between elements causing network inefficiencies. Identification of a possible influencer set for each inefficient element is carried out at step 703, and may be accomplished again by applying step 305, as detailed in FIG. 6, individually on each subset E(e.sub.i) to E(e.sub.k) to find for each possible inefficient sub-elements; e.g., e.sub.ij to e.sub.il for E(e.sub.i), e.sub.kj to e.sub.ki for E(e.sub.k), and so on. In accordance with example embodiments, this step may be carried out for each e.sub.i to e.sub.k by separately taking each e.sub.i to e.sub.k to be the “current major contributor,” and then selecting only those feature vectors of the data set E containing the current major contributor.

(58) The effect that a given influencer may have on a particular inefficient element in causing or amplifying its inefficiency (i.e., its statistically significant negative contribution to an associated performance metric) may be additive, or may be due to an incompatibility between the feature-value pairs that correspond to the particular inefficient element and the given influencer. To determine whether the influence entails an additive effect or an incompatibility between the two elements, an additional step 705 may taken once an influential sub-element is discovered. Given an element e.sub.i and the sub-element e.sub.ij derived from E(e.sub.i), the intersection of subsets E(e.sub.i)∩E(e.sub.ij) may be taken to perform an additional comparison. A T-test calculation may then be performed comparing the distribution of metric scores for the sets E(e.sub.i)∩E(e.sub.ij) against the complement distribution E\(E(e.sub.i)∩E(e.sub.ij) to distinguish between an additive effect and an incompatibility.

(59) In cases where e.sub.i and e.sub.ij appear independently, their impact may be weighted against intersectional cases in the complement distribution at step 707. If there is a statistically significant difference between the distribution of metric scores for E(e.sub.i)∩E(e.sub.j) and for E\(E(e.sub.i)∩E(e.sub.j), then the combination of elements may be deemed an incompatibility in accordance with the P-value set at 709. Conversely, when no incompatibility is found, an additive effect may be assumed at step 711.

(60) Dependency Analysis

(61) FIG. 8 is a flow chart illustrating a simplified, example method for determining the influence between influencer elements, in accordance with example embodiments. The major contributor dependency analysis 210 (corresponding to step 210 in FIG. 2) may be performed by the Dependency Analyzer 112 of FIG. 1. The method may be used to establish hierarchical relationships and mutual dependencies between inefficient elements. Again adopting Set Theory, hierarchical dependency is a reflexive and transitive relation between two elements. The hierarchy forms a graph, establishing parent-child relationships between associated elements. To form this hierarchy, the element set of a child element must be a subset of the element set of the parent. For example, the content provider (e.g., Gmail®) is a child of content category (e.g., email). A mutual dependency is a bidirectional hierarchical dependency. It occurs when the element sets of two elements are equal, in which case two elements are said to be mutually dependent or equivalent.

(62) Co-occurrence may be used to establish hierarchies, where a co-occurrence refers to two inefficient elements that occur in pairs within the same feature vectors. To determine co-occurrences, for each pair of inefficient elements (which include sub-elements from influence analysis), the method 210 starts at step 801 by creating a crosstab, which is a joint distribution of pairs of features. Then, at step 803, when a one-way co-occurrence threshold (e.g., 90%) is met, meaning that when the inefficient element e.sub.i=(f.sub.1,v.sub.1) occurs, in at least the threshold number of cases e.sub.2=(f.sub.2,v.sub.2) also occurs (e.sub.1 is the parent and e.sub.2 the child in this scenario), a hierarchical relationship is formed with a directional edge e.sub.1.fwdarw.e.sub.2. When a two-way co-occurrence is established (i.e., the threshold is satisfied such that e.sub.1.fwdarw.e.sub.2 and e.sub.1←e.sub.2), then a mutual dependency is formed, placing both elements in the same node and removing the parent-child relationship.

(63) At step 805, redundant dependencies are removed as follows. For each hierarchical dependency between dependent elements (a parent e.sub.p and a child e.sub.c), the mean of an associated network performance metric in computed for the set containing e.sub.p, E(e.sub.p), and not containing e.sub.c, E(e.sub.p)/E(e.sub.c). Then, a pair of rules is applied to identify which elements are deemed inherently inefficient and which are deemed the mere result of the dependency on an inefficient element.

(64) Rule 1: If the mean of the metric containing the parent but excluding the child E(e.sub.p)/E(e.sub.c), is lower than the mean of the metric in the associated major contributor dataset E(e.sub.i), then the child is considered the root of the inefficiency of parent. In this case the parent is pruned.

(65) Rule 2: If the mean of the metric containing the child E(e.sub.c), is smaller than the mean of the metric containing the parent E(e.sub.p) with an additional margin factor (e.g., 5% to 10%), then the parent is considered the root of the inefficiency of the child. In this case the child is pruned.

(66) Graphing and Pruning

(67) FIG. 9 is a flow chart illustrating a simplified, example method for grouping mutually dependent inefficient elements and prune dependencies, in accordance with example embodiments. Graphing and pruning 212 (corresponding to step 212 in FIG. 2) may be generated by the Graph Generator 114 of FIG. 1. It presents all elements identified as major contributors with their influential elements. At step 901, the elements that are mutually dependent are grouped in one node. At step 903 multiple paths are removed. Specifically, for each pair of connected nodes, all the paths connecting them are evaluated using Depth First Search (DFS) algorithm. Then, only the longest path between these two nodes is kept. This method may simplify the graph by reducing the number of edges without any information loss. At step 905, the nodes representing the pruned elements are removed as these nodes are not the root causes of network inefficiencies.

Simplified Example

(68) A practical example implementation for the methods and systems described herein is shown in FIGS. 10A, 10B, and 10C, and 11. An example application provides a user interface 116 for displaying the graph generated by the automatic root cause analysis method and system of the present disclosure to a user.

(69) FIGS. 10A, 10B, and 10C represent results of a simplified example of a graph highlighting the influence of elements on major contributors. The top nodes 1002, 1022, and 1042 in FIGS. 10A, 10B, and 10C represent major contributors as determined based on a direct association with network performance degradation determined in step 108. All other nodes in the graph result from the influence analysis step 110, with dependency arrangements such as the combination of several elements in a single node and edges between nodes determined during the dependency analysis step 112 resolved, and finally the graph pruning operations at step 114.

(70) By way of example, the three top nodes 1002, 1022, 1042 illustrate different possible scenarios of influence of elements on major contributors. The top node 1002 indicates that the LTE serving gateway number 35 (Node2 Name:SGW35) is a major contributor to the network performance metric degradation (e.g. mobile network call setup time). The influential elements to this major contributor are the service type Tunneling (node 1004), the SSL service (node 1006) and the TLS service (node 1008). They have been identified as the most influential elements to metric degradation for the major contributor Node2 Name:SGW35 (node 1002). Two hierarchical parent-child relationships (edges 1012, 1014) have been established between influential elements. The edges 1012 and 1014 indicate a one-way co-occurrence between respectively the influential elements Service:SSL and Service:TLS (nodes 1006 and 1008) and the influential element Service Type:Tunneling (node 1004). The directional edge indicates that when Service Type:Tunneling occurs, in at least the threshold number of cases Service:SSL or Service:TLS also occurs. The edge 1010 attaches the parent Service Type:Tunneling (node 1004) to its corresponding major contributor node 1002. Then, at step 703 by rule 2, the influential element Service Type:Tunneling (node 1004) has been identified as less explanatory than its children Service:SSL and Service:TLS (nodes 1006 and 1008). Consequently, this dependency has been pruned (as signified by the diagonal hatch marks). Finally, the influence qualification has determined that Service:SSL (node 1006) is explainable as an incompatibility with the major contributor Node2 Name:SGW35 (node 1002) outlined by a dashed line 1016 and that Service:TLS (node 1008) is explainable as an additive effect as it did not pass the incompatibility assessment. The combination of Service:SSL (node 1006) and Node2 Name:SGW35 (node 1002) represents a possible root causes of the network performance metric degradation. Service:TLS and Node2 Name:SGW35 may also represent independent compounding root causes.

(71) By way of example, the top node 1022 indicates that the handset type Phone B (Handset Type:Phone B) is also a major contributor to the network performance metric degradation. The influential elements to this major contributor are the handset manufacturer Vendor A (node 1024), the service type Gaming (node 1026), the content category Gaming (node 1026), the content provider Game (node 1028) and the service Game A (node 1028). They have been identified in this example as the most influential elements to metric degradation for the major contributor Handset Type:Phone B (node 1022). Mutual dependencies have been identified between Service Type:Gaming and Content Category:Gaming. Consequently, they have been grouped into a single node 1026. Similarly, Content Provider:Game and Service:Game A have been grouped into a single node 1028. Two hierarchical relationships (edges 1032 and 1034) have been established between influential elements. The influential elements Service Type:Gaming and Content Category:Gaming (nodes 1026) are the children of the influential element Handset Manufacturer:Vendor A (node 924) and Content Provider:Game and Service:Game A (node 1028) are the children of the influential elements Service Type:Gaming and Content Category:Gaming (nodes 1026). The edge 1030 attaches the parent Handset Manufacturer:Vendor A (node 1024) to its corresponding major contributor node 1022. Then, at step 703 by rule 2, the influential element Handset Manufacturer:Vendor A (node 1024) has been identified as less explanatory than its children Service Type:Gaming and Content Category:Gaming (nodes 1026). Consequently, this dependency has been pruned (as signified by the diagonal hatch marks). Also, at step 703 by rule 1, the influential elements Content Provider:Game and Service:Game A (node 1028) have been identified as less explanatory than their parents Service Type:Gaming and Content Category:Gaming (nodes 1026). Consequently, this dependency has been pruned (as signified by the diagonal hatch marks). Finally, the influence qualification has determined that Service Type:Gaming and Content Category:Gaming (nodes 1026) is explainable as an additive effect as it did not pass the incompatibility assessment. The combination of Service Type:Gaming and Content Category:Gaming (nodes 1026) with Handset Type:Phone B (node 1022) represents a possible root causes of the network performance metric degradation.

(72) By way of example, the top node 1042 indicates that the service type Email (Service Type:Email) is also a major contributor to the network performance metric degradation. In this example, no other elements were found during the influence analysis. Thus, the graph would comprise only the major contributor as a single node. The Service Type:Email (node 1042) represents a possible root causes of the network performance metric degradation.

(73) FIG. 11 shows an example source code in Python for the Major Contributors Identifier 108, the Influence Analyzer 110, the Dependency Analyzer 112 and the Graph Generator 114.

II. EXAMPLE OPERATING ENVIRONMENT AND EXAMPLE SYSTEM

(74) FIG. 12 shows a block diagram of the environment 1200 in which the example operates. Multiple monitoring entities 1208 are deployed in Radio Access Network 1202, Mobile core network 1204, internet service provider (ISP) network 1206, or in other types of network. Monitoring systems 1210 collect the network monitoring data and feed the automatic root cause analysis system 1212 to identify the possible root causes of networks performance degradation. The possible root causes are presented on the user interface 1214, which could be standalone or integrated into network operator Operations Support Systems.

(75) FIG. 13 is a block diagram of an example system 1300 for the automatic root cause analysis, in accordance with example embodiments. The example system 1300 may be or include elements or components of a computing device or system, such as that described above. As such system 1300 may include a processor 1308, memory 1306, and a user interface 1304. Also shown is an automatic root cause analysis module 1312, which may be implemented as separate hardware, software, firmware, or on a virtual machine or container. In an example embodiment, the automatic root cause analysis module 1312 may be instructions stored in the memory 1306 that, when executed by the processor 1308, cause the analysis system 1302 to carry out operations described herein. The analysis system 1302 may also include a network interface 1310 with a communicative connection to a monitoring system 1314 of a network. Interconnections between the elements within the analysis system 1302, indicated by double arrows, may represent one or more bus or other interconnection structures.

III. EXAMPLE OPERATIONS

(76) FIG. 14 is a flow chart illustrating an example embodiment of a method 1400. The method illustrated by FIG. 14 may be carried out by a computing device, such as computing device analysis system 1302 or as described above in connection with FIG. 1. However, the process can be carried out by other types of devices or device subsystems. For example, the process could be carried out by a portable computer, such as a laptop or a tablet device.

(77) The embodiments of FIG. 1400 may be simplified by the removal of any one or more of the features shown therein. Further, these embodiments may be combined with features, aspects, and/or implementations of any of the previous figures or otherwise described herein.

(78) The example method 1400 may also be embodied as instructions executable by one or more processors of the one or more server devices of the system or virtual machine or container. For example, the instructions may take the form of software and/or hardware and/or firmware instructions. In an example embodiment, the instructions may be stored on a non-transitory computer readable medium. When executed by one or more processors of the one or more servers, the instructions may cause the one or more servers to carry out various operations of the example method.

(79) Block 1402 of example method 900 may involve obtaining a set of data records including features that characterize operational aspects of a communication network. Each given data record may be made up of at least a feature vector and one or more performance metrics characterizing operational performance of the communication network. In particular, each feature vector of the set may include a plurality of elements, e.sub.i, i=1, . . . , n, each made up of a feature-value pair, (f.sub.i,v.sub.k), k=1, . . . , m.sub.j, for each j=i, that identifies f.sub.j with a particular one of n operational aspects of the communication network and that assigns to v.sub.k one of m.sub.j values of the particular operational aspect. As discussed above the operational aspects that are characterized by features may correspond to hardware, software, operational, or functional components related to the network operations. In an example embodiment, the set of data records may be or include log records of operations in the communication network. Each log record may be a voice call, a session detail record for a data session, a performance record, or a status or health check record for at least one of a network device, a network service, network operation system, or network monitoring system, for example.

(80) Block 1404 may involve applying a first statistical analysis to the set of data records and their performance metrics to generate one or more data subsets of major contributors to one or more inefficiencies. More particularly, each of the data subsets may include a respective subset of feature vectors that each contains a respective, particular inefficient element. As discussed above, an inefficient element is one that is associated with a statistically significant negative contribution to network performance.

(81) Block 1406 may involve respectively applying a second statistical analysis to each respective subset of feature vectors to identify influencer elements of the particular inefficient element, and to classify influencer elements as being either additive or associated with an incompatibility with the particular inefficient element. More specifically, for each respective subset of feature vectors, the second statistical analysis may be used to first identify a respective set of influencer elements representing elements associated with statistically significant negative influence on causing the negative contribution to network performance associated with the respective, particular inefficient element. Then, if the respective set of influencer elements is non-empty, the analysis may discriminate between those influencer elements associated with additive negative influence and those associated with an incompatibility with the particular operational aspect of the communication network and assigned value identified with the respective, particular inefficient element.

(82) Block 1408 may involve analyzing the influencer elements to determine dependencies. In accordance with example embodiments, for each respective subset of feature vectors, each possible pair of elements of the influencer elements of the respective influencer set of the respective, particular inefficient element may be analyzed to determine a dependency relationship, based on a co-occurrence of inefficiency associated with both pair members. By doing so, all determined dependency relationships from all pairs may be represented in a respective hierarchical dependency tree in which influencer elements of the respective influencer set correspond to nodes.

(83) Block 1410 may involve identifying redundant dependencies and removing them from the dependency tree. Specifically, for each determined dependency relationship in each respective hierarchical dependency tree, a metric-based rule may be applied to identify redundant dependencies of the respective hierarchical dependency tree. Then, at least one element of each of the redundant dependencies may be removed from each respective hierarchical dependency tree.

(84) Block 1412 may involve grouping mutually dependent influencer elements of each respective hierarchical dependency tree and retaining only a longest of any multiple paths between remaining nodes to generate a respective dependency graph. This results in a “pruned” dependency tree in which the remaining edges have been effectively clarified to reveal and diagnose root causes of major contributors to network degradation.

(85) Finally, block 1414 may involve displaying at least one respective dependency graph in a display device of the system. Advantageously, the displayed graph may provide a graphical rendering of the relationships between major contributors to network degradation and their root causes.

(86) In accordance with example embodiments, the first statistical analysis may include a χ.sup.2 test to identify redundant features, followed by filtering to select only features that provide discriminating information. After selection, a means test may be applied to determine which elements are major contributors. In an example embodiment, these operations may entail applying a χ.sup.2 test for each pair (f.sub.j,f.sub.k), j≠k, in the set of data records to identify redundant pairs for which f.sub.j and f.sub.k provide redundant information with respect to respectively associated metrics. One of the pair members of each redundant pair may then be marked as excluded from consideration in further analysis. Next, for each f.sub.j=t not marked as excluded from consideration the set of data records may be subdivided into one or more respective feature subsets each having the same respective value v.sub.i, after which a variance test with respect to the associated metric may be applied to the respective feature subsets to determine whether f.sub.j=t represents a discriminating feature. If not, f.sub.j=t may be marked as excluded from consideration in further analysis. Finally, elements and features of the remaining data records may be analyzed for major contributors. Specifically, for each particular element of the remaining set of data records is one that has an identical value and corresponds to a feature f.sub.j=s that is not excluded from consideration by either the χ.sup.2 test or the variance test. Statistics of the performance metric(s) associated with each such element are computed, and if a statistically significant negative contribution to network performance with respect to an associated performance metric is determined, the associated element is deemed a major contributor, and identified as an inefficient element. Thus, the respective data subset associated with each inefficient element may be retained as one of the generated one or more data subsets a respective collection of feature vectors each containing the particular element, wherein the particular element is the particular inefficient element.

(87) In further accordance with example embodiments, determining major contributors may be based on comparing means of performance metrics of “candidate” inefficient elements with means of all elements. Specifically, considering a given particular element, a first mean of an associated performance metric for a first data set that includes the given particular element may be computed. Then a second mean of the associated performance metric for a second data set that excludes the given particular element may be computed. Finally, based on a comparison of the first and second computed means, it may be determined that the given particular element has a statistically significant negative contribution to network performance with respect to the associated performance metric.

(88) In accordance with example embodiments, generating the one or more data subsets by applying a first statistical analysis to the set of data records and their performance metrics may entail ranking the respective, particular inefficient elements of among the respective subsets into a list according to increasing negative contribution to network performance, and then retaining only a threshold number of list elements in ranked order. For example, only the top 50 major contributors, based on how negatively they contribute to network performance, may be retained for root cause analysis. Other threshold numbers (e.g., besides 50) may be used as well.

(89) In accordance with example embodiments, applying the second statistical analysis may be used to identify influencer elements. Specifically, for each respective subset of feature vectors, every given element—except for the particular inefficient element—that has an identical value and has a statistically significant negative contribution to network performance with respect to an associated performance metric may be identified as an influencer element, and thus included in the respective set of influencer elements for the particular inefficient element.

(90) In accordance with example embodiments, applying the second statistical analysis may further discriminate between additive influences and one caused by incompatibilities. Specifically, for each influencer of the respective, particular inefficient element (major contributor), a respective influencer subset of feature vectors corresponding to those containing the given influencer element is first determined. Then, for each respective influencer subset the following operations may be carried out: (i) determining a first intersection with the respective subset of feature vectors; (ii) determining a second intersection with a complementary set of respective subset of feature vectors; (iii) applying a T-test to compare the first and second intersections. Finally (iv), if the T-test comparison yields a statistically significant difference, then marking the given influencer element as an incompatibility element, otherwise mark the given influencer element as an additive element.

(91) In accordance with example embodiments analyzing each possible pair of elements of the influencer elements of respective influencer sets to determine a dependency relationships may entail determining a joint distribution of the respective associated features (f.sub.r,f.sub.t) for each possible pairing of the influencer elements of each respective influencer set. A co-occurrence of the influencer elements of the respective influencer set may then be determined based on the joint distribution. Further, each co-occurrence it may be determined whether it is directional or two-way.

(92) In accordance with example embodiments, identifying redundancies and removing them from the hierarchical dependency tree may entail, for each hierarchical dependency between a dependent parent element, e.sub.p, and a dependent child element e.sub.c, computing a first metric mean with respect to an associated performance metric of each of e.sub.p and e.sub.c for a first particular set containing e.sub.p and not containing e. A second metric mean with respect to the associated performance metric may be computed for second particular set containing the respective, particular inefficient element, and the first and second metric means may be compared. If the first metric mean is smaller than the second metric mean, then the parent element may be removed from the respective hierarchical dependency tree. If the second metric mean is smaller than the first metric mean by more than a threshold amount, then the child element may be removed from the respective hierarchical dependency tree.

(93) In accordance with example embodiments, grouping mutually dependent influencer elements of each respective hierarchical dependency tree and retaining only a longest of any multiple paths between remaining nodes to generate a respective dependency graph may entail, for each node of each respective hierarchical dependency tree, grouping mutually dependent element. Then, for each pair of nodes connected with multiple paths, a depth first search (DFS) algorithm may be applied to determine the longest of the multiple paths. Finally all of multiple paths that are not the longest paths may be removed.

IV. CONCLUSION

(94) The present disclosure is not to be limited in terms of the particular embodiments described in this application, which are intended as illustrations of various aspects. Many modifications and variations can be made without departing from its scope, as will be apparent to those skilled in the art. Functionally equivalent methods and apparatuses within the scope of the disclosure, in addition to those described herein, will be apparent to those skilled in the art from the foregoing descriptions. Such modifications and variations are intended to fall within the scope of the appended claims.

(95) The above detailed description describes various features and operations of the disclosed systems, devices, and methods with reference to the accompanying figures. The example embodiments described herein and in the figures are not meant to be limiting. Other embodiments can be utilized, and other changes can be made, without departing from the scope of the subject matter presented herein. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the figures, can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations.

(96) With respect to any or all of the message flow diagrams, scenarios, and flow charts in the figures and as discussed herein, each step, block, and/or communication can represent a processing of information and/or a transmission of information in accordance with example embodiments. Alternative embodiments are included within the scope of these example embodiments. In these alternative embodiments, for example, operations described as steps, blocks, transmissions, communications, requests, responses, and/or messages can be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved. Further, more or fewer blocks and/or operations can be used with any of the message flow diagrams, scenarios, and flow charts discussed herein, and these message flow diagrams, scenarios, and flow charts can be combined with one another, in part or in whole.

(97) A step or block that represents a processing of information can correspond to circuitry that can be configured to perform the specific logical functions of a herein-described method or technique. Alternatively or additionally, a step or block that represents a processing of information can correspond to a module, a segment, or a portion of program code (including related data). The program code can include one or more instructions executable by a processor for implementing specific logical operations or actions in the method or technique. The program code and/or related data can be stored on any type of computer readable medium such as a storage device including RAM, a disk drive, a solid state drive, or another storage medium.

(98) The computer readable medium can also include non-transitory computer readable media such as computer readable media that store data for short periods of time like register memory and processor cache. The computer readable media can further include non-transitory computer readable media that store program code and/or data for longer periods of time. Thus, the computer readable media may include secondary or persistent long term storage, like ROM, optical or magnetic disks, solid state drives, compact-disc read only memory (CDROM), for example. The computer readable media can also be any other volatile or non-volatile storage systems. A computer readable medium can be considered a computer readable storage medium, for example, or a tangible storage device.

(99) Moreover, a step or block that represents one or more information transmissions can correspond to information transmissions between software and/or hardware modules in the same physical device. However, other information transmissions can be between software modules and/or hardware modules in different physical devices.

(100) The particular arrangements shown in the figures should not be viewed as limiting. It should be understood that other embodiments can include more or less of each element shown in a given figure. Further, some of the illustrated elements can be combined or omitted. Yet further, an example embodiment can include elements that are not illustrated in the figures.

(101) While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purpose of illustration and are not intended to be limiting, with the true scope being indicated by the following claims.

Automatic root cause diagnosis in networks based on hypothesis testing

Assignee

Inventors

Cpc classification

Classification Explorer

G06F16/212

PHYSICS

Classification Explorer

G06F16/2246

PHYSICS

Classification Explorer

H04L41/0631

ELECTRICITY

Classification Explorer

H04L41/0636

ELECTRICITY

Classification Explorer

H04L41/142

ELECTRICITY

International classification

Classification Explorer

G06F16/00

PHYSICS

Classification Explorer

G06F16/21

PHYSICS

Classification Explorer

H04L12/24

ELECTRICITY

Classification Explorer

G06F16/22

PHYSICS

Abstract

Claims

Description