METHOD FOR DETECTING ANOMALIES IN COMMUNICATIONS, AND CORRESPONDING DEVICE AND COMPUTER PROGRAM PRODUCT

Abstract

Described herein are solutions for detecting anomalies in communications exchanged through a communication network between a respective source and a respective destination. For this purpose, a computer (40a) generates pre-processed data (PD) that comprise one or more tokens for the respective communication. Next, the computer divides the monitoring interval (MI) into a training interval (TI) and a verification interval (VI).

The computer then generates (1008) a first list of features (F.sub.SRC,TI) for the connections of a given source (SRC) in the training interval (TI). For this purpose, the computer determines, for the connections of the given source (SRC), the unique destination identifiers and, for each token, the respective unique values. Next, the computer determines a first set of enumeration rules and associates, by means of the first set of enumeration rules, to each connection a respective enumerated destination identifier and one or more respective enumerated tokens. Likewise, the computer generates, by means of a second set of enumeration rules, a second list of features (F.sub.GRP,TI) for the connections of a set of devices (GRP) to which the given source (SRC) belongs.

The computer then generates (1010) a first set of Bayesian networks by training, for each feature of the first list of features (F.sub.SRC,TI), a respective Bayesian network using the data of the other features of the first list of features (F.sub.SRC,TI), and generates a second set of Bayesian networks by training, for each feature of the second list of features (F.sub.GRP,TI), a respective Bayesian network using the data of the other features of the second list of features (F.sub.GRP,TI).

Next, the computer generates, for the connections of the given source (SRC) in the verification interval (VI), a third list of features (F.sub.SRC,VI) using the first set of enumeration rules and a fourth list of features (F.sub.GRP,VI) using the second set of enumeration rules. Consequently, by means of the first set of Bayesian networks and the second set of Bayesian networks, the computer can classify each value of the third list of features (F.sub.SRC,VI) and of the fourth list of features (F.sub.GRP,VI), respectively, as normal or anomalous. In various embodiments, the classification can also use one or more SVMs (Support Vector Machines).

Claims

1. A method of detecting anomalies in communications exchanged via a communication network between a respective source and a respective destination, comprising the steps of: obtaining metadata for a plurality of communications in a monitoring interval, wherein said metadata include for each communication an identifier of said source, an identifier of said destination, and data extracted from an application protocol of the respective communication; processing said extracted data to obtain preprocessed data comprising one or more tokens for the respective communication, wherein each token comprises a string; dividing said monitoring interval into a training interval and a verification interval; obtaining the identifier of a given source and generating a first list of a plurality of features (F.sub.SRC,TI) for connections of said given source in said training interval via the following steps: selecting the connections of said given source in said training interval, determining for said connections of said given source in the said training interval the univocal destination identifiers and for each token the respective univocal values, determining a first set of enumeration rules by enumerating said univocal destination identifiers and for each token the respective univocal values, and associating by means of said first set of enumeration rules with each connection of said source in said training interval a respective enumerated destination identifier and one or more respective enumerated tokens, wherein said first list of features comprises for each connection of said given source in said training interval the respective enumerated destination identifier and the respective one or more enumerated tokens; obtaining the identifier of a group of devices to which said given source belongs and generating a second list of a plurality of features for the connections of the devices belonging to said group of devices in said training interval via the following steps: selecting the connections of said group of devices in said training interval, determining for said connections of said group of devices in said training range the univocal destination identifiers and for each token the respective univocal values, determining a second set of enumeration rules by enumerating said univocal destination identifiers and for each token the respective univocal values, and associating by means of said second set of enumeration rules with each connection of said group of devices in said training interval a respective enumerated destination identifier and one or more respective enumerated tokens, wherein said second list of features comprises for each connection of said group of devices in said training interval the respective enumerated destination identifier and the respective one or more enumerated tokens; generating a first set of Bayesian networks by training for each feature of said first list of features a respective Bayesian network using the data of the other features of said first list of features (F.sub.SRC,TI), and generating a second set of Bayesian networks by training for each feature of said second list of features a respective Bayesian network using the data of the other features of said second list of features, generating a third list of a plurality of features for the connections of said given source in said verification interval via the following steps: selecting the connections of said given source in said verification interval, and associating by means of said first set of enumeration rules with each connection of said given source in said verification interval a respective enumerated destination identifier and one or more respective enumerated tokens, wherein said third list of features comprises for each connection of said given source in said verification interval the respective enumerated destination identifier and the respective one or more respective enumerated tokens; generating a fourth list of a plurality of features for connections of said given source in said verification interval via the following steps: selecting the connections of said given source in said verification interval, associating by means of said second set of enumeration rules with each connection of said given source in said verification interval a respective enumerated destination identifier and one or more respective enumerated tokens, wherein said fourth list of features comprises for each connection of said given source in said verification interval the respective enumerated destination identifier and the respective one or more respective enumerated tokens; repeating the following steps for each connection of said given source in said verification interval: determining based on the values of the features of said third list of features associated with the respective connection of said given source for each feature of said third list of features the respective most probable value by using said first set of Bayesian networks, classifying each value of the features of said third list of features associated with the respective connection of said given source via the following steps: in case the value of a feature of said third list of features corresponds to the respective most probable value, classifying the value of the feature of said third list of features as normal, and in case the value of a feature of said third list of features does not correspond to the respective most probable value: a) determining for the value of said feature of said third list of features the respective probability of occurrence by using said first set of Bayesian networks, and b) classifying the value of said feature of said third list of features as normal if the respective probability of occurrence is greater than a first threshold, and c) classifying the value of said feature of said third list of features as anomalous if the respective probability of occurrence is smaller than said first threshold; and determining based on the values of the feature values of said fourth list of features associated with the respective connection of said given source for each feature of said fourth list of features the respective most probable value by using said second set of Bayesian networks, and classifying each value of the features of said fourth list of features associated with the respective connection of said given source via the following steps: in case the value of a feature of said fourth list of features corresponds to the respective most probable value, classifying the value of the feature of said fourth list of features as normal, and in case the value of a feature of said fourth list of features does not correspond to the respective most probable value: a) determining for the value of said feature of said fourth list of features the respective probability of occurrence by using said second set of Bayesian networks, and b) classifying the value of said feature of said fourth list of features as normal if the respective probability of occurrence is greater than a second threshold, and c) classifying the value of said feature of said fourth list of features as anomalous if the respective probability of occurrence is smaller than said second threshold.

2. The method according to claim 1, comprising: repeating the following steps for each connection of said given source in said verification interval: determining a first number of values of the features of said third list of features associated with the respective connection of said given source that are classified as anomalous, determining a second number of values of the features of said fourth list of features associated with the respective connection of said given source that are classified as anomalous, and classifying the connection of said given source as anomalous if the first number and/or the second number is greater than a third threshold.

3. The method according to claim 1, comprising: repeating the following steps for each connection of said given source in said verification interval: determining a first average value of the probabilities of occurrence of the values of the features in said third feature list associated with the respective connection of said given source that are classified as anomalous, determining a second average value of the probability of occurrence of the values of the features of said fourth feature list associated with the respective connection of said given source that are classified as anomalous, and classifying the connection of said given source as anomalous if the first average value and/or the second average value is smaller than a fourth threshold.

4. The method according to claim 2, comprising: training a first single-class Support Vector Machine, SVM, by using said first list of features, repeating the following steps for each connection of said given source in said verification interval: classifying the values of the features of said third list of features associated with the respective connection of said given source as normal or anomalous by using said first SVM, and classifying the connection of said given source as suspicious if the connection of said given source is classified as anomalous and the values of the features of said third list of features associated with the respective connection of said given source are classified as anomalous by said first SVM.

5. The method according to claim 2, comprising: training a second single-class SVM by using said second list of features, repeating the following steps for each connection of said given source in said verification interval: classifying the values of the feature of said fourth list of features associated with the respective connection of said given source as normal or anomalous by using said second SVM, and classifying the connection of said given source as suspicious if the connection of said given source is classified as anomalous and the feature values of said fourth list of features associated with the respective connection of said given source are classified as anomalous by said second SVM.

6. The method according to claim 1, comprising: repeating the following steps for each source of a plurality of sources: obtaining a respective identifier of the respective source and generating a respective first list of a plurality of features for the connections of the respective source in said training interval, calculating for each feature of the respective first list of a plurality of features a respective average value, thereby generating a fifth list of features comprising for each source the respective average values of the features of the respective first list of a plurality of features, and generating groups of devices by applying a clustering algorithm, preferably a k-means clustering algorithm, to said fifth list of features.

7. The method according to claim 1, wherein said communications comprise Hypertext Transfer Protocol, HTTP communications, and wherein said one or more tokens are selected from: the HTTP method, the host, the mime type and/or one or more tokens extracted from the user agent field and/or the referrer field; and/or wherein said communications comprise Server Message Block, SMB communications, and wherein said one or more tokens are chosen from: the relative or absolute path to the file and/or one or more tokens extracted from the path to the file.

8. The method according to claim 1, wherein said first list of a plurality of features, said second list of a plurality of features, said third list of a plurality of features and said fourth list of a plurality of features further comprise at least one of: an enumerated value generated for a destination port of the Transmission Control Protocol, TCP, or the User Datagram Protocol, UDP, of the respective communication, a numerical value identifying the duration of the connection, and a numeric value identifying the amount of data exchanged.

9. The method according to claim 1, comprising: discretizing one or more of the features of said first list of a plurality of features, said second list of features of a plurality of features, said third list of a plurality of features, and said fourth list of a plurality of features by means of a clustering algorithm, preferably a k-means clustering algorithm.

10. The method according to claim 1, comprising: managing a database comprising for each source of a plurality of sources a respective list, wherein each list comprises metadata and/or preprocessed data of a subset of the connections of the respective source in said training interval, wherein said managing a database comprises: deleting data that are older than said training interval, receiving for a given source a list of metadata and/or preprocessed data of the connections of the respective source in said verification interval, selecting the list associated with said source and determining a first number of connections saved in said selected list, determining the number of connections of the respective source in said verification interval, determining a second number of connections as a function of a maximum number of connections, said first number of connections saved in said selected list and said number of connections of the respective source in said verification interval, randomly selecting said second connection number from said connections of the respective source in said verification interval and inserting the metadata and/or preprocessed data of said selected connections into said selected list, and possibly randomly deleting connections from said selected list if the number of connections saved in said selected list exceeds said maximum number of connections.

11. A device configured to implement the method according to claim 1.

12. A computer-program product that can be loaded into the memory of at least one processor and comprises portions of software code for implementing the steps of the method according to claim 1.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0051] The embodiments of the present disclosure will now be described with reference to the annexed drawings, which are provided purely by way of non-limiting example and in which:

[0052] FIG. 1 shows an example of a communication system comprising a network-security monitoring;

[0053] FIG. 2 shows an embodiment of operation of an SNMP, where the platform detects possible anomalies in the communications;

[0054] FIG. 3 shows an embodiment of a step of selection of a training and testing dataset that can be used with the platform of FIG. 2;

[0055] FIG. 4 shows an embodiment of a feature-extraction step that can be used during the selection step of FIG. 3;

[0056] FIG. 5 shows an embodiment of a learning and testing step that can be used with the platform of FIG. 2; and

[0057] FIG. 6 shows an embodiment for management of a database that comprises the data that can be used for the learning step.

DETAILED DESCRIPTION

[0058] In the ensuing description, numerous specific details are provided to enable an in-depth understanding of the embodiments. The embodiments may be implemented without one or more of the specific details, or with other methods, components, materials, etc. In other cases, operations, materials, or structures that are well known are not represented or described in detail so that the aspects of the embodiments will not be obscured.

[0059] Reference throughout this description to “an embodiment” or “one embodiment” means that a particular characteristic, distinctive element, or structure described with reference to the embodiment is comprised in at least one embodiment. Hence, the use of the phrases “in an embodiment” or “in one embodiment” in various parts of this description does not necessarily refer to one and the same embodiment. Moreover, the particular characteristics, distinctive elements, or structures may be combined in any way in one or more embodiments.

[0060] The references appearing herein are provided only for convenience and do not define the sphere of protection or the scope of the embodiments.

[0061] In the ensuing FIGS. 2 to 6, the parts, elements, or components that have already been described with reference to FIG. 1 are designated by the same references as those used previously in this figure; the description of these elements that have been described previously will not be repeated hereinafter in order not to overburden the present detailed description.

[0062] As mentioned in the previous section, an SNMP makes available to security analysts a set of analytics, capable of identifying sequences of suspect events that in probablistic terms are such as to indicate occurrence of an attack.

[0063] In this context, the present description regards a module configured for analyzing the network traffic, preferably also at an application layer. For instance, in various embodiments the module can analyze one or more of the following protocols: HTTP (HyperText Transfer Protocol), POP (Post-Office Protocol), in particular version 3 (POP3), IMAP (Internet Message Access Protocol), SMTP (Simple Mail Transfer Protocol), and SMB (Server Message Block).

[0064] In particular, this module analyzes the connections set up by the machines monitored by the SNMP, for example between the clients DEV and a local server SRVL and/or between the clients DEV and the WAN, for detecting inconsistencies with respect to their usual behavior. Consequently, the corresponding function can be implemented within a computer 40a configured to implement an SNMP, for example via software code and/or hardware components. For a general description of such an SNMP reference may be made to the description of FIG. 1.

[0065] FIG. 2 shows an embodiment of operation of the computer 40a for analyzing communications.

[0066] After a starting step 1000, the computer 40a receives (in step 1002) data packets DP from one or more data-traffic sensors. For instance, as explained previously, the above data packets DP can be provided by a SPAN port 402 of a switch 100, a router and/or a firewall 20, a TAP 404, etc. In general, the computer 40a may also be directly integrated in one of the data-traffic sensors, for example within a firewall with sufficient computing capacity.

[0067] For instance, with reference to data packets DP in accordance with the IP, each IP packet (IPv4 or IPv6) includes a header comprising an IP source address and an IP destination address. Moreover, each IP packet may include data of a transport protocol, which comprise a payload and possibly further routing information, for example a port for the protocols TCP or UDP. In fact, these transport protocols are used to identify connections between a first device, such as a client DEV, and a second device, such as a server SRV.

[0068] Consequently, in step 1004, the computer 40a can process the data packet DP and extract data characterizing the data packet DP. In particular, in various embodiments, the computer 40a can extract from these headers routing information, such as: [0069] in the case where the data packet DP comprises an IP packet, the IP source address and the IP destination address; and [0070] in the case where the data packet DP comprises a TCP or UDP packet (possibly included in an IP packet), the respective port.

[0071] Moreover, by analyzing unique communications at the transport layer, the computer 40a can also determine the duration of the connection and/or the amount of data exchanged. In general, the amount of data exchanged may correspond to a first datum that indicates the amount of data sent by the first device to the second device, a second datum that indicates the amount of data sent by the second device to the first device, and/or a cumulative value of the first and second data.

[0072] In various embodiments, the computer 40a can also determine a so-called hostname and/or domain name associated to the first device and/or to the second device. In addition, by analyzing the communications at the link layer, the computer 40a can determine also the MAC address of the devices of the LAN, for example the MAC addresses of the clients DEV and/or of the local server SRVL.

[0073] Consequently, in various embodiments, the computer 40a generates, for the communication exchanged between two devices, for example between a client DEV and a server SRV, respective metadata MD, which may comprise: data that identify the first device (e.g., a client DEV), for instance, the respective IP address and/or MAC address, data that identify the second device (e.g., a server SRV), for instance, the respective IP address and/or the respective domain name, data that identify the UDP or TCP (destination) port of the communication; data that identify the duration of the connection, for example expressed in seconds, and/or data that identify the amount of data exchanged, for example the number of bytes.

[0074] In various embodiments, the computer 40a analyzes in step 1004 also the data of the packets DP at the application layer, i.e., the data that are included in the payload of the transport protocols, for example, in the TCP packets. For instance, by analyzing the payload and/or the port number, the computer 40a can detect the type of application protocol and analyze one or more data of the respective protocol. In general, this analysis is well known for non-encrypted (or only partially encrypted) protocols, such as: HTTP (HyperText Transfer Protocol), POP (Post-Office Protocol), in particular version 3 (POP3), IMAP (Internet Message Access Protocol), SMTP (Simple Mail Transfer Protocol), and SMB (Server Message Block).

[0075] For instance, with reference to the HTTP, the computer 40a can determine the host name, i.e., the registered domain name of the target or the corresponding IP address, contained in the URI (Uniform Resource Identifier) of the HTTP request. Optionally, the computer 40a can extract also one or more data on the basis of the HTTP request sent, such as the HTTP method used for the request, the user agent, a field normally used to describe the agent generating the request, in such a way as to provide information to improve intercompatibility between systems, and/or the referrer, which contains, in the case of re-addressing of the request, the URI that has caused re-addressing.

[0076] Instead, by analyzing the HTTP response, the computer 40a can determine the HTTP status code, i.e., the code that indicates the status of the request, received from the destination.

[0077] In various embodiments, the computer 40a can determine also data that identify a type of file, such as a mime type, in particular with reference to the file/payload requested and/or contained in the data of the connection.

[0078] Likewise, if the communication regards a resource-sharing protocol, such as SMB, the computer 40a can determine the relative or absolute path of the resource accessed. Moreover, the computer 40a can determine the size of the file exchanged.

[0079] The person skilled in the art will appreciate that there exist also encrypted protocols, for example HTTPS (HyperText Transfer Protocol over Secure Socket Layer/Transport Layer Security). In this case, it is practically impossible to analyze the contents of the communication. However, in this case the computer 40a can analyze the handshake step of the TLS or SSL protocols. For instance, for this purpose the Italian patent application 102021000015782 may be cited, the contents of which are for this purpose incorporated herein as reference.

[0080] In various embodiments, the computer 40a is configured for storing the aforesaid metadata MD in a memory or database 408. In general, the metadata MD may be extracted in real time, or the computer 40a can store at least in part the data packets DP and process the data packets DP periodically or upon request by an operator. In general, on the basis of the characteristics of the metadata MD, the respective metadata MD can also be stored in different databases or tables of the database itself, or else simply in different log files. For instance, as illustrated schematically in FIG. 2, the computer 40a can manage for each application protocol a respective table or a respective log file, such as: [0081] a table or a log 4080 for the HTTP communications; [0082] a table or a log 4082 for the SMB communications; and [0083] optionally one or more tables or log files 4084 for the communications associated to the e-mail protocols, such as POP3, IMAP, and SMTP.

[0084] In general, the computer 40a could also receive directly the metadata MD from one or more data-traffic sensors, and consequently step 1002 is purely optional.

[0085] In various embodiments, the computer 40a then analyzes, in step 1006, the metadata MD of the various application protocols (HTTP, SMB, etc.) to extract the metadata MD that regard a given monitoring interval MI. For instance, the monitoring interval MI may be defined by an operator or may be determined automatically. For instance, in various embodiments, step 1006 can be started up once a day, for example in night-time hours, and the computer 40a can obtain the metadata MD for a period that corresponds to a given number of days, preferably longer than 15 days, for example a month, starting from the present date.

[0086] In various embodiments, the computer 40a then processes (in step 1008) the metadata MD of the monitoring interval MI to generate pre-processed data PD and/or to extract a set of features F for each communication. In various embodiments, a portion of these pre-processed data PD and/or of these features F, in particular the data that can be determined by processing individually the metadata MD of a single communication, can also be stored in the memory 408 to prevent these data from having to be calculated each time. Hence, a part of the processing operations of step 1008 may also be implemented in step 1004 to save the pre-processed data PD and/or the features F, associated to a given communication already present in the memory 408.

[0087] Consequently, the computer 40a can process, in step 1010, the aforesaid features F to determine possible anomalies for a given source device SRC. For instance, the source device SRC may be provided by an operator. As will be described in greater detail hereinafter, to detect possible anomalies, the computer 40a is configured to learn, in step 1010, what is the usual network activity of the device within the network being monitored.

[0088] Consequently, in step 1012, the computer 40a can verify whether any anomalies have been detected. In the case where the computer 40a has detected an anomalous behavior (output “Y” from the verification step 1012), the computer 40a proceeds to a step 1014, where it signals the event to an operator, for example by sending a notification to the terminal 406 (see FIG. 1). Next, the computer 40a returns to step 1002 or 1004 to receive new data packets DP and/or metadata MD, or to step 1010 to analyze the communications of another device SRV. For instance, in this way, the computer 40a could repeat automatically steps 1010-1014 for all the devices that have been detected in the network monitored. Instead, in the case where the computer 40a has not detected any anomaly (output “N” from the verification step 1012), the computer 40a can return directly to step 1002 or step 1004.

[0089] FIG. 3 shows an embodiment of pre-processing in step 1008.

[0090] In particular, as explained previously, by analyzing the data packets DP, the computer 40a is able to determine two sets of metadata MD. The first set of basic metadata MD comprises the characteristic data that can be determined on the basis of the IP and of the transport protocol, for example TCP or UDP. For instance, as mentioned previously, for a given source device, said metadata MD may comprise for each communication: [0091] an identifier of the source machine, for example the IP address and/or MAC address and/or the hostname of the source machine; [0092] an identifier of the destination machine, for example the IP destination address and/or the hostname of the target machine; [0093] the TCP or UDP destination port; [0094] optionally the duration of the connection, for example expressed in seconds; and [0095] optionally the amount of data exchanged, for example expressed in bytes received at input and sent at output or else evaluated as the sum of the two.

[0096] To this first set of basic metadata there may then be added other “specific” metadata that depend upon the application protocol used.

[0097] In various embodiments, the computer 40a is configured for filtering, in step 1030, one or more of these data to extract important information, the so-called tokens, discarding information of little importance.

[0098] For instance, in various embodiments, the computer 40a can filter, in step 1030, the user-agent field of the HTTP. For instance, a typical user-agent field may have the following content:

“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36”
where the first string “Mozilla/5.0” represents a token that identifies the program used for sending the HTTP request, whereas the following part of the string comprises additional comments. Consequently, the computer 40a can filter the user-agent field to extract just the token, i.e., the string up to the first space.

[0099] Likewise, the computer 40a can process (in step 1030) the URL contained in the referrer field of an HTTP request. For instance, in various embodiments, the computer 40a is configured for analyzing the respective string to extract a token chosen from between: [0100] in the case where the field comprises an IP address, the respective IP address and/or the first token of the path of the URL; [0101] in the case where the field comprises an FQDN (Fully Qualified Domain Name), the first-level and second-level domain and optionally the first token of the path of the URL.

[0102] For instance, in various embodiments, the pre-processing step applied, respectively, to “http://192.168.0.10/aizoon/aizoon.php?action=dashboard.view” and “http://www.aizoon.it/” makes it possible to obtain the token “aizoon” and “aizoon.it”.

[0103] In various embodiments, the computer 40a can remove possible information on the ports, contained in the URL of the referrer.

[0104] Likewise, also with reference to the path of an SMB communication, the computer can extract, in step 1030, one or more tokens from the path of the file. For instance, in various embodiments, the computer 40a could enable specification of a number of subfile levels, and the computer can extract, in step 1030, a first token that comprises only the path up to this number of subfile levels. In addition or as an alternative, the computer can extract, in step 1030, a second token that comprises only the name of the file (without path).

[0105] In various embodiments, the pre-processing module 1030 may also manage possible empty fields, for example, assigning a special value “missing”. Consequently, the previous tokens can be determined individually for each communication and represent pre-processed data PD.

[0106] Consequently, each connection comprises a plurality of parameters, where each parameter corresponds to a respective field of the metadata MD and/or to a respective token of the pre-processed data PD extracted from the metadata MD. In this context, each parameter may comprise a string or a numeric value.

[0107] Hence, by analyzing these parameters, the computer 40a can extract (in step 1032) for the source device SRC one or more features from each parameter, thus generating a set of features F.sub.SRC.

[0108] FIG. 4 shows a possible embodiment of step 1032.

[0109] In particular, in step 1060, the computer 40a filters the connections obtained in step 1006, for selecting only the connections that comprise, as identifier of the source machine, the source device SRC, for example identified via an IP address and/or MAC address, and/or a hostname. Consequently, step 1060 selects only the connections generated by the source device SRC. Moreover, the computer 40a selects a first parameter.

[0110] In a verification step 1062, the computer 40a verifies whether the parameter comprises strings or numeric values, i.e., whether the parameter is a numeric parameter x.sub.N or a parameter x.sub.C with non-numeric values. In the case where the parameter comprises strings (output “Y” from the verification step 1062), i.e., a parameter x.sub.C, the computer 40a proceeds to a step 1064 to convert the parameter x.sub.C into a respective numeric parameter x′.sub.N.

[0111] In particular, in various embodiments, to convert the strings into numeric values, the computer 40a considers each unique string of the parameter x.sub.C (i.e., of the respective field of the metadata MD or of a token) as a categorical value and assigns to each unique categorical value (unique string) a respective numeric identifier; i.e., the value of the respective converted numeric parameter x′.sub.N corresponds to enumeration of the categorical value of the respective parameter x.sub.C.

[0112] For instance, in various embodiments, the computer 40a uses, as identifier of the destination machine, the concatenation of the hostname strings and of the respective IP, for example “www.aizoon.it/192.168.0.10” in such a way as to prevent collisions between machines with the same hostname and/or to take into consideration servers that manage different hostnames through a single IP address. In the embodiment considered, this conversion is hence carried out by enumeration of the distinct categorical values observed in the specific monitoring window. For instance, in this way, also the following data can be converted into numeric parameters x′.sub.N: [0113] the IP destination address; [0114] for HTTP communications, the HTTP method, the host, the tokens extracted from the user-agent field and possibly those extracted from the referrer field, and the mime type; and [0115] for SMB communications, the relative or absolute path of the file and/or a token extracted from the path.

[0116] In general, according to the use of the network, the unique values of a data field/token x.sub.C may even be several. Consequently, in various embodiments, the computer 40a is configured to extract for each field/token x.sub.C only a given number n of most frequent categorical values. For instance, for this purpose, the computer 40a can generate a list of unique values that are detected for the parameter x.sub.C, and determine for each unique value a respective value of occurrence. Next, the computer 40a can then determine the value of the respective numeric parameter x′.sub.N by enumerating exclusively the n most frequent categorical values and assign a generic value “others” to all the other (less frequent) values detected. Preferably, the enumeration also follows the occurrence of the unique values; for example, the most frequent value could have the enumeration “1”, the least frequent value, once again considered individually, could have the enumeration “n”, and the generic value “others” could correspond to the enumeration “n+1”. In various embodiments, the value n may be programmable, possibly also individually for each field/token x.sub.C. Alternatively, the computer can also determine automatically the value n, for example by incrementing the value n until the likelihood of occurrence of the unique values enumerated individually exceeds a given threshold, for example 80%. This operation hence also enables reduction of the size of the categorical parameters and consequently the computational costs of the subsequent analysis, which, in the case of excessively great dimensionality, could become complex and slow.

[0117] Consequently, after conversion of a categorical parameter x.sub.C into a respective numeric parameter x′.sub.N, the computer 40a can proceed to a step 1066, where the computer 40a can process the values of the numeric parameter x′.sub.N and/or extract one or more features from the values of the numeric parameter x′.sub.N. Likewise, also in the case where the parameter is already a numeric parameter x.sub.N (output “N” from the verification step 1062), the computer 40a can proceed to step 1066. For instance, as mentioned previously, the purely numeric parameters x.sub.N may be the duration of the connection, the volume of data exchanged during the connection, and/or the dimension of a file exchanged through an SMB communication. In general, the computer 40a can manage the HTTP status code and the TCP/UDP destination port as numeric parameter x.sub.N or preferably as categorical parameter x.sub.C, with the corresponding operation of enumeration of the unique values (step 1064).

[0118] Consequently, step 1066 receives for a given numeric parameter x.sub.N or enumerated parameter x′.sub.N, generically denoted hereinafter as parameter x, a respective sequence of values x.sub.1, x.sub.2, x.sub.3, etc., where each value of the parameter x is associated to a respective communication.

[0119] In various embodiments, the computer 40a is configured for processing, in a step 1068, said sequence of values through a statistical analysis to determine data characterizing the distribution of the values, for example the respective mean value x and optionally the standard deviation σ(x). In general, in the case of an enumerated parameter x′.sub.N, the mean value x and the standard deviation σ(x) can be calculated using the number of occurrences of each enumerated value.

[0120] In various embodiments, these statistical data of a numeric parameter x may also be used for normalizing the values of the parameter x. For instance, in various embodiments, the computer 40a is configured for computing a normalized parameter x′ by centering the values of the respective numeric parameter x around their mean x, subtracting from them the value of the mean x itself, and then dividing the result by their standard deviation σ(x) in order to render more evident the information regarding the variation of the feature with respect to its absolute value; namely,

[00001] $x^{'} = \frac{x - \overline{x}}{σ (x)}$

[0121] In general, as will be described in greater detail hereinafter, this normalization is purely optional and is preferably applied only for the parameters x.sub.N that were already originally numeric.

[0122] In various embodiments, the computer 40a can discretize, in a step 1070, the values of a numeric parameter x.sub.N, of an enumerated parameter x′.sub.N, and/or of a normalized parameter x′, also in this case generically denoted as parameter x.

[0123] For instance, in the simplest case, the computer 40a can determine for the parameter x a range of values between a minimum value x.sub.min and a maximum value x.sub.max. Next, the computer 40a can divide the range of values into a number M of sub-intervals, for example of the same size, where associated to each sub-interval is a respective index. For instance, the number M or size of the sub-intervals may be programmable. Hence, the computer can generate a respective discretized parameter x.sub.D by determining, for each value of the parameter x, the respective sub-interval and use as value for the discretized parameter x.sub.D the index of the respective sub-interval.

[0124] Instead, in other embodiments, the computer 40a is configured to generate, in step 1070, the values of the discretized parameter x.sub.D by means of a clustering algorithm, such as a Lloyd clustering algorithm, for example the k-means clustering algorithm. For instance, in various embodiments, the computer 40a is configured for estimating the number K of clusters with the Calinski (Calinski-Harabasz) criterion, a method that typically yields with high likelihood a value of K equal or close to the optimal number of clusters. For instance, for this purpose, also the Wikipedia webpage “Determining the number of clusters in a dataset” may be cited. In general, the aforesaid clustering algorithms are per se well known, and among these there may be cited also the paper by Volkovich, Zeev & Toledano-Kitai, D. & Weber, Gerhard-Wilhelm “Self-learning K-means clustering: A global optimization approach”, 2013, Journal of Global Optimization 56, DOI: 10.1007/s10898-012-9854-y.

[0125] In various embodiments, the computer 40a can also determine, in a step 1072, whether the values of a (numeric, enumerated, normalized, or discretized) parameter are all the same and eliminate the respective parameter x in so far as it does not provide any kind of information. For instance, in the case where the computer 40a has determined, in step 1068, the standard deviation σ(x), it can determine, in step 1072, whether the standard deviation σ(x), or likewise the variance, is zero. For instance, in this way, there may be eliminated features that make reference to the second set of “specific” metadata, which in turn refer to application protocols that are never used by the device being monitored.

[0126] Consequently, in a step 1074, the computer 40a can verify whether there are other parameters to be processed. In particular, in the case where there are other parameters to be processed (output “Y” from the verification step 1074), the computer 40a can proceed, in a step 1076, to selecting another parameter, and the procedure returns to step 1062 to process the new parameter selected. Instead, in the case where there are no other parameters to be processed (output “N” from the verification step 1074), the pre-processing step 1032 terminates, at an end step 1078.

[0127] Consequently, as also illustrated in FIG. 3, the pre-processing procedure 1032 makes it possible to obtain, for the connections of a given source device SRC, respective numeric parameters x.sub.N and enumerated parameters x′.sub.N, possibly normalized into parameters x′.sub.N and/or discretized into parameters x.sub.D, which represent a set F.sub.SRC of features that can be used for analyzing the connections of the device SRC.

[0128] In various embodiments, the computer 40a can also determine, in a step 1036, a second set F.sub.GRP of features for the connections of a set/group of devices to which the device SRC belongs.

[0129] For instance, for this purpose, the computer 40a can determine, in a step 1034, a set of devices GRP for the device SRC. In general, the set GRP may comprise all the devices of the network, in particular the client devices DEV, or just a subset of the devices of the network. For instance, an operator could specify manually the devices that belong to one and the same set GRP, for example assigning all the computers of a given department to a respective set GRP.

[0130] Instead, in various embodiments, the computer may also automatically determine the devices that belong to given sets GRP. For instance, for this purpose, the computer 40a may once again use a k-means clustering algorithm. The number of clusters K may be programmable. For instance, the number K may be chosen from 5 and 20 classes of devices, preferably between 5 and 10.

[0131] In various embodiments, to speed up processing, instead of analyzing all the communications of all the devices, the computer 40a is configured to generate for each device a subset of features, extracting from one or more features the corresponding statistical data that identify the distribution of the respective features, for example the mean value and/or the variance.

[0132] For instance, in various embodiments, the computer 40a is configured to determine the mean value x for at least each of the following parameters: [0133] an enumerated parameter x′.sub.N determined for the data that identify the destination device; [0134] an enumerated parameter x′.sub.N determined for the UDP or TCP destination ports; [0135] one or more enumerated parameters x′.sub.N determined for the respective tokens extracted from the metadata MD at the application-protocol layer; and [0136] optionally a numeric parameter x.sub.N that corresponds to the duration of the connection, and/or one or more numeric parameters x.sub.N that identify the number of bytes exchanged.

[0137] In various embodiments, to determine these data for all the devices SRC that are monitored by the computer 40a, for example for all the devices DEV, step 1032 may hence be repeated for each device SRC. Consequently, in various embodiments, the clustering algorithm receives, in step 1034, for each source device SRC of the network monitored, data that identify the source device, for example the respective IP address and/or MAC address, and the respective sets of mean values x calculated for the features determined for the connections of the source device. On the basis of these data, the clustering algorithm then divides the devices monitored into different clusters, which hence correspond to respective sets GRP.

[0138] Consequently, the computer 40a can determine, in step 1036, the connections of all the devices of the set GRP to which the device SRC selected belongs, and then determine the second set F.sub.GRP of features for these connections. For instance, for this purpose, the computer can follow, in step 1036, the same steps described with reference to step 1032 (see FIG. 4), to obtain, in step 1060, only the connections of the devices that belong to the set GRP. Finally, step 1008 terminates at an end step 1038.

[0139] Consequently, in various embodiments, the computer 40a is configured to obtain, for a given device SRC and a given monitoring interval MI: a first set of features F.sub.SRC for the connections of the device SRC and, optionally, a second set of features F.sub.GRP for the connections of the devices that belong to the set GRP of the device SRC. In general, in the case where the computer 40a is configured for analyzing all the devices, the computer 40a can hence determine a first set of features F.sub.SRC for the connections of each device SRC, and optionally a second set of features F.sub.GRP for the connections of each set of devices GRP. In this context, the features F.sub.GRP may hence also be determined only once for all the devices of one and the same set GRP.

[0140] In various embodiments, the computer 40a can then process, in step 1010, the features F.sub.SRC of the device SRV and optionally the features F.sub.GRP of the respective set of devices GRP in order to detect anomalies in the connections of the device SRC.

[0141] FIG. 5 shows a possible embodiment of the learning and classification step 1010.

[0142] In the embodiment considered, the computer 40a is configured to divide the list of the features into a first set associated to the connections in a first time interval, referred to hereinafter as training interval TI, and a second set associated to the connections in a second interval, referred to hereinafter as verification interval VI. In particular, in this case, the monitoring interval MI comprises both of the intervals TI and VI. For instance, in various embodiments, the verification interval VI may correspond to the last day (or likewise to the last week) comprised in the monitoring interval MI, and the training interval TI may correspond to the remaining days of the monitoring interval MI. In general, the duration of the verification interval VI may be constant or variable. For instance, in the case where step 1000 is started up more or less periodically, the verification interval VI may comprise the data from the last startup of the procedure. Consequently, the verification interval VI may even be shorter and correspond approximately to one hour.

[0143] Consequently, once step 1010 has been started up, the computer 40a can obtain, in a step 1090, the list of features F.sub.SRC of the device SRC for the training interval TI, and use these data, referred to hereinafter as list of features of the training dataset F.sub.SRC,TI, for training one or more classifiers and/or estimators. In general, as described previously, the set of features F.sub.SRC may comprise a plurality of the parameters described previously, in particular, for each field or token considered, the respective numeric parameter x.sub.N or enumerated parameter x′.sub.N, possibly normalized (x′.sub.N) and/or discretized (x.sub.D). In particular, in the case of enumerated features, the computer can generate the list of features F.sub.SRC,TI only for the connections of the device SRC in the training interval TI. For this purpose, the computer hence selects the connections of the device SRC in the training interval. Next, as described previously, the computer determines, for the aforesaid connections of the device SRC, at least the unique destination identifiers and, for each token, the respective unique values and determines a first set of enumeration rules, enumerating the unique destination identifiers and, for each token, the respective unique values. Consequently, in various embodiments, the computer can associate, by means of the first set of enumeration rules, to each connection of the device SRC in the training interval, a respective enumerated destination identifier and one or more respective enumerated tokens; i.e., the list of features F.sub.SRC,TI comprises, for each connection of the device SRC in the training interval, the respective enumerated destination identifier and the respective one or more enumerated tokens.

[0144] For instance, in various embodiments, the computer 40a is configured for training, in a step 1092, a Bayesian engine and/or, in a step 1094, an SVM (Support Vector Machine). Classifiers based upon a Bayesian network are per se well known and may be implemented, for example, with R program libraries for statistical data processing. For instance, for this purpose, there may be cited the paper by Mihaljevic, Bojan et al., “bnclassify: Learning Bayesian Network Classifiers”, January 2019, The R Journal, 10:455, DOI 10.32614/RJ-2018-073. For instance, in various embodiments, the Bayesian engine constructs, in step 1092, NBN (Naïve Bayesian Network) and/or TAN (Tree-Augmented Naïve Bayes) models. For instance, for this purpose, there may be cited the paper by C K Chowet et al., “Approximating discrete probability distributions with dependence trees”, IEEE Transactions on Information Theory, 14(3):462-467, 1968.

[0145] For instance, in various embodiments, assuming that the set of features F.sub.SRC comprises m features f, the computer 40a can train a Bayesian network for each feature f by calculating the posterior probability P(f.sub.i|f) between the features f.sub.i of the system and f, with i={1, . . . , m}, f.sub.i≠f.

[0146] For instance, in various embodiments, the Bayesian network uses features that have discrete values; namely, the features f and f.sub.i used by the Bayesian engine 1092 are chosen exclusively from among the enumerated parameters x′.sub.N and the discretized parameters x.sub.D. For instance, in this case, the posterior probability P(f.sub.i|f) for a given value of the feature f can be calculated by determining the occurrence of each possible value of the feature f.sub.i. Consequently, in this way, the distribution of the posterior probability P(f.sub.i|f) can likewise be determined for each possible value of the feature f. For instance, in various embodiments, the computer 40a is configured to use the following set of features F.sub.SRC: [0147] the enumerated parameter x′.sub.N determined for the data that identify the destination device; [0148] the enumerated parameter x′.sub.N determined for the UDP or TCP destination ports; [0149] one or more enumerated parameters x′.sub.N determined for the respective tokens extracted from the metadata MD at the application-protocol layer; and [0150] optionally a discretized parameter x.sub.D determined for the duration of the connection, and/or one or more discretized parameters x.sub.D determined for the number of bytes exchanged.

[0151] In the testing step, it is hence possible to estimate the prior probability P(f| f.sub.i) of a given value of the feature f on the basis of a set of features observed f.sub.i using the Bayes theorem:

[00002] $P (f .Math. f_{i}) = h .Math. P (f) .Math. {.Math.}_{i}^{n} P (f_{i} .Math. f), f_{i} \neq f$

where h is a normalization constant. For instance, for this purpose, there may be cited the paper Langley et al., “An Analysis of Bayesian Classifiers”, June 1998. For instance, by repeating the calculation of the prior probability P(f| f.sub.i) for all the values of the feature f, the computer 40a can determine the most likely value f of the feature f for a given combination of features observed f.sub.i.

[0152] To do this, it is possible to use either an NBN (Naïve Bayesian Network) or else the TAN (Tree-Augmented Naïve Bayes) algorithm. An NBN can be represented via an AG (Acyclic Graph), where corresponding to each node is a feature f.sub.i and corresponding to each arc is a conditional probability between two features. In addition, in the NBN there is assumed a strong independence between all the features f.sub.i, and for this reason the AG is a star graph with a feature f (referred to as “root node”) at the center and arcs e.sub.i=P(f.sub.i|f). Unlike NBNs, in TANs the hypothesis of strong independence between the attributes is abandoned, and there is envisaged introduction of an additional arc for each node (with exclusion of the root node) that will connect it to the feature that affects it most.

[0153] As shown in FIG. 5, in various embodiments, the computer 40a can obtain, in a step 1096, the list of features F.sub.GRP of the set GRP of the device SRC for the training interval TI, and use these data, referred to hereinafter as features F.sub.GRP,TI, for training, in a step 1098, a second set of k Bayesian networks, for example NBNs or preferably TANs.

[0154] Also in this case, in the case of enumerated features, the computer can then generate the list of features F.sub.GRP,TI only for the connections of the set of devices GRP in the training interval TI. For this purpose, the computer hence selects the connections of the set of devices GRP in the training interval. Next, as described previously, the computer determines, for the aforesaid connections of the set of devices GRP, at least the unique destination identifiers and, for each token, the respective unique values, and determines a second set of enumeration rules, enumerating the unique destination identifiers and, for each token, the respective unique values. Consequently, in various embodiments, the computer can associate, by means of the second set of enumeration rules, to each connection of the set of devices GRP in the training interval, a respective enumerated destination identifier and one or more respective enumerated tokens; namely, the list of features F.sub.GRP,TI comprises, for each connection of the set of devices GRP in the training interval, the respective enumerated destination identifier and the respective one or more enumerated tokens.

[0155] Consequently, in various embodiments, step 1092 is used for training m Bayesian networks using the data of the individual device SRC, and step 1098 is used for training k Bayesian networks using the data of the set of devices GRP. In this way, there may hence be generated similar sets of Bayesian networks also for the other monitored devices SRC and possibly for the other sets of devices GRP.

[0156] As mentioned previously, in various embodiments, the computer 40a can also train an SVM in step 1094. In particular, in various embodiments, the SVM is a one-class classifier. In this case, the computer 40a uses, in step 1094, a training algorithm that seeks a hypersphere that best circumscribes all the instances of the training dataset, i.e., the list of the data F.sub.SRC,TI. In fact, through the appropriate adjustment of its hyperparameters, it is possible to exclude the trivial solutions, for example, the one represented by the hypersphere of infinite radius. Consequently, in step 1094, the computer 40a trains an SVM to classify the data F.sub.SRC,TI as a normal behavior, which makes it possible to verify whether a given set of features F.sub.SRC represents a normal behavior (it lies inside the hypersphere) or an anomalous behavior (it lies outside the hypersphere).

[0157] Likewise, in various embodiments, the computer 40a can train, in a step 1100, an SVM using as training dataset the list of the data F.sub.GRP,TI, i.e., the data of the respective set GRP. In this way, similar SVMs can then be trained for the other monitored devices SRC and possibly for the other sets of devices GRP. As mentioned previously, in various embodiments, the Bayesian networks use discretized parameters. Instead, the SVM 1094 and optionally the SVM 1100 can use the numeric parameters x.sub.N and/or the respective normalized values x′.sub.N.

[0158] Consequently, the computer 40a can obtain, in step a 1102, the list of features F.sub.SRC of the device SRC for the verification/testing interval VI and use these data, referred to hereinafter as list of features or testing dataset F.sub.SRC,VI to verify whether each set of features F.sub.SRC of the dataset F.sub.SRC,VI represents a normal behavior or an anomalous behavior.

[0159] In particular, in the case of enumerated features, to generate the list of features F.sub.SRC,VI for the connections of the device SRC in the verification interval, the computer can select the connections of the device SRC in the verification interval and associate, by means of the first set of enumeration rules described previously, to each connection of the device SRC a respective enumerated destination identifier and one or more respective enumerated tokens, where the list of features F.sub.SRC,VI comprises, for each connection of the device SRC in the verification interval, the respective enumerated destination identifier and the respective one or more enumerated tokens.

[0160] Consequently, in various embodiments, the computer 40a can use a given set of features F.sub.SRC of the testing dataset F.sub.SRC,VI to determine, in step 1104, through the Bayesian networks trained in step 1092, whether the respective combination of values indicates a normal behavior or a suspect/anomalous behavior. For instance, in various embodiments, the computer 40a estimates for this purpose the value of a feature f under observation using as prediction method the so-called likelihood weighting. In this case, the computer 40a can estimate, for each possible value of a feature f, the respective probability P(f| f.sub.i) using the other features f.sub.i of the set of features F.sub.SRC. Next, the computer can hence select the most likely value f and compare the most likely value estimated f with the value of the respective feature f of the given set of features F.sub.SRC. In the case where the values correspond, the value of the feature f is considered normal. Instead, if the effective value is different from the most likely value f, the computer 40a compares the probability of the real event occurring P(f| f.sub.i) with a (preferably configurable) threshold. In various embodiments, the computer 40a can compute the threshold also as a function of the probability of the most likely value estimated f.

[0161] Consequently, in the case where the value of the feature f is different from the most likely value f and the probability of the current value of the prediction is below the threshold, the computer 40a can classify the value of the feature f as suspect/anomalous. Hence, the NBN or TAN models yield a measurement of coherence for each of the values of the features f observed with respect to the usual activity of the machine SRC analyzed. In fact, the lower the confidence calculated by a TAN model for a specific observed value of the feature f, the higher the inconsistency of the observed value in the network traffic.

[0162] In this context, the training step takes into consideration a given set of features F.sub.SRC, which typically comprises enumerated values, whilst the features with zero variance can be eliminated (step 1072). Consequently, only the testing features that are present also during training may be considered. To solve this problem, each new value of a categorical variable never seen during training can be set at the value “others”. In this case, irrespective of the corresponding likelihood, the computer 40a can consider the value “others” always as being anomalous.

[0163] The inventors have noted that there are likely to be incoherences in individual features also with a normal data traffic. Consequently, in various embodiments, the computer classifies a given connection, as identified through the respective set of features F.sub.SRC, as suspect only if the number of features identified as suspect exceeds a given threshold, which for example may correspond to m/2; i.e., a connection is judged suspect if and only if there are encountered incoherences for the majority of the features.

[0164] In addition or as an alternative, the computer 40a can compute a mean value for a given connection as a function of the incoherences of all the features and/or of only the features that are different from the respective most likely value. Consequently, in this case, the computer 40a can compare this mean value with a threshold that makes it possible to set a percentage of incoherence below which the entire datum is considered coherent (this can be done to eliminate possible data with low incoherence that could be false positives).

[0165] Consequently, the computer 40a determines in step 1104—by means of the Bayesian networks trained in step 1092 for the training dataset F.sub.SRC,TI—the connections of the testing dataset F.sub.SRC,VI that are suspect. In various embodiments, the computer 40a can then also determine in a step 1106—by means of the Bayesian networks trained in step 1098 for the training dataset F.sub.GRP,TI—the connections of the testing dataset F.sub.SRC,VI that are suspect.

[0166] In particular, in the case of enumerated features, the computer can generate for this purpose a list of features F.sub.GRP,VI for the connections of the device SRC in the verification interval. To generate this list of features F.sub.GRP,VI, the computer can select the connections of the device SRC in the verification interval and associate this time, by means of the second set of enumeration rules described previously, to each connection of the device SRC, a respective enumerated destination identifier and one or more respective enumerated tokens, where the list of features F.sub.SRC,VI comprises, for each connection of the device SRC in the verification interval, the respective enumerated destination identifier and the respective one or more enumerated tokens. Consequently, whereas step 1104 verifies the list of features F.sub.SRC,VI that comprises enumerated features with the enumeration rules determined for the individual device SRC, step 1106 verifies the list of features F.sub.GRP,VI that comprises enumerated features with the enumeration rules determined for the set of devices GRP.

[0167] In various embodiments, the computer 40a can then also determine in a step 1108—by means of the SVM trained in step 1094 for the training dataset F.sub.SRC,TI—the connections of the testing dataset F.sub.GRP,VI that are suspect, i.e., classified as anomalous. In various embodiments, the computer 40a can also determine in a step 1110—by means of the SVM trained in step 1100 for the training dataset F.sub.GRP,TI—the connections of the testing dataset F.sub.GRP,VI that are suspect. In general, the computer can classify all the connections by means of the SVM 1094 and/or the SVM 1100. As an alternative, the computer 40a can classify, in step 1108, only the connections that have been identified as suspect in step 1104, and likewise the computer 40a can classify, in step 1110, only the connections that have been identified as suspect in step 1106.

[0168] Consequently, in a step 1112, the computer can combine the results supplied by the various estimators, and the procedure terminates at an end step 1114. For instance, in the simplest case, the computer 40a can classify a connection as suspect if: [0169] the first Bayesian engine indicates, in step 1104, that the respective connection is suspect, and (when used) also the first SVM indicates, in step 1108, that the same connection is suspect; or [0170] optionally, the second Bayesian engine indicates, in step 1106, that the respective connection is suspect, and (when used) also the second SVM indicates, in step 1108, that the same connection is suspect.

[0171] In what follows, some additional embodiments will now be described. As explained previously, in various embodiments, the computer acquires the data of a given machine SRC and possibly for a respective set GRP for a monitoring interval MI in which the connection data are divided into a training dataset TI and a testing dataset VI.

[0172] However, the data to be processed may amount to a very large number. For instance, as explained previously, to limit the number of possible values of numeric parameters x.sub.N or of enumerated parameters x′.sub.N, the number of possible values of each feature can be reduced by applying, in step 1070, a clustering algorithm, such as a k-means clustering algorithm. This hence enables reduction of the computational costs of the Bayesian networks 1092 and 1098.

[0173] However, frequently a large number of connections must be processed. In order to limit the amount of data gathered and balance them as much as possible for each machine, the learning and classification module 1010 can execute a sampling operation.

[0174] In particular, FIG. 6 shows a possible embodiment of steps 1006 and 1008. In particular, as explained previously, the computer 40a determines, in step 1006, only the connections that regard a given monitoring interval MI. Moreover, step 1008, and in particular step 1030, can be used to generate pre-processed data PD. For instance, in the embodiment considered, the data of the connections that are then used to generate the features are stored in a database (for example, the database 408), for instance in the form of a table or list 4086.

[0175] In particular, in the embodiment considered, the procedure of analysis (step 1000) is started, periodically or manually, after a given time interval with respect to a previous starting step. In particular, in various embodiments, this time interval corresponds to the testing interval VI. However, an operator could also specify a different testing interval VI, or the duration of the testing interval VI could be pre-set.

[0176] In the embodiment considered, the computer 40a then obtains, in a step 1120, the duration of the monitoring interval MI. For instance, the duration of the monitoring interval MI may be pre-set or may be calculated with respect to a given date and time selected. For instance, the date can be determined automatically (for example, by selecting the current date minus 30 days) and the time can be pre-set (for example, in the 0-to-24 hour range of the day selected).

[0177] Consequently, in various embodiments, the computer 40a can erase, in step 1120, from the database 4086 also the data of the connections that are older than the monitoring interval MI.

[0178] In various embodiments, the computer 40a then determines, in a step 1122, for a given machine SRC the number of connections n.sub.of saved in the table/list 4086. For instance, for this purpose, the data of each machine SRC can be saved in a respective table/list. Moreover, the computer 40a determines, in a step 1124, the number n.sub.nf of the new connections for the machine SRC extracted from the database 408, i.e., the tables/lists 4080, 4082, 4084.

[0179] Consequently, in various embodiments, the computer 40a can use as training dataset, for example, F.sub.SRC,TI, the respective data of the n.sub.of connections of a given machine (or the respective features of the machines of the set GRP) and as testing dataset, for example, F.sub.SRC,VI, the respective data of the n.sub.nf connections of the machine SRC.

[0180] In various embodiments, the computer 40a updates, in a step 1126, also the data stored in the database 4086 on the basis of the data of the new connections. In particular, for this purpose, the computer uses a (constant or programmable) parameter that indicates the maximum number n.sub.f of connections to be saved for each machine SRC.

[0181] Consequently, the computer 40a can compute a number of data n.sub.s to be entered in the database 4086. For instance, for this purpose, the computer 40a can compute the number of features that are lacking to reach the “maximum” number n.sub.f, provided that there are at least already present a certain number of features (fixed, for example, at 0.5% of n.sub.f) and that this number is at least equal to 1. In addition, if n.sub.nf were not to reach n.sub.s, the computer 40a adds only the new data of the connections available, namely

n.sub.s=max(n.sub.f−n.sub.of,0.005.Math.n.sub.f,1)

n.sub.s=min(n.sub.s,n.sub.of)

[0182] The computer 40a can then possibly erase, in step 1126, from the database 4086 some connections to free the space to add the n.sub.s new records, in particular in the case where n.sub.of+n.sub.s>n.sub.f. Preferably, the computer randomly selects the connections to be erased and/or to be entered.

[0183] Consequently, the computer 40a can save, in step 1126, at least a part of the metadata MD of the n.sub.s connections in the database 4086. For instance, in various embodiments, the computer saves only the metadata MD that are then used during the subsequent step of feature extraction in steps 1032-1036. In fact, as shown in FIG. 6, preferably the computer 40a stores also at least one part of the pre-processed data PD for each of the n.sub.s new connections in the database 4086. For instance, in this way, the computer 40a can store in the database 4086 already the tokens described previously, whilst the enumeration of the categorical parameters and/or the statistical analyzes can be carried out in real time. Consequently, instead of extracting all the data of the connections from the database 408 when the procedure is started up, the computer can cyclically update the data of the connections stored in the database 4086.

[0184] Furthermore, as explained previously, step 1010 identifies anomalous/suspect connections. Once the results on the inconsistency of the individual data and/or connections are obtained, the computer 40a can also aggregate, in step 1112, the results in order to obtain a value of inconsistency that is no longer by packet/connection but by single machine and/or for the entire network under analysis.

[0185] For instance, in a currently preferred embodiment, the computer 40a is configured for obtaining, for a given interval VI, the mean value of inconsistencies described previously for all the connections classified as anomalous. Next, the computer 40a selects the respective mean values of inconsistencies of each machine SRC and calculates a respective overall mean value on the basis of the data of the machine. Consequently, in this way, the computer obtains for each verification interval VI a respective overall mean value for each machine. In general, the computer 40a could also divide the data of the testing interval VI into sub-intervals (for example, of one hour) and compute, for each verification sub-interval, a respective overall mean value for each machine.

[0186] In various embodiments, the computer 40a also determines the respective number of inconsistencies of each machine SRC for the interval VI (or for each sub-interval).

[0187] Consequently, in this way, the computer can determine a list/table that comprises: [0188] the identifier of the machine SRC; [0189] an identifier of the interval TI or of the respective sub-interval; [0190] the overall mean value calculated for the anomalous connections of the machine SRC in the interval TI or the respective sub-interval; and [0191] optionally the number of anomalous connections of the machine SRC in the interval TI or respective sub-interval.

[0192] In various embodiments, the computer 40a then generates a first list of overall mean values and/or numbers of the anomalous connections, selecting the data of the various intervals/sub-intervals of a given machine SRC. Additionally, the computer 40a generates a second list of overall mean values and/or numbers of anomalous connections, selecting the data, associated to a given interval VI (or sub-interval), of the different machines SRC.

[0193] The first and/or second list can then be analyzed statistically, for example to understand the statistical distribution of the data.

[0194] For instance, in various embodiments, the computer 40a determines a first histogram for the overall mean values of the first list. Additionally, the computer 40a determines a second histogram for the numbers of the connections of the first list.

[0195] Likewise, the computer 40a can determine a third histogram for the overall mean values of the second list, and/or a fourth histogram for the numbers of the connections of the second list.

[0196] For instance, to generate a histogram, the computer can first discretize the respective overall mean values, and possibly also the numbers of the connections.

[0197] Consequently, associated to each value is the number of occurrences o of the respective discretized value in the respective list. Likewise, an analysis could be carried out for so-called quantiles. Hence, this analysis provides a list of discretized values and the respective values of occurrence o.

[0198] In various embodiments, the values of occurrence o are then weighted via a non-linear function, which weights more the values of occurrence o that are higher.

[0199] For instance, in various embodiments, the computer obtains weighted values o′ using the following equation:

o′=o.sup.c

where c is a coefficient and for example corresponds to e.

[0200] Consequently, in various embodiments, the computer can obtain, for each of the histograms, a single representative value, for example by calculating the mean value of the weighted values of occurrence o′, i.e., o′. For instance, such final values (for example, o′) can be used to compare the behavior of the network over time.

[0201] Of course, without prejudice to the underlying principles of the invention, the details of implementation and the embodiments may vary widely with respect to what has been described and illustrated herein purely by way of example, without thereby departing from the scope of the present invention, as defined in the annexed claims.

METHOD FOR DETECTING ANOMALIES IN COMMUNICATIONS, AND CORRESPONDING DEVICE AND COMPUTER PROGRAM PRODUCT

Inventors

Cpc classification

Classification Explorer

G06N20/10

PHYSICS

Classification Explorer

G06N7/01

PHYSICS

Classification Explorer

H04L41/16

ELECTRICITY

Classification Explorer

H04L63/1408

ELECTRICITY

Classification Explorer

H04L43/04

ELECTRICITY

International classification

Classification Explorer

H04L41/16

ELECTRICITY

Classification Explorer

G06F18/23213

PHYSICS

Classification Explorer

G06N20/10

PHYSICS

Classification Explorer

H04L43/04

ELECTRICITY

Abstract

Claims

Description