Traffic analysis method, common service traffic attribution method, and corresponding computer system

Abstract

This application provides a traffic analysis method and apparatus, and a computer system. The method includes: obtaining a plaintext feature and a ciphertext feature of a packet in traffic, where the ciphertext feature includes a length feature of an encrypted field in the packet; and analyzing the traffic based on the plaintext feature and the ciphertext feature, to identify a service or an application to which the traffic belongs. The method may be used for service identification or application identification. The ciphertext feature is introduced in traffic analysis, so that traffic identification accuracy is improved in a packet encryption scenario. In addition, this application further provides a common service traffic attribution method and apparatus, and a computer system.

Claims

1. A common service traffic attribution method, comprising: determining, according to an identification rule, a maximum quantity of incoming packets required for a traffic analysis, wherein the identification rule is obtained based on a feature by using a machine learning algorithm to identify different services in traffic; filtering the traffic based on the maximum quantity of incoming packets; obtaining a feature of a packet in the traffic, wherein the feature comprises a ciphertext feature having one or more of a sequence, a length, or a transmission direction of an encrypted packet; analyzing the traffic based on the feature, to identify a start service, an exclusive service, and a common service in the traffic, wherein the start service is a service invoked in an application startup phase, the exclusive service is a service invoked by only one application, and the common service is a service invoked by a plurality of applications; and attributing traffic of a common service whose identification time is between a first identification time of a start service A and a second identification time of a start service B to an application that invokes an exclusive service whose identification time is between the first identification time and the second identification time, wherein the start service A is any identified start service, and the start service B is a first start service whose identification time is after the first identification time.

2. The method according to claim 1, before the obtaining the feature, further comprising: filtering the traffic based on Internet Protocol (IP) information of the traffic.

3. The method according to claim 1, wherein analyzing the traffic comprises: performing matching between the feature and each of a first identification rule, a second identification rule, and a third identification rule to identify the start service, the exclusive service, and the common service in the traffic, wherein the first identification rule, the second identification rule, and the third identification rule are obtained based on the feature by using a machine learning algorithm.

4. The method according to claim 1, wherein attributing traffic of a common service comprises: determining the application based on the exclusive service and correspondence information, wherein the correspondence information comprises a correspondence between the exclusive service and an application that invokes the exclusive service.

5. The method according to claim 1, wherein the feature further comprises a plaintext feature, and the plaintext feature comprises a feature comprising a character and/or a digit that can be directly obtained from the packet through parsing.

6. A common service traffic attribution method, comprising: determining, according to an identification rule, a maximum quantity of incoming packets required for a traffic analysis, wherein the identification rule is obtained based on a feature by using a machine learning algorithm to identify different services in traffic; filtering the traffic based on the maximum quantity of incoming packets; obtaining a feature of a packet in the traffic, wherein the feature comprises a ciphertext feature, and the ciphertext feature comprises any one or more of a sequence, a length, or a transmission direction of an encrypted packet; analyzing the traffic based on the feature, to identify an exclusive service and a common service in the traffic, wherein the exclusive service is a service invoked by only one application, and the common service is a service invoked by a plurality of applications; and attributing traffic of a common service whose identification time is between an identification time of an exclusive service A and an identification time of an exclusive service B to an application, wherein the application is an application that invokes the exclusive service A, the exclusive service A is any identified exclusive service, and the exclusive service B is a first exclusive service whose identification time is after the identification time of the exclusive service A.

7. The method according to claim 6, wherein the analyzing the traffic based on the feature, to identify an exclusive service and a common service in the traffic comprises: performing matching between the feature and each of a second identification rule and a third identification rule to identify the exclusive service and the common service in the traffic, wherein the second identification rule and the third identification rule are obtained based on the feature by using a machine learning algorithm.

8. A computer system, comprising a memory and a processor, wherein the memory is configured to store a computer readable instruction, which when executed by the processor, causes the processor to perform a common service traffic attribution method, the method comprising: determining, according to an identification rule, a maximum quantity of incoming packets required for a traffic analysis, wherein the identification rule is obtained based on a feature by using a machine learning algorithm to identify different services in traffic; and filtering the traffic based on the maximum quantity of incoming packets; obtaining a feature of a packet in the traffic, wherein the feature comprises a ciphertext feature, and the ciphertext feature comprises any one or more of a sequence, a length, or a transmission direction of an encrypted packet; analyzing the traffic based on the feature, to identify a start service, an exclusive service, and a common service in the traffic, wherein the start service is a service invoked in an application startup phase, the exclusive service is a service invoked by only one application, and the common service is a service invoked by a plurality of applications; and attributing traffic of a common service whose identification time is between a first identification time of a start service A and a second identification time of a start service B to an application that invokes an exclusive service whose identification time is between the first identification time and the second identification time, the start service A is any identified start service, and the start service B is a first start service whose identification time is after the first identification time.

9. The computer system according to claim 8, wherein analyzing the traffic comprises: performing matching between the feature and each of a first identification rule, a second identification rule, and a third identification rule to identify the start service, the exclusive service, and the common service in the traffic, wherein the first identification rule, the second identification rule, and the third identification rule are obtained based on the feature by using a machine learning algorithm.

10. The computer system according to claim 8, wherein attributing traffic of a common service comprises: determining the application based on the exclusive service and correspondence information, wherein the correspondence information comprises a correspondence between the exclusive service and an application that invokes the exclusive service.

11. A computer system, comprising a memory and a processor, wherein the memory is configured to store a computer readable instruction, which when executed by the processor, causes the processor to perform a common service traffic attribution method, comprising: determining, according to an identification rule, a maximum quantity of incoming packets required for a traffic analysis, wherein the identification rule is obtained based on a feature by using a machine learning algorithm, and the identification rule is used to identify different services in traffic; filtering the traffic based on the maximum quantity of incoming packets; obtaining a feature of a packet in the traffic, wherein the feature comprises a ciphertext feature, and the ciphertext feature comprises any one or more of a sequence, a length, or a transmission direction of an encrypted packet; analyzing the traffic based on the feature, to identify an exclusive service and a common service in the traffic, wherein the exclusive service is a service invoked by only one application, and the common service is a service invoked by a plurality of applications; and attributing traffic of a common service whose identification time is between an identification time of an exclusive service A and an identification time of an exclusive service B to an application that invokes the exclusive service A, the exclusive service A is any identified exclusive service, and the exclusive service B is a first exclusive service whose identification time is after the identification time of the exclusive service A.

12. The computer system according to claim 11, wherein the analyzing the traffic based on the feature, to identify an exclusive service and a common service in the traffic comprises: performing matching between the feature and each of a second identification rule and a third identification rule to identify the exclusive service and the common service in the traffic, wherein the second identification rule and the third identification rule are obtained based on the feature by using a machine learning algorithm.

13. A non-transitory computer-readable medium storing computer instructions for common service traffic attribution, that when executed by one or more processors, cause the one or more processors to perform a method, which comprises: determining, according to an identification rule, a maximum quantity of incoming packets required for a traffic analysis, wherein the identification rule is obtained based on a feature by using a machine learning algorithm to identify different services in traffic; and filtering the traffic based on the maximum quantity of incoming packets; obtaining a feature of a packet in the traffic, wherein the feature comprises a ciphertext feature, and the ciphertext feature comprises any one or more of a sequence, a length, or a transmission direction of an encrypted packet; analyzing the traffic based on the feature, to identify a start service, an exclusive service, and a common service in the traffic, wherein the start service is a service invoked in an application startup phase, the exclusive service is a service invoked by only one application, and the common service is a service invoked by a plurality of applications; and attributing traffic of a common service whose identification time is between a first identification time of a start service A and a second identification time of a start service B to an application that invokes an exclusive service whose identification time is between the first identification time and the second identification time, wherein the start service A is any identified start service, and the start service B is a first start service whose identification time is after the first identification time.

14. The medium according to claim 13, wherein the analyzing the traffic comprises: performing matching between the feature and each of a first identification rule, a second identification rule, and a third identification rule to identify the start service, the exclusive service, and the common service in the traffic, wherein the first identification rule, the second identification rule, and the third identification rule are obtained based on the feature by using a machine learning algorithm.

15. The medium according to claim 13, wherein attributing traffic of a common service comprises: determining the application based on the exclusive service and correspondence information, wherein the correspondence information comprises a correspondence between the exclusive service and an application that invokes the exclusive service.

Description

BRIEF DESCRIPTION OF DRAWINGS

(1) To describe the technical solutions provided in this application more clearly, the following briefly describes the accompanying drawings. Apparently, the accompanying drawings in the following descriptions show merely some embodiments of this application.

(2) FIG. 1 is a hierarchical schematic diagram of traffic;

(3) FIG. 2 shows an example of an HTTP request packet and an HTTP response packet;

(4) FIG. 3 is a schematic diagram of a TLS handshake process;

(5) FIG. 4 is a schematic diagram of a logical structure of a traffic analysis apparatus according to an embodiment of this application;

(6) FIG. 5 is a schematic flowchart of a traffic analysis method according to an embodiment of this application;

(7) FIG. 6 is a schematic principle diagram of a traffic attribution method according to an embodiment of this application;

(8) FIG. 7 is a schematic diagram of a logical structure of a traffic analysis apparatus according to an embodiment of this application;

(9) FIG. 8 is a schematic flowchart of a traffic feature construction method according to an embodiment of this application;

(10) FIG. 9 is a schematic flowchart of learning a service identification rule or an application identification rule according to an embodiment of this application;

(11) FIG. 10 is a schematic flowchart of a traffic filtering method according to an embodiment of this application;

(12) FIG. 11 is a schematic flowchart of identifying three service types according to an embodiment of this application;

(13) FIG. 12 is a schematic flowchart of a common service traffic attribution method according to an embodiment of this application; and

(14) FIG. 13 is a schematic diagram of a logical structure of a computer system according to an embodiment of this application.

DESCRIPTION OF EMBODIMENTS

(15) To help understand the technical solutions proposed in this application, some elements introduced in the descriptions of this application are first described herein. It should be understood that the following descriptions are merely intended to help understand these elements, so as to understand content of the embodiments, but do not necessarily cover all possible cases.

(16) Traffic: Network communication packets are generated when devices connected through a network interact with each other, and these packets are referred to as traffic. The traffic is a general meaning.

(17) Data stream: A data packet generated in a complete communication process (from establishment of a connection to an end of the connection) between a server and a client is referred to as a data stream of the connection. In an application use process, interaction is usually performed for a plurality of times. Therefore, a plurality of data streams are generated to form application traffic.

(18) For example, the data stream is traffic generated during a session starting from TLS handshake establishment and ending with a Transmission Control Protocol (TCP) FIN (finish) packet. The data stream represents a process of interaction between two subjects, for example, interaction between an application process and the server.

(19) Common service: An API (application programming interface) deployed on a server and invoked by a plurality of application programs publicly provides services for completing some functions, for example, map navigation, cloud storage, and video transmission.

(20) Traffic analysis: A network communication packet is obtained through listening, capturing, copying, or the like, and original communication content of the network communication packet is restored through parsing, reassembling, segmentation, or the like, so as to understand instant statuses of two network communication parties.

(21) Plaintext feature: A plaintext feature is a feature including a character and/or a digit that can be directly obtained from a packet through parsing, and is different from a ciphertext feature.

(22) FIG. 1 is a schematic diagram of a hierarchical structure of traffic. In FIG. 1, a mobile application Facebook® is used as an example. Traffic of the application may be divided into three layers in a hierarchical structure. A first layer is a data stream layer, to be specific, traffic generated during a session starting from TLS handshake establishment and ending with a TCP FIN packet. The data stream layer indicates interaction between an application process and a server. A second layer is a service layer, to be specific, a submodule that interacts with the server in the application. All traffic generated when a process corresponding to the service layer interacts with the server is traffic of the service module, such as a cloud storage service or a message service of Facebook®. A third layer is an application layer, to be specific, the application program Facebook®. Facebook® further includes common services, such as a login service, a cloud service, and a message push service. The common services in Facebook® may be invoked by other application programs. It means that traffic belonging to the common services does not necessarily all belong to Facebook®. After new traffic arrives and after a traffic analysis module identifies a common service, the traffic analysis module further needs to attribute, by using a specific method, traffic of the common service to an application program to which the traffic should be attributed. In this way, the traffic of the application program can be accurately calculated.

(23) Currently, a traffic identification technology mainly focuses on traffic identification at the application layer, and traffic identification at the service layer is basically not performed. However, common service traffic in the application market currently occupies at least 60% of total traffic, and a quantity of applications using a common service module occupies at least 95% of a total quantity. A most prominent service identification problem is Google®-type service identification. For example, a conflict of identifying common service traffic, such as Google® map traffic, occurs for all application programs using a Google® map service. Consequently, a service of an operator is seriously affected. However, in actual application, a service cannot be accurately identified by using an application-layer traffic identification technology, and consequently a relatively high false identification rate is generated.

(24) An existing widely-used traffic analysis solution is a plaintext feature identification method in which traffic is identified by using a plaintext feature of a Hypertext Transfer Protocol (HTTP) packet and a plaintext feature of a TLS handshake message. The HTTP packet includes a request packet and a response packet. FIG. 2 shows an example of an HTTP request packet (a) and an HTTP response packet (b). The HTTP packet includes three parts: a starting row, a message header, and a body. Table 1 shows a possible action of the starting row.

(25) TABLE-US-00001 TABLE 1 Action Meaning GET Request to obtain a resource identified by a URI. POST Add new data after a resource identified by a URI. HEAD Request to obtain a response message header of a resource identified by a URI. PUT Request a server to store a resource and use a URI as an identifier of the resource. DELETE Request a server to delete a resource identified by a URI. TRACE Request a server to return received request information mainly for testing or diagnosis. CONNECT Reserved for future use. OPTIONS Request to query performance of a server, or query an option and a requirement that are related to a resource.

(26) In traffic analysis, interaction behavior being performed between the client and the server side may be determined through the foregoing actions. For example, interaction content may be determined by using the resource identified by the uniform resource identifier (URI), and a host field in a header field may be used to determine whether the packet belongs to an application. Therefore, in a plaintext feature analysis technology, these character or digital features that can be parsed are usually directly used to speculate statuses of two network communication parties. Subsequently, after an encryption technology is introduced in a network communication protocol, only a small part of unencrypted traffic can continue to use the plaintext feature analysis technology.

(27) Due to application of the protocol encryption technology, all plaintext feature fields of an original HTTP packet are encrypted into Hypertext Transfer Protocol Secure (HTTPS)-based fields. At least 90% of current network traffic is based on the HTTPS protocol. A structure of the HTTPS protocol is that a TLS protocol layer is encapsulated on the original HTTP packet. A handshake process of the TLS protocol is shown in FIG. 3, and is similar to a three-way handshake process of the TCP protocol. As shown in FIG. 3, a TLS protocol client first sends ClientHello to the server; the server returns ServerHello and a certificate; the client receives the certificate, generates a public key for encryption, and sends the public key and an encryption algorithm to the server; and a handshake process ends after confirmation by the server. Then the two parties start to send an encrypted application data packet. When protocol encryption is performed, a plaintext feature includes a feature of a TLS handshake message, and a ciphertext feature includes a feature of encrypted application data. In the prior art, only a plaintext feature in traffic is used to perform application identification.

(28) TLS handshake messages mainly include 10 basic types (and other extended types). A feature of a TLS handshake message is constructed below mainly based on one or more of the 10 types of packets. The 10 types of packets include (1) to (5), and (7) (equivalent to (9)) that are shown in FIG. 3, and further include HelloRequest, ServerKeyExchange, CertificateRequest, and CertificateVerify that are not shown in FIG. 3. The following briefly describes the 10 types of packets in Table 2. Some of the packets in Table 2 are required by the server or the client, and are not mandatory in all scenarios.

(29) TABLE-US-00002 TABLE 2 Packet type Meaning or function HelloRequest Handshake actively initiated by a server. This is not common and is mainly used in the following case: A session has lasted for a long time, and the server reestablishes a new connection to a client to reduce security risks. ClientHello Hello message sent by a client to a server, including a session ID. ServerHello Hello message sent by a server to a client, including an encryption algorithm and a compression algorithm that are selected by the server. Certificate Certificate chain sent by a server to a client. ServerKeyExchange Message received by a client from a server, carrying a parameter for establishing symmetric encryption. The parameter is optional and is not required in all key exchange algorithms. CertificateRequest A server requests a client to provide a certificate. This is not common in a web server. ServerHelloDone Hello done message. ClientKeyExchange Responsible for sending the following three pieces of information to a server: a random number: The random number is encrypted by using a public key of the server, to prevent eavesdropping; a code change notification: indicating that subsequent information is sent by using an encryption method and a key that are negotiated by both parties; and a client handshake end notification: indicating that a handshake phase of a client ends. The notification is also a hash value of all previously sent content, and is used for verification by the server. Certificate Verify A client needs to verify whether a certificate of a server is issued by a trusted authority, whether a domain name in the certificate is consistent with an actual domain name, or whether the certificate expires. If verification on the certificate succeeds, the client fetches a public key of the server from the certificate of the server. Finished When this message is sent, the message is already encrypted, because negotiation has ended, a ChangeCipherSpec message has been sent, and encrypted communication between two parties has been activated.

(30) It should be noted that the ChangeCipherSpec protocol is not a part of a handshake protocol, and sending the ChangeCipherSpec protocol indicates that encryption statuses of the two parties are ready. In subsequent communication, ciphertext encryption communication negotiated by the two parties is used, and details are not described in this application. In addition, the Finished packet herein indicates that a handshake process ends, and is not the foregoing TCP FIN packet. A communication process between the client and the server is actually as follows: A TCP handshake is first established at the TCP layer; then the TLS handshake message shown in FIG. 3 is transmitted by using the TCP protocol; then a service packet is transmitted; and finally current interaction ends by using the TCP FIN packet.

(31) In an existing solution, one or more of the foregoing TLS handshake messages may be used to construct features, the features are converted into machine-readable rules, such as XML (extensible markup language), and the rules are stored. After network traffic is parsed, these rules are read for traffic filtering in a corresponding protocol format. A filtering manner may be sequential filtering. A full matching rule starting from the ClientHello packet and ending with the Finish packet is established (that is, all plaintext fields in the packet are input). After filtering is completed, traffic obtained after filtering is sent to a service logic matching module, an application to which the traffic belongs is identified based on an application ID corresponding to the rule, and a matching result is output.

(32) However, for some applications of a same type, because the applications of the same type are relatively highly similar in terms of some features (such as certificates), the applications cannot be distinguished when a rule is established by using only the features of the foregoing TLS handshake messages. In addition, traffic of different services in a same application cannot be identified by using only the features of the foregoing TLS handshake messages. In particular, common traffic generated when different applications use a same service is identified as traffic of a single application. Especially when a nested service exists inside the service, a large amount of false identification is generated. These current plaintext features cannot be used to subdivide service traffic. When common service traffic is generated, identification cannot be completed. Therefore, after a common service occurs, statistics about common traffic of a next application or a previous application is usually collected to a current application during traffic statistics collection. Consequently, a false identification rate is relatively high.

(33) Herein, applications of a same type are applications that invoke a same or similar service. Because the server issues a same type of certificate to a same type of service, identification cannot be performed by using only the TLS handshake messages. The applications of the same type may be applications comprising a same service, for example, two map applications of a same company or different companies; or may be applications that are of different types of a same company and that invoke a same service.

(34) FIG. 4 is a schematic diagram of a logical structure of a traffic analysis apparatus 400 according to an embodiment. The apparatus includes a feature learning module 410, a service identification module 420, and a traffic attribution module 430.

(35) Further, the traffic analysis apparatus may be connected to a traffic parsing apparatus 300. The traffic parsing apparatus 300 is configured to: parse received traffic, and then output a result obtained through parsing to the traffic analysis apparatus 400. In a traffic parsing process, range information of a field is extracted (specifically extracted by a parsing module in FIG. 4) step by step according to a protocol format. Specifically, the prior art may be used, and details are not described in this embodiment.

(36) Further, the traffic analysis apparatus 400 may include a traffic filtering module 440, configured to: filter, according to all or some of rules obtained by the feature learning module 410, the result that is output by the traffic parsing apparatus 300; and input, to the service identification module 420, traffic obtained through filtering, so as to reduce an amount of processing by the service identification module 420 and improve processing efficiency. The parsing process may be further implemented in combination with hardware. For example, the parsing process is accelerated in combination with a hardware acceleration apparatus.

(37) A plurality of modules in FIG. 4 may be deployed on a same physical machine, or may be deployed on different physical machines.

(38) The traffic analysis apparatus 400 is used as an example. The following describes a traffic analysis method provided in this application. The traffic analysis method belongs to some or all functions provided by the traffic analysis apparatus 400.

(39) FIG. 5 is a schematic flowchart of a traffic analysis method according to an embodiment.

(40) S501. A feature learning module 410 performs machine learning based on collected history traffic data or traffic data obtained in another manner, and obtains an application-service rule of each application through machine learning.

(41) In a machine learning process, a feature of a packet needs to be extracted. The feature of the packet herein includes either or both of a plaintext feature and a ciphertext feature of the packet. The plaintext feature includes a feature including a character and/or a digit that can be directly obtained from the packet through parsing. The ciphertext feature includes any one or more of a sequence, a length, and a transmission direction of an encrypted packet.

(42) An application-service rule of an application includes identification rules of three services invoked by the application. The three services include a start service, an application exclusive service, and a common service. The application-service rule is used to perform service identification. In addition, because the three rules are associated with a specific application, an application to which identified traffic belongs may be learned according to the rules. Start services and common services of two or more different applications may be partially or completely the same, so that identification rules obtained through learning may be partially repeated.

(43) The machine learning process may be performed offline, in other words, not in real time; or may be performed in real time. Some traffic data may be periodically obtained when the machine learning process is performed in real time, and an application-service rule is generated or updated through machine learning.

(44) In some other embodiments, a manager may manage, by using a management configuration module (not shown in the figure), the rules obtained by the feature learning module 410. For example, the manager may add, delete, modify, or view these rules.

(45) S502. After traffic arrives, a traffic parsing apparatus 300 reads a packet in the traffic from a storage (for example, a memory), parses the packet according to a protocol format of the packet, and transmits, to a traffic filtering module 440, a packet (or referred to as traffic) obtained through parsing.

(46) A protocol above a transport layer, namely, a TCP/IP layer, is used in a parsing process, for example, the TLS protocol. A TLS protocol-based packet may be divided into a TLS handshake part and a TLS record part according to a format. In this embodiment, the handshake part mainly includes seven types of data packets, including ClientHello, ServerHello, Certificate, and the like. As mentioned above, not all the 10 types of data packets are used.

(47) S503. The traffic filtering module 440 receives the traffic from the traffic parsing apparatus 300, obtains the application-service rule from the feature learning module 410, filters a received packet according to the application-service rule, and sends, to a service identification module 420, a packet obtained through filtering.

(48) In one embodiment, the feature learning module 410 stores the application-service rule in the memory by using a file or in another form. After reading the application-service rule from the memory, the traffic filtering module 440 filters the traffic according to the application-service rule.

(49) The traffic filtering module 440 is mainly configured to preprocess the traffic before service identification, such as filtering or offloading, so as to reduce system overheads and improve processing efficiency of the service identification module 420. The traffic filtering module 440 can support performing parsing based on different fields in different packets such as HTTP and TLS packets, and can also support a custom regular filter mode.

(50) In some other embodiments, the traffic filtering module 440 may not be required.

(51) S504. The service identification module 420 receives, from the traffic filtering module 440, the traffic obtained through filtering, obtains the application-service rule from the feature learning module 410, performs, according to the application-service rule, service identification on the traffic obtained through filtering, and obtains an identification result. The identification result includes a “location” of each service and a type of a service to which the traffic belongs: a start service, an application exclusive service, or a common service. Finally, the identification result is sent to a traffic attribution module 430.

(52) The “location” of the service herein does not mean a geographical location. Location information of a service can be understood as a mark or an indication, and is used to indicate a sequence of a time for identifying the service relative to another service. For example, the location information of the service may be a time point at which the service is identified, or a digit that may reflect a sequence.

(53) For example, if it is determined that a feature of a data stream S1 matches a feature of a start service of an application, traffic of the data stream S1 belongs to the start service, and then a correspondence between the data stream S1, a start service, and a service location is recorded in the memory.

(54) S505. The traffic attribution module 430 receives the identification result sent by the service identification module 420, and determines, based on a start service and an exclusive service (or based only on the exclusive service), an application to which traffic of a common service belongs.

(55) In one embodiment, the service identification module 420 records the identification result in the memory, and the memory may be a cache, or may be another type of memory. Then the traffic attribution module 430 reads the identification result from the memory.

(56) In one embodiment, an application identification time (that is, a location of a start service) does not need to be considered. When an exclusive service is identified, an application (for example, an application ID) corresponding to the exclusive service is recorded in the memory, and traffic of a common service that appears after the time point belongs to the application. When a next exclusive service is subsequently identified, a new application (which may be the same as the previous application because a same application may have two or more exclusive services) is recorded. This method is applicable to a scenario in which there is no traffic between a start service and an exclusive service, and the exclusive service is equivalent to a start service.

(57) In one embodiment, a start service is first identified, an application identification time is determined, and the identification time is stored in the memory. It should be noted that the “time” herein is not necessarily a time value. When an exclusive service is identified, an application corresponding to the exclusive service is recorded in the memory, and traffic of a common service that appears after the time point belongs to the application. After a next start service is subsequently identified, updating the application recorded in the memory is considered.

(58) In the foregoing two embodiments, to save storage space of the memory, an aging time of stored content, a quantity of stored content entries, or the like may be set during implementation of the method.

(59) The following uses the second embodiment as an example for description. There is only a slight difference between the first implementation and the second implementation. With reference to the second implementation, a person skilled in the art may learn how to implement the first implementation.

(60) First, currently received traffic is segmented based on location information of all identified start services. For example, a first segment ranges from a start service SS.sub.a to a start service SS.sub.b, and a second segment ranges from the start service SS.sub.b to a start service SS.sub.c.

(61) Then an application corresponding to a segment is determined based on location information of an exclusive service. For example, if an exclusive service OS.sub.b is in the second segment, and the exclusive service OS.sub.b is exclusive to an application B, it is determined that the second segment corresponds to the application B. It should be understood that segments and applications are not in a one-to-one correspondence. The second segment corresponds to the application B, but it does not mean that traffic of the application B exists only in the second segment. The application B may be started for a plurality of times.

(62) Finally, an application to which the common service belongs is determined based on the location information of the common service and the application corresponding to the segment. For example, if a common service PS.sub.a is in the second segment, and it is learned that the second segment corresponds to the application B, traffic of the common service PS.sub.a belongs to the application B.

(63) S502 to S505 are usually a real-time processing process.

(64) For ease of understanding, FIG. 6 is a schematic diagram illustrating a process of attributing common service traffic. In the figure, an arrow is used to represent a data stream, and also represent a service. A service location is a start location of the arrow. Blocks on the arrow represent an uplink packet and a downlink packet, and a plurality of blocks are combined to form different packet features. As shown in FIG. 6, it is assumed that three start services SS.sub.a, SS.sub.b, and SS.sub.c, two exclusive services OS.sub.a and OS.sub.b, and two common services PS.sub.a and PS.sub.b have been identified in step S504.

(65) The exclusive service OS.sub.b exists after the start service SS.sub.b and before a next start service SS.sub.c, and it is learned that OS.sub.b is exclusive to the application B. Therefore, it may be determined that the start service SS.sub.b is a start service of the application B. Further, it may be determined that a start time of the application B is approximately a time indicated by a location of the start service SS.sub.b. Likewise, the exclusive service OS.sub.a is exclusive to an application A. Therefore, it may be determined that the start service SS.sub.a is a start service of the application A.

(66) The common service PS.sub.a is in the second segment, and appears after the application B is started. Therefore, traffic of the common service PS.sub.a should belong to the application B. However, although arrival time points of most data streams of the other common service PS.sub.b coincide with the second segment, it is learned from the figure that an initial location (a location at which the common service is identified) of the other common service PS.sub.b is in the first segment. However, the application B has not been started in this case. Therefore, the traffic of PS.sub.b belongs to the application A instead of the application B.

(67) It should be noted that a time at which a service is identified (that is, a time indicated by a location of the service) is not an exact time at which the application is started or the service is started. However, a sequence in which services are identified is usually consistent with a sequence in which the services run.

(68) The solutions are collectively described above. The following uses a Google® application (for example, Google Map) as an example to describe a service identification method and a service traffic attribution method in detail, and the foregoing steps are specifically implemented. In a current technology, accuracy of identifying traffic of the Google® application is relatively low, and attribution of common service traffic cannot be correctly determined, thereby affecting a normal traffic identification service of an operator. Therefore, in this application, the Google® application is used as an example to describe a traffic analysis method.

(69) An objective of the method to be described below is to determine attribution of traffic of a Google common service, so as to improve traffic identification accuracy of the Google® application.

(70) A general process of the method is similar to that in FIG. 5, and includes the following: First, an application-service rule is obtained by using a technology of constructing a feature of encrypted traffic and a feature learning technology. The application-service rule specifically includes three types of rules: a first identification rule used to identify a start service, a second identification rule used to identify an exclusive service, and a third identification rule used to identify a common service (for a specific rule learning process, refer to the following descriptions). Then an application-service rule filtering technology is used to reduce to-be-matched traffic, dynamically set a quantity of incoming packets, and the like, so as to reduce system performance overheads. Then the three types of services are identified by using the application-service rule, and an application to which a common service belongs is determined based on locations of the different types of services.

(71) FIG. 7 is a schematic diagram of a logical structure of a traffic analysis apparatus 700 according to an embodiment. The traffic analysis apparatus 700 receives, from a traffic parsing apparatus 800, traffic obtained through parsing, and analyzes the traffic. Specifically, the traffic analysis apparatus 700 includes a feature learning module 710, a service identification module 720, a traffic attribution module 730, and a traffic filtering module 740. The following describes the apparatus with reference to a detailed method.

(72) FIG. 8 shows a method for determining a feature vector. The method is performed by a constructor 711 of the feature learning module 710. First, the constructor 711 constructs a feature matrix (S801), and each column is a feature.

(73) The feature matrix may be constructed by using one or more of the following three methods. Method 1: The feature matrix is constructed based on a plaintext of a packet. For example, an SNI (server name indication) field in a ClientHello packet is used as a column of features. Method 2: The feature matrix is constructed based on a ciphertext feature of a protocol, for example, a length of a first data packet of uplink application data and/or a length of a downlink data packet, and ciphertext content does not need to be obtained. Method 3: The feature matrix is constructed by combining a plaintext and a ciphertext. The feature matrix may be manually constructed for the first time. In a subsequent step, the feature matrix may be adjusted based on a learned feature value range.

(74) After the feature matrix is obtained, the feature vector is generated (S802). Specifically, a feature of each data stream in application traffic is checked. If the data stream includes the feature in a corresponding feature column, the data stream is marked as 1; or if the feature does not appear, the data stream is marked as 0. In this way, a feature matrix of all data streams can be finally obtained, and each row of the matrix represents a feature vector of a data stream. For example, if application traffic of Google Map includes 20 data streams and there are 30 constructed feature columns, a 20×30 feature matrix including 0 and 1 is output.

(75) FIG. 9 shows a method for obtaining an application-service rule based on a feature vector by using a machine learning algorithm. The method is performed by a learner 712 of the feature learning module 710. The learner 712 obtains the feature vector from the constructor 711, searches, based on the machine learning algorithm, for the feature vector that can be used to distinguish between services, searches for a feature column and a feature value that correspond to the feature vector of the service, and converts a search result into a rule (or referred to as a service identification rule) used to identify the service (S901). Specifically, three types of identification rules are found: the first identification rule, the second identification rule, and the third identification rule, and the three types of identification rules respectively correspond to a start service identification rule, an exclusive service identification rule, and a common service identification rule mentioned in the foregoing embodiment.

(76) When the learner 712 finds a feature vector used to distinguish between services (S902), the learner 712 outputs an identification rule corresponding to the feature vector, and combines a service identification rule learned for a same type of application into the application-service rule of the application (S903). When the learner 712 does not find a feature vector used to distinguish between services (S902), the learner 712 sends, to the constructor, a request for reconstructing the feature matrix (S904), to request to reconstruct the feature matrix. Referring to FIG. 8, after the constructor 711 determines that the request is received (S803), the feature matrix is reconstructed by using some predetermined methods (S804). For example, ciphertext features (such as digital features) are segmented in equal lengths, then the feature matrix is reconstructed based on a segmentation result, and the feature vector is re-output. The steps shown in FIG. 8 and FIG. 9 are iterated until the application-service rule is output.

(77) In this embodiment, the machine learning algorithm such as a decision tree algorithm, an artificial neural network algorithm, a support vector machine algorithm, a clustering algorithm, a Bayes classification algorithm, a Markov chain algorithm, or a probabilistic graphical model may be used.

(78) The rule includes three types: a first identification rule, a second identification rule, and a third identification rule. As shown in Table 3 to Table 5 below, a rule includes one or more fields.

(79) It should be noted that the “field” in Table 3 to Table 5 indicates a field in the rule and is customized. “Location” is a field in an actual data packet. The field is usually agreed on by an Internet Protocol team, and is visible in a Request For Comments (RFC) document of a corresponding protocol and is a consensus in the art. A value may be obtained by using the field, to match a preset value of the field in the rule.

(80) TABLE-US-00003 TABLE 3 Field Location Description Example First SNI TLS The field is “clients4.google.com” identification rule handshake a server name. TLS record TLS record Packet For example, a first length length packet record length 254 feature may be determined as a start of Google Map.

(81) An example of the first identification rule is as follows:

(82) SNI=www.googleapis.com && TLS record=512

(83) When the rule is used, a value is obtained from a TLS handshake field of a received data packet, and a value is obtained from a TLS record length field, to perform matching between the two values and the identification rule. It is determined whether the two obtained values are respectively www.googleapis.com and 512. If yes, the matching succeeds; or if no, the matching fails. A method for using another rule in the following is similar to that for the foregoing rule, and details are not described.

(84) TABLE-US-00004 TABLE 4 Field Location Description Example Second SNI TLS The field is a “clients4.google.com” identification rule handshake server name. CertCommonName Certificate Certificate alias “blackberry.com” UserAgent HTTP Browser and “com.google.android.youtube” head system name (single-packet identification) UDP-UserAgent HTTP Browser and “com.google.android.youtube” head system name (single-packet identification) Client TLS Data sent by a 0-1300 application data record client to a (sequential (cAppD) length server side matching in a same (Considering direction, and packet supporting TCP fragmentation and TLS packets) and performance, the field may be replaced with TCP.length.) Server TLS Data sent by a 0-1300 (a application data record server side to a maximum of four (sAppD) length client packets matched in (Considering this direction, packet sequential fragmentation matching in a same and direction, and TCP performance, and TLS packets) the field may be replaced with TCP.length.) Other TLS Another Existing TLS handshake possible identification handshake (fingerprint) rule feature

(85) An example of the second identification rule is as follows:

(86) iOS® system: SNI=clients4.google.com && sAppD[1]==62 && sAppD[2]==42 && sAppD[3]==38 && sAppD[4]>=242 && sAppD[4]<=243 && cAppD[1]==53 && cAppD[2]==50 && cAppD[3]>=301 && cAppD[3]<=308; and

(87) Android® system: SNI=clients4.google.com && sAppD[1]-376 && nCAppD>=1 && cAppD[1]>=848 && cAppD[1]<=849, where

(88) sAppD[x] indicates a length of an x.sup.th application data packet sent by the server side to the client side, and cAppD[x] indicates a length of an x.sup.th application data packet sent by the client side to the server side.

(89) TABLE-US-00005 TABLE 5 Field Location Description Example Third SNI TLS The field is a “clients4.google.com” identification rule handshake server name (single-packet identification). CertCommonName Certificate Certificate alias “blackberry.com” (single-packet identification) Other TLS Another possible Existing TLS handshake handshake identification feature (fingerprint) rule

(90) An example of the third identification rule is as follows:

(91) #SNI_googleadservices.com

(92) #SNI_www.googleapis.com

(93) #CertCommonName_google-analytics.com

(94) The foregoing is a process of obtaining a service identification rule, and the process is performed offline. The following describes a real-time traffic analysis process. In the real-time traffic analysis process, the following processes such as a traffic obtaining process, a traffic filtering process, a service identification process, and a process of attributing common service traffic are sequentially performed in real time.

(95) FIG. 10 shows a traffic filtering method. The method is optional, but can be used to reduce to-be-matched traffic and improve processing efficiency. The method is performed by a domain filtering module 741 in a traffic filtering module 740. Input of the module 741 has two parts. One part is a packet (that is, to-be-filtered traffic) obtained by parsing network traffic by a traffic parsing apparatus 800, and the other part is an application-service rule that is output by a learner 712. Output of the module 741 is traffic obtained through filtering.

(96) In one embodiment, after the application-service rule and the to-be-filtered traffic are received, a maximum quantity of incoming packets required when the rule is used to identify a service is determined according to the application-service rule (S1001). In addition, an ASN domain of Google is calculated based on IP information of the to-be-filtered traffic (S1001). The traffic is filtered based on the determining result and the maximum quantity of incoming packets (S1002), and the traffic obtained through filtering belongs to the ASN domain of Google and meets a requirement for the maximum quantity of incoming packets.

(97) The maximum quantity of incoming packets herein is a maximum quantity of packets that are read by a traffic analysis apparatus 700 from a data stream. For example, if the maximum quantity of incoming packets is 5, a quantity of read packets is less than or equal to 5. If the quantity of read packets exceeds 5, no packet is read. In other words, when the traffic is filtered, other data packets different from the five packets are filtered out.

(98) FIG. 11 shows a method for performing service identification on traffic obtained through filtering. The method is performed by a service identification module 720. Input is a result of filtering current network traffic by a domain filtering module 741 and an application-service rule that is output by a feature learning module 710; and output is a service classification identification result. First, a single-user identification module 721 distinguishes between application traffic of a single user based on an IP, a session ID, a device ID, a user ID, or other identity identification information in the traffic obtained through filtering, and inputs the application traffic of the single user to a service classification module 722 (S1101). The service classification module 722 identifies a start service, an exclusive service, and a common service of each application in the traffic of the single user according to the application-service rule (S1102), and sends an identification result to a traffic attribution module 730. In the identification process, a packet feature in the traffic of the single user may be extracted for performing matching with an application-service rule one by one. If the matching succeeds, a matching process ends, and a service type and an application that correspond to a rule with which the matching succeeds are output.

(99) It should be noted that, in some other embodiments, the single-user identification module 721 and an execution process of the single-user identification module 721 are not necessary. For example, traffic originally comes from one user, or traffic comes from a plurality of users, but a requirement for a solution does not include distinguishing between traffic of different users.

(100) FIG. 12 shows a method for attributing traffic to an application. The method is performed by a traffic attribution module 730. Input of the module is a service identification result for a single user, and output is an application to which traffic of a common service belongs.

(101) In one embodiment, a location of a start service is obtained (S1201), and traffic of a single user is segmented by using the location (S1202). An exclusive service in the segment (namely, a current segment) is obtained, and an application to which traffic in the segment belongs is obtained (S1203). The application is an application that invokes the exclusive service. Then a cache table is established, and information recorded in the cache table includes an application ID, a user ID, and a location of a start service that correspond to the segment (S1204).

(102) To save storage space, only application IDs, user IDs, and location information of start services that correspond to a previous segment and the current segment are stored in the cache table.

(103) It should be understood that the cache table is a table stored in a cache in a form of a table. In some other embodiments, the information may also be stored in another storage space in another form.

(104) If a previous module identifies a common service, a location of the identified common service is obtained (S1205). It is determined, based on the location of the common service, whether the common service belongs to the current segment (S1206); and if the common service belongs to the current segment, an application to which the common service belongs is output (S1207); or if the common service does not belong to the current segment, the cache table is queried for application information of a corresponding location by using the location information of the user (S1208), and the application to which the common service belongs is output Alternatively, the cache table is directly queried for application information of a corresponding location based on the location of the common service, and an application to which the common service belongs is output.

(105) It should be noted that an ID of an entry in this embodiment is information used to identify the entry, and may be a digit, a text, code, or information of another type. In this embodiment, a location of a service is a time at which the service is identified. Refer to a start location of an arrow that indicates a service in FIG. 6.

(106) Any method provided in the foregoing embodiments may be implemented on one or more physical computers. The apparatus proposed in the foregoing embodiments may be deployed on one or more physical computers. Unit module division inside the apparatus is merely shown as an example, and all unit modules may be deployed on a same physical computer, or may be deployed on different physical computers.

(107) FIG. 13 is a schematic diagram of a logical structure of a computer system according to an embodiment. The computer system may be any type of computer system, such as a network device (for example, a DPI device), a server, a mobile terminal, a personal computer, or an in-vehicle computer. The computer system 1300 includes components such as a processor 1310, a memory 1320, and a network interface 1330 (which is also referred to as a network interface card, a network adapter, or the like). The computer system and another device may be interconnected to implement more functions, for example, traffic charging.

(108) The processor 1310 may be a single-core processor or a multi-core processor. When the processor 1310 is the multi-core processor, the method provided in this application may run on one core, or may run on different cores in a distributed manner. There may be one or more processors 1310, and the plurality of processors may be of a same type or different types. The processor types include a central processing unit (CPU), a graphics processing unit, a microprocessor, a coprocessor, and the like.

(109) The network interface 1330 is configured to connect to another network device, and the connection includes a wireless connection and a wired connection. In this embodiment, the network interface 1330 may be configured to obtain traffic from a network to perform traffic parsing or traffic analysis.

(110) The memory 1320 includes a volatile memory and a nonvolatile memory. Usually, the nonvolatile memory stores a computer readable instruction of a traffic analysis apparatus 1322 and/or a traffic parsing apparatus 1321 provided in this application, and may further store a computer readable instruction of another program module 123 (for example, an operating system). After these computer readable instructions are read and run by the processor 1310, any one or more methods provided in the foregoing embodiments of this application may be implemented. For specific implementation of the traffic analysis apparatus 1322 and the traffic parsing apparatus 1321, refer to the foregoing embodiments. In another embodiment, the traffic analysis apparatus 1322 and the traffic parsing apparatus 1321 may be separately deployed on different physical computers.

(111) The foregoing components are connected by using a bus 140. There may be one or more buses 140. The bus 140 includes an advanced microcontroller bus architecture (AMBA) industry standard architecture (ISA) bus, a micro channel architecture (MCA) bus, an extended ISA (extended-ISA) bus, a Video Electronics Standards Association (VESA) local bus, a peripheral component interconnect (PCI) bus, and the like.

(112) The traffic analysis method provided in this application is different from a prior-art TLS handshake solution used only for application identification, and this application provides more fine-grained service identification. A ciphertext feature of a packet is used in a service identification process, thereby improving service identification accuracy. Correspondingly, in a rule learning process, ciphertext feature learning is added. Under impact of a ciphertext feature (for example, a length, a sequence, or a transmission direction of an application data packet) on service identification, a feature matrix is constructed, a feature vector is learned, and finally an application-service rule is generated, so that an identification granularity is increased, thereby resolving a problem that some TLS handshake features are insufficient to distinguish between and identify traffic. Further, according to the traffic analysis method provided in this application, a feature of an encrypted HTTP session part is combined with a TLS handshake plaintext feature, and the feature vector is learned by using an adaptive binning method that combines a numeric feature and a symbol feature, so as to identify application or service traffic, and improve identification accuracy and precision.

(113) According to the common service traffic attribution method provided in this application, an attribution problem is resolved through collaboration of three services; a traffic segment is located by using a start service; an application label is obtained by using an exclusive service; and common service traffic is attributed by using segment information, thereby resolving a problem that common service traffic cannot be attributed to an application.

(114) This application further provides a filtering method that is based on a maximum quantity of incoming packets and an ASN domain of traffic, so as to reduce traffic that needs to be analyzed. In addition, in a rule generation process, efficiency is considered, redundant rules are combined, and a quantity of determining times is reduced. Therefore, a problem that rule complexity is excessively high and performance seriously deteriorates is resolved. In a TLS handshake rule, a full procedure field of a certificate needs to be parsed, and a large amount of memory is consumed. A single field cannot be accurately matched, and consequently identification overheads are increased. A parsed field needs to be optimized, and rule complexity needs to be reduced. An effect of the filtering method provided in this application lies in that a filtering policy is adaptively adjusted based on a parameter provided by an identification rule; impact imposed by a redundant rule on performance is reduced; a filtering module is designed; a quantity of reading times and performance overheads are reduced; a disadvantage of a full-field feature establishment rule in a current technical solution is overcome; and a high-speed real-time traffic identification environment is adapted.

(115) In a high-speed environment of a backbone core network, a quantity of packets required for traffic identification is greatly limited. Therefore, in the description process in this application, no full-traffic feature is applied. However, if hardware technologies progress or any special construction environment can support this feature learning manner, this application can be naturally extended to this traffic identification environment. A core identification step is still similar to that in the foregoing embodiments of this application, and a difference is readily figured out by a person skilled in the art. In addition, random packaging of the TLS protocol, or the lower-level TCP protocol, or a manually constructed proprietary protocol may partially change a feature value during identification, and this solution still falls within the protection scope of this application.

(116) The technical solutions provided in this application may be applied to a policy and charging control scenario of an operator, and may be further applied to a video key quality indicator (key quality indicator, KQI) scenario, for example, a content delivery network (content delivery network, CDN) traffic distinguishing scenario. In this scenario, common traffic is generated for a reason similar to that in the foregoing embodiments, and attribution of common traffic used by different applications in a CDN may be basically identified and distinguished according to the method provided in the foregoing embodiments, so as to accurately complete a video KQI statistics collection requirement. More broadly, the solutions provided in this application are applicable to any scenario in which common traffic generated by a common service needs to be distinguished.

(117) It should be noted that the module or unit division in the foregoing embodiments is only shown as an example, and functions of the described modules are merely described as an example. This application is not limited thereto. A person of ordinary skill in the art may combine functions of two or more modules according to a requirement, or divide functions of one module to obtain more modules with a finer granularity, or there may be other variants.

(118) For same or similar parts of the embodiments described above, mutual reference may be made to the embodiments.

(119) The described apparatus embodiments are merely examples. The modules described as separate parts may or may not be physically separated, and parts shown as modules may or may not be physical modules, may be located in one position, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments. In addition, in the accompanying drawings of the apparatus embodiments provided in this application, connection relationships between modules indicate that the modules have communication connections to each other, and may be specifically implemented as one or more communications buses or signal cables. A person of ordinary skill in the art may understand and implement the embodiments of this application without creative efforts.

(120) The foregoing descriptions are merely specific implementations of this application, but are not intended to limit the protection scope of this application.

Traffic analysis method, common service traffic attribution method, and corresponding computer system

Assignee

Inventors

Cpc classification

Classification Explorer

H04L63/0428

ELECTRICITY

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

H04L63/0245

ELECTRICITY

Classification Explorer

H04L47/22

ELECTRICITY

Classification Explorer

H04L43/028

ELECTRICITY

Classification Explorer

H04L43/062

ELECTRICITY

Classification Explorer

H04L63/1408

ELECTRICITY

Classification Explorer

H04L47/32

ELECTRICITY

Classification Explorer

H04L47/2441

ELECTRICITY

Classification Explorer

H04L43/026

ELECTRICITY

International classification

Classification Explorer

G01R31/08

PHYSICS

Classification Explorer

G06F11/00

PHYSICS

Classification Explorer

G08C15/00

PHYSICS

Classification Explorer

H04L47/32

ELECTRICITY

Classification Explorer

H04L47/2441

ELECTRICITY

Classification Explorer

H04L43/062

ELECTRICITY

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

H04L47/22

ELECTRICITY

Classification Explorer

H04J1/16

ELECTRICITY

Classification Explorer

H04J3/14

ELECTRICITY

Classification Explorer

H04L1/00

ELECTRICITY

Abstract

Claims

Description