Traffic analysis method, common service traffic attribution method, and corresponding computer system
11425047 · 2022-08-23
Assignee
Inventors
Cpc classification
H04L63/0428
ELECTRICITY
H04L47/32
ELECTRICITY
H04L47/2441
ELECTRICITY
International classification
G01R31/08
PHYSICS
G08C15/00
PHYSICS
H04L47/32
ELECTRICITY
H04L47/2441
ELECTRICITY
Abstract
This application provides a traffic analysis method and apparatus, and a computer system. The method includes: obtaining a plaintext feature and a ciphertext feature of a packet in traffic, where the ciphertext feature includes a length feature of an encrypted field in the packet; and analyzing the traffic based on the plaintext feature and the ciphertext feature, to identify a service or an application to which the traffic belongs. The method may be used for service identification or application identification. The ciphertext feature is introduced in traffic analysis, so that traffic identification accuracy is improved in a packet encryption scenario. In addition, this application further provides a common service traffic attribution method and apparatus, and a computer system.
Claims
1. A common service traffic attribution method, comprising: determining, according to an identification rule, a maximum quantity of incoming packets required for a traffic analysis, wherein the identification rule is obtained based on a feature by using a machine learning algorithm to identify different services in traffic; filtering the traffic based on the maximum quantity of incoming packets; obtaining a feature of a packet in the traffic, wherein the feature comprises a ciphertext feature having one or more of a sequence, a length, or a transmission direction of an encrypted packet; analyzing the traffic based on the feature, to identify a start service, an exclusive service, and a common service in the traffic, wherein the start service is a service invoked in an application startup phase, the exclusive service is a service invoked by only one application, and the common service is a service invoked by a plurality of applications; and attributing traffic of a common service whose identification time is between a first identification time of a start service A and a second identification time of a start service B to an application that invokes an exclusive service whose identification time is between the first identification time and the second identification time, wherein the start service A is any identified start service, and the start service B is a first start service whose identification time is after the first identification time.
2. The method according to claim 1, before the obtaining the feature, further comprising: filtering the traffic based on Internet Protocol (IP) information of the traffic.
3. The method according to claim 1, wherein analyzing the traffic comprises: performing matching between the feature and each of a first identification rule, a second identification rule, and a third identification rule to identify the start service, the exclusive service, and the common service in the traffic, wherein the first identification rule, the second identification rule, and the third identification rule are obtained based on the feature by using a machine learning algorithm.
4. The method according to claim 1, wherein attributing traffic of a common service comprises: determining the application based on the exclusive service and correspondence information, wherein the correspondence information comprises a correspondence between the exclusive service and an application that invokes the exclusive service.
5. The method according to claim 1, wherein the feature further comprises a plaintext feature, and the plaintext feature comprises a feature comprising a character and/or a digit that can be directly obtained from the packet through parsing.
6. A common service traffic attribution method, comprising: determining, according to an identification rule, a maximum quantity of incoming packets required for a traffic analysis, wherein the identification rule is obtained based on a feature by using a machine learning algorithm to identify different services in traffic; filtering the traffic based on the maximum quantity of incoming packets; obtaining a feature of a packet in the traffic, wherein the feature comprises a ciphertext feature, and the ciphertext feature comprises any one or more of a sequence, a length, or a transmission direction of an encrypted packet; analyzing the traffic based on the feature, to identify an exclusive service and a common service in the traffic, wherein the exclusive service is a service invoked by only one application, and the common service is a service invoked by a plurality of applications; and attributing traffic of a common service whose identification time is between an identification time of an exclusive service A and an identification time of an exclusive service B to an application, wherein the application is an application that invokes the exclusive service A, the exclusive service A is any identified exclusive service, and the exclusive service B is a first exclusive service whose identification time is after the identification time of the exclusive service A.
7. The method according to claim 6, wherein the analyzing the traffic based on the feature, to identify an exclusive service and a common service in the traffic comprises: performing matching between the feature and each of a second identification rule and a third identification rule to identify the exclusive service and the common service in the traffic, wherein the second identification rule and the third identification rule are obtained based on the feature by using a machine learning algorithm.
8. A computer system, comprising a memory and a processor, wherein the memory is configured to store a computer readable instruction, which when executed by the processor, causes the processor to perform a common service traffic attribution method, the method comprising: determining, according to an identification rule, a maximum quantity of incoming packets required for a traffic analysis, wherein the identification rule is obtained based on a feature by using a machine learning algorithm to identify different services in traffic; and filtering the traffic based on the maximum quantity of incoming packets; obtaining a feature of a packet in the traffic, wherein the feature comprises a ciphertext feature, and the ciphertext feature comprises any one or more of a sequence, a length, or a transmission direction of an encrypted packet; analyzing the traffic based on the feature, to identify a start service, an exclusive service, and a common service in the traffic, wherein the start service is a service invoked in an application startup phase, the exclusive service is a service invoked by only one application, and the common service is a service invoked by a plurality of applications; and attributing traffic of a common service whose identification time is between a first identification time of a start service A and a second identification time of a start service B to an application that invokes an exclusive service whose identification time is between the first identification time and the second identification time, the start service A is any identified start service, and the start service B is a first start service whose identification time is after the first identification time.
9. The computer system according to claim 8, wherein analyzing the traffic comprises: performing matching between the feature and each of a first identification rule, a second identification rule, and a third identification rule to identify the start service, the exclusive service, and the common service in the traffic, wherein the first identification rule, the second identification rule, and the third identification rule are obtained based on the feature by using a machine learning algorithm.
10. The computer system according to claim 8, wherein attributing traffic of a common service comprises: determining the application based on the exclusive service and correspondence information, wherein the correspondence information comprises a correspondence between the exclusive service and an application that invokes the exclusive service.
11. A computer system, comprising a memory and a processor, wherein the memory is configured to store a computer readable instruction, which when executed by the processor, causes the processor to perform a common service traffic attribution method, comprising: determining, according to an identification rule, a maximum quantity of incoming packets required for a traffic analysis, wherein the identification rule is obtained based on a feature by using a machine learning algorithm, and the identification rule is used to identify different services in traffic; filtering the traffic based on the maximum quantity of incoming packets; obtaining a feature of a packet in the traffic, wherein the feature comprises a ciphertext feature, and the ciphertext feature comprises any one or more of a sequence, a length, or a transmission direction of an encrypted packet; analyzing the traffic based on the feature, to identify an exclusive service and a common service in the traffic, wherein the exclusive service is a service invoked by only one application, and the common service is a service invoked by a plurality of applications; and attributing traffic of a common service whose identification time is between an identification time of an exclusive service A and an identification time of an exclusive service B to an application that invokes the exclusive service A, the exclusive service A is any identified exclusive service, and the exclusive service B is a first exclusive service whose identification time is after the identification time of the exclusive service A.
12. The computer system according to claim 11, wherein the analyzing the traffic based on the feature, to identify an exclusive service and a common service in the traffic comprises: performing matching between the feature and each of a second identification rule and a third identification rule to identify the exclusive service and the common service in the traffic, wherein the second identification rule and the third identification rule are obtained based on the feature by using a machine learning algorithm.
13. A non-transitory computer-readable medium storing computer instructions for common service traffic attribution, that when executed by one or more processors, cause the one or more processors to perform a method, which comprises: determining, according to an identification rule, a maximum quantity of incoming packets required for a traffic analysis, wherein the identification rule is obtained based on a feature by using a machine learning algorithm to identify different services in traffic; and filtering the traffic based on the maximum quantity of incoming packets; obtaining a feature of a packet in the traffic, wherein the feature comprises a ciphertext feature, and the ciphertext feature comprises any one or more of a sequence, a length, or a transmission direction of an encrypted packet; analyzing the traffic based on the feature, to identify a start service, an exclusive service, and a common service in the traffic, wherein the start service is a service invoked in an application startup phase, the exclusive service is a service invoked by only one application, and the common service is a service invoked by a plurality of applications; and attributing traffic of a common service whose identification time is between a first identification time of a start service A and a second identification time of a start service B to an application that invokes an exclusive service whose identification time is between the first identification time and the second identification time, wherein the start service A is any identified start service, and the start service B is a first start service whose identification time is after the first identification time.
14. The medium according to claim 13, wherein the analyzing the traffic comprises: performing matching between the feature and each of a first identification rule, a second identification rule, and a third identification rule to identify the start service, the exclusive service, and the common service in the traffic, wherein the first identification rule, the second identification rule, and the third identification rule are obtained based on the feature by using a machine learning algorithm.
15. The medium according to claim 13, wherein attributing traffic of a common service comprises: determining the application based on the exclusive service and correspondence information, wherein the correspondence information comprises a correspondence between the exclusive service and an application that invokes the exclusive service.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1) To describe the technical solutions provided in this application more clearly, the following briefly describes the accompanying drawings. Apparently, the accompanying drawings in the following descriptions show merely some embodiments of this application.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
DESCRIPTION OF EMBODIMENTS
(15) To help understand the technical solutions proposed in this application, some elements introduced in the descriptions of this application are first described herein. It should be understood that the following descriptions are merely intended to help understand these elements, so as to understand content of the embodiments, but do not necessarily cover all possible cases.
(16) Traffic: Network communication packets are generated when devices connected through a network interact with each other, and these packets are referred to as traffic. The traffic is a general meaning.
(17) Data stream: A data packet generated in a complete communication process (from establishment of a connection to an end of the connection) between a server and a client is referred to as a data stream of the connection. In an application use process, interaction is usually performed for a plurality of times. Therefore, a plurality of data streams are generated to form application traffic.
(18) For example, the data stream is traffic generated during a session starting from TLS handshake establishment and ending with a Transmission Control Protocol (TCP) FIN (finish) packet. The data stream represents a process of interaction between two subjects, for example, interaction between an application process and the server.
(19) Common service: An API (application programming interface) deployed on a server and invoked by a plurality of application programs publicly provides services for completing some functions, for example, map navigation, cloud storage, and video transmission.
(20) Traffic analysis: A network communication packet is obtained through listening, capturing, copying, or the like, and original communication content of the network communication packet is restored through parsing, reassembling, segmentation, or the like, so as to understand instant statuses of two network communication parties.
(21) Plaintext feature: A plaintext feature is a feature including a character and/or a digit that can be directly obtained from a packet through parsing, and is different from a ciphertext feature.
(22)
(23) Currently, a traffic identification technology mainly focuses on traffic identification at the application layer, and traffic identification at the service layer is basically not performed. However, common service traffic in the application market currently occupies at least 60% of total traffic, and a quantity of applications using a common service module occupies at least 95% of a total quantity. A most prominent service identification problem is Google®-type service identification. For example, a conflict of identifying common service traffic, such as Google® map traffic, occurs for all application programs using a Google® map service. Consequently, a service of an operator is seriously affected. However, in actual application, a service cannot be accurately identified by using an application-layer traffic identification technology, and consequently a relatively high false identification rate is generated.
(24) An existing widely-used traffic analysis solution is a plaintext feature identification method in which traffic is identified by using a plaintext feature of a Hypertext Transfer Protocol (HTTP) packet and a plaintext feature of a TLS handshake message. The HTTP packet includes a request packet and a response packet.
(25) TABLE-US-00001 TABLE 1 Action Meaning GET Request to obtain a resource identified by a URI. POST Add new data after a resource identified by a URI. HEAD Request to obtain a response message header of a resource identified by a URI. PUT Request a server to store a resource and use a URI as an identifier of the resource. DELETE Request a server to delete a resource identified by a URI. TRACE Request a server to return received request information mainly for testing or diagnosis. CONNECT Reserved for future use. OPTIONS Request to query performance of a server, or query an option and a requirement that are related to a resource.
(26) In traffic analysis, interaction behavior being performed between the client and the server side may be determined through the foregoing actions. For example, interaction content may be determined by using the resource identified by the uniform resource identifier (URI), and a host field in a header field may be used to determine whether the packet belongs to an application. Therefore, in a plaintext feature analysis technology, these character or digital features that can be parsed are usually directly used to speculate statuses of two network communication parties. Subsequently, after an encryption technology is introduced in a network communication protocol, only a small part of unencrypted traffic can continue to use the plaintext feature analysis technology.
(27) Due to application of the protocol encryption technology, all plaintext feature fields of an original HTTP packet are encrypted into Hypertext Transfer Protocol Secure (HTTPS)-based fields. At least 90% of current network traffic is based on the HTTPS protocol. A structure of the HTTPS protocol is that a TLS protocol layer is encapsulated on the original HTTP packet. A handshake process of the TLS protocol is shown in
(28) TLS handshake messages mainly include 10 basic types (and other extended types). A feature of a TLS handshake message is constructed below mainly based on one or more of the 10 types of packets. The 10 types of packets include (1) to (5), and (7) (equivalent to (9)) that are shown in
(29) TABLE-US-00002 TABLE 2 Packet type Meaning or function HelloRequest Handshake actively initiated by a server. This is not common and is mainly used in the following case: A session has lasted for a long time, and the server reestablishes a new connection to a client to reduce security risks. ClientHello Hello message sent by a client to a server, including a session ID. ServerHello Hello message sent by a server to a client, including an encryption algorithm and a compression algorithm that are selected by the server. Certificate Certificate chain sent by a server to a client. ServerKeyExchange Message received by a client from a server, carrying a parameter for establishing symmetric encryption. The parameter is optional and is not required in all key exchange algorithms. CertificateRequest A server requests a client to provide a certificate. This is not common in a web server. ServerHelloDone Hello done message. ClientKeyExchange Responsible for sending the following three pieces of information to a server: a random number: The random number is encrypted by using a public key of the server, to prevent eavesdropping; a code change notification: indicating that subsequent information is sent by using an encryption method and a key that are negotiated by both parties; and a client handshake end notification: indicating that a handshake phase of a client ends. The notification is also a hash value of all previously sent content, and is used for verification by the server. Certificate Verify A client needs to verify whether a certificate of a server is issued by a trusted authority, whether a domain name in the certificate is consistent with an actual domain name, or whether the certificate expires. If verification on the certificate succeeds, the client fetches a public key of the server from the certificate of the server. Finished When this message is sent, the message is already encrypted, because negotiation has ended, a ChangeCipherSpec message has been sent, and encrypted communication between two parties has been activated.
(30) It should be noted that the ChangeCipherSpec protocol is not a part of a handshake protocol, and sending the ChangeCipherSpec protocol indicates that encryption statuses of the two parties are ready. In subsequent communication, ciphertext encryption communication negotiated by the two parties is used, and details are not described in this application. In addition, the Finished packet herein indicates that a handshake process ends, and is not the foregoing TCP FIN packet. A communication process between the client and the server is actually as follows: A TCP handshake is first established at the TCP layer; then the TLS handshake message shown in
(31) In an existing solution, one or more of the foregoing TLS handshake messages may be used to construct features, the features are converted into machine-readable rules, such as XML (extensible markup language), and the rules are stored. After network traffic is parsed, these rules are read for traffic filtering in a corresponding protocol format. A filtering manner may be sequential filtering. A full matching rule starting from the ClientHello packet and ending with the Finish packet is established (that is, all plaintext fields in the packet are input). After filtering is completed, traffic obtained after filtering is sent to a service logic matching module, an application to which the traffic belongs is identified based on an application ID corresponding to the rule, and a matching result is output.
(32) However, for some applications of a same type, because the applications of the same type are relatively highly similar in terms of some features (such as certificates), the applications cannot be distinguished when a rule is established by using only the features of the foregoing TLS handshake messages. In addition, traffic of different services in a same application cannot be identified by using only the features of the foregoing TLS handshake messages. In particular, common traffic generated when different applications use a same service is identified as traffic of a single application. Especially when a nested service exists inside the service, a large amount of false identification is generated. These current plaintext features cannot be used to subdivide service traffic. When common service traffic is generated, identification cannot be completed. Therefore, after a common service occurs, statistics about common traffic of a next application or a previous application is usually collected to a current application during traffic statistics collection. Consequently, a false identification rate is relatively high.
(33) Herein, applications of a same type are applications that invoke a same or similar service. Because the server issues a same type of certificate to a same type of service, identification cannot be performed by using only the TLS handshake messages. The applications of the same type may be applications comprising a same service, for example, two map applications of a same company or different companies; or may be applications that are of different types of a same company and that invoke a same service.
(34)
(35) Further, the traffic analysis apparatus may be connected to a traffic parsing apparatus 300. The traffic parsing apparatus 300 is configured to: parse received traffic, and then output a result obtained through parsing to the traffic analysis apparatus 400. In a traffic parsing process, range information of a field is extracted (specifically extracted by a parsing module in
(36) Further, the traffic analysis apparatus 400 may include a traffic filtering module 440, configured to: filter, according to all or some of rules obtained by the feature learning module 410, the result that is output by the traffic parsing apparatus 300; and input, to the service identification module 420, traffic obtained through filtering, so as to reduce an amount of processing by the service identification module 420 and improve processing efficiency. The parsing process may be further implemented in combination with hardware. For example, the parsing process is accelerated in combination with a hardware acceleration apparatus.
(37) A plurality of modules in
(38) The traffic analysis apparatus 400 is used as an example. The following describes a traffic analysis method provided in this application. The traffic analysis method belongs to some or all functions provided by the traffic analysis apparatus 400.
(39)
(40) S501. A feature learning module 410 performs machine learning based on collected history traffic data or traffic data obtained in another manner, and obtains an application-service rule of each application through machine learning.
(41) In a machine learning process, a feature of a packet needs to be extracted. The feature of the packet herein includes either or both of a plaintext feature and a ciphertext feature of the packet. The plaintext feature includes a feature including a character and/or a digit that can be directly obtained from the packet through parsing. The ciphertext feature includes any one or more of a sequence, a length, and a transmission direction of an encrypted packet.
(42) An application-service rule of an application includes identification rules of three services invoked by the application. The three services include a start service, an application exclusive service, and a common service. The application-service rule is used to perform service identification. In addition, because the three rules are associated with a specific application, an application to which identified traffic belongs may be learned according to the rules. Start services and common services of two or more different applications may be partially or completely the same, so that identification rules obtained through learning may be partially repeated.
(43) The machine learning process may be performed offline, in other words, not in real time; or may be performed in real time. Some traffic data may be periodically obtained when the machine learning process is performed in real time, and an application-service rule is generated or updated through machine learning.
(44) In some other embodiments, a manager may manage, by using a management configuration module (not shown in the figure), the rules obtained by the feature learning module 410. For example, the manager may add, delete, modify, or view these rules.
(45) S502. After traffic arrives, a traffic parsing apparatus 300 reads a packet in the traffic from a storage (for example, a memory), parses the packet according to a protocol format of the packet, and transmits, to a traffic filtering module 440, a packet (or referred to as traffic) obtained through parsing.
(46) A protocol above a transport layer, namely, a TCP/IP layer, is used in a parsing process, for example, the TLS protocol. A TLS protocol-based packet may be divided into a TLS handshake part and a TLS record part according to a format. In this embodiment, the handshake part mainly includes seven types of data packets, including ClientHello, ServerHello, Certificate, and the like. As mentioned above, not all the 10 types of data packets are used.
(47) S503. The traffic filtering module 440 receives the traffic from the traffic parsing apparatus 300, obtains the application-service rule from the feature learning module 410, filters a received packet according to the application-service rule, and sends, to a service identification module 420, a packet obtained through filtering.
(48) In one embodiment, the feature learning module 410 stores the application-service rule in the memory by using a file or in another form. After reading the application-service rule from the memory, the traffic filtering module 440 filters the traffic according to the application-service rule.
(49) The traffic filtering module 440 is mainly configured to preprocess the traffic before service identification, such as filtering or offloading, so as to reduce system overheads and improve processing efficiency of the service identification module 420. The traffic filtering module 440 can support performing parsing based on different fields in different packets such as HTTP and TLS packets, and can also support a custom regular filter mode.
(50) In some other embodiments, the traffic filtering module 440 may not be required.
(51) S504. The service identification module 420 receives, from the traffic filtering module 440, the traffic obtained through filtering, obtains the application-service rule from the feature learning module 410, performs, according to the application-service rule, service identification on the traffic obtained through filtering, and obtains an identification result. The identification result includes a “location” of each service and a type of a service to which the traffic belongs: a start service, an application exclusive service, or a common service. Finally, the identification result is sent to a traffic attribution module 430.
(52) The “location” of the service herein does not mean a geographical location. Location information of a service can be understood as a mark or an indication, and is used to indicate a sequence of a time for identifying the service relative to another service. For example, the location information of the service may be a time point at which the service is identified, or a digit that may reflect a sequence.
(53) For example, if it is determined that a feature of a data stream S1 matches a feature of a start service of an application, traffic of the data stream S1 belongs to the start service, and then a correspondence between the data stream S1, a start service, and a service location is recorded in the memory.
(54) S505. The traffic attribution module 430 receives the identification result sent by the service identification module 420, and determines, based on a start service and an exclusive service (or based only on the exclusive service), an application to which traffic of a common service belongs.
(55) In one embodiment, the service identification module 420 records the identification result in the memory, and the memory may be a cache, or may be another type of memory. Then the traffic attribution module 430 reads the identification result from the memory.
(56) In one embodiment, an application identification time (that is, a location of a start service) does not need to be considered. When an exclusive service is identified, an application (for example, an application ID) corresponding to the exclusive service is recorded in the memory, and traffic of a common service that appears after the time point belongs to the application. When a next exclusive service is subsequently identified, a new application (which may be the same as the previous application because a same application may have two or more exclusive services) is recorded. This method is applicable to a scenario in which there is no traffic between a start service and an exclusive service, and the exclusive service is equivalent to a start service.
(57) In one embodiment, a start service is first identified, an application identification time is determined, and the identification time is stored in the memory. It should be noted that the “time” herein is not necessarily a time value. When an exclusive service is identified, an application corresponding to the exclusive service is recorded in the memory, and traffic of a common service that appears after the time point belongs to the application. After a next start service is subsequently identified, updating the application recorded in the memory is considered.
(58) In the foregoing two embodiments, to save storage space of the memory, an aging time of stored content, a quantity of stored content entries, or the like may be set during implementation of the method.
(59) The following uses the second embodiment as an example for description. There is only a slight difference between the first implementation and the second implementation. With reference to the second implementation, a person skilled in the art may learn how to implement the first implementation.
(60) First, currently received traffic is segmented based on location information of all identified start services. For example, a first segment ranges from a start service SS.sub.a to a start service SS.sub.b, and a second segment ranges from the start service SS.sub.b to a start service SS.sub.c.
(61) Then an application corresponding to a segment is determined based on location information of an exclusive service. For example, if an exclusive service OS.sub.b is in the second segment, and the exclusive service OS.sub.b is exclusive to an application B, it is determined that the second segment corresponds to the application B. It should be understood that segments and applications are not in a one-to-one correspondence. The second segment corresponds to the application B, but it does not mean that traffic of the application B exists only in the second segment. The application B may be started for a plurality of times.
(62) Finally, an application to which the common service belongs is determined based on the location information of the common service and the application corresponding to the segment. For example, if a common service PS.sub.a is in the second segment, and it is learned that the second segment corresponds to the application B, traffic of the common service PS.sub.a belongs to the application B.
(63) S502 to S505 are usually a real-time processing process.
(64) For ease of understanding,
(65) The exclusive service OS.sub.b exists after the start service SS.sub.b and before a next start service SS.sub.c, and it is learned that OS.sub.b is exclusive to the application B. Therefore, it may be determined that the start service SS.sub.b is a start service of the application B. Further, it may be determined that a start time of the application B is approximately a time indicated by a location of the start service SS.sub.b. Likewise, the exclusive service OS.sub.a is exclusive to an application A. Therefore, it may be determined that the start service SS.sub.a is a start service of the application A.
(66) The common service PS.sub.a is in the second segment, and appears after the application B is started. Therefore, traffic of the common service PS.sub.a should belong to the application B. However, although arrival time points of most data streams of the other common service PS.sub.b coincide with the second segment, it is learned from the figure that an initial location (a location at which the common service is identified) of the other common service PS.sub.b is in the first segment. However, the application B has not been started in this case. Therefore, the traffic of PS.sub.b belongs to the application A instead of the application B.
(67) It should be noted that a time at which a service is identified (that is, a time indicated by a location of the service) is not an exact time at which the application is started or the service is started. However, a sequence in which services are identified is usually consistent with a sequence in which the services run.
(68) The solutions are collectively described above. The following uses a Google® application (for example, Google Map) as an example to describe a service identification method and a service traffic attribution method in detail, and the foregoing steps are specifically implemented. In a current technology, accuracy of identifying traffic of the Google® application is relatively low, and attribution of common service traffic cannot be correctly determined, thereby affecting a normal traffic identification service of an operator. Therefore, in this application, the Google® application is used as an example to describe a traffic analysis method.
(69) An objective of the method to be described below is to determine attribution of traffic of a Google common service, so as to improve traffic identification accuracy of the Google® application.
(70) A general process of the method is similar to that in
(71)
(72)
(73) The feature matrix may be constructed by using one or more of the following three methods. Method 1: The feature matrix is constructed based on a plaintext of a packet. For example, an SNI (server name indication) field in a ClientHello packet is used as a column of features. Method 2: The feature matrix is constructed based on a ciphertext feature of a protocol, for example, a length of a first data packet of uplink application data and/or a length of a downlink data packet, and ciphertext content does not need to be obtained. Method 3: The feature matrix is constructed by combining a plaintext and a ciphertext. The feature matrix may be manually constructed for the first time. In a subsequent step, the feature matrix may be adjusted based on a learned feature value range.
(74) After the feature matrix is obtained, the feature vector is generated (S802). Specifically, a feature of each data stream in application traffic is checked. If the data stream includes the feature in a corresponding feature column, the data stream is marked as 1; or if the feature does not appear, the data stream is marked as 0. In this way, a feature matrix of all data streams can be finally obtained, and each row of the matrix represents a feature vector of a data stream. For example, if application traffic of Google Map includes 20 data streams and there are 30 constructed feature columns, a 20×30 feature matrix including 0 and 1 is output.
(75)
(76) When the learner 712 finds a feature vector used to distinguish between services (S902), the learner 712 outputs an identification rule corresponding to the feature vector, and combines a service identification rule learned for a same type of application into the application-service rule of the application (S903). When the learner 712 does not find a feature vector used to distinguish between services (S902), the learner 712 sends, to the constructor, a request for reconstructing the feature matrix (S904), to request to reconstruct the feature matrix. Referring to
(77) In this embodiment, the machine learning algorithm such as a decision tree algorithm, an artificial neural network algorithm, a support vector machine algorithm, a clustering algorithm, a Bayes classification algorithm, a Markov chain algorithm, or a probabilistic graphical model may be used.
(78) The rule includes three types: a first identification rule, a second identification rule, and a third identification rule. As shown in Table 3 to Table 5 below, a rule includes one or more fields.
(79) It should be noted that the “field” in Table 3 to Table 5 indicates a field in the rule and is customized. “Location” is a field in an actual data packet. The field is usually agreed on by an Internet Protocol team, and is visible in a Request For Comments (RFC) document of a corresponding protocol and is a consensus in the art. A value may be obtained by using the field, to match a preset value of the field in the rule.
(80) TABLE-US-00003 TABLE 3 Field Location Description Example First SNI TLS The field is “clients4.google.com” identification rule handshake a server name. TLS record TLS record Packet For example, a first length length packet record length 254 feature may be determined as a start of Google Map.
(81) An example of the first identification rule is as follows:
(82) SNI=www.googleapis.com && TLS record=512
(83) When the rule is used, a value is obtained from a TLS handshake field of a received data packet, and a value is obtained from a TLS record length field, to perform matching between the two values and the identification rule. It is determined whether the two obtained values are respectively www.googleapis.com and 512. If yes, the matching succeeds; or if no, the matching fails. A method for using another rule in the following is similar to that for the foregoing rule, and details are not described.
(84) TABLE-US-00004 TABLE 4 Field Location Description Example Second SNI TLS The field is a “clients4.google.com” identification rule handshake server name. CertCommonName Certificate Certificate alias “blackberry.com” UserAgent HTTP Browser and “com.google.android.youtube” head system name (single-packet identification) UDP-UserAgent HTTP Browser and “com.google.android.youtube” head system name (single-packet identification) Client TLS Data sent by a 0-1300 application data record client to a (sequential (cAppD) length server side matching in a same (Considering direction, and packet supporting TCP fragmentation and TLS packets) and performance, the field may be replaced with TCP.length.) Server TLS Data sent by a 0-1300 (a application data record server side to a maximum of four (sAppD) length client packets matched in (Considering this direction, packet sequential fragmentation matching in a same and direction, and TCP performance, and TLS packets) the field may be replaced with TCP.length.) Other TLS Another Existing TLS handshake possible identification handshake (fingerprint) rule feature
(85) An example of the second identification rule is as follows:
(86) iOS® system: SNI=clients4.google.com && sAppD[1]==62 && sAppD[2]==42 && sAppD[3]==38 && sAppD[4]>=242 && sAppD[4]<=243 && cAppD[1]==53 && cAppD[2]==50 && cAppD[3]>=301 && cAppD[3]<=308; and
(87) Android® system: SNI=clients4.google.com && sAppD[1]-376 && nCAppD>=1 && cAppD[1]>=848 && cAppD[1]<=849, where
(88) sAppD[x] indicates a length of an x.sup.th application data packet sent by the server side to the client side, and cAppD[x] indicates a length of an x.sup.th application data packet sent by the client side to the server side.
(89) TABLE-US-00005 TABLE 5 Field Location Description Example Third SNI TLS The field is a “clients4.google.com” identification rule handshake server name (single-packet identification). CertCommonName Certificate Certificate alias “blackberry.com” (single-packet identification) Other TLS Another possible Existing TLS handshake handshake identification feature (fingerprint) rule
(90) An example of the third identification rule is as follows:
(91) #SNI_googleadservices.com
(92) #SNI_www.googleapis.com
(93) #CertCommonName_google-analytics.com
(94) The foregoing is a process of obtaining a service identification rule, and the process is performed offline. The following describes a real-time traffic analysis process. In the real-time traffic analysis process, the following processes such as a traffic obtaining process, a traffic filtering process, a service identification process, and a process of attributing common service traffic are sequentially performed in real time.
(95)
(96) In one embodiment, after the application-service rule and the to-be-filtered traffic are received, a maximum quantity of incoming packets required when the rule is used to identify a service is determined according to the application-service rule (S1001). In addition, an ASN domain of Google is calculated based on IP information of the to-be-filtered traffic (S1001). The traffic is filtered based on the determining result and the maximum quantity of incoming packets (S1002), and the traffic obtained through filtering belongs to the ASN domain of Google and meets a requirement for the maximum quantity of incoming packets.
(97) The maximum quantity of incoming packets herein is a maximum quantity of packets that are read by a traffic analysis apparatus 700 from a data stream. For example, if the maximum quantity of incoming packets is 5, a quantity of read packets is less than or equal to 5. If the quantity of read packets exceeds 5, no packet is read. In other words, when the traffic is filtered, other data packets different from the five packets are filtered out.
(98)
(99) It should be noted that, in some other embodiments, the single-user identification module 721 and an execution process of the single-user identification module 721 are not necessary. For example, traffic originally comes from one user, or traffic comes from a plurality of users, but a requirement for a solution does not include distinguishing between traffic of different users.
(100)
(101) In one embodiment, a location of a start service is obtained (S1201), and traffic of a single user is segmented by using the location (S1202). An exclusive service in the segment (namely, a current segment) is obtained, and an application to which traffic in the segment belongs is obtained (S1203). The application is an application that invokes the exclusive service. Then a cache table is established, and information recorded in the cache table includes an application ID, a user ID, and a location of a start service that correspond to the segment (S1204).
(102) To save storage space, only application IDs, user IDs, and location information of start services that correspond to a previous segment and the current segment are stored in the cache table.
(103) It should be understood that the cache table is a table stored in a cache in a form of a table. In some other embodiments, the information may also be stored in another storage space in another form.
(104) If a previous module identifies a common service, a location of the identified common service is obtained (S1205). It is determined, based on the location of the common service, whether the common service belongs to the current segment (S1206); and if the common service belongs to the current segment, an application to which the common service belongs is output (S1207); or if the common service does not belong to the current segment, the cache table is queried for application information of a corresponding location by using the location information of the user (S1208), and the application to which the common service belongs is output Alternatively, the cache table is directly queried for application information of a corresponding location based on the location of the common service, and an application to which the common service belongs is output.
(105) It should be noted that an ID of an entry in this embodiment is information used to identify the entry, and may be a digit, a text, code, or information of another type. In this embodiment, a location of a service is a time at which the service is identified. Refer to a start location of an arrow that indicates a service in
(106) Any method provided in the foregoing embodiments may be implemented on one or more physical computers. The apparatus proposed in the foregoing embodiments may be deployed on one or more physical computers. Unit module division inside the apparatus is merely shown as an example, and all unit modules may be deployed on a same physical computer, or may be deployed on different physical computers.
(107)
(108) The processor 1310 may be a single-core processor or a multi-core processor. When the processor 1310 is the multi-core processor, the method provided in this application may run on one core, or may run on different cores in a distributed manner. There may be one or more processors 1310, and the plurality of processors may be of a same type or different types. The processor types include a central processing unit (CPU), a graphics processing unit, a microprocessor, a coprocessor, and the like.
(109) The network interface 1330 is configured to connect to another network device, and the connection includes a wireless connection and a wired connection. In this embodiment, the network interface 1330 may be configured to obtain traffic from a network to perform traffic parsing or traffic analysis.
(110) The memory 1320 includes a volatile memory and a nonvolatile memory. Usually, the nonvolatile memory stores a computer readable instruction of a traffic analysis apparatus 1322 and/or a traffic parsing apparatus 1321 provided in this application, and may further store a computer readable instruction of another program module 123 (for example, an operating system). After these computer readable instructions are read and run by the processor 1310, any one or more methods provided in the foregoing embodiments of this application may be implemented. For specific implementation of the traffic analysis apparatus 1322 and the traffic parsing apparatus 1321, refer to the foregoing embodiments. In another embodiment, the traffic analysis apparatus 1322 and the traffic parsing apparatus 1321 may be separately deployed on different physical computers.
(111) The foregoing components are connected by using a bus 140. There may be one or more buses 140. The bus 140 includes an advanced microcontroller bus architecture (AMBA) industry standard architecture (ISA) bus, a micro channel architecture (MCA) bus, an extended ISA (extended-ISA) bus, a Video Electronics Standards Association (VESA) local bus, a peripheral component interconnect (PCI) bus, and the like.
(112) The traffic analysis method provided in this application is different from a prior-art TLS handshake solution used only for application identification, and this application provides more fine-grained service identification. A ciphertext feature of a packet is used in a service identification process, thereby improving service identification accuracy. Correspondingly, in a rule learning process, ciphertext feature learning is added. Under impact of a ciphertext feature (for example, a length, a sequence, or a transmission direction of an application data packet) on service identification, a feature matrix is constructed, a feature vector is learned, and finally an application-service rule is generated, so that an identification granularity is increased, thereby resolving a problem that some TLS handshake features are insufficient to distinguish between and identify traffic. Further, according to the traffic analysis method provided in this application, a feature of an encrypted HTTP session part is combined with a TLS handshake plaintext feature, and the feature vector is learned by using an adaptive binning method that combines a numeric feature and a symbol feature, so as to identify application or service traffic, and improve identification accuracy and precision.
(113) According to the common service traffic attribution method provided in this application, an attribution problem is resolved through collaboration of three services; a traffic segment is located by using a start service; an application label is obtained by using an exclusive service; and common service traffic is attributed by using segment information, thereby resolving a problem that common service traffic cannot be attributed to an application.
(114) This application further provides a filtering method that is based on a maximum quantity of incoming packets and an ASN domain of traffic, so as to reduce traffic that needs to be analyzed. In addition, in a rule generation process, efficiency is considered, redundant rules are combined, and a quantity of determining times is reduced. Therefore, a problem that rule complexity is excessively high and performance seriously deteriorates is resolved. In a TLS handshake rule, a full procedure field of a certificate needs to be parsed, and a large amount of memory is consumed. A single field cannot be accurately matched, and consequently identification overheads are increased. A parsed field needs to be optimized, and rule complexity needs to be reduced. An effect of the filtering method provided in this application lies in that a filtering policy is adaptively adjusted based on a parameter provided by an identification rule; impact imposed by a redundant rule on performance is reduced; a filtering module is designed; a quantity of reading times and performance overheads are reduced; a disadvantage of a full-field feature establishment rule in a current technical solution is overcome; and a high-speed real-time traffic identification environment is adapted.
(115) In a high-speed environment of a backbone core network, a quantity of packets required for traffic identification is greatly limited. Therefore, in the description process in this application, no full-traffic feature is applied. However, if hardware technologies progress or any special construction environment can support this feature learning manner, this application can be naturally extended to this traffic identification environment. A core identification step is still similar to that in the foregoing embodiments of this application, and a difference is readily figured out by a person skilled in the art. In addition, random packaging of the TLS protocol, or the lower-level TCP protocol, or a manually constructed proprietary protocol may partially change a feature value during identification, and this solution still falls within the protection scope of this application.
(116) The technical solutions provided in this application may be applied to a policy and charging control scenario of an operator, and may be further applied to a video key quality indicator (key quality indicator, KQI) scenario, for example, a content delivery network (content delivery network, CDN) traffic distinguishing scenario. In this scenario, common traffic is generated for a reason similar to that in the foregoing embodiments, and attribution of common traffic used by different applications in a CDN may be basically identified and distinguished according to the method provided in the foregoing embodiments, so as to accurately complete a video KQI statistics collection requirement. More broadly, the solutions provided in this application are applicable to any scenario in which common traffic generated by a common service needs to be distinguished.
(117) It should be noted that the module or unit division in the foregoing embodiments is only shown as an example, and functions of the described modules are merely described as an example. This application is not limited thereto. A person of ordinary skill in the art may combine functions of two or more modules according to a requirement, or divide functions of one module to obtain more modules with a finer granularity, or there may be other variants.
(118) For same or similar parts of the embodiments described above, mutual reference may be made to the embodiments.
(119) The described apparatus embodiments are merely examples. The modules described as separate parts may or may not be physically separated, and parts shown as modules may or may not be physical modules, may be located in one position, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments. In addition, in the accompanying drawings of the apparatus embodiments provided in this application, connection relationships between modules indicate that the modules have communication connections to each other, and may be specifically implemented as one or more communications buses or signal cables. A person of ordinary skill in the art may understand and implement the embodiments of this application without creative efforts.
(120) The foregoing descriptions are merely specific implementations of this application, but are not intended to limit the protection scope of this application.