AUTOMATIC EVENT GRAPH CONSTRUCTION METHOD AND DEVICE FOR MULTI-SOURCE VULNERABILITY INFORMATION
20230035121 · 2023-02-02
Inventors
- Ying WEI (Jiangsu, CN)
- Xiaobing Sun (Jiangsu, CN)
- Lili BO (Jiangsu, CN)
- Bin Li (Jiangsu, CN)
- Xingqi CHENG (Jiangsu, CN)
Cpc classification
G06N7/01
PHYSICS
Y02D10/00
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
G06N3/0442
PHYSICS
International classification
G06F21/57
PHYSICS
Abstract
Provided is an automatic event graph construction method for multi-source vulnerability information. The method includes the following steps. A vulnerability report is crawled from a vulnerability database, a cause of vulnerability is taken as an event trigger word, and a vulnerability type is determined through the cause of vulnerability. An attacker, consequence, location and other information in a description are identified by named-entity recognition, and information completion is performed. An explicit relation between events is extracted by using text information, an implicit relation between events is extracted by using text similarity, and vulnerability-related code representation is performed. Obtained vulnerability event information is visualized into an event graph through a visualization tool.
Claims
1. An automatic event graph construction method for multi-source vulnerability information, comprising: step 1, crawling a vulnerability report from a vulnerability database according to a Common Vulnerabilities and Exposures identifier (CVE-ID), and constructing a vulnerability report data set; step 2, taking a cause of vulnerability as an event trigger word, constructing a vulnerability event trigger word label set, extracting a trigger word of a vulnerability event, and determining a vulnerability type through the trigger word; step 3, extracting vulnerability event elements from description information of a vulnerability by named-entity recognition, and performing information completion; step 4, extracting an explicit relation between vulnerability events by using text information, and extracting an implicit relation between vulnerability events by using text similarity; step 5, performing vulnerability-related code representation; and step 6, visualizing vulnerability event information obtained from steps 2 to 5 into a vulnerability event graph; wherein the graph comprises vulnerability event related elements and a relation between vulnerability events, and the vulnerability event is associated with the vulnerability type through the event trigger word.
2. The automatic event graph construction method for the multi-source vulnerability information according to claim 1, wherein the step 1 comprises: collecting a vulnerability report from vulnerability databases Common Vulnerabilities and Exposures (CVE), National Vulnerability Database (NVD) and IBM X-Force Exchange according to the CVE-ID; and acquiring description information, a release date, a Common Vulnerability Scoring System (CVSS) score, a Common Weakness Enumeration (CWE) category and a related link in the report to obtain a vulnerability report data set.
3. The automatic event graph construction method for the multi-source vulnerability information according to claim 1, wherein the step 2 comprises: performing sequence labeling task training on a Bidirectional Encoder Representations from Transformers (BERT)model by using a vulnerability event trigger word label set, and extracting trigger words by using the trained model; and classifying the extracted trigger words by using a softmax classifier, wherein the vulnerability type comprises at least one of a time-related vulnerability, a configuration vulnerability, an inputvalidation vulnerability, a memory vulnerability, a logic resource vulnerability, a numeric vulnerability or an unknown vulnerability.
4. The automatic event graph construction method for the multi-source vulnerability information according to claim 2, wherein the vulnerability event elements extracted in the step 3 comprise a triggering operation, an occurrence situation, an attacker, an affected version, a consequence, and a location; and the step 3 comprises: training a sequence labeling task of a BERT model by using a constructed vulnerability event element label set, and performing event element extraction by connecting a Bi-Directional Long Short-Term Memory (BiLSTM) layer and a Conditional Random Field (CRF) layer using the trained model.
5. The automatic event graph construction method for the multi-source vulnerability information according to claim 2, wherein the step 3 comprises: in a case where part of the event elements are missed in CVE and NVD descriptions, completing the event elements by using a description in IBM X-Force Exchange.
6. The automatic event graph construction method for the multi-source vulnerability information according to claim 1, wherein the step 4 comprises: extracting an explicit relation between vulnerability events through a sentence pattern template, wherein an explicit vulnerability relation type comprises at least one of a similar relation, a causal relation, a brother relation, a regression relation, an included relation or a dependency relation.
7. The automatic event graph construction method for the multi-source vulnerability information according to claim 1, wherein the step 4 comprises: extracting a vulnerability implicit similar relation by calculating a vectorized cosine similarity of vulnerability description information.
8. The automatic event graph construction method for the multi-source vulnerability information according to claim 1, wherein the step 5 comprises: representing a vulnerability code as at least one of an abstract syntax tree (AST), a control-flow graph (CFG) or a program dependence graph (PDG).
9. An automatic event graph construction device for multi-source vulnerability information, comprising: a memory, a processor, and a computer program stored on the memory and executable by the processor, wherein the computer program, when loaded into the processor, performs the following steps: step 1, crawling a vulnerability report from a vulnerability database according to a Common Vulnerabilities and Exposures identifier (CVE-ID), and constructing a vulnerability report data set; step 2, taking a cause of vulnerability as an event trigger word, constructing a vulnerability event trigger word label set, extracting a trigger word of a vulnerability event, and determining a vulnerability type through the trigger word; step 3, extracting vulnerability event elements from description information of a vulnerability by named-entity recognition, and performing information completion; step 4, extracting an explicit relation between vulnerability events by using text information, and extracting an implicit relation between vulnerability events by using text similarity; step 5, performing vulnerability-related code representation; and step 6, visualizing vulnerability event information obtained from steps 2 to 5 into a vulnerability event graph; wherein the graph comprises vulnerability event related elements and a relation between vulnerability events, and the vulnerability event is associated with the vulnerability type through the event trigger word.
10. The device according to claim 9, wherein the step 1 comprises: collecting a vulnerability report from vulnerability databases Common Vulnerabilities and Exposures (CVE), National Vulnerability Database (NVD) and IBM X-Force Exchange according to the CVE-ID; and acquiring description information, a release date, a Common Vulnerability Scoring System (CVSS) score, a Common Weakness Enumeration (CWE) category and a related link in the report to obtain a vulnerability report data set.
11. The device according to claim 9, wherein the step 2 comprises: performing sequence labeling task training on a Bidirectional Encoder Representations from Transformers (BERT)model by using a vulnerability event trigger word label set, and extracting trigger words by using the trained model; and classifying the extracted trigger words by using a softmax classifier, wherein the vulnerability type comprises at least one of a time-related vulnerability, a configuration vulnerability, an inputvalidation vulnerability, a memory vulnerability, a logic resource vulnerability, a numeric vulnerability or an unknown vulnerability.
12. The device according to claim 10, wherein the vulnerability event elements extracted in the step 3 comprise a triggering operation, an occurrence situation, an attacker, an affected version, a consequence, and a location; and the step 3 comprises: training a sequence labeling task of a BERT model by using a constructed vulnerability event element label set, and performing event element extraction by connecting a Bi-Directional Long Short-Term Memory (BiLSTM) layer and a Conditional Random Field (CRF) layer using the trained model.
13. The device according to claim 10, wherein the step 3 comprises: in a case where part of the event elements are missed in CVE and NVD descriptions, completing the event elements by using a description in IBM X-Force Exchange.
14. The device according to claim 9, wherein the step 4 comprises: extracting an explicit relation between vulnerability events through a sentence pattern template, wherein an explicit vulnerability relation type comprises at least one of a similar relation, a causal relation, a brother relation, a regression relation, an included relation or a dependency relation.
15. The device according to claim 9, wherein the step 4 comprises: extracting a vulnerability implicit similar relation by calculating a vectorized cosine similarity of vulnerability description information.
16. The device according to claim 9, wherein the step 5 comprises: representing a vulnerability code as at least one of an abstract syntax tree (AST), a control-flow graph (CFG) or a program dependence graph (PDG).
17. A non-transitory computer-readable storage medium, which is configured to store a computer program, wherein a processor is configured to, when executing the computer program, perform the following steps: step 1, crawling a vulnerability report from a vulnerability database according to a Common Vulnerabilities and Exposures identifier (CVE-ID), and constructing a vulnerability report data set; step 2, taking a cause of vulnerability as an event trigger word, constructing a vulnerability event trigger word label set, extracting a trigger word of a vulnerability event, and determining a vulnerability type through the trigger word; step 3, extracting vulnerability event elements from description information of a vulnerability by named-entity recognition, and performing information completion; step 4, extracting an explicit relation between vulnerability events by using text information, and extracting an implicit relation between vulnerability events by using text similarity; step 5, performing vulnerability-related code representation; and step 6, visualizing vulnerability event information obtained from steps 2 to 5 into a vulnerability event graph; wherein the graph comprises vulnerability event related elements and a relation between vulnerability events, and the vulnerability event is associated with the vulnerability type through the event trigger word.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
DETAILED DESCRIPTION
[0039] To make the objects, solutions and advantages of the present application more apparent, a more detailed description is given hereinafter to illustrate the present application in conjunction with drawings and embodiments. It is to be understood that the embodiments described herein are intended to explain the present application and not to limit the present application.
[0040] In an embodiment, in conjunction with
[0041] In step 1, constructing a vulnerability report data set;
[0042] In step 2, extracting an event trigger word and determining a vulnerability type through the trigger word;
[0043] In step 3, extracting vulnerability event elements and performing information completion; In step 4, extracting a relation between vulnerability events;
[0044] In step 5, performing vulnerability-related code representation; and
[0045] In step 6, visualizing vulnerability event information obtained in the above steps into a vulnerability event graph.
[0046] Further, in an embodiment, the process of constructing the vulnerability report data set in step 1 includes steps 1-1 and 1-2.
[0047] In step 1-1, a vulnerability report is collected from vulnerability databases CVE, NVD and IBM X-Force Exchange according to a CVE-ID, and the screenshots of vulnerability reports of the three databases are shown in
[0048] In step 1-2, the extracted vulnerability report is pre-treated to remove redundant information in the report, and description information, an issue date, a CVSS score, a CWE category and a related link in the report are acquired to obtain the vulnerability report data set.
[0049] Further, the process of extracting the trigger word on the vulnerability event and determining the vulnerability type through the trigger word in step 2 includes steps 2-1, 2-2, and 2-3.
[0050] In step 2-1, the description in the vulnerability report is manually labeled by BIO labeling, and a cause of a vulnerability event is taken as an event trigger word. As shown in FIG. 3, “Double free vulnerability” is the cause (that is, the trigger word) in the description of the sentence, so the three words are labeled as “B-Trigger”, “I-Trigger”, and “I-Trigger”, respectively. In addition, all other words in the sentence are labeled as “0”. In this manner, a vulnerability event trigger word label set is constructed, and 80% of the description data is randomly selected as a training set while the remaining 20% is taken as a test set.
[0051] In step 2-2, a sequence labeling task training is performed on a BERT model by using the vulnerability event trigger word label set constructed in step 2-1, as shown in
[0052] In step 2-3, the trigger word extracted in step 2-2 is classified by a softmax classifier, and the category is shown in Table 1. In multinomial logistic regression and linear discriminant analysis, the input of the softmax function is a result obtained from K different linear functions, and the probability that the sample vector xx belongs to the .sub.j-th category is:
[0053] In the above function, yy denotes a certain category, xx is a sample vector, x.sup.Tx.sup.T is a transposed vector of the sample vector, and W is a weight parameter. The numerator of the function refers to the mapping of a real output to zero to positive infinity through an exponential function while the denominator refers to obtaining a sum of all the results and then normalizing of the sum. The sum of the probability values that the sample vector x belongs to each category is 1, and the highest probability value is selected as the category type of the sample.
TABLE-US-00001 TABLE 1 Trigger word type table of vulnerability event Trigger word type of Type vulnerability event abbreviation Related description Time-related Vul TIM which refers to security vulnerabilities that can be exploited for attack with specific competition conditions or other strict time requirements (such as thread deadlock and thread concurrency). Configuration Vul CON which refers to security vulnerabilities used by an attacker through certain security configuration errors (such as improper access control, default settings, and authority management). Input Validation Vul INP which refers to security vulnerabilities caused by improper input validation (such as SQL injection, cross-site scripting (XSS), format string attacks, and HTTP response splitting). Memory Vul MEM which refers to the attack on a software system by consuming a large amount of physical resources (such as memory) without releasing these physical resources or by manipulating such resources in the wrong way (such as buffer overflows, memory leak, and stack depletion). Logic Resource Vul LOG which refers to security vulnerabilities that can be exploited by repeatedly leaking logic resources without releasing the logic resources (such as computing time and infinite loops). Numeric Vul NUM which refers to security vulnerabilities that can be exploited by accumulating numeric errors (such as integer overflows). Unknown Vul UNK which refers to security vulnerabilities whose category cannot be analyzed due to insufficient description information in the security vulnerability report.
[0054] Further, the process of extracting vulnerability event elements and performing information completion in step 3 includes steps 3-1, 3-2, 3-3, and 3-4.
[0055] In step 3-1, event elements in the vulnerability event are defined, including a triggering operation, an affected version, an attacker, a consequence, and a location, whose descriptions are shown in Table 2.
TABLE-US-00002 TABLE 2 Vulnerability event element description table Vulnerability element event type abbre- element type viation Related description Triggering Ope operations through which the vulnerability Operation is triggered, such as a crafted application that uses the fork system call Occurrence Sit situations where a vulnerability occurs, for Situation example, Secure Sockets Layer (SSL) is used Attacker Atk vulnerability attackers, such as remote attacker and local user Affected Ver product versions affected by the Version vulnerability, such as Linux kernel before 2.6.34 Consequence Con consequences caused by the vulnerability, such as execution of arbitrary commands Location Loc locations where the vulnerability occurs, such as dispatch_cmd function in prot. c
[0056] In step 3-2, the description in the vulnerability report is manually labeled. As shown in
[0057] In step 3-3, a sequence labeling task training is performed on a BERT model by using the vulnerability event element label set constructed in step 3-2, and a BiLSTM layer and a CRF layer are connected by using the trained model to perform event element extraction, as shown in
[0058] In step, since some event elements may be missing from the CVE and NVD descriptions, the event element is extracted by using the BERT+BiLSTM+CRF model in steps 3-1 to 3-3 in conjunction with the description in IBM X-Force Exchange to complete the missing elements. As can be known from the examples in
[0059] Further, the process of the extracting relation between vulnerability events in step 4 includes steps 4-1 and 4-2.
[0060] In step 4-1, as can be known in conjunction with
TABLE-US-00003 TABLE 3 Vulnerability explicit relation type table Relation Vulnerability type explicit abbre- relation type viation Related sentence pattern Similar Sim a similar issue to . . . Relation this issue is very similar to . . . Causal Cau this vulnerability exists because of an Relation incorrect/incomplete/insufficient fix for . . . Brother Bro a different vulnerability than . . . Relation this issue was SPLIT from . . . Regression Reg this issue exists because of a . . . regression Relation Included Inc the scope of . . . is limited to the . . . Relation product this is an even more permissive variant of . . . Dependency Dep this can be leveraged/exploited by . . . Relation attackers using . . .
[0061] In step 4-2, as can be known in conjunction with
[0062] Further, the process of performing the vulnerability-related code representation in step 5 includes: all data in a vulnerability data set is subjected to code representation, and vulnerability codes are represented as an AST, a CFG and a PDG with the tool Joern. The codes are represented as a composite graph structure through the AST, the CFG and the PDG, and the data transfer and control in the codes depend on directed edges to connect each graph node.
[0063] Further, the process of the visualizing vulnerability event information obtained in the above steps into a vulnerability event graph is step 6 includes: the event trigger words, event elements, event relations and code representations obtained in the above steps 2, 3, 4 and 5 are visualized with the tool Neo4j to form a vulnerability event graph. For example,
[0064] The present disclosure provides an automatic event graph construction technology and system for multi-source vulnerability information for the vulnerability data mining field. Such a technology integers a variety of information related to vulnerabilities, identifies the information, and finally visualizes the information by using visualization tools. First, the automatic construction of the vulnerability event graph can enable developers to understand vulnerability events and factors related to the vulnerability events more intuitively, reduce the manpower and time cost for developers to manually analyze and understand vulnerability data, and improve the effectiveness and efficiency of software maintenance. Second, the constructed vulnerability event graph can be regarded as the foundation of vulnerability analysis and repair by the researchers and vulnerabilities can be analyzed and repaired more quickly and accurately by using the knowledge formed by a large amount of data in the graph, thereby reducing potential safety hazards and economic losses caused by vulnerabilities.
[0065] Based on the same intention concept, in an embodiment, the present disclosure provides an automatic event graph construction system for multi-source vulnerability information. The system includes a data set construction module, a trigger word extraction module, a vulnerability event element identification module, a vulnerability event relation identification module, a vulnerability code representation module, and a visualization module. The data set construction module is configured to crawl a vulnerability report from a vulnerability database according to a CVE-ID and construct a vulnerability report data set. The trigger word extraction module is configured to take a cause of vulnerability as an event trigger word, construct a vulnerability event trigger word label set, extract a trigger word of a vulnerability event, and determine a vulnerability type through the trigger word. The vulnerability event element identification module is configured to extract vulnerability event elements from description information of a vulnerability by named-entity recognition and perform information completion. The vulnerability event relation identification module is configured to extract an explicit relation between vulnerability events by using text information and extract an implicit relation between vulnerability events by using text similarity. The vulnerability code representation module is configured to perform vulnerability-related code representation. The visualization module is configured to visualize obtained vulnerability event information into a vulnerability event graph; where the graph includes vulnerability event related elements and a relation between vulnerability events, and vulnerability events are associated with vulnerability types through the event trigger word. For the implementation details of each module, reference may be made to the above-mentioned automatic event graph construction method for multi-source vulnerability information, which will not be repeated here.
[0066] Based on the same invention concept, in an embodiment, the present disclosure provides an automatic event graph construction device for multi-source vulnerability information. The device includes a memory, a processor, and a computer program stored on the memory and executable by the processor, where the computer program, when loaded into the processor, performs the above-mentioned automatic event graph construction method for multi-source vulnerability information.
[0067] The above illustrates and describes basic principles, main features and advantages of the present disclosure. It is to be understood by those skilled in the art that the present disclosure is not limited to the above embodiments. The above embodiments and description stated above just provide the principle of the present disclosure. Various modifications and improvements may be made in the present disclosure without departing from the spirit and scope of the present disclosure, and these modifications and improvements all fall within the scope of the present disclosure. The scope of the present disclosure is defined by the appended claims and equivalents thereof