System and method of synthesizing potential malware for predicting a cyberattack

20230205877 · 2023-06-29

    Inventors

    Cpc classification

    International classification

    Abstract

    A system and method for malware classification using machine learning models trained using synthesized feature sets based on features extracted from samples of known malicious objects and known safe objects. The synthesized feature sets act as virtual samples for training a machine learning classifier to recognize new objects in the wild that are likely to be malicious.

    Claims

    1. A method for malware detection in a computer system comprising the following steps: a. extracting static and dynamic features of a known malware sample; b. extracting static and dynamic features of a known clean object sample; c. preparing a synthetic malware feature dataset and a clean objects feature dataset; d. training a malware classification machine learning model based on the synthetic malware feature dataset and the clean objects feature dataset; e. obtaining an unknown system object for malware analysis; and f. classifying the unknown system object, wherein the result of classification includes one or more of the following: a rate of conformity with at least one class of objects, a determination if the file is malicious or not, and a determination of malware type.

    2. The method of claim 1, wherein the step of preparing a synthetic malware feature dataset and clean objects feature dataset further comprises the steps of: a. grouping features in datasets by the type of feature; and b. synthesizing new feature sets in the synthetic malware feature dataset, wherein each new feature set is a combination of the extracted feature sets related to a first known malware sample and the result of a substitution of at least one feature related to a first known malware sample with a least one feature related to a second known malware sample.

    3. The method of claim 2, wherein the substitution of least one feature is performed for features from the same group.

    4. The method of claim 3, wherein the step of synthesizing new feature sets further comprises the step of selecting feature sets related to malware samples of one class of objects.

    5. The method of claim 4, wherein the one class of objects is defined using at least one of static analysis, dynamic analysis, sample execution log analysis, and malware classification based on static and dynamic analysis.

    6. The method of claim 5, wherein the step of synthesizing new feature sets in the malware feature dataset further comprises filtering out features corresponding to known clean object samples.

    7. A system for malware detection in a networked computer system comprising: a. static and dynamic feature extractors for identifying feature vectors in a known malware sample; b. static and dynamic feature extractors for identifying feature vectors of a known clean object sample; c. a synthetic malware feature dataset and a clean objects feature dataset prepared from the features extracted by the static and dynamic feature extractors; and d. a malware classification machine learning model trained on the synthetic malware feature dataset and the clean objects feature dataset.

    8. The system of claim 7, wherein the synthetic malware feature dataset and clean objects feature dataset includes grouped features in datasets by the type of feature.

    9. The system of claim 8, wherein the wherein the synthetic malware feature dataset is a combination of the extracted feature sets related to a first known malware sample and the result of a substitution of at least one feature related to a first known malware sample with a least one feature related to a second known malware sample.

    10. The system of claim 9, wherein the substitution of at least one feature is performed for features from the same group.

    11. The system of claim 10, wherein the synthetic malware dataset comprises feature sets related to malware samples of one class of objects.

    12. The system of claim 11, wherein the one class of objects is defined using at least one of static analysis, dynamic analysis, sample execution log analysis, and malware classification based on static and dynamic analysis.

    13. The system of claim 12, wherein the synthetic malware feature dataset has been filtered to remove feature vectors corresponding to known clean object samples.

    14. A method for malware detection in a computer system comprising the following steps: a. loading a malware feature dataset and a clean objects feature dataset, wherein the datasets include static and dynamic feature sets; b. grouping features in datasets by the type of features; c. selecting feature sets related to malware samples of one class of objects; d. synthesizing new feature sets in the malware feature dataset; e. training a malware classification machine learning model based on the static and dynamic features from the malware feature dataset extended with the new feature sets and the clean objects feature dataset; f. preparing a synthetic malware feature dataset and a clean objects feature dataset; g. obtaining an unknown system object for malware analysis; and h. classifying the unknown system object, wherein the result of classification includes one or more of the following: a rate of conformity with at least one class of objects, a determination if the file is malicious or not, and a determination of malware type.

    15. The method of claim 14, wherein the class of objects in step (c) is defined using at least one of static analysis, dynamic analysis, sample execution log analysis, and malware classification based on static and dynamic analysis.

    16. The method of claim 15, wherein each new feature set is a combination of the feature set from a selected feature set related to a first known malware sample and a result of substitution of at least one feature from the selected feature set related to the first known malware sample with at least one feature from the selected feature set related to a second known malware sample.

    17. The method of claim 16, wherein the substitution of at least one feature is performed for features from the same group.

    18. The method of claim 16, wherein the step of synthesizing new feature sets further comprises filtering out features corresponding to known clean object samples.

    Description

    DESCRIPTION OF THE DRAWINGS

    [0011] FIG. 1 shows a system for training a machine learning malware classification model.

    [0012] FIG. 2 shows the system for training a machine learning malware classification model like FIG. 1 but with additional details related to feature synthesis.

    [0013] FIG. 3 shows a system for static and dynamic analysis of object samples.

    [0014] FIG. 4 shows an example of building a sample of vectors corresponding to a certain class of malicious objects where the attributes of the sample are mixed.

    [0015] FIG. 5 shows an example of building a sample of vectors including a filtered feature set and feature substitution.

    [0016] FIG. 6 shows a method of training a malware classification machine learning model and classifying malware by synthesizing feature sets from malware and clean collections.

    [0017] FIG. 7 shows a method of training a malware classification machine learning model and classifying malware by creating a synthesized dataset from static and dynamic features, where the features in the datasets are grouped by feature type.

    [0018] FIG. 8 shows a method of training a malware classification machine learning model and classifying malware by creating a synthesized dataset from selected features related to malware samples of one class of objects.

    DETAILED DESCRIPTION

    [0019] The invention comprises a system and method for training and using machine learning malware classification models. Synthetic datasets are created and used for training a machine learning malware classifier. These synthetic datasets improve the ability of machine learning models to accurately detect and classify malware. These synthetic datasets act as virtual samples that allow machine learning classifiers to be trained to detect previously unknown malware. The invention improves machine learning malware classifiers by increasing classification accuracy and reducing false positives. Increased accuracy by a malware classifier improves the efficiency of a computer system by protecting them from new malware threats while reducing false positives ensures the usefulness of the computer system for its intended tasks. The improved malware classifier can also be used for penetration testing. Synthetic malware datasets can be used to create hypothetical “new” malware objects for testing purposes. These new objects can be used to test the detection capabilities of existing computer security systems to rate the

    [0020] In the context of machine learning, a feature is an input variable used in making predictions or classifications in machine learning. Feature engineering is the process of determining which features might be useful in training a machine learning model, and then converting raw data from log files and other sources into those features. Feature extraction aims to reduce the number of features in a dataset by creating new features from the existing ones (and then discarding the original features).

    [0021] Malicious processes in computer systems can be detected using dynamic analysis and static analysis. Dynamic analysis, also called “behavior analysis,” focuses on how an untrusted file or process acts. Static analysis, on the other hand, is concerned with what can be known about an untrusted file or process before runtime.

    [0022] FIG. 1 shows how the machine learning classification model is trained by extracting static and dynamic features from a malware collection and a clean objects collection. The system 100 comprises malware collection 102 and clean objects collection 104. These collections 102, 104 communicate with static analysis feature extractor 106 and dynamic analysis feature extractor 108. In turn, static analysis feature extractor 106 and dynamic analysis feature extractor 108 pass extracted dataset features to malware feature dataset 110 and clean objects feature dataset 112. These datasets 110, 112 interact with malware classification machine learning module 114.

    [0023] Module 114 comprises a file with functions for training malware classification machine learning model 116. For example, in a Python environment, module 114 contains variables of different types, such as arrays, dictionaries, objects, and is saved in a file with the .py extension.

    [0024] Machine learning model 116 is a file that has been trained to recognize certain types of patterns in a given dataset. Pattern recognition is achieved by way of functions and algorithms provided by module 114.

    [0025] The system of FIG. 2 resembles FIG. 1 but shows additional details related to feature synthesis, including a feature synthesizing unit and a synthesized feature dataset. System 200 comprises malware collection 202 and clean objects collection 204. These collections communicate with static analysis and dynamic analysis feature extractors 206, 208.

    [0026] Feature synthesis is accomplished through the interaction of malware feature dataset 210, synthesized feature dataset 212, feature synthesizing unit 214, and clean objects feature dataset 216. The extractors 206, 208 pass extracted dataset features to both malware feature dataset 210 and clean objects feature dataset 216. Feature synthesizing unit 214 is passed feature data from malware feature dataset 210 and clean objects feature dataset 216. Feature synthesizing unit 214 mixes features from datasets 210 and 216 and passes the resulting mixed features to synthesized feature dataset 212.

    [0027] Malware classification machine learning module 218 comprises a file with functions for training malware classification machine learning model 220. For example, in a Python environment, module 218 contains variables of different types, such as arrays, dictionaries, objects, and is saved in a file with the .py extension.

    [0028] Machine learning model 220 is a file that has been trained to recognize certain types of patterns in a given dataset. Pattern recognition is achieved by way of functions and algorithms provided by module 218. In this configuration, module 218 is passed a synthesized feature dataset 212 and a clean objects feature dataset. Thus, model 220 is trained from “virtual” malware data rather than from known malware samples.

    [0029] FIG. 3 shows system 300 for static and dynamic analysis of an object collection 302 comprising object samples 304. Threat analysis server 306 is configured for dynamic analysis of sample 304 by way of running the sample as application 308. Activity monitor 310 records information about the activity of application 308 during runtime. Monitor 310 passes features identified during runtime to dynamic feature extractor 312. Object sample 304 is also passed to static feature extractor 314 for static feature extraction. The static and dynamic feature extractors 312, 314 pass extracted features to malware feature dataset 316, synthesized feature dataset 318, and clean objects feature dataset 320. Extracted static and dynamic features are passed to malware feature dataset 316 or clean objects feature dataset 320 depending on the nature of object collection 302 from which sample 304 was obtained.

    [0030] Activity monitor 310 also passes features identified during runtime to sample execution log 322. Log data from execution log 322 is then passed to feature synthesizing unit 324. Feature synthesizing unit 324 interacts with the malware, synthesized, and clean objects feature datasets 316, 318, and 320. The mixing of features among various feature datasets, such as malware, synthesized, and clean objects feature datasets 316, 318, and 320, is shown in detail in FIGS. 4 and 5.

    [0031] The output of the mixed datasets 316, 318, and 320 is passed to malware classification machine learning training unit 326, which trains malware classification machine learning model 328. In an embodiment, malware classification machine learning model 328 passes threat detection updates 330 to protected computer systems 332.

    [0032] FIG. 4 shows example 400 of building synthetic feature vectors corresponding to a certain class of malicious objects where the attributes of the sample are mixed. A feature vector is the list of feature values representing a row of a dataset. Known labeled malware objects 402 include object samples 1, 2, 3, . . . K (404). These object samples include feature sets 1, 2, 3, . . . K (406) and the feature sets are used to create synthesized malware objects 410 comprising feature sets x, x+1, x+2, and x+3 (412). Static features 414 are represented by the prefix A and Dynamic features 416 are represented by the prefix B. For example, feature set 1 (406) comprises a given number of static features A11, A12, . . . A1n and a given number of dynamic features B11, B12, . . . B1m. Feature set 2 (406) likewise comprises static features A21, A22, . . . A2n and dynamic features B21, B22, . . . B2m. Feature sets 3 through K (406) follow this pattern, where the last static feature is represented by n and the last dynamic feature is represented by m.

    [0033] Synthesized feature sets x, x+1, x+2, and x+3 (412) comprise mixed static and dynamic features taken from the static features 414 and dynamic features 416 from feature sets 3 and K. For example, feature set x (412) comprises static features AK1, A32, . . . A3n and dynamic features B31, B32, . . . B3m.

    [0034] Static features 414 and dynamic features 416 are divided into one group of features 420 and one type of feature 422. A group of features comprises, for example, stack traces, API calls sequences, operations with files, or operations with a register or network. Or group features may include file modifications or reading files. Feature sets 3 and K (406) and features sets x through x+3 (412) comprise an object class 430 of features from known labeled malware objects and synthesized malware objects. The static features and dynamic features found in the known labeled malware objects 402 in feature sets 3 and K (432) comprise object class 432. Class-defining features 440 are the features in object class 432 that are mixed and used to populate the static and dynamic features for synthetic feature sets x, x+1, x+2, and x+3.

    [0035] FIG. 5 shows example 500 of using known labeled malware objects and known labeled clean objects to create synthesized malware objects. Known malware objects 502 include object samples 1-K (504) with corresponding feature sets 1-K (406). Synthesized malware objects 510 with corresponding feature sets x, x+1, x+2, and x+3 (512). Feature sets 506 and 512 comprise static features 514 and dynamic features 516. The static and dynamic features 514, 516 in feature sets 1-K (506) and feature sets 512 are grouped into a first feature group 520 and a second feature group 522. These groups 520, 522 are used as parameters for feature substitution. For example, features within first group 520 and second group 522 are substituted for other static and dynamic features in the same group. In the example shown in FIG. 5, the substituted static features are A11 with AK2 and AK1 with A12. The substituted dynamic features are B11 with BK2 and BK1 with B12. These substitutions take place between features sets 1 and K (512).

    [0036] A filtered feature set 524 corresponding to feature set x+2 (512) is defined in relation to known labeled clean objects 530. These known labeled clean objects 530 have corresponding feature sets 1, 2, 3, . . . L (534). Features sets 1-L comprise static features 536 and dynamic features 538. Static features 536 are labeled C11, C12, . . . C1n and D11, D12, . . . D1m for feature set 1. For feature set 2, the static features are C21, C22, . . . C2n and the dynamic features are D21, D22, . . . D2m. Feature set 3 has static features A11, AK2, . . . AKn and dynamic features BK1, B12, . . . B3m. This feature set—A11, AK2, . . . AKn and BK1, B12, . . . B3m—also appears in synthesized malware objects feature set x+2 where it is identified as filtered feature set 524.

    [0037] FIG. 6 shows a method 600 for training a malware classification machine learning model and classifying malware by synthesizing feature sets from malware and clean collections. At step 602 static and dynamic features are extracted from known malware samples from a malware collection. Then at step 604 static and dynamic features are extracted from known clean object samples from a clean objects collection. At step 606, a malware feature dataset and a clean objects feature dataset are prepared for machine learning analysis. A malware classification machine learning model is trained at step 608 based on static and dynamic features from the malware feature dataset and the clean objects feature dataset. An unknown system object for malware analysis is obtained at step 610. The object is then classified with the malware classification machine learning model at step 612. The result of the classification includes one or more of the following: determining a rate of conformity to at least one class of objects, determining if the file is malicious or clean, and determining malware type if malicious.

    [0038] FIG. 7 shows a method 700 of training a malware classification machine learning model and classifying malware by creating a synthesized dataset from static and dynamic features. A malware feature dataset and a clean objects feature data set are loaded for machine-learning data analysis at step 702. The loaded datasets include static and dynamic feature sets. At step 704, the features in these datasets are grouped by feature type. Then new feature sets are synthesized in a malware feature dataset at step 706. Each new feature set is a combination of the loaded feature set related to a first known malware sample and a result of substitution of at least one feature related to the first known malware sample with at least one feature related to a second malware sample. The substitution is preferably performed for features from the same group. The training of a malware machine learning model takes place at step 708. Static and dynamic features from the malware feature dataset extended with new, synthesized feature sets and the clean objects dataset. At step 710 an unknown system object is obtained for malware analysis. The object is classified with the trained malware classification machine learning model at step 712. The result of classification includes at least one of the following: determining a rate of conformity to at least one class of objects, determining if the file is malicious or clean, and determining the type of malware if the file is malicious.

    [0039] FIG. 8 shows a method 800 of training a malware classification machine learning model and classifying malware by creating a synthesized dataset from selected features related to malware samples of one class of objects. A malware feature dataset and a clean objects feature data set are loaded for machine-learning data analysis at step 802. The loaded datasets include static and dynamic feature sets. At step 804, the features in these datasets are grouped by feature type. Feature sets are selected related to malware samples of one class of objects at step 806. The class of objects is defined using at least one of static analysis, dynamic analysis, sample execution log analysis, and malware classification based on static and dynamic analysis. Then new feature sets are synthesized in a malware feature dataset at step 810.

    [0040] Each new feature set is a combination of the selected feature set related to a first known malware sample and a result of substitution of at least one feature related to the first known malware sample with at least one feature related to a second malware sample. The substitution is preferably performed for features from the same group. The training of a malware machine learning model takes place at step 812. Static and dynamic features from the malware feature dataset extended with new feature sets and the clean objects dataset. At step 814 an unknown system object is obtained for malware analysis. The object is classified with the trained malware classification machine learning model at step 816. The result of classification includes at least one of the following: determining a rate of conformity to at least one class of objects, determining if the file is malicious or clean, and determining the type of malware if the file is malicious.