System and method of synthesizing potential malware for predicting a cyberattack
20230205877 · 2023-06-29
Inventors
- Sergey Ulasen (Saint Petersburg, RU)
- Vladimir Strogov (Singapore, SG)
- Serguei Beloussov (Singapore, SG)
- Stanislav Protasov (Singapore, SG)
Cpc classification
G06F21/56
PHYSICS
G06F21/566
PHYSICS
International classification
Abstract
A system and method for malware classification using machine learning models trained using synthesized feature sets based on features extracted from samples of known malicious objects and known safe objects. The synthesized feature sets act as virtual samples for training a machine learning classifier to recognize new objects in the wild that are likely to be malicious.
Claims
1. A method for malware detection in a computer system comprising the following steps: a. extracting static and dynamic features of a known malware sample; b. extracting static and dynamic features of a known clean object sample; c. preparing a synthetic malware feature dataset and a clean objects feature dataset; d. training a malware classification machine learning model based on the synthetic malware feature dataset and the clean objects feature dataset; e. obtaining an unknown system object for malware analysis; and f. classifying the unknown system object, wherein the result of classification includes one or more of the following: a rate of conformity with at least one class of objects, a determination if the file is malicious or not, and a determination of malware type.
2. The method of claim 1, wherein the step of preparing a synthetic malware feature dataset and clean objects feature dataset further comprises the steps of: a. grouping features in datasets by the type of feature; and b. synthesizing new feature sets in the synthetic malware feature dataset, wherein each new feature set is a combination of the extracted feature sets related to a first known malware sample and the result of a substitution of at least one feature related to a first known malware sample with a least one feature related to a second known malware sample.
3. The method of claim 2, wherein the substitution of least one feature is performed for features from the same group.
4. The method of claim 3, wherein the step of synthesizing new feature sets further comprises the step of selecting feature sets related to malware samples of one class of objects.
5. The method of claim 4, wherein the one class of objects is defined using at least one of static analysis, dynamic analysis, sample execution log analysis, and malware classification based on static and dynamic analysis.
6. The method of claim 5, wherein the step of synthesizing new feature sets in the malware feature dataset further comprises filtering out features corresponding to known clean object samples.
7. A system for malware detection in a networked computer system comprising: a. static and dynamic feature extractors for identifying feature vectors in a known malware sample; b. static and dynamic feature extractors for identifying feature vectors of a known clean object sample; c. a synthetic malware feature dataset and a clean objects feature dataset prepared from the features extracted by the static and dynamic feature extractors; and d. a malware classification machine learning model trained on the synthetic malware feature dataset and the clean objects feature dataset.
8. The system of claim 7, wherein the synthetic malware feature dataset and clean objects feature dataset includes grouped features in datasets by the type of feature.
9. The system of claim 8, wherein the wherein the synthetic malware feature dataset is a combination of the extracted feature sets related to a first known malware sample and the result of a substitution of at least one feature related to a first known malware sample with a least one feature related to a second known malware sample.
10. The system of claim 9, wherein the substitution of at least one feature is performed for features from the same group.
11. The system of claim 10, wherein the synthetic malware dataset comprises feature sets related to malware samples of one class of objects.
12. The system of claim 11, wherein the one class of objects is defined using at least one of static analysis, dynamic analysis, sample execution log analysis, and malware classification based on static and dynamic analysis.
13. The system of claim 12, wherein the synthetic malware feature dataset has been filtered to remove feature vectors corresponding to known clean object samples.
14. A method for malware detection in a computer system comprising the following steps: a. loading a malware feature dataset and a clean objects feature dataset, wherein the datasets include static and dynamic feature sets; b. grouping features in datasets by the type of features; c. selecting feature sets related to malware samples of one class of objects; d. synthesizing new feature sets in the malware feature dataset; e. training a malware classification machine learning model based on the static and dynamic features from the malware feature dataset extended with the new feature sets and the clean objects feature dataset; f. preparing a synthetic malware feature dataset and a clean objects feature dataset; g. obtaining an unknown system object for malware analysis; and h. classifying the unknown system object, wherein the result of classification includes one or more of the following: a rate of conformity with at least one class of objects, a determination if the file is malicious or not, and a determination of malware type.
15. The method of claim 14, wherein the class of objects in step (c) is defined using at least one of static analysis, dynamic analysis, sample execution log analysis, and malware classification based on static and dynamic analysis.
16. The method of claim 15, wherein each new feature set is a combination of the feature set from a selected feature set related to a first known malware sample and a result of substitution of at least one feature from the selected feature set related to the first known malware sample with at least one feature from the selected feature set related to a second known malware sample.
17. The method of claim 16, wherein the substitution of at least one feature is performed for features from the same group.
18. The method of claim 16, wherein the step of synthesizing new feature sets further comprises filtering out features corresponding to known clean object samples.
Description
DESCRIPTION OF THE DRAWINGS
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
DETAILED DESCRIPTION
[0019] The invention comprises a system and method for training and using machine learning malware classification models. Synthetic datasets are created and used for training a machine learning malware classifier. These synthetic datasets improve the ability of machine learning models to accurately detect and classify malware. These synthetic datasets act as virtual samples that allow machine learning classifiers to be trained to detect previously unknown malware. The invention improves machine learning malware classifiers by increasing classification accuracy and reducing false positives. Increased accuracy by a malware classifier improves the efficiency of a computer system by protecting them from new malware threats while reducing false positives ensures the usefulness of the computer system for its intended tasks. The improved malware classifier can also be used for penetration testing. Synthetic malware datasets can be used to create hypothetical “new” malware objects for testing purposes. These new objects can be used to test the detection capabilities of existing computer security systems to rate the
[0020] In the context of machine learning, a feature is an input variable used in making predictions or classifications in machine learning. Feature engineering is the process of determining which features might be useful in training a machine learning model, and then converting raw data from log files and other sources into those features. Feature extraction aims to reduce the number of features in a dataset by creating new features from the existing ones (and then discarding the original features).
[0021] Malicious processes in computer systems can be detected using dynamic analysis and static analysis. Dynamic analysis, also called “behavior analysis,” focuses on how an untrusted file or process acts. Static analysis, on the other hand, is concerned with what can be known about an untrusted file or process before runtime.
[0022]
[0023] Module 114 comprises a file with functions for training malware classification machine learning model 116. For example, in a Python environment, module 114 contains variables of different types, such as arrays, dictionaries, objects, and is saved in a file with the .py extension.
[0024] Machine learning model 116 is a file that has been trained to recognize certain types of patterns in a given dataset. Pattern recognition is achieved by way of functions and algorithms provided by module 114.
[0025] The system of
[0026] Feature synthesis is accomplished through the interaction of malware feature dataset 210, synthesized feature dataset 212, feature synthesizing unit 214, and clean objects feature dataset 216. The extractors 206, 208 pass extracted dataset features to both malware feature dataset 210 and clean objects feature dataset 216. Feature synthesizing unit 214 is passed feature data from malware feature dataset 210 and clean objects feature dataset 216. Feature synthesizing unit 214 mixes features from datasets 210 and 216 and passes the resulting mixed features to synthesized feature dataset 212.
[0027] Malware classification machine learning module 218 comprises a file with functions for training malware classification machine learning model 220. For example, in a Python environment, module 218 contains variables of different types, such as arrays, dictionaries, objects, and is saved in a file with the .py extension.
[0028] Machine learning model 220 is a file that has been trained to recognize certain types of patterns in a given dataset. Pattern recognition is achieved by way of functions and algorithms provided by module 218. In this configuration, module 218 is passed a synthesized feature dataset 212 and a clean objects feature dataset. Thus, model 220 is trained from “virtual” malware data rather than from known malware samples.
[0029]
[0030] Activity monitor 310 also passes features identified during runtime to sample execution log 322. Log data from execution log 322 is then passed to feature synthesizing unit 324. Feature synthesizing unit 324 interacts with the malware, synthesized, and clean objects feature datasets 316, 318, and 320. The mixing of features among various feature datasets, such as malware, synthesized, and clean objects feature datasets 316, 318, and 320, is shown in detail in
[0031] The output of the mixed datasets 316, 318, and 320 is passed to malware classification machine learning training unit 326, which trains malware classification machine learning model 328. In an embodiment, malware classification machine learning model 328 passes threat detection updates 330 to protected computer systems 332.
[0032]
[0033] Synthesized feature sets x, x+1, x+2, and x+3 (412) comprise mixed static and dynamic features taken from the static features 414 and dynamic features 416 from feature sets 3 and K. For example, feature set x (412) comprises static features AK1, A32, . . . A3n and dynamic features B31, B32, . . . B3m.
[0034] Static features 414 and dynamic features 416 are divided into one group of features 420 and one type of feature 422. A group of features comprises, for example, stack traces, API calls sequences, operations with files, or operations with a register or network. Or group features may include file modifications or reading files. Feature sets 3 and K (406) and features sets x through x+3 (412) comprise an object class 430 of features from known labeled malware objects and synthesized malware objects. The static features and dynamic features found in the known labeled malware objects 402 in feature sets 3 and K (432) comprise object class 432. Class-defining features 440 are the features in object class 432 that are mixed and used to populate the static and dynamic features for synthetic feature sets x, x+1, x+2, and x+3.
[0035]
[0036] A filtered feature set 524 corresponding to feature set x+2 (512) is defined in relation to known labeled clean objects 530. These known labeled clean objects 530 have corresponding feature sets 1, 2, 3, . . . L (534). Features sets 1-L comprise static features 536 and dynamic features 538. Static features 536 are labeled C11, C12, . . . C1n and D11, D12, . . . D1m for feature set 1. For feature set 2, the static features are C21, C22, . . . C2n and the dynamic features are D21, D22, . . . D2m. Feature set 3 has static features A11, AK2, . . . AKn and dynamic features BK1, B12, . . . B3m. This feature set—A11, AK2, . . . AKn and BK1, B12, . . . B3m—also appears in synthesized malware objects feature set x+2 where it is identified as filtered feature set 524.
[0037]
[0038]
[0039]
[0040] Each new feature set is a combination of the selected feature set related to a first known malware sample and a result of substitution of at least one feature related to the first known malware sample with at least one feature related to a second malware sample. The substitution is preferably performed for features from the same group. The training of a malware machine learning model takes place at step 812. Static and dynamic features from the malware feature dataset extended with new feature sets and the clean objects dataset. At step 814 an unknown system object is obtained for malware analysis. The object is classified with the trained malware classification machine learning model at step 816. The result of classification includes at least one of the following: determining a rate of conformity to at least one class of objects, determining if the file is malicious or clean, and determining the type of malware if the file is malicious.