MULTIVARIATE MALWARE DETECTION METHODS AND SYSTEMS
20220366047 · 2022-11-17
Inventors
Cpc classification
G06F21/566
PHYSICS
International classification
Abstract
Methods and systems for detecting whether an executable file comprises malware are disclosed. The methods and systems rely on various feature extraction and feature representation processes to allow patterns associated with Portable Executable (PE) files to be analyzed in an improved representation space. In one example, six different feature sets are extracted from a PE file and represented in six different feature spaces, before being input into a multivariate ensemble deep neural network-based model.
Claims
1. A multivariate malware detection method comprising the steps of: receiving an executable file; extracting a plurality of feature sets from the executable file, to generate a plurality of extracted feature sets, the plurality of feature sets relating to characteristics of the executable file; representing the plurality of extracted feature sets in one or more corresponding feature spaces to generate a plurality of represented feature sets; inputting the plurality of represented feature sets into the inputs of a corresponding plurality of deep neural networks; combining the plurality of deep neural networks into a multivariate ensemble deep neural network; and detecting the presence of malware in the executable file based on the output of the multivariate ensemble deep neural network.
2. The multivariate malware detection method of claim 1, wherein one of the plurality of feature sets comprises header information relating to the parameters of the executable file.
3. The multivariate malware detection method of claim 2, wherein: the extracting step includes extracting a parameter feature set containing executable file parameters found in the header of the executable file to generate an extracted parameter feature set; the representing step includes representing the extracted parameter feature set using one-hot-encoding to generate a represented parameter feature set; and the inputting step comprises inputting the represented parameter feature set into the inputs of a multilayer perceptron.
4. The multivariate malware detection method of claim 3, wherein the executable file is a portable executable (PE) file and the parameters include one or more of the following PE file parameters: MajorSubsystemVersion, Machine, MajorOperatingSystemVersion, MinorLinkerVersion, and Subsystem.
5. The multivariate malware detection method of claim 1, wherein one of the plurality of feature sets comprises imported functions and libraries listed as being used by the executable file.
6. The multivariate malware detection method of claim 5, wherein: the extracting step includes extracting an import feature set containing a list of imported functions and libraries listed in the executable file to generate an extracted import feature set; the representing step includes representing the extracted import feature set as a list of imported function and library pairs; and the inputting step comprises inputting the represented import feature set into the inputs of a multilayer perceptron.
7. The multivariate malware detection method of claim 6, wherein the executable file is a portable executable (PE) file and the import feature set is at least partially extracted from a DIRECTORY_ENTRY_IMPORT object.
8. The multivariate malware detection method of claim 1, wherein one of the plurality of feature sets comprises the value of the bytes located in the section containing the entry point of the executable file.
9. The multivariate malware detection method of claim 8, wherein: the extracting step includes extracting an entry point feature set containing the value of the bytes located in the section containing the entry point of the executable file to generate an entry point feature set; the representing step includes representing each byte of the extracted entry point feature set as a pixel of color space and cropping the resulting pixels into an n by m pixel image; and the inputting step comprises inputting the n by m pixel image into the inputs of a two dimensional (2D) convolutional neural network.
10. The multivariate malware detection method of claim 9, wherein the color space is 8-bit grayscale and the image contains 32×32 pixels.
11. The multivariate malware detection method of claim 1, wherein one of the plurality of feature sets comprises characteristics of the assembly language instructions of the entry function of the executable file.
12. The multivariate malware detection method of claim 11, wherein: the extracting step includes extracting an entry function feature set containing characteristics of the assembly language instructions of the entry function of the executable file to generate an entry function feature set; the representing step includes representing the characteristics using a min-wise independent permutations locality sensitive hashing of ngram models of the entry function feature set; and the inputting step comprises inputting the represented entry function feature set into the inputs of a multilayer perceptron.
13. The multivariate malware detection method of claim 12, wherein the ngram model is a 1, 2 and 3 grams model.
14. The multivariate malware detection method of claim 1, wherein one of the plurality of feature sets comprises section characteristics of the executable file.
15. The multivariate malware detection method of claim 14, wherein: the extracting step includes extracting a section information feature set containing section characteristics of the executable file to generate an extracted section information feature set; the representing step includes representing the extracted section information feature set as a min-wise independent permutations locality sensitive hashing of a binary table representing the section characteristics to create a represented section information feature set; and the inputting step comprises inputting the represented section information feature set into the inputs of a multilayer perceptron.
16. The multivariate malware detection method of claim 15, wherein the executable file is a portable executable (PE) file and the section information feature set includes section name, section position, pointer to raw data, relative virtual address (RVA), size of raw data, virtual size, whether the Entry Point is within the section, and code, readable, writeable and executable flags.
17. The multivariate malware detection method of claim 1, wherein one of the plurality of feature sets comprises a plurality of printable strings and associated locations of each of the plurality of printable strings in the executable file.
18. The multivariate malware detection method of claim 17, wherein: the extracting step includes extracting a string feature set containing a plurality of printable strings and associated locations of each of the plurality of printable strings in the executable file to generate an extracted string feature set; the representing step includes representing the extracted string feature set as a vector of string statistics to generate a represented string feature set; and the inputting step comprises inputting the represented string feature set into the inputs of a multilayer perceptron.
19. The multivariate malware detection method of claim 1, wherein the step of combining comprises the step of concatenating the last hidden layer from each of the plurality of deep neural networks.
20. A multivariate detection system for detecting whether an executable file comprises malware, the system comprising: a processor; and at least one non-transitory memory containing instructions which when executed by the processor cause the system to: receive an executable file; extract a plurality of feature sets from the executable file, to generate a plurality of extracted feature sets, the plurality of feature sets relating to characteristics of the executable file; represent the plurality of extracted feature sets in one or more corresponding feature spaces to generate a plurality of represented feature sets; input the plurality of represented feature sets into the inputs of a corresponding plurality of deep neural networks; combine the plurality of deep neural networks into a multivariate ensemble deep neural network; and detect the presence of malware in the executable file based on the output of the multivariate ensemble deep neural network.
Description
DRAWINGS
[0021] In order that the claimed subject matter may be more fully understood, reference will be made to the accompanying drawings, in which:
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
DESCRIPTION OF VARIOUS EMBODIMENTS
[0039] It will be appreciated that, for simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements or steps. Numerous specific details are set forth in order to provide a thorough understanding of the exemplary embodiments of the subject matter described herein.
[0040] However, it will be understood by those of ordinary skill in the art that the embodiments described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the present subject matter. Furthermore, this description is not to be considered as limiting the scope of the subject matter in any way but rather as illustrating the various embodiments.
[0041] As used herein, an “executable file”, “executable program” or “executable” is defined as a file that can cause a computing device to perform indicated tasks according to encoded instructions.
[0042] As used herein, the term “Portable Executable (PE)” or “PE” is defined as a file format for various files, including but not limited to executable files, object code and Dynamic-Link Library (DLL), used in the Windows™ operating systems. The structure, characteristics, parameters and contents of portable executable files are well known to the skilled reader and are not included herein for the sake of brevity.
[0043] As used herein, the term “feature” is an individual measurable property or characteristic of an executable file which can be used to train a machine learning model. A feature can include, but is not limited to, information included in or referenced in the file header of an executable file, information included in or referenced in the section headers of an executable file and/or information included in or referenced in the sections of an executable file. As used herein, the term “feature set” is a set of one or more features.
[0044] As used herein, the term “feature space” is an n-dimensional reference space in which features can be represented. Feature representation is a technique used because machine learning models require inputs that are mathematically and computationally convenient to process.
[0045] As used herein, a “deep neural network” is a type of artificial neural network comprising an input layer, an output layer and a number of hidden layers between the input layer and the output layer.
[0046] In addition, as used herein, the wording “and/or” is intended to represent an inclusive-or. That is, “X and/or Y” is intended to mean X or Y or both, for example. As a further example, “X, Y, and/or Z” is intended to mean X or Y or Z or any combination thereof.
[0047]
[0048] Accordingly, when one or more portable executable files are received, each branch extracts a feature set, represents the extracted feature set in a feature space and inputs the represented feature set into a deep neural network that has been trained to detect patterns in the feature space, the patterns being associated with malware. The outputs of each branch are combined together to form a multivariate ensemble deep neural network architecture. The deep neural network architecture is said to be “multivariate” because it uses a plurality of feature sets, each containing one or more features. As such, the deep neural network uses multiple variables (i.e., features) as inputs. The deep neural network architecture is said to be an “ensemble” deep neural network architecture because it relies on ensemble machine learning, which combines the predictions from multiple neural network models in order to reduce variance of predictions and reduce generalization error.
[0049] While six branches are shown in the example of
[0050]
[0051] Processor 21 may comprise one or more processors for performing processing operations that implement functionality of the malware detection system 20. A processor of processors 21 may be a general-purpose processor executing program code stored in memory component 23 to which is has access. Alternatively, a processor of processor 21 may be a specific-purpose processor comprising one or more preprogrammed hardware or firmware elements (e.g., application-specific integrated circuits (ASICs), electrically erasable programmable read-only memories (EEPROMs), etc.) or other related elements.
[0052] Memory component 23 comprises one or more memories for storing program code executed by processor 21 and/or data used during operation of processor 21. A memory of memory component 23 may be a semiconductor medium (including, for example, a solid-state memory), a magnetic storage medium, an optical storage medium, and/or any other suitable type of memory. A memory of memory component 23 may be read-only memory (ROM) and/or random-access memory (RAM), for example.
[0053] In some embodiments, two or more elements of processor 21 may be implemented by devices that are physically distinct from one another and may be connected to one another via data bus 26 or via a communication link. In other embodiments, two or more elements of processor 21 may be implemented by a single integrated device. As will be appreciated by the skilled reader, the hardware and software components of malware detection system 20 may be implemented in any other suitable way in other embodiments.
[0054] With reference to
[0055]
[0056] [AddressOfEntryPoint, Machine, SizeOfOptionalHeader, Characteristics, MajorLinkerVersion, MinorLinkerVersion, SizeOfCode, SizeOflnitializedData, SizeOfUninitializedData, BaseOfCode, ImageBase, SectionAlignment, FileAlignment, MajorOperatingSystemVersion, MinorOperatingSystemVersion, MajorImageVersion, MinorImageVersion, MajorSubsystemVersion, MinorSubsystemVersion, SizeOfImage, SizeOfHeaders, CheckSum, Subsystem, DIICharacteristics, SizeOfStackReserve, SizeOfStackCommit, SizeOfHeapReserve, SizeOfHeapCommit, LoaderFlags, NumberOfRvaAndSizes, SectionsNb, SectionsMeanEntropy, SectionsMinEntropy, SectionsMaxEntropy, SectionsMeanRawsize, SectionsMinRawsize, SectionsMaxRawsize, SectionsMeanVirtualsize, SectionsMinVirtualsize, SectionMaxVirtualsize, ImportsNbDLL, ImportsNb, ImportsNbOrdinal, ExportNb, ResourcesNb, ResourcesMeanEntropy, ResourcesMinEntropy, ResourcesMaxEntropy, ResourcesMeanSize, ResourcesMinSize, ResourcesMaxSize, LoadConfigurationSize, VersionInformationSize, VersionInformationSize, FileFlags, FileOS, FileType, FileVersionLS, ProductVersionLS, Signature, StrucVersion]
[0057] Some PE file header information is more predictable than other PE file header information. As such, not all header information is equally valuable in helping to predict the likelihood of malware. In some embodiments, the most valuable parameters extracted from the PE file header parameters includes:
[0058] [ResourcesMaxEntropy, Characteristics, MajorSubsystemVersion, SectionsMaxEntropy, Machine, ResourcesMeanEntropy, ResourcesMinEntropy, DIICharacteristics, SectionsMeanEntropy, ImageBase, SectionsMinEntropy, MinorLinkerVersion, Subsystem, MajorOperatingSystemVersion]
[0059] In some embodiments, the extracted feature set for the first exemplary branch of the multivariate DNN-based architecture includes: [MajorSubsystemVersion, Machine, MajorOperatingSystemVersion, MinorLinkerVersion, Subsystem]
[0060] In order to continue preparing the inputs to the multilayer perceptron DNN of
[0061] As shown in
[0062] In some embodiments, a dataset of sample executable files can be used to train and test the first exemplary model. In some of such embodiments, the architecture of the first exemplary model shown in
[0063] With reference to
[0064]
[0065] Suitable parsing tools include, but are not limited to, Portable Executable reader module (pefile). Then, if an object is found, the system can iterate through every entry in the object and list all the DLLs and corresponding imported functions. If using ordinal, a lookup table can be used to find API functions associated with certain DLLs. In some embodiments, the extracted feature sets will comprise a list of DLL (aka LIB) and API pairs. In the exemplary branch shown in
[0066] [shlwapi.dll:ColorHLSToRGB, shlwapi.dll:ColorRGBToHLS, shlwapi.dll:ord176, shlwapi.dll:SHAutoComplete, shlwapi.dll:UrlUnescapeW, phlpapi.dll:GetExtendedTcpTable, . . . ]
[0067] In order to represent the extracted feature set, the method starts at step 41 by extracting a feature set comprising LIB:API pairs. Then, at step 42, the method includes the step of generating 1, 2 and 3 grams (shingles) of the LIB:API pairs. Finally, at step 43, the method includes the step of generating a list of min-wise independent permutations (MinHash) using 128 permutation. As will be appreciated by the skilled reader, MinHash is one of many locality sensitive hashing schemes that can be used in accordance with the systems and methods disclosed herein for estimating how similar two sets are.
[0068] The represented feature set of the second exemplary branch can then be input into the second exemplary model shown in
[0069] In some embodiments, a dataset of sample executable files can be used to train and test the second exemplary model. In some of such embodiments, the architecture of the second exemplary model shown in
[0070] With reference to
[0071]
[0072] The data relating to this third exemplary branch comprises an extraction of the section where the Entry Point (EP) lies. As will be appreciated by the skilled reader, the entry point of an executable file is where the execution of instructions of a program begins. This is performed in part to verify if the address of the EP is within the boundaries of the determined section. Typically, the EP will be situated in the “.code” or “.text” sections of the PE file. An EP that lies in a different section of a PE file could in itself be suspicious. The feature extraction is performed by first extracting the EP address from the PE file. Then, the system can iterate through every section to check if the EP address is within the boundaries of the section in question. When the section in which the EP is located is determined, the system can collect all raw bytes (i.e., the value of the bytes) from the section. As such the raw extracted feature will be the value of the bytes of the section in which the EP is located.
[0073] As shown in
[0074] The represented feature set (i.e., grayscale representation) of the third exemplary branch can then be input into the third exemplary model shown in
[0075] In some embodiments, a dataset of sample executable files can be used to train and test the third exemplary model. In some of such embodiments, the architecture of the third exemplary model shown in
[0076] With reference to
[0077] In a non-limiting example, this process could include first determining the file type using a file type library such as the Python-magic™ library. Then, the Capstone™ Architecture and Mode can be initialized using the file type determined in the previous step. The file can then be opened using the r2pipe module in Radare2™. In some embodiments, an analysis timeout of 30 seconds can be used to limit the analysis time, before analyzing the file using Radare2™. The raw bytes found in the Entry Point function can then be extracted and disassembled using Capstone™. Finally, the order, address, size, raw byte, mnemonic and operand of every instruction in the entry function of the executable file can be collected. The raw extracted feature set can be a list of dictionary of order, address, size, raw byte, mnemonic and operand.
[0078] Similarly to the representation method used in respect of the third exemplary branch, the feature representation method relating to the fourth exemplary branch, comprises a MinHash of 1, 2 and 3 grams of each mnemonic, as shown in steps 81, 82 and 83 of
[0079] The represented feature set of the fourth exemplary branch can then be input into the fourth exemplary model shown in
[0080] In some embodiments, a dataset of sample executable files can be used to train and test the fourth exemplary model. In some of such embodiments, the architecture of the fourth exemplary model shown in
[0081] With reference to
[0082] In some embodiments, the extracted feature set can then be represented by first converting the section characteristic table of
[0083] The represented feature vectors of the fifth exemplary branch can then be input into the fifth exemplary model shown in
[0084] In some embodiments, a dataset of sample executable files can be used to train and test the fifth exemplary model. In some of such embodiments, the architecture of the fifth exemplary model shown in
[0085] With reference to
[0086] In some embodiments, each of the General strings shown in
[0087] #string: the total number of strings.
[0088] #noise: the total number of noise strings (strings with special symbols such as ‘D$I’), regex=re.compile(‘[@!#$%{circumflex over ( )}&*( )“\′< >,\′?∧|}{˜:†=\+\−\[†]]’) is used to search noise strings
[0089] #English sentence: the total number of English sentences, enchant,checker.SpellChecker is applied to search English sentences. In some embodiments, other languages could be used.
[0090] #repeated character: the total number of strings which are all same characters such as “00000000”, since entropy of each string will be calculated later, the strings with entropy of 0 can be defined as repeated characters.
[0091] #file extension: the total number of strings which have file extensions, such as *.dll, *.exe
[0092] In some embodiment, each of the Domain knowledge strings shown in
[0093] The constants.json file from the open source tool stringsifter (https://github.com/fireeye/stringsifter) can be used as the domain knowledge dictionary to get the total number of specific strings.
[0094] #winApi: the total number of windows API strings. In the dictionary, there are 28307 items related to windows API such as ‘ACUIProviderInvokeUI’, ‘ADSIAbandonSearch’, ‘ADSICloseDSObject’. Each string in the PE file can be matched to the items to get the total number of windows API strings. This method can be applied to the following features, but with different items.
[0095] #dll: the total number of DLL file strings.
[0096] #common dll: the total number of DLL file strings which are in the common DLL dictionary. There are 32 items in the common DLL dictionary such as ‘wowarmhw’, ‘xtajit’, ‘advapi32’, ‘advapi’, ‘clbcatq’, ‘combase’.
[0097] #malware dll: the total number of DLL file strings which are in the malware DLL dictionary. There are nine items in the malware DLL dictionary such as ‘wininet’, ‘bypassuacdll’, ‘dnsapi.
[0098] #cpp: the total number of strings which are related with cpp. There are 236 items in the cpp dictionary such as ‘get_file_size’, ‘.xdata$x’, ‘Cast to smaller type causing loss of data’.
[0099] #fun_mal: the total number of important functions which may be related with malwares. There are 330 items in the fun_mal dictionary such as ‘AdjustTokenPrivileges’, ‘CallNextHookEx’, ‘CheckRemoteDebuggerPresent’.
[0100] #pe_arti: the total number of strings related with PE artifacts. There are 12 items in the pe_artifacts dictionary such as ‘ProductVersion’, ‘VS_VERSION_INFO’, ‘!This program cannot be run in DOS mode.’.
[0101] #language: the total number of language strings such as “English-United States” and “German”. There are 245 items in the language dictionary.
[0102] #date: the total number of strings related with date such as “Sunday” and “May”. There are 33 items in the language dictionary.
[0103] #blacklist: the total number of strings which are in the blacklist dictionary. There are 280 items in the blacklist dictionary such as ‘project.thisdocument’, ‘microsoft office’, ‘microsoft word’, ‘worddocument’, ‘xmlhttp’, ‘summaryinformation’.
[0104] In some embodiments, each of the Entropy strings shown in
[0105] Scipy.stats.entropy can be used to calculate entropy of each string. Pandas.Dataframe.quantile is used to get Quantile 10 or 100.
[0106] Avg: the average of the strings' entropy.
[0107] Max: the maximum value of the strings' entropy.
[0108] Min: the minimum value of the strings' entropy.
[0109] Quantile 10 or 100: the decile or the percentile of the strings' entropy.
[0110] In some embodiments, each of the Length strings shown in
[0111] Avg: the average of the strings' length.
[0112] Max: the maximum value of the strings' length.
[0113] Min: the minimum value of the strings' length.
[0114] Quantile 10 or 100: the decile or the percentile of the string's length.
[0115] An example string statistics vector representation of the above-described example is shown in
[0116] As will be appreciated, the aforementioned feature representation provides a great deal of flexibility to add more string statistics features as more domain knowledge is acquired, thereby expanding the feature space even further.
[0117] The represented feature set of the sixth exemplary branch can then be input into the sixth exemplary model shown in
[0118] In some embodiments, a dataset of sample executable files can be used to train and test the sixth exemplary model. In some of such embodiments, the architecture of the sixth exemplary model shown in
[0119] As will now be described with reference to
[0120]
[0121] In some embodiments, a dataset of sample executable files can be used to train and test the ensemble model of
[0122] The variety of feature sets associated with the ensemble model allows exploration of patterns in comparatively large representation space. As such, the ensemble model enables better space representation of a PE file. As will also be appreciated by the skilled reader, different feature extraction and feature representation processes that enable the ensemble DNN-based model of
[0123] As will be appreciated by the skilled reader, any combination of two or more of the exemplary branches described herein can be combined to form one or more embodiments of the multivariate ensemble deep neural network methods and systems in accordance with the present disclosure.
[0124] Moreover, a person of skill in the art will readily recognize that steps of various aforementioned methods can be performed by programmed computers. Herein, some embodiments are also intended to cover program storage devices, e.g., digital data storage media, which are machine or computer readable and encode machine-executable or computer-executable programs of instructions, wherein said instructions perform some or all of the steps of said above-described methods. The program storage devices may be, e.g., digital memories, magnetic storage media such as a magnetic disks and magnetic tapes, hard drives, or optically readable digital data storage media. The embodiments are also intended to cover computers programmed to perform said steps of the above-described methods.
[0125] The description and drawings merely illustrate the principles of the invention. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the invention and are included within the scope of the appended claims. Furthermore, all examples recited herein are principally intended expressly to be only for pedagogical purposes to aid the reader in understanding the principles of the invention and the concepts contributed by the inventors to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention, as well as specific examples thereof, are intended to encompass equivalents thereof.
[0126] The functions of the various elements shown in
[0127] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative software and/or circuitry embodying the principles of the invention. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor whether or not such computer or processor is explicitly shown.