SYSTEMS AND METHODS OF MALWARE DETECTION
20210250364 · 2021-08-12
Inventors
Cpc classification
H04L69/322
ELECTRICITY
H04L63/145
ELECTRICITY
G06F17/18
PHYSICS
G06F21/566
PHYSICS
H04L63/1466
ELECTRICITY
International classification
G06F17/18
PHYSICS
G06F21/56
PHYSICS
Abstract
Systems and methods for detecting suspicious malware by analyzing data such as transfer protocol data or logs from a host within an enterprise is provided. The systems and methods include a database for storing current data and historical data obtained from the network and a detection module and an optional display. The embodiments herein extract information from non-encrypted transfer protocol metadata, determine a plurality of features, utilize an outlier detection model that is based on historical behaviors, calculate a suspiciousness score, and create alerts for analysis by users when the score exceeds a threshold. In doing so, the systems and methods of the present invention improve the ability to identify suspicious outliers or potential malware on an iterative basis over time.
Claims
1. A system for detecting malware in a network comprising: a database for storing current data and historical data obtained from the network; and a detection module adapted and configured to perform the steps of: loading the current data from the database; filtering the current data to obtain filtered data based on at least one criterion; saving the filtered data to the database and loading a previously filtered data; determining values of a plurality of features; computing an outlier score for each of values of the plurality of features; and merging the outlier scores to obtain an output score used to detect malware.
2. The system of claim 1, wherein the at least one criterion is one of a file path information, file name, a content type, a content length, and a file extension type.
3. The system of claim 1, wherein the step of computing an outlier score includes performing at least one of a Z-score and a p-value calculation of each of the plurality of features.
4. The system of claim 1, wherein the step of merging the outlier scores includes using one of a sum, a weighted average, and a supervised machine learning model algorithm to obtain the output score.
5. The system of claim 1, wherein the current data and the historical data includes metadata of a transfer protocol.
6. The system of claim 5, wherein the transfer protocol is one of HTTP, FTP, SMB, and SMTP.
7. The system of claim 1, further comprising at least one sensor for parsing out the metadata from the current data and the historical data.
8. The system of claim 1, wherein the detection module is further adapted and configured to perform the step of creating an alert for each of the output scores at or above a predetermined threshold.
9. The system of claim 8, further comprising a display for displaying the alert received from the detection module.
10. The system of claim 1, wherein the plurality of features includes at least one of: a count of a number of times downloads are made from an observed protocol host over a time interval, a count of a number of times an observed transfer protocol path is downloaded over a time interval, an amount by which the value of one feature within the plurality of features is abnormal relative to other file downloads with a same extension as the one feature, and a determination of how strongly a downloaded file name within the current data correlates with a list of known malware file names.
11. A method for detecting malware in a network comprising the steps of: storing current data and historical data obtained from the network in a database; loading the current data from the database; filtering the current data to obtain filtered data based on at least one criterion; saving the filtered data to the database and loading a previously filtered data; determining values of a plurality of features; computing an outlier score for each of the values of the plurality of features; and merging the outlier scores to obtain an output score used to detect malware.
12. The method of claim 11, wherein the at least one criterion is one of a file path information, a file name, a content type, a content length, and a file extension type.
13. The method of claim 11, wherein the step of computing an outlier score includes performing at least one of a Z-score and a p-value calculation of each of the plurality of features.
14. The method of claim 11, wherein the step of merging the outlier scores includes using one of a sum, a weighted average, and a supervised machine learning model algorithm to obtain the output score.
15. The method of claim 11, wherein the current data and the historical data includes metadata of a transfer protocol.
16. The method of claim 15, wherein the transfer protocol is one of HTTP, FTP, SMB, and SMTP.
17. The method of claim 15, further comprising the step of parsing out the metadata from the current data and the historical data using a sensor.
18. The method of claim 11, further comprising the step of creating an alert for each of the output scores at or above a predetermined threshold.
19. The method of claim 18, further comprising displaying the alert in a display.
20. The method of claim 18, wherein the plurality of features includes at least one of: a count of a number of times downloads are made from an observed protocol host over a time interval, a count of a number of times an observed transfer protocol path is downloaded over a time interval, an amount by which the value of one feature within the plurality of features is abnormal relative to other file downloads with a same extension, and a determination of how strongly a downloaded file name within the current data correlates with a list of known malware file names.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] The present disclosure is illustrated and described herein with reference to the various drawings, in which like reference numbers are used to denote like system components, as appropriate, and in which:
[0010]
[0011]
[0012]
[0013]
DETAILED DESCRIPTION OF THE DISCLOSURE
[0014] Embodiments of the invention provide systems and methods for detecting suspicious malware by analyzing data of transfer protocols from a host within an enterprise. Malware as referenced herein includes malware downloads, file records, file transfers, emails, attachments etc. Embodiments of the invention also include devices such as user interfaces for alerting network analysts of suspicious incidents, based on the systems and methods employed herein. The embodiments herein extract information from internet transfer protocols including non-encrypted transfer protocols such as HTTP, FTP, SMB, SMTP, etc. The HTTP protocol is utilized herein as an exemplary internet transfer protocol. However, the systems and methods of the embodiments of the invention herein are not limited to analysis of only HTTP protocol data.
[0015] There are many benefits and advantages of the embodiments of the present invention. For example, the embodiments herein perform malware detection in a scalable and inexpensive way, which allows for implementations at a very large scale. Therefore, the embodiments herein are highly beneficial for networks and enterprises with high volumes of traffic, such as HTTP traffic. In addition, the embodiments of the methods of the present invention utilize an outlier detection model that is based on historical behaviors. As such, the embodiments herein “learn” common protocol behaviors that are specific to the network being analyzed. In doing so, the systems and methods of the present invention improve the ability to identify suspicious outliers or potential malware on an iterative basis over time. Additional benefits and advantages of the benefits herein are described below and illustrated with respect to the figures.
System Architecture for Detecting Malware
[0016]
[0017] The sensors 104 serve the purpose of mirroring network traffic or data sent to corresponding access switches 105, parsing out the metadata (e.g., HTTP protocol data), and saving the data to a database 106. The content of the parsed data can vary depending on application specifics and the desires of network analysts. In an embodiment, the data includes at least Path values from HTTP headers. The Path field of the HTTP header represents the string of characters that follows the Host in a URL. An example of an HTTP Path is “/document/d/1VWC/my_song.mp3”. The HTTP Path is utilized to determine which records represent file downloads. The data could also contain other fields from HTTP headers, like Host, Content-Length, User-Agent, etc. An example of an HTTP Host is “www.google.com.” For HTTP downloads, the Content-Length in the header of the HTTP response typically represents the size of the downloaded file. For HTTP downloads, the User-Agent in the header of the HTTP request typically contains some information about the platform requesting the download (e.g., “Mozilla/5.0”). The sensors 104 can consist of one or more hardware components. Sensors 104 may also include or be comprised entirely of software modules that run on the individual devices 110a-n, 111a-n within the access switch layer 102.
[0018] In
[0019] The detection module 108 executes a series of steps at a regular frequency in order to identify suspicious downloads in the recently acquired data. The run frequency can be configured depending on the network data volume and the desires of network analysts. Each time the process, or process software, runs within the detection module 108, it identifies suspicious downloads that occurred in the time interval since the previous run. This interval is referred to as the “test interval,” and the metadata written to database 106 during the test interval are referred to as the “test data.” As an example, the detection module 108 could be configured to run daily. In this case, the software would identify suspicious downloads in test data collected by sensors 104 during the 24-hour test interval prior to each run.
Methods for Malware Detection
[0020]
[0021] In step 204, values of a plurality of suspiciousness features.sub.1-N are computed or determined for the currently filtered test data and the historically filtered test data. Suspiciousness features are properties or characteristics related to file type, file typed, domain, outdated User-Agent, outdated Content-Type that are associated with potential malware downloads etc. Features.sub.1-N may be designed or pre-selected by a user or analyst, and may include, for example information about how many users have accessed a particular domain, the fraction of days within a time period with SLD, a Path, a Referer, a User-Agent, the fraction of the same-extension downloads in a given period of time with the same Content-Type or similar Content-length, and other file name-related suspiciousness. In some embodiments, features.sub.1-N are determined based on historical network data. In sum, the features.sub.1-N m are defined to correlate with the suspicious behavior exhibited by malware.
[0022] As noted above, a plurality of features.sub.1-N m may be used in the detection of malware according to the embodiments herein, and the invention is not limited to a specific feature set or number of features. In an exemplary embodiment, features 1-4 are described and calculated as follows. Feature 1 is a count of the number of times downloads are made from the observed HTTP Host on the network over an interval (e.g., 30 days) preceding the test interval, multiplied by a negative one (−1) so that unpopular hosts are represented by higher values. Feature 2 is a count of the number of times the observed HTTP Path was downloaded on the network over the 30-day interval preceding the test interval, multiplied by negative one so that unpopular Paths are represented by higher values. Feature 3 is a quantitative measure of outlierness (i.e., the amount by which the feature value is abnormal with respect to a group of other feature value data points) of the observed file size relative to other file downloads with the same extension that occurred on the network over the 30-day interval preceding the test interval. This could be quantified, for example, using the “Local Outlier Factor” algorithm. Feature 4 is a quantitative measure of how strongly the downloaded file name correlates with a list of known malware file names. This is measured, for example, using a supervised machine learning model like a “Long Short-Term Memory” network.
[0023] In step 205, after the feature.sub.1-N values of the currently filtered test data are acquired, they are saved to the database 106. Then, a set of historical feature values, saved during previous runs, is loaded from the database 106. In an embodiment, the amount of historical feature values that are loaded in step 205 is selected and configurable based on: 1) the resources available in the computation engine 107; and/or 2) a predetermined amount selected by a user or network analyst. In an exemplary embodiment, the detection module 108 is configured with a daily test interval, and historical feature values that was saved over the seven days preceding the test interval is loaded from the database 106. In an embodiment, in which feature 1 is used, a table or list containing the counts of each Host value in the currently filtered test data is written to the database 106. Then in the second part, a selected length of time (e.g., 30 days) of the historical summary tables of the historical data from previous runs is loaded and used to compute features.sub.1-N m for each download in the currently filtered test data. This aspect provides the advantage of computational efficiency in part because steps 201 and 202 only need to be executed once in each test interval.
[0024] In step 206, an algorithm such as a Z-score, P-value or similar is used to calculate an outlier score of the values features.sub.1-N m. In an embodiment, the historical feature values are used to calculate the outlier score for each feature value in the filtered test data. As illustrated in exemplary
[0025] After outlier scores are computed for each features.sub.1-N m in step 206, a table or list of filtered test data with features and Z-scores is created and passed to step 207. In step 207, the Z-scores for each row of the filtered test data are merged into a single “suspiciousness score” or output score. In an embodiment, the output score indicates how suspicious or likely a file download may be of qualifying as malware. The Z-score computation, and merging steps are illustrated in
[0026] In step 208, output scores 402 are compared to a threshold, and those below a predetermined/selected threshold are removed from the currently filtered test data. Alerts may be generated from the surviving records, i.e., file records associated with output scores 402 above the threshold, and these file records are considered or further analyzed as (e.g., suspicious downloads of) potential malware. Each alert may contain illustrations and/or data describing the suspicious indicators, behaviors, and underlying metadata for a single suspicious download. In an embodiment, illustrated in step 209, the alerts are sent from the computation engine 107 to a display device 109 where they can be reviewed and analyzed by network analysts. In an embodiment, the display device 109 has additional analytical capabilities such that a user may triage the alerts and provide feedback and/or additional algorithms via the display device 109 and then send them back to the computation engine 107 so that details of the algorithm may be adjusted accordingly.
[0027] The subject matter described herein can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structural means disclosed in this specification and structural equivalents thereof, or in combinations of them. The subject matter described herein can be implemented as one or more computer program products, such as one or more computer programs tangibly embodied in an information carrier (e.g., in a machine readable storage device), or embodied in a propagated signal, for execution by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, or multiple computers). A computer program (also known as a program, software, software application, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file. A program can be stored in a portion of a file that holds other programs or data, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
[0028] The processes and logic flows described in this specification, including the method steps of the subject matter described herein, can be performed by one or more programmable processors executing one or more computer programs to perform functions of the subject matter described herein by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus of the subject matter described herein can be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
[0029] Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processor of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of nonvolatile memory, including by way of example semiconductor memory devices, (e.g., EPROM, EEPROM, and flash memory devices); magnetic disks, (e.g., internal hard disks or removable disks); magneto optical disks; and optical disks (e.g., CD and DVD disks). The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
[0030] The subject matter described herein can be implemented in a computing system that includes a back end component (e.g., a data server), a middleware component (e.g., an application server), or a front end component (e.g., a client computer mobile device, wearable device, having a graphical user interface or a web browser through which a user can interact with an implementation of the subject matter described herein), or any combination of such back end, middleware, and front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
[0031] It is to be understood that the disclosed subject matter is not limited in its application to the details of construction and to the arrangements of the components set forth in the following description or illustrated in the drawings. The disclosed subject matter is capable of other embodiments and of being practiced and carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting. As such, those skilled in the art will appreciate that the conception, upon which this disclosure is based, may readily be utilized as a basis for the designing of other structures, methods, and systems for carrying out the several purposes of the disclosed subject matter. It is important, therefore, that the claims be regarded as including such equivalent constructions insofar as they do not depart from the spirit and scope of the disclosed subject matter.
[0032] Although the disclosed subject matter has been described and illustrated in the foregoing exemplary embodiments, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the details of implementation of the disclosed subject matter may be made without departing from the spirit and scope of the disclosed subject matter, which is limited only by the claims which follow.