Apparatus and methods for network-based line-rate detection of unknown malware
11349852 · 2022-05-31
Assignee
Inventors
- Hongwen Zhang (Calgary, CA)
- Mark Koob (Calgary, CA)
- Kevin Chmilar (Calgary, CA)
- Husam Kinawi (Calgary, CA)
Cpc classification
H04L63/145
ELECTRICITY
G06F21/53
PHYSICS
H04L12/22
ELECTRICITY
International classification
H04L9/32
ELECTRICITY
G06F21/56
PHYSICS
G06F21/53
PHYSICS
G06F21/55
PHYSICS
Abstract
A network-based line-rate method and apparatus for detecting and managing potential malware utilizing a black list of possible malware to scan content and detect potential malware content based upon characteristics that match the preliminary signature. The undetected content is then subjected to an inference-based processes and methods to determine whether the undetected content is safe for release. Typical to inference-based processes and method, the verdict is a numerical value within a predetermined range, out of which content is not safe. The network content released if the verdict is within safe range, otherwise, the apparatus provides various options of handling such presumably unsafe content; options including, soliciting user input whether to release, block, or subject the content to further offline behavioral analysis.
Claims
1. An apparatus for network-based, malware analysis comprising: a processor; and a memory configured to store computer program code; wherein the processor, memory, and computer program code are configured for in-line inspection of network-based content and to provide: a signature scanner configured to scan at network line-rates incoming network-based content with signature-based scanning which comprises comparing a signature of the incoming network-based content with previously identified signatures to identify if the incoming network-based content is a known threat content; an Artificial Intelligent (AI) scanner configured to scan at network-line rates and previously trained and configured to: read the code of the network-based content which has not been identified as a known threat in the signature-based scanning step without executing the code; use machine learned characteristics of malicious code to assign a risk value to the network-based content based on the read code of the network-based content; identify network-based content having been assigned a risk score below a safe threshold value as safe content; identify network-based content having been assigned a risk score above a threat threshold value as threat content; and identify network-based content having been assigned a risk score above the safe threshold value and below the threat threshold value as suspicious content; and a controller configured, based on the scans: to allow safe content; and to block threat content; and a behavioural scanner configured to run suspicious content in an isolated virtual environment to determine whether the suspicious content contains threat content or safe content; wherein the controller is configured to notify a user of identified suspicious content and prompt the user for input regarding how to process the identified suspicious content.
2. The apparatus according to claim 1, the network-based content includes Multipurpose Internet Mail Extension (MIME) objects.
3. The apparatus according to claim 1, the network-based content includes one or more attachments.
4. The apparatus according to claim 1, wherein the AI scanner is configured to scan the content in parallel.
5. The apparatus according to claim 1, wherein the controller is configured to: receive information from other users of the apparatus regarding how to process suspicious content; and provide this information to the user to allow the user to base their input on information received from other users.
6. The apparatus according to claim 1, wherein the content is processed through the signature and AI scanners at a rate of at least 1 Gbps.
7. The apparatus according to claim 1, wherein the apparatus is configured to identify different types of malware identified as threat content.
8. The apparatus according to claim 2, the network-based content includes one or more attachments.
9. The apparatus according to claim 1, wherein suspicious content from the behavioural scanner is passed to a machine learning algorithm as identified content and is identified to the AI scanner as either comprising malware or as not comprising malware; wherein the AI scanner is configured to scan the identified content to refine the characteristics which it uses to identify content as a threat based on the identified content received.
10. The apparatus according to claim 1, wherein the AI scanner is configured to prioritize scanning unidentified content which has not been identified as being blocked or allowed over identified content which has already been identified as being blocked or allowed.
11. The apparatus according to claim 1, wherein the apparatus is configured to assign priority and allocate apparatus resources based on both the size of each inspection task and the time taken to complete each content inspection task.
12. The apparatus according to claim 1 wherein the network-line rates are between 100 Mbits/s and 1 Gbits/s.
13. The apparatus according to claim 1 wherein the network-line rates are between 1 Gbits/s and 10 Gbits/s.
14. The apparatus according to claim 1 wherein the network-line rates are over 100 Gbits/s.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Various objects, features and advantages of the invention will be apparent from the following description of particular embodiments of the invention, as illustrated in the accompanying drawings. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of various embodiments of the invention. Similar reference numerals indicate similar components.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
DETAILED DESCRIPTION
(15) Introduction
(16) Various aspects of the invention will now be described with reference to the figures. A number of possible alternative features are introduced during the course of this description. It is to be understood that, according to the knowledge and judgment of persons skilled in the art, such alternative features may be substituted in various combinations to arrive at different embodiments of the present invention.
(17) A problem with using signature-based scanning is that it requires the lengthy process of finding malware, getting a sample, analyzing it, generating a signature and adding it to the repository that is propagated to users as anti-virus updates.
(18) Another problem is that some malware (e.g. polymorphic and metamorph malware) is able to mutate its code, making signature generation difficult. A polymorphic virus for example, may make use of a decrypter, encrypter stub, thereby making the creation of a signature more difficult because a different key is used each time the virus propagates to a different computer the body of the virus is completely different every time. Metamorphic viruses may be able to ‘evolve’, i.e., re-code themselves. This ability makes the creation of a signature difficult.
(19) Machine learning techniques on the other hand, may identify (or assign a risk to) malware based on their characteristics rather than a pre-determined signature even if it hasn't been detected elsewhere or analyzed by researchers.
(20) Once the ‘intelligent’ malware detection system has been trained using known malware activity the system gains the ability to detect previously unknown malware. However, using such ‘intelligent’ malware detection systems alone may generate an unworkable quantity of false positives, as characteristics of safe files may be incorrectly identified as corresponding to malware.
(21) The present technology relates to a network-based line-rate method and system for detecting and managing potential malware utilizing a blacklist of possible malware to scan content and detect potential malware content based upon characteristics that match the preliminary signature. The undetected content (e.g. content which does not correspond to previously identified malware) is then subjected to an inference-based or behavioural-based processes and methods to determine whether the undetected content is safe for release. Typical to inference-based processes and method, the verdict is a numerical value within a predetermined range, out of which content is not safe. The network content released if the verdict is within safe range, otherwise, the system provides various options of handling such presumably unsafe content; options including, soliciting user input whether to release, block, or subject said content to further offline behavioral analysis.
(22) That is, the present technology uses a multi-staged analysis which may offer better accuracy during line-rate content-based scanning. Offline behavioral analysis may be used in conjunction with the present technology. For example, after-the-fact remediation may be used if it was deduced that malware is present.
(23) With the advent of mobile devices and cloud computing, where the accuracy of network-based solutions has to be high, embodiments use multiple scanners (e.g. arranged sequentially), in the order of signature scanning first, then the inference AI scanning techniques. In this way, a higher accuracy of detecting unknown malware may be achieved.
(24) Overview
(25)
(26) In this case, failure to be mapped to a known-malware signature results does not mean the content is not malware and hence the undetected content 110 is subjected to a second stage, the second stage comprising AI scanning 102. In other embodiments, the second stage may comprise behavioural, sandboxing analysis and/or inference-based techniques. Because of this second stage scanning—after being scanned against known malware—the processing times and accuracy of this inference-based second stage can be greatly enhanced.
(27) In some embodiments, content that is still uncategorized after the two stages of scanning can either be blocked 113, subjected to user's input to determine whether the content be blocked, released or subjected to a full behavioral scanning.
(28) The scanning apparatus may be configured to scan received content and/or content for transmission.
(29) The scanning apparatus may be configured to scan content before or after a firewall. Configuring the scans to take place before the firewall helps relieve firewalls and other systems that may be fatigued.
(30) The content identified as safe can then be delivered to, for example, applications, Operating Systems, a Network or connected infrastructure (e.g. comprising clients, servers) or used in conjunction with the Internet of Things.
(31) The first two stages can be performed at network line rates (e.g. one or more of: up to 100 Mbits/s; between 100 Mbits/s and 1 Gbits/s; between 1 Gbits/s and 10 Gbits/s; between 10 Gbits/s and 100 Gbits/s; and over 100 Gbits/s).
(32) To give an indication of how the present system may be used, in a US school district with 120,000 active internet users, 3.2% are attacked on average generating 3,840 events daily. Signature-based is typically able to stop 3,540 of these. Without AI scanning, the remaining 300 files were submitted to a sandbox company. Among these submitted files, only 2 (1 to 3) is APT (Advanced Persistent Threat). The Second stage is hence expected to handle around 295 of these 300. Percentage-wise, signature scanning accuracy will typically be around 92.1%, with AI this may increase to 99.8% and with Sandboxing the accuracy may approach 100%. It will be appreciated that these accuracy figures are for guidance only.
(33) Signature scanning scans against known-malware, and hence provides as a base blacklist. AI scanning scans content, analyzing its structure looking for patterns or sequence thereof, that are common across malware. AI scanning helps protect against malware for which signatures are yet to be developed (unknown malware or custom tailored malware). Because malware is typically made of executable computer instructions, AI scanning techniques can result in false positives especially when scanning ‘safe’ computer instructions. AI scanning techniques may also be computationally ‘expensive’ compared with signature scanning.
(34) Signature Based Scanning
(35) The first stage of scanning in the embodiment of
(36) In this embodiment, the scan is performed according to the scheme outlined in Morishta et al. (U.S. Pat. No. 7,630,379). For example, the signature scanner may be configured to carry out the following steps: a) subjecting each newly arriving data payload to content recognition to determine if the newly arriving data payload content has been previously inspected, has not been inspected or is currently under inspection; b) allowing a newly arriving data payload recognized as previously inspected to be delivered without content inspection; c) subjecting a newly arriving data payload recognized as not been inspected to content inspection to produce a new payload inspection result whereby the newly arriving data payload becomes a newly inspected data payload; d) storing a message digest for the newly inspected data payload with the new payload inspection result in a content history lookup table wherein content recognition includes the steps of: i) subjecting each newly arriving data payload to a one way hash function to calculate a message digest of the newly arriving data payload; ii) comparing the message digest of the newly arriving data payload to previously stored message digests in the content history lookup table wherein each previously stored message digest has an associated inspection result; and wherein iii) if the message digest of the newly arriving data payload from step ii) is identical to a previously stored message digest determining: a. if the previously stored message digest is flagged as inspected then i. determining a policy action based on the inspection result; or b. if the previously stored message digest is flagged as under-inspection then i. waiting a pre-determined time period before repeating step ii).
(37) In this way, the accuracy of the verdict can be improved by referring to verdicts of previously scanned content.
(38) In this case, the signature scanning is configured to block known malware 113a by identifying known malware as threat content 113; to forward all other traffic 110 to the next scanner 102; and provide coverage for Legacy File Types.
(39) AI Risk Assignment
(40) The second scanning stage 102 in this case is an AI (Artificial Intelligence) scanning stage which is configured to identify characteristics common to malware but different from safe content.
(41) Artificial intelligence may use algorithms to classify the behavior of a file as malicious or benign according to a series of file features that are manually extracted from the file itself. The machine may be configured by the manufacturer or user what parameters, variables or features to look at in order to make the decision. In some cases machine learning solutions are used to identify a suspicious situation, but the final decision as to what to do about it is left to a user.
(42) Artificial Intelligence scanning may use machine learning. Once a machine learns what malicious code looks like, it may be configured to identify unknown code as malicious or benign (by assigning a risk value) and in real-time. Then a policy can be applied to delete or quarantine the file or perform some other specified action.
(43) AI scanning may be performed in a way similar to CylancePROTECT® software (e.g. as described in US 2015/227741).
(44) AI scanning may comprise analyzing and classifying many (e.g. thousands) characteristics per file to determine a risk value.
(45) Analysis of a file may be based on a previous analysis of a large number (e.g. hundreds of millions) of files of specific types (executables, PDFs, Microsoft Word® documents, Java, Flash, etc.).
(46) The goal of this pre-analysis is to ensure that: there is a statistically significant sample size; the sample files cover a wide range of possible range of file types and file authors (or author groups such as Microsoft, Adobe, etc.); and the sample is evenly distributed across the specific file types.
(47) The pre-analysis involves identifying the files into three categories: known and verified valid, known and verified malicious and unknown.
(48) The next phase in the machine learning process is the extraction of attributes which are characteristic of known malware. These attributes or characteristics can be as basic as the PE (Portable Executable) file size or the compiler used and as complex as a review of the first logic leap in the binary. The identified characteristics may be dependent on its type (.exe, .dll, .com, .pdf, .java, .doc, .xls, .ppt, etc.).
(49) By identifying multiple characteristics may also substantially increases the difficulty for an attacker to create a piece of malware that is not detected by the AI scanner.
(50) In this case, the AI scanner 102 is configured to analyze all Executables; provide risk scores; block threat content 113b (content with a risk value above a threat threshold); allow safe content 111a (content with a risk value below a safe threshold); identify suspicious content 112 (content with a risk value between the sage threshold and the threat threshold); and pass suspicious content 112 to the behavioural scanner.
(51) In some cases, the behavioural scanner may be omitted. It will be appreciated that in some embodiments, the malware analysis apparatus may prompt the user to determine how to process suspicious content. For example, the apparatus may be configured to ask whether suspicious content should be allowed or blocked (e.g. prevented from running). The apparatus may be configured to remember user's choices and prompt or apply them for future identified suspicious content (e.g. based on source, risk score).
(52) Behavioural Scanner
(53) In this case, the embodiment comprises a behavioural scanner 103 for scanning suspicious content 112. The behavioural scanner comprises a malware sandbox. A malware sandbox is a controlled testing environment (sandbox) which is isolated from other computers in which an item of suspicious code can be executed in a controlled way to monitor and analyze its behaviour.
(54) The malware sandbox may comprise an emulator. An emulator may be considered to be a software program that simulates the functionality of another program or a piece of hardware.
(55) The malware sandbox may use virtualization. With virtualization, the suspicious content may be executed on the underlying hardware. The virtualization may be configured to only control and mediate the accesses of different programs to the underlying hardware. In this fashion, the sandbox may be isolated from the underlying hardware.
(56) The behavioural scanner in this case is configured to minimize false positives from the AI-based scanner by providing a safe outlet for content identified as suspicious by the AI scanner.
(57) The behavioural scanner is configured, based on the suspicious content's 103 behaviour to identify the content as either threat content 113c or as safe content 111b.
(58) Content identified as safe content (e.g. by the AI or behavioural scanning) is allowed. Depending on the context, allowed may mean that the content can continue to its destination on the network, or that the content can be executed. Content identified as threat content (e.g. by the signature, AI or behavioural scanning) is blocked.
(59) Interaction Between Stages
(60) In embodiments described herein, because the first step is a signature-based analysis, the threshold for identifying a particular file as a risk may be higher in the AI-scanning stage. This may reduce the instances of false-positives.
(61) Once a file is identified as being a threat in the signature-based scan, this file may be passed to the machine learning algorithm already identified as being a threat for learning (in addition to being blocked). Similarly, in embodiments with a sandbox stage, any files identified as being safe or as being a threat may be passed to the AI algorithm to allow the AI algorithm to learn from this sandbox analysis. This may allow the machine learning algorithm to refine the characteristics which it uses to identify content as a threat.
(62) Because NBCI systems have a finite number of system resources, by filtering out known malware in a first stage, more of these resources may be dedicated to detecting unknown content using AI scanning. This may result in a more robust and stable NBCI apparatus. In addition, to maintain line rate scanning speeds, it is important that the AI scanning stage is dedicated to unknown threats. That is, because AI can be slower than signature scanning, it may be important to ensure that AI scanning is used only on content which can not be identified as a threat using the more rapid signature scanning.
Second Embodiment
(63)
(64) That is,
(65) Interaction with Network
(66)
(67) The malware analysis apparatus is configured to scan content passing between the internal and external portions of the network to filter threat content.
(68) As noted above, the malware analysis apparatus may be configured to scan content before or after a firewall. Configuring the scans to take place before the firewall helps relieve firewalls and other systems that may be fatigued.
(69) Controlling Computation
(70) Regarding optimizing the computation required to perform content inspection on concurrently received network data packet payloads are described. In the context of this description “concurrently” means data payloads received by a computer network within a short time period such that the system resources considers the data payloads to have been effectively received at the same time or within a short time period.
(71) With reference to
(72) In this case, a policy module 15 applies a set of operations, such as the downstream delivery of the recognized payload 14a, or modify the payload, based on e.g. business specific policies. An inspected payload 14c and inspection result 14d is returned to the CRM 12 in order that subsequent receipt of a similar payload does not pass through the content inspection module 16. Generally, an inspection result is one or more markers that indicate that the content has been inspected and/or classified, and that enable other functions to be performed on the payload according to pre-determined policies.
(73) With reference to
(74) After inspection, the newly inspected content 14c is passed through the one-way hash-function to calculate a message digest 20a of the newly inspected content 14c. A CIH record 42b is inserted into the CIHL Table 24. This entry has the message digest 20a, the Inspection State “Inspected”, the Inspection Result 14d, and optionally other supplementary information as will be explained in greater detail below.
(75) If the comparison returns a matching CIH record 42 with the Inspection State field 43 being “Under Inspection” (Step 29), meaning a previous payload carrying the same content is currently being inspected, the processing of the latter payload content will wait for a period of time 26 before continuing. When the system determines that the inspection state of the previous payload content (
(76) If the comparison (step 28) returns a matching CIH record 42 with the Inspection State field 43 being “Inspected”, meaning that the digest corresponds to the message digest of previously inspected content, the payload by-passes the content inspection module 16 as recognized payload 14a.
(77) The one-way hash function may be a known Secure Hash Algorithm (SHA), such as SHA-1, MD2, MD4, MD5, variations thereof or other hashing algorithms as known to those skilled in the art.
(78) With reference to
(79) In this embodiment, the payload is decomposed into logical portions 30 and each portion is evaluated to determine if it has been inspected. If the algorithm determines that there are un-inspected portions (step 31), a message digest (step 32) is calculated for the un-inspected portions. Each message digest is then searched within the CIHL table as described above.
(80) Decomposition may be achieved by breaking down a payload into logical portions such as by attachment within an email, or the file content within a zip file.
(81) Scheduling Manager
(82) In a preferred embodiment, scheduling the content inspection of multiple inspection tasks is conducted to prevent system resource exhaustion in the event of the rapid or simultaneous arrival of many different data payloads, many instances of the same content, or in the event of a deny-of-service attack. Scheduling of one or more of the AI and signature scanning stages will ensure that the system resources are efficiently utilized to complete content inspection and are spent on applying the content inspection algorithms to one only instance of any multiple instances. This is achieved by giving much lower priority to time-consuming or system resource demanding content processing tasks. Scheduling is accomplished by utilizing the content inspection state (i.e. un-inspected, under-inspection or inspected) together with information relating to the number of required inspection tasks, the time of receipt of an inspection task and the size of the inspection task.
(83)
(84) As shown in
(85) As content inspection tasks are being registered and un-registered from the TQD CIP manager 90, the TQD CIP manager 90 continuously loops through each of the registered content inspection tasks and reviews and updates the status or priority of each content inspection task.
(86) With reference to
(87) If the TTL is not less than zero, the TQD CIP manager 90 will reduce the priority for the ith inspection task (step 106) by a pre-determined value.
(88) Once the priority has been adjusted or the ith CIP has been aborted, the TQD CIP manager determines if there are any remaining registered inspection and either waits for a period of time (step 94) to check for registered inspection tasks or continues reviewing and adjusting the status of other registered inspection tasks.
(89) As an example of a possible scheduling scenario, 5 content inspection tasks may have been registered with the TQD CIP manager 90. These registered inspection tasks may include 3 small files (e.g. 3 kb each), 1 medium size file (e.g. 10 Mb) and 1 large file (e.g. 100 Mb) received in any particular order. In processing these inspection tasks, the manager will seek to balance the content inspection in order to maintain efficiency for a desired level of service. For example, scheduling manager parameters may be set to ensure that priority is assigned to inspection of the smaller files first irregardless of the time of receipt. Alternatively, scheduling manager parameters may be set to ensure that priority is assigned strictly based on the time of arrival irregardless of size. As illustrated in
(90) It is understood by those skilled in the art that the determination of priority and the allocation of system resources to effectively manage content inspection based on content size, and time-to-complete an inspection task may be accomplished by a variety of algorithms and that the methodologies described above are only a limited number of examples of such algorithms.
(91) Classification of Inspection Results
(92) In various embodiments, the content of a data payload, as a recognized payload 14a or an inspected payload 14c can be associated with further information as described below allowing the system to take particular actions with respect to the payload based on the inspection result (
(93) a) Classification of Content
(94) The inspection result can be classified on the basis of content. For example, it can be a marker indicating that the content is spam, spyware or a virus.
(95) b) Content Instructions
(96) The inspection result can include a set of instructions to alter the content. In this case, the policy module 15 may use these instructions to take further steps with respect to the payload. For example, if the content is marked as a virus, the instructions may be to warn the recipient that the payload contains a virus and should not be opened. In other examples, the instructions may be to prevent the delivery of payload, but to send information indicating that the delivery has been denied.
(97) c) Supplementary Data
(98) The inspection result can be associated with supplementary data. Supplementary data provides further functionality including enhanced security to the methods of the invention.
(99) For example, supplementary data may include the time of creation 44 of the message digest which may be used to provide enhanced security. That is, as it is known that given enough time, an attacker can achieve a collision with the commonly used one-way hash algorithms, by adding time information as supplementary data, a message digest can be retired if the message digest is older than a pre-determined value.
(100) In another embodiment supplementary data may also or alternatively include the size 45 of the payload wherein the size information can be used to provide finer granularity to also reduce the possibility of a hash code collision. In this example, when conducting the CIHL table search function within the lookup table, both the message digest and the size have to match those of the payload.
(101) Deployment
(102) The system may be deployed as an adaptive external module of an existing content inspection system, or as an embedded module within a content inspection system.
(103) In one embodiment, the system is implemented with the content recognition module interacting with an existing content inspection co-processor as shown in
(104) In another embodiment, the system is a software component embedded into a content inspection module which is a software module, a co-processor or a computer system.
(105) In a further embodiment, and in order to leverage the computation spent on content inspection, the message digests along with the inspection results can be shared among several instances of content recognition/inspection systems as shown in
(106) The preceding description is intended to provide an illustrative description of the invention. It is understood that variations in the examples of deployment of the invention may be realized without departing from the spirit of the invention.
CONCLUSIONS
(107) As outlined above, the present technology may be better able to reach a verdict using signature and AI-based scanning in real time (i.e. at line rates). This may be at least partially because of the subsonic scanning (and hashing of previous caches) as well as the Time Quantum based scheduling described in the applicant's previous U.S. Pat. No. 7,630,379. These techniques may be used in both the signature and the AI based scanners.
(108) Another aspect of the present application may be used inline with traffic. The reason most previous network solutions typically use tap mode is because they cannot ‘cope’ with the network throughput, and hence being in tap mode means they can pass on some network traffic, without slowing the network. However, disadvantages of such tap mode (not inline) solutions may include that they cannot participate in the handshakes of ‘encryption’ (SSL for example) and hence cannot ‘see’ encrypted content. With encryption becoming more the norm, increasing at nearly 16% year over year (nearly 64% now), the usefulness of tap mode solution may be diminished.
(109) Although the present invention has been described and illustrated with respect to preferred embodiments and preferred uses thereof, it is not to be so limited since modifications and changes can be made therein which are within the full, intended scope of the invention as understood by those skilled in the art.