Malware detection quality control
20220414215 · 2022-12-29
Inventors
- Andrey Kulaga (Moscow, RU)
- Nikolay Balakin (Abakan City, RU)
- Nikolay Grebennikov (Singapore, SG)
- Serguei Beloussov (Singapore, SG)
- Stanislav Protasov (Singapore, SG)
Cpc classification
G06F21/56
PHYSICS
International classification
Abstract
A method of continuous development of an internal threat scan engine based on an iterative quality assessment includes iteratively performing a dynamic assessment of a quality of a threat detection with a frequency defined for each of objects in an object collection, wherein a result of the dynamic assessment includes internal and external scan results of the objects and a consistency verdict of the internal and external scan results of the objects, changing a frequency of scanning iteration of the objects based on the consistency verdict of the external and internal scan results of the objects, classifying the objects based on the result of the dynamic assessment, and creating a development task including the internal and external scan results of the objects, meta-data of the objects, and automated test results to provide details for developing a software to fix inconsistency of the internal and external scan results.
Claims
1. A method of continuous development of an internal threat scan engine based on an iterative quality assessment, the method comprising: iteratively performing a dynamic assessment of a quality of a threat detection with a frequency defined for each of objects in an object collection, wherein a result of the dynamic assessment includes internal and external scan results of the objects and a consistency verdict of the internal and external scan results of the objects; changing a frequency of scanning iteration of the objects based on the consistency verdict of the external and internal scan results of the objects; classifying the objects based on the result of the dynamic assessment; creating a development task including the internal and external scan results of the objects, meta-data of the objects, and automated test results to provide details for developing a software to fix inconsistency of the internal and external scan results of the objects; controlling the dynamic assessment in accordance with a dynamic of implementation of the development task; and maintaining the quality of the threat detection on a given level based on the controlled dynamic assessment and a priority of the development task.
2. The method according to claim 1, wherein the frequency of scanning iteration of the objects changes when the objects include new files.
3. The method according to claim 2, wherein the frequency of scanning iteration of the objects further changes when the internal and external scan results of the objects differ.
4. The method according to claim 1, wherein, in the changing of the frequency of scanning iteration of the objects, when the objects are new files and internal and external scan results of the objects differ, the frequency of scanning iteration of the objects increases.
5. The method according to claim 1, wherein, when the internal and external scan results of the objects differ, the frequency of scanning iteration of the objects is more than the frequency of scanning iteration of the objects when the internal and external scan results of the objects are consistent.
6. The method according to claim 1, further comprising: before the changing of the frequency of scanning iteration of the objects, receiving information on the objects about malicious and clean files from various sources.
7. The method according to claim 1, wherein the internal and external scan results of the objects include file appearances, last scan dates, datasets, and historical information.
8. The method according to claim 1, wherein the objects are from collections of various sources with supported products and scanning engines.
9. The method according to claim 1, wherein the classifying of the objects includes comparing verdicts of the internal and external scan results of the objects with information of third-party scanning services about the objects.
10. The method according to claim 1, wherein the iteratively performing of the dynamic assessment continues by repetition to achieve a predetermined value for the quality of threat detection.
11. A system for continuous development of an internal threat scan engine based on an iterative quality assessment, the system comprising: a processor coupled to a memory storing instructions, the processor being configured to: iteratively perform a dynamic assessment of a quality of a threat detection with a frequency defined for each of objects in an object collection, wherein a result of the dynamic assessment includes internal and external scan results of the objects and a consistency verdict of the internal and external scan results of the objects; change a frequency of scanning iteration of the objects based on the consistency verdict of the external and internal scan results of the objects; classify the objects based on the result of the dynamic assessment; create a development task including the internal and external scan results of the objects, meta-data of the objects, and automated test results to provide details for developing a software to fix inconsistency of the internal and external scan results of the objects; control the dynamic assessment in accordance with a dynamic of implementation of the development task; and maintain the quality of the threat detection on a given level based on the controlled dynamic assessment and a priority of the development task.
12. The system according to claim 11, wherein the processor changes the frequency of scanning iteration of the objects when the objects include new files.
13. The system according to claim 12, wherein the processor further changes the frequency of scanning iteration of the objects when the internal and external scan results of the objects differ.
14. The system according to claim 11, wherein, in the changing of the frequency of scanning iteration of the objects, when the objects are new files and internal and external scan results of the objects differ, the processor increases the frequency of scanning iteration of the objects.
15. The system according to claim 11, wherein, when the internal and external scan results of the objects differ, the frequency of scanning iteration of the objects is more than the frequency of scanning iteration of the objects when the internal and external scan results of the objects are consistent.
16. The system according to claim 11, wherein, before the changing of the frequency of scanning iteration of the objects, the processor receives information on the objects about malicious and clean files from various sources.
17. The system according to claim 11, wherein the internal and external scan results of the objects include file appearances, last scan dates, datasets, and historical information.
18. The system according to claim 11, wherein the objects are from collections of various sources with supported products and scanning engines.
19. The system according to claim 11, wherein the classifying of the objects includes comparing verdicts of the internal and external scan results of the objects with information of third-party scanning services about the objects.
20. The system according to claim 11, wherein the processor continues the iteratively performing of the dynamic assessment by repetition to achieve a predetermined value for the quality of threat detection.
Description
SUMMARY OF FIGURES
[0016] The exemplary aspects of the invention will be better understood from the following detailed description of the exemplary embodiments of the invention with reference to the drawings:
[0017]
[0018]
[0019]
[0020]
DETAILED DESCRIPTION
[0021] Exemplary embodiments of the invention will now be described with reference to the drawings.
[0022] As shown exemplarily in
[0023] In step 105, the external scan result and object metadata that include at least the time of the first time object detection are obtained, and in step 106, the internal scan result and object metadata are obtained.
[0024] Metadata may, for example, be related to VirusTotal, which includes a full list of engines in use, a list of existing privileges, etc. Metadata can be used for object classification, since the metadata is included in a scan result fully or partly.
[0025] In step 107, a composite verdict including the results of internal and external scans and object meta-data from internal and external scan engines is obtained.
[0026] Step 108 includes updating the object history information with the composite verdict of iteration, and the composite verdict is assessed at step 109.
[0027] When the composite verdict is determined to be consistent in step 110, a task “A” is scheduled for the next iteration in step 112, a list of object sources that provided the object is obtained in step 118, and the trust level of object source, based on the completed iteration, is updated in step 119.
[0028] The composite verdict is determined to be inconsistent when some pairs of verdict attributes are not equal. For example, scanning verdicts (e.g., malicious or black, suspicious, or grey and white) differ, or security rating (a number) also differ, or the class of threat is different (trojan or virus or ransomware), or meta-data contain additional data, etc.
[0029] When the composite verdict is determined to be inconsistent in step 110, the object is classified with other objects within collection based on a composite verdict in step 111, automated tests to verify the inconsistency of result for the class of objects is conducted in step 113, and in step 114, the object history information with automated test results is updated.
[0030] When the internal verdict is determined to be correct in step 115, the process proceeds to step 112 to schedule the task “B” for the next iteration. Scheduled time, scope of external and internal engines and object collections are set for task “A” and “B” independently.
[0031] When the internal verdict is determined to be incorrect in step 115, a scan engine development task with a composite verdict and automated test results are created in step 116, and a task is scheduled for the next iteration in step 117 by obtaining a list of object sources that provided the object in step 118.
[0032]
[0033] As shown in
[0034] The iteration scheduler 206 interacts with the external scan manager 203, the internal scan manager 204, and the automated tests 219.
[0035] More specifically, the iteration scheduler 206 controls the flow of the scanning during the iteration and scheduling the next iteration based on the composite verdict.
[0036] The internal scan manager 204 communicates with security applications 216 and 217, and internal scan engine 218 by sending commands or objects to internal scan engines or products to get current scan results, using API, command line, or other interfaces.
[0037] The external scan manager 203 communicates with external scan engines 201 and 202 by sending commands or objects to third-party tools and engines to get current scan results, using API, command line, or other interfaces.
[0038] The object 209 is provided from the object collections 211 that includes object collections 212 and 213 communicating with object sources 214 and 215. The object 209 may include files (e.g., scripts, exe files, documents, webpages etc.), URLs, IP addresses, domain names, etc.
[0039] In an exemplary aspect, the system shown in
[0040] A result of the dynamic assessment includes internal and external scan results of the objects, by internal scan engine 1 and external scan engine 2, respectively, and a consistency verdict of the internal and external scan results of the objects 209.
[0041] The system changes a frequency of scanning iteration of the objects 209 based on the consistency verdict of the external and internal scan results of the objects assessed by the composite verdict analyzer 205.
[0042] The system further classifies the objects 209 based on the result of the dynamic assessment, create a development task 208 including the internal and external scan results of the objects 209, meta-data of the objects, and automated test results to provide details for developing a software to fix inconsistency of the internal and external scan results of the objects.
[0043] The system also controls the dynamic assessment in accordance with a dynamic of implementation of the development task 208 and maintains the quality of the threat detection on a given level based on the controlled dynamic assessment and a priority of the development task 208.
[0044]
[0045] In an exemplary aspect of the present invention, a system automatically receives information about malicious and clean files from various sources (e.g., VirusTotal feed, MalaShare, workstations, etc.). The system automatically classifies files into categories, for example, black, white, and gray, and allows to obtain a confusion matrix and its dependence on time for various verdict providers on different datasets.
[0046] Aggregation by file appearance and last scan dates, datasets and any combination of these parameters are available, and collection and storage of all historical information is provided (for example, as scan results). The system allows visualization of the metrics, and works in real time (i.e., behind by the time required for classification).
[0047] In the above-described process, as shown in the exemplary
[0048] The entities shown in the exemplary
[0049] The data source has 2 attributes of id and name. It can produce files (e.g., only hashes are used) and fills in all required fields. One data source produces many files. At the same time one file can be produced by different data sources (i.e., ids varies).
[0050] File has properties. All required fields are filled in by data source, optional fields could be updated by verdicts. File could be scanned by different verdict providers.
[0051] Verdict Provider has 2 attributes of id and name, it provides verdicts. The temporal dependence of verdicts are considered, one file can be scanned several times to get several verdicts.
[0052] Verdict has 4 attributes that are filled by verdict provider, including id, file_id, scan_date, and result.
[0053] Task controls all system. Firstly, new files are produced by data sources. Then, among all files, the interested files are selected (e.g., new files that were received recently). After that the selected files are rescanned by verdict providers and new verdicts are received.
[0054] The entity relationship diagram shown in
[0055]
[0056] Data Sources
[0057] Contains attributes of data sources.
TABLE-US-00001 Name Data type Description id INTEGER Data source identifier, primary key, unique. name VARCHAR A required unique field-data source name.
[0058] Files
[0059] Contains attributes of data sources.
TABLE-US-00002 Name Data type Description id INTEGER File identifier, primary key, unique. file_type VARCHAR A required field. Describes file type. source_id INTEGER A required field, source id, refers to the id field of the table data_sources. first_submission DATETIME A required field, file receive time, sets by the data source. The time we received the file. md5 VARCHAR A required field, files hashe, respectively. sha1 VARCHAR A required field, files hashe, respectively. sha256 VARCHAR A required field, files hashe, respectively. first_appearance DATETIME An optional field, first appearance of a file in the VirusTotal system. final_verdict INTEGER An optional field, the final verdict, based on the solution of the system.
[0060] Verdict Providers
[0061] Contains attributes of verdict providers.
TABLE-US-00003 Name Data type Description id INTEGER Verdict provider identifier, primary key, unique. name VARCHAR A required unique field-verdict provider name.
[0062] Verdicts
[0063] Contains attributes of verdicts.
TABLE-US-00004 Name Data type Description id INTEGER Verdict identifier, primary key, unique. file_id INTEGER A required field, file id, refers to the id field of the table files. scan_date DATETIME A required field, scan time. verdict_provider_ INTEGER An optional field, containing verdict. name Each verdict provider has its own column.
[0064] In an exemplary aspect, when creating the database, 4 empty tables are created. The table verdicts is created with 3 columns (id, file_id, and scan_date). When adding a record to the verdict_providers table, a column in the verdicts table with the corresponding name is created.
[0065] With regard to the component model, MDQCS includes the following components:
1. DataSources
2. VerdictProviders
3. DetectDB
4. Tasks
[0066] The following tables describe the responsibility of each component:
TABLE-US-00005 Component name Responsibility Input Output Data- Gets a list of new files Valid List of received Sources from data source authorization files-hashes and data (if required) receiving time Verdict Scans specified files List of hashes Reports Providers and provides reports Detect DB Responsible for List of files or — storing all information verdicts (files and reports) Tasks Launches other Instances of the — modules in the desired above modules sequence and controls the entire workflow
[0067] Data Sources
[0068] Methods
TABLE-US-00006 .produse( ) Get a list of new files .get_error( ) Get errors
[0069] Attributes
TABLE-US-00007 name Data source name
[0070] Verdict Providers
[0071] Methods
TABLE-US-00008 .scan( ) Scans specified files and provides reports .get_errors( ) Get errors
[0072] Attributes
TABLE-US-00009 name Verdict provider name
[0073] Detect DB
[0074] Methods
TABLE-US-00010 .open( ) Open DB .commit( ) Commit changes .create_DB( ) Create DB if does not exist .add_new_data_source( ) Add new entry in data sources table .add_new_verdict_ Add new entry in verdict_providers table provider( ) .add_files( ) Add new entries in files table .add_verdicts( ) Add new entries in verdicts table .close( ) Close connection.
[0075] With regard to the metrics, the time of appearance can be considered as both the time of receipt by the system (i.e., first submission) and the time of appearance in VT system (i.e., first appearance).
[0076] All metrics are calculated on:
TABLE-US-00011 on various datasets {dataset=‘dataset_name’} on average on all datasets {dataset=‘average’} weighted average for all datasets {dataset=‘weighted’}
[0077] With regard to detection quality dependence on time, the detection quality is calculated on the day the file appears, the next day, and so on, up to the limit that is set.
[0078] Regarding averaging the previous metric over time for a specified period, the averaging detection quality for 0 day files, for files the next day, and so on may be considered.
[0079] Average detection quality for files is received in a specified period. The number of files from one source are known to another and how quickly do they appear depend on the following:
[0080] Local (white) files and BD cleanset;
[0081] MalShare and VT; and
[0082] Local files and VT.
[0083] The descriptions of the various exemplary embodiments of the present invention have been presented for purposes of illustration but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
[0084] Further, Applicant's intent is to encompass the equivalents of all claim elements, and no amendment to any claim of the present application should be construed as a disclaimer of any interest in or right to an equivalent of any element or feature of the amended claim.