Scoring domains and IPS using domain resolution data to identify malicious domains and IPS
11533293 · 2022-12-20
Assignee
Inventors
- Swapna Buccapatnam Tirumala (Holmdel, NJ, US)
- Fei Wu (Jersey City, NJ, US)
- Carolyn Roche Johnson (Holmdel, NJ)
Cpc classification
H04L41/22
ELECTRICITY
H04L63/0236
ELECTRICITY
H04L63/1475
ELECTRICITY
International classification
Abstract
Domains and IPs are scored using domain resolution data to identify malicious domains and IPs. A domain and IP resolution graph for a set of domains and IPs in a system. A seed set of known malicious domains and known malicious IPs is selected from a malicious domain and malicious IP database. A graphical probabilistic propagation inference from the domain and IP resolution graph and the seed set of known malicious domains and known malicious IPs is generated. A malicious score is calculated for each domain in the set of domains and each IP in the set of IPs, and the malicious domain and malicious IP database is updated.
Claims
1. A method for discovering malicious domains and IP addresses (IPs) in a network having a set of domains and a set of IPs comprising: accessing a domain name system query database; building a domain and IP resolution graph for the set of domains based on the accessing of the domain name system query database; accessing a malicious domain and malicious IP database; selecting a seed set of known malicious domains and known malicious IPs from the malicious domain and malicious IP database based on the accessing of the malicious domain and malicious IP database; generating a graphical probabilistic propagation inference from the domain and IP resolution graph and the seed set of known malicious domains and known malicious IPs, wherein the generating of the graphical probabilistic propagation inference is based on applying a respective known malicious domain and malicious IP of the seed set to each member of a plurality of members of a graphical inference component, and wherein each member computes in parallel and independently malicious scores for other domains and IPs based on the domain and IP resolution graph; calculating a malicious score for each domain in the set of domains and each IP in the set of IPs based on the malicious scores computed by the members; and updating the malicious domain and malicious IP database based on the calculating of the malicious score.
2. The method of claim 1 wherein generating the graphical probabilistic propagation inference comprises generating a graphical inference from each domain in the set of domains and each IP in the set of IPs.
3. The method of claim 2 further comprising creating a set of combined inferences by combining each graphical inference from each domain in the set of domains and each IP in the set of IPs.
4. The method of claim 3 wherein computing the malicious score for each domain in the set of domains and each IP in the set of IPs comprises computing the malicious score from each combined inference in the set of combined inferences.
5. The method of claim 4 wherein computing the malicious score for each domain in the set of domains and each IP in the set of IPs comprises computing the malicious score for each domain in the set of domains and each IP in the set of IPs by layers.
6. The method of claim 5 wherein computing the malicious score for each domain in the set of domains and the malicious score for each IP in the set of IPs by layers comprises computing the malicious score for each domain in the set of domains and the malicious score for each IP in the set of IPs starting from a layer depth value d, where d is equal to zero.
7. The method of claim 6 further comprising: incrementing d by one; computing the malicious score for each domain in the set of domains and each IP in the set of IPs in a layer depth where d is equal to d plus one to create a set of malicious scores; and if d is less than a threshold value repeating incrementing d by one and computing the malicious score for each domain in the set of domains and each IP in the set of IPs if d is equal to the threshold value, returning the set of malicious scores to the malicious domain and malicious IP database.
8. A system for discovering malicious domains and IP addresses (IPs) in a network having a set of domains and a set of IPs comprising: a storage device storing a domain name system query database; a storage device storing a malicious domain and malicious IP database; a processor; a non-volatile computer memory for storing computer instructions coupled to the processor, wherein the processor, responsive to executing the computer instructions, performs operations comprising: accessing the domain name system query database; building a domain and IP resolution graph for the set of domains based on the accessing of the domain name system query database; accessing the malicious domain and malicious IP database; selecting a seed set of known malicious domains and known malicious IPs from the malicious domain and malicious IP database based on the accessing of the malicious domain and malicious IP database; generating a graphical probabilistic propagation inference from the domain and IP resolution graph and the seed set of known malicious domains and known malicious IPs, wherein the generating of the graphical probabilistic propagation inference is based on applying a respective known malicious domain and malicious IP of the seed set to each member of a plurality of members of a graphical inference component, and wherein each member computes in parallel and independently malicious scores for other domains and IPs based on the domain and IP resolution graph; calculating a malicious score for each domain in the set of domains and each IP in the set of IPs based on the malicious scores computed by the members; and updating the malicious domain and malicious IP database based on the calculating of the malicious score.
9. The system of claim 8 wherein generating the graphical probabilistic propagation inference performed by the processor comprises generating a graphical inference from each domain in the set of domains and each IP in the set of IPs.
10. The system of claim 9 wherein the operations performed by the processor further comprise creating a set of combined inferences by combining each graphical inference from each domain in the set of domains and each IP in the set of IPs.
11. The system of claim 10 wherein computing the malicious score for each domain in the set of domains and each IP in the set of IPs performed by the processor comprises computing the malicious score from each combined inference in the set of combined inferences.
12. The system of claim 11 wherein computing the malicious score for each domain in the set of domains and each IP in the set of IPs performed by the processor comprises computing the malicious score for each domain in the set of domains and each IP in the set of IPs by layers.
13. The system of claim 12 wherein computing the malicious score for each domain in the set of domains and the malicious score for each IP in the set of IPs by layers performed by the processor comprises computing the malicious score for each domain in the set of domains and the malicious score for each IP in the set of IPs starting from a layer depth value d, where d is equal to zero.
14. The system of claim 13 wherein the operations performed by the processor further comprise incrementing d by one; computing the malicious score for each domain in the set of domains and each IP in the set of IPs in a layer depth where d is equal to d plus one to create a set of malicious scores; and if d is less than a threshold value repeating incrementing d by one and computing the malicious score for each domain in the set of domains and each IP in the set of IPs if d is equal to the threshold value, returning the set of malicious scores to the malicious domain and malicious IP database.
15. A non-transitory, tangible computer-readable medium having computer-executable instructions stored thereon which, when executed by a computer, cause the computer to perform a method for discovering malicious domains and IP addresses (IPs) in a network having a set of domains and a set of IPs comprising: accessing a domain name system query database; building a domain and IP resolution graph for the set of domains based on the accessing of the domain name system query database; accessing a malicious domain and malicious IP database; selecting a seed set of known malicious domains and known malicious IPs from the malicious domain and malicious IP database based on the accessing of the malicious domain and malicious IP database; generating a graphical probabilistic propagation inference from the domain and IP resolution graph and the seed set of known malicious domains and known malicious IPs, wherein the generating of the graphical probabilistic propagation inference is based on applying a respective known malicious domain and malicious IP of the seed set to each member of a plurality of members of a graphical inference component, and wherein each member computes in parallel and independently malicious scores for other domains and IPs based on the domain and IP resolution graph; calculating a malicious score for each domain in the set of domains and each IP in the set of IPs based on malicious scores computed by the members; and updating the malicious domain and malicious IP database based on the calculating of the malicious score.
16. The non-transitory, tangible computer-readable medium of claim 15 wherein generating the graphical probabilistic propagation inference comprises generating a graphical inference from each domain in the set of domains and each IP in the set of IPs.
17. The non-transitory, tangible computer-readable medium of claim 16 wherein the method performed by the computer further comprises creating a set of combined inferences by combining each graphical inference from each domain in the set of domains and each IP in the set of IPs.
18. The non-transitory, tangible computer-readable medium of claim 17 wherein computing the malicious score for each domain in the set of domains and each IP in the set of IPs comprises computing the malicious score from each combined inference from the set of combined inferences.
19. The non-transitory, tangible computer-readable medium of claim 18 wherein computing the malicious score for each domain in the set of domains and each IP in the set of IPs comprises computing the malicious score for each domain in the set of domains and each IP in the set of IPs by layers.
20. The non-transitory, tangible computer-readable medium of claim 19 wherein computing the malicious score for each domain in the set of domains and the malicious score for each IP in the set of IPs by layers comprises: computing the malicious score for each domain in the set of domains and the malicious score for each IP in the set of IPs starting from a layer depth value d, where d is equal to zero to create a set of malicious scores; incrementing d by one; computing the malicious score for each domain in the set of domains and each IP in the set of IPs in a layer depth where d is equal to d plus one; and if d is less than a threshold value repeating incrementing d by one and computing the malicious score for each domain in the set of domains and each IP in the set of IPs if d is equal to the threshold value, returning the set of malicious scores to the malicious domain and malicious IP database.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
Glossary
(13) “Graphical model” is a type of probabilistic network that use graphs to represent and manipulate joint probability distributions. A graphical model has both a structural component encoded by the pattern of edges in the graph, and a parametric component encoded by numerical potentials associated with sets of edges in the graph.
(14) “Inference algorithms” allow statistical quantities (such as likelihoods and conditional probabilities) and information-theoretic quantities (such as mutual information and conditional entropies) to be computed efficiently.
(15) Probabilistic inference is the task of deriving the probability of one or more random variables taking a specific value or set of values.
(16)
(17)
(18)
ϕ(ν,u).
(19)
(20)
(21)
(22)
(23) Illustrated in
(24) The system 800 includes a DNS query database 801 and a malicious domain and IP database 803.
(25) The system 800 includes a resolution graph module 805 that is responsible for building a Domain/IP resolution graph. The resolution graph module 805 accesses the DNS query database to get the domain-IP resolution history, i.e., which set of domains are resolved to which set of IPs, for a given period of time, e.g., one day. Then the resolution graph module 805 constructs a bipartite graph G(V,E) as follows. We use V to denote the set of Domains/IPs and E to denote the set of edges, where an edge exists between a Domain and an IP if the Domain had been resolved to the IP in the given period of time.
(26) The system 800 also includes a seeding module 807 that selects a seed set of known malicious domains/IPs.
(27) The system 800 includes an inference module 809 that provides a graphical probabilistic propagation inference based on the input from the resolution graph module 805 and the seeding module 807 and is responsible for assigning malicious scores to the Domains/IPs.
(28) The output of the inference module 809 is provided to malicious score module 811 which feeds that output to the malicious Domain/IP database 803 which would be updated accordingly.
(29)
(30)
(31) An embodiment of a concrete implementation of graphical inference component 902 is included below. A bipartite graph G(V,E) is defined as follows. V to denotes the set of Domains/IPs and E denotes the set of edges, where an edge exists between a Domain and an IP if the Domain had been resolved to the IP in the given period of time. We use S.sub.seed to denote the seed set of Domains/IPs that are known malicious. {τ.sub.ν}.sub.ν∈V represent the initial malicious scores of domain/IP before a graphical inference is made. A Propagation/Link function for a pair of Domains/IPs u,ν as the probability that u is malicious because of v, given that v is malicious.
ϕ(ν,u)P(u is malicious because of ν|νis malicious)
Finally, the final malicious scores of the Domains/IPs are denoted as
Γ{γ.sub.ν}ν∈ν
The graphical inference method which corresponds to graphical inference component 902 may be described as follows.
(32) TABLE-US-00001 Algorithm 1 - Probabilistic Propagation Algorithm. Input: Domain-IP Graph G(ν, ε), prior probabilities {τ.sub.ν }ν ∈ ν, and propagation function ϕ(ν, u), f or ν, u ∈ ν. Output: Probabilistic scores Γ {γ.sub.ν}ν ∈ ν. 1: Initialization: Λ.sup.ν = {.Math.} for ∀ν ∈ ν (where Λ.sup.ν is denotes the set of probabilities that a domain/IP v is malicious because of another domain/IP. {.Math.} means that initially, it is set to be an empty set for all domain/IP v.) 2: - - - - - - - - Iterate through evidences - - - - - - - 3: for all ν ∈ ν such that τ.sub.ν > 0 do 4: Construct a tree rooted at ν with layers of nodes as {L.sub.0.sup.(ν) = {ν}, . . . , L.sub.k.sup.(ν)} . (Where L.sub.d.sup.(ν) denotes the set of Domains/IPs that has of depth d starting from ν. Initially, d = 0 and L.sub.d.sup.(ν)={v}. In other words, starting from the node ν that is focused on, the nodes may be sorted as follows. ν is considered as a root node of depth 0; the nodes that have an edge with ν are considered to be of depth 1; the nodes that have an edge to the nodes of depth 1 are of depth 2; and so on.) 5. S.sup.(ν,ν) = {τ.sub.ν} and S.sup.(ν,u) = Ø for ∀u ∈ ν . (Where S.sup.(ν,u) denote the set of probabilities that u being malicious because of the nodes in the last layer of the tree rooted at v) 6. for l = 0, 1, . . . . , k do 7. for all u ∈ L.sub.1.sup.(ν) do 8. if S.sup.(ν,u) ≠ Ø then 9. - - Calculate inference from ν to u - - - 10. Λ.sup.(u)[ν] = 1 − Π.sub.δ∈S.sub.
{γ.sub.ν}ν ∈ ν. 17. return Γ = {γ.sub.ν}ν ∈ ν
Specifically, lines 1-3 correspond to the malicious domain/IP assignment component 901, lines 4-13 correspond to the graphical inference component 902, and lines 14-16 correspond to the score computation component 909.
(33) Illustrated in
(34) In step 1101 the method 1100 accesses a DNS query database and extracts information necessary to build a domain/IP resolution graph for a domain set.
(35) In step 1103, the method 1100 builds a domain/IP resolution graph.
(36) In step 1105, the method 1100 accesses a malicious domain/IP database that contains a listing of malicious domains and IPs.
(37) In step 1107, the method 1100 selects a seed set of malicious domains/IPs
(38) In step 1109, the method 1100 generates graphical probabilistic inferences for the domains/IPs.
(39) In step 1111, the method 1100 calculates a malicious score for each domain/IP.
(40) In step 1113, the method 1100 updates the malicious domain/IP database with a listing of newly identified malicious domains/IPs.
(41) Illustrated in
(42) In step 1201 the method 1200 assigns different known malicious domains/IPs.
(43) In step 1203, the method 1200 computes the malicious scores for other domains/IPs.
(44) In step 1205, the method 1200 combines the malicious scores.
(45) In step 1207, the method 1200 computes the final malicious score for each domain/IP.
(46) As used in some contexts in this application, in some embodiments, the terms “component,” “system” and the like are intended to refer to, or comprise, a computer-related entity or an entity related to an operational apparatus with one or more specific functionalities, wherein the entity can be either hardware, a combination of hardware and software, software, or software in execution. As an example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, computer-executable instructions, a program, and/or a computer. By way of illustration and not limitation, both an application running on a server and the server can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. The components may communicate via local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the Internet with other systems via the signal). As another example, a component can be an apparatus with specific functionality provided by mechanical parts operated by electric or electronic circuitry, which is operated by a software or firmware application executed by a processor, wherein the processor can be internal or external to the apparatus and executes at least a part of the software or firmware application. As yet another example, a component can be an apparatus that provides specific functionality through electronic components without mechanical parts, the electronic components can comprise a processor therein to execute software or firmware that confers at least in part the functionality of the electronic components. While various components have been illustrated as separate components, it will be appreciated that multiple components can be implemented as a single component, or a single component can be implemented as multiple components, without departing from example embodiments.
(47) Further, the various embodiments can be implemented as a method, apparatus or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware or any combination thereof to control a computer to implement the disclosed subject matter. The term “article of manufacture” as used herein is intended to encompass a non-transitory computer program accessible from any computer-readable device or computer-readable storage/communications media. For example, computer readable storage media can include, but are not limited to, magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips), optical disks (e.g., compact disk (CD), digital versatile disk (DVD)), smart cards, and flash memory devices (e.g., card, stick, key drive). Of course, those skilled in the art will recognize many modifications can be made to this configuration without departing from the scope or spirit of the various embodiments.
(48) In addition, the words “example” is used herein to mean serving as an instance or illustration. Any embodiment or design described herein as an “example” is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word example is intended to present concepts in a concrete fashion. As used in this application, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or”. That is, unless specified otherwise or clear from context, “X employs A or B” is intended to mean any of the natural inclusive permutations. That is, if X employs A; X employs B; or X employs both A and B, then “X employs A or B” is satisfied under any of the foregoing instances. In addition, the articles “a” and “an” as used in this application and the appended claims should generally be construed to mean “one or more” unless specified otherwise or clear from context to be directed to a singular form.
(49) As employed herein, the term “processor” can refer to substantially any computing processing unit or device comprising, but not limited to comprising, single-core processors; single-processors with software multithread execution capability; multi-core processors; multi-core processors with software multithread execution capability; multi-core processors with hardware multithread technology; parallel platforms; and parallel platforms with distributed shared memory. Additionally, a processor can refer to an integrated circuit, an application specific integrated circuit (ASIC), a digital signal processor (DSP), a field programmable gate array (FPGA), a programmable logic controller (PLC), a complex programmable logic device (CPLD), a discrete gate or transistor logic, discrete hardware components or any combination thereof designed to perform the functions described herein. Processors can exploit nano-scale architectures such as, but not limited to, molecular and quantum-dot based transistors, switches and gates, in order to optimize space usage or enhance performance of user equipment. A processor can also be implemented as a combination of computing processing units.
(50) As used herein, terms such as “data storage,” data storage,” “database,” and substantially any other information storage component relevant to operation and functionality of a component, refer to “memory components,” or entities embodied in a “memory” or components comprising the memory. It will be appreciated that the memory components or computer-readable storage media, described herein can be either volatile memory or nonvolatile memory or can include both volatile and nonvolatile memory.
(51) What has been described above includes mere examples of various embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing these examples, but one of ordinary skill in the art can recognize that many further combinations and permutations of the present embodiments are possible. Accordingly, the embodiments disclosed and/or claimed herein are intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims. Furthermore, to the extent that the term “includes” is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term “comprising” as “comprising” is interpreted when employed as a transitional word in a claim.
(52) In addition, a flow diagram may include a “start” and/or “continue” indication. The “start” and “continue” indications reflect that the steps presented can optionally be incorporated in or otherwise used in conjunction with other routines. In this context, “start” indicates the beginning of the first step presented and may be preceded by other activities not specifically shown. Further, the “continue” indication reflects that the steps presented may be performed multiple times and/or may be succeeded by other activities not specifically shown. Further, while a flow diagram indicates a particular ordering of steps, other orderings are likewise possible provided that the principles of causality are maintained.
(53) As may also be used herein, the term(s) “operably coupled to”, “coupled to”, and/or “coupling” includes direct coupling between items and/or indirect coupling between items via one or more intervening items. Such items and intervening items include, but are not limited to, junctions, communication paths, components, circuit elements, circuits, functional blocks, and/or devices. As an example of indirect coupling, a signal conveyed from a first item to a second item may be modified by one or more intervening items by modifying the form, nature or format of information in a signal, while one or more elements of the information in the signal are nevertheless conveyed in a manner than can be recognized by the second item. In a further example of indirect coupling, an action in a first item can cause a reaction on the second item, as a result of actions and/or reactions in one or more intervening items.
(54) Although specific embodiments have been illustrated and described herein, it should be appreciated that any arrangement which achieves the same or similar purpose may be substituted for the embodiments described or shown by the subject disclosure. The subject disclosure is intended to cover any and all adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, can be used in the subject disclosure. For instance, one or more features from one or more embodiments can be combined with one or more features of one or more other embodiments. In one or more embodiments, features that are positively recited can also be negatively recited and excluded from the embodiment with or without replacement by another structural and/or functional feature. The steps or functions described with respect to the embodiments of the subject disclosure can be performed in any order. The steps or functions described with respect to the embodiments of the subject disclosure can be performed alone or in combination with other steps or functions of the subject disclosure, as well as from other embodiments or from other steps that have not been described in the subject disclosure. Further, more than or less than all of the features described with respect to an embodiment can also be utilized.