BRAND SQUATTING DOMAIN DETECTION SYSTEMS AND METHODS
20220201036 · 2022-06-23
Inventors
Cpc classification
G06F18/214
PHYSICS
H04L61/302
ELECTRICITY
H04L63/20
ELECTRICITY
H04L63/1483
ELECTRICITY
International classification
Abstract
The present application provides a system for detecting brand squatting domains with a three-stage detection pipeline having three different classifiers. The provided system helps predict whether an unknown domain will be malicious. The first classifier detects abusive brand squatting domains, such as those that impersonate exact popular brand names, as soon as the domains are registered. The second classifier detects abusive brand squatting domains when hosting information becomes available, in combination with the information available for the first classifier. The third classifier detects abusive brand squatting domains when certificate information associated with domains is available, in combination with the information available for the first and second classifiers. The performance of each classifier improves from the first to the second to the third with the first classifier making determinations with the least information and the third classifier making determinations with the most information.
Claims
1. A system for detecting brand squatting domains comprising: a memory; and a processor in communication with the memory, the processor configured to: receive or acquire newly registered domain information including a plurality of domain names, determine, using at least one first model, a first likelihood of whether a first domain name of the plurality of domain names is a brand squatting domain based on the first domain name, receive or acquire hosting information for at least some of the plurality of domain names including the first domain name, determine, using at least one second model, a second likelihood of whether the first domain name is a brand squatting domain based on the hosting information of the first domain name, receive or acquire certificate information for at least some of the plurality of domain names including the first domain name, and determine, using at least one third model, a third likelihood of whether the first domain name is a brand squatting domain based on the certificate information of the first domain name.
2. The system of claim 1, wherein the at least one first model is trained to detect brand squatting domains based on a dataset of abusive and non-abusive domain names.
3. The system of claim 1, wherein the at least one second model is trained to detect brand squatting domains based on hosting information of abusive and non-abusive domain names.
4. The system of claim 1, wherein the at least one third model is trained to detect brand squatting domains based on certificate information of abusive and non-abusive domain names.
5. The system of claim 1, wherein the second likelihood of whether the first domain name is a brand squatting domain is determined further based on the first domain name.
6. The system of claim 1, wherein the third likelihood of whether the first domain name is a brand squatting domain is determined further based on the first domain name and the hosting information of the first domain name.
7. The system of claim 1, wherein the at least one first model, the at least one second model, and the at least one third model are each random forest classifiers.
8. The system of claim 1, wherein the at least one first model is trained on at least features included in the group consisting of a plurality of suspicious keywords, a length of a domain name, a quantity of minus signs in a domain name, whether a top-level domain is a previously known top-level domain with low reputation, a position of a brand in a domain name, and a quantity of generic top-level domains present within a domain name.
9. The system of claim 1, wherein the at least one first model is trained on at least features included in the group consisting of a quantity of days a domain registration is valid from a last update date to a registration expiration date, a WHOIS name of a domain registrar, whether a domain is parked, whether a top-level domain of a name server is suspicious, whether a domain is re-registered, and whether a domain and NS 2LD are matching.
10. The system of claim 1, wherein the at least one second model is trained on at least features included in the group consisting of a quantity of authoritative name servers for all domains belonging to a given apex, whether at least one name server domain is a suspicious top-level domain, a quantity of IPs on which the domains belonging to the apex are hosted, a quantity of start of authority domains for all domains belonging to a given apex, and whether a name server 2LD matches with an apex domain.
11. The system of claim 1, wherein the at least one third model is trained on at least features included in the group consisting of an average number of levels of all subdomains belonging to a given apex domain, an average length of domains belonging to a given apex domain, an average number of brands included across all domains for a given apex domain, and an average number of minus signs included across all domains for a given apex domain.
12. The system of claim 1, wherein the at least one third model is trained on at least features included in the group consisting of a quantity of certificates related to all domains belonging to a given apex domain, a quantity of star domains across all related certificates for a given domain, a mean of certificate validity duration, a standard deviation of the certificate validity duration, a minimum certificate validity duration, a maximum certificate validity duration, a mean of a quantity of domains in certificates, a standard deviation of the quantity of domains in certificates, a minimum quantity of domains in certificates, a maximum quantity of domains in certificates, a mean of a quantity of apex domains in certificates, a standard deviation of the quantity of apex domains in certificates, a minimum quantity of apex domains in certificates, and a maximum quantity of apex domains in certificates.
13. A method for detecting brand squatting domains comprising: receiving or acquiring newly registered domain information including a plurality of domain names; determining, using at least one first model, a first likelihood of whether a first domain name of the plurality of domain names is a brand squatting domain based on the first domain name; receiving or acquiring hosting information for at least some of the plurality of domain names including the first domain name; determining, using at least one second model, a second likelihood of whether the first domain name is a brand squatting domain based on the hosting information of the first domain name; receiving or acquiring certificate information for at least some of the plurality of domain names including the first domain name; and determining, using at least one third model, a third likelihood of whether the first domain name is a brand squatting domain based on the certificate information of the first domain name.
14. The method of claim 13, wherein the second likelihood is determined subsequent in time to the first likelihood being determined.
15. The method of claim 13, wherein the third likelihood is determined subsequent in time to both the first and second likelihoods being determined.
16. The method of claim 13, wherein the certificate information is received or acquired subsequent in time to the hosting information being received or acquired, which is subsequent in time to the newly registered domain information being received or acquired.
17. The method of claim 13, wherein the newly registered domain information is included in a WHOIS record.
18. The method of claim 13, wherein the hosting information is included in a pDNS database.
19. A non-transitory, computer-readable medium storing instructions, which when executed by a processor, cause the processor to: receive or acquire newly registered domain information including a plurality of domain names; determine, using at least one first model, a first likelihood of whether a first domain name of the plurality of domain names is a brand squatting domain based on the first domain name; receive or acquire hosting information for at least some of the plurality of domain names including the first domain name; determine, using at least one second model, a second likelihood of whether the first domain name is a brand squatting domain based on the hosting information of the first domain name; receive or acquire certificate information for at least some of the plurality of domain names including the first domain name; and determine, using at least one third model, a third likelihood of whether the first domain name is a brand squatting domain based on the certificate information of the first domain name.
20. The non-transitory, computer-readable medium storing instructions of claim 19, wherein the certificate information is included in a certificate for the first domain name of a CT log feed.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
DETAILED DESCRIPTION
[0024] The present application relates generally to abusive domain detection. More specifically, the present application provides a system for detecting brand squatting domains with a three-stage detection pipeline having three different classifiers. The provided system helps predict whether an unknown domain will be malicious. The first classifier, NRD (newly registered domains) classifier, detects abusive brand squatting domains, such as those that impersonate exact popular brand names, as soon as the domains are registered. For example, an impersonating domain name may include a brand name such as CompanyA in apex domains (e.g., companyA-best.com. companyA-com.com, companyA.io, etc.) or in subdomains (e.g., companyA.com-evil.com, companyA.evil.com). Registered domains are then either hosted at the registrar itself or another hosting provider, at which point a domain is associated with additional attributes related to its hosting infrastructure.
[0025] The second classifier, hosting classifier, detects abusive brand squatting domains when hosting information becomes available. The hosting classifier utilizes the information available at the time of registration, and hosting information, to detect additional abusive brand squatting domains.
[0026] With time, most domains obtain a TLS certificate so many abusive domains also obtain certificates. The third classifier, or TLS classifier, detects abusive brand squatting domains when certificate information associated with domains is available. For example, an initiative by the Google Chrome® browser enforces certificate authorities to log newly issued certificates in a distributed database for improved security. The TLS classifier considers all previous features along with TLS certificate features to either detect additional abusive domains or improve the confidence of the previously detected domains. Each classifier's performance (e.g., precision, recall, FPR (defines how many incorrect positive results occur among all negative samples available during a test), etc.) progressively improves from the first to the third as more information becomes available for latter classifiers.
[0027] In view of the above, the NRD classifier detects abusive brand squatting domains with the least amount of information whereas the TLS classifier has the most information out of the three detection engines. Hence, with more information, one can make more confident decisions with the latter classifier, but it takes the longest time to detect. It is tempting to delay the detection until domain certificate information is available as the classifier at this stage provides the highest performance. However, running the first two classifiers can be beneficial in detection and taking necessary action early to reduce or mitigate the damage brand squatting domains cause. Abusive EBS domains are utilized for a short-time period and by the time all the information available, some of the attacks may already have been carried out. Browser based blacklists help warn users of malicious domains, but they take time propagate submitted malicious domain. Detecting these domains early and submitting to the major browser vendor help browsers warn about these malicious domains by the time they access. In at least one example, a user of the provided system can treat the results from the first engine with caution (e.g. build a suspicious list that is used to warn users) and as more details emerge, the user may take aggressive actions (e.g. block highly malicious domains) for the results from the other two engines.
[0028]
[0029] The brand squatting domain detection system 102 may be in communication over a network 108 with sources of information (e.g., external servers) for use in abusive domain detection. For example, the brand squatting domain detection system 102 may be in communication with a domain registrar 110 that stores information on registered domains. For instance, the domain registrar 110 may store a domain name for each registered domain, and may continually update the data each time a new domain is registered. In some aspects, the brand squatting domain system 102 may obtain hosting information from the domain registrar 110 (e.g., if a registered domain is hosted at the domain registrar 110 itself). In other aspects, the brand squatting domain system 102 may obtain hosting information from a hosting provider 120 that hosts a particular domain. In another example, the brand squatting domain detection system 102 may be in communication with a certificate authority 130 that grants TLS certificates to domains a stores information in a CT log. The network 108 can include, for example, the Internet or some other data network, including, but not limited to, any suitable wide area network or local area network.
[0030] The processor of the brand squatting domain detection system 102 is configured to determine whether domain names are likely to be abusive using machine learning models trained to do so. In at least some aspects, the brand squatting domain detection system 102 may use three separate classifiers to determine a likelihood that a domain name is abusive based on different information for each classifier. Each classifier may be implemented by a machine learning model trained on the features available at the stage of the respective classifier. Each of the respective machine learning models may include one or more supervised learning models, unsupervised learning models, or other suitable types of machine learning models. For instance, the brand squatting domain detection system 102 may include an NRD classifier implemented by a machine learning model trained on abusive and non-abusive domain names to detect domain names likely to be abusive upon their registration. In various examples, the NRD classifier may be a random forest classifier (e.g., with five-fold cross validation). The brand squatting domain detection system 102 may also include a hosting classifier implemented by a machine learning model trained on the abusive and non-abusive domain names and also on hosting information of abusive and non-abusive domains to detect domain names likely to be abusive. In various examples, the hosting classifier may be a random forest classifier (e.g., with five-fold cross validation). Additionally, the brand squatting domain detection system 102 may include a TLS classifier implemented by a machine learning model trained on the abusive and non-abusive domain names, the hosting information of abusive and non-abusive domains, and certificate information of abusive and non-abusive domains to detect domain names likely to be abusive. In various examples, the TLS classifier may be a random forest classifier (e.g., with five-fold cross validation).
[0031]
[0032] The example method 200 may include receiving or acquiring newly registered domain information (block 202). The newly registered domain information includes multiple domain names. When a domain is registered with a domain registrar (e.g., the domain registrar 110), a WHOIS record is created and made available. With increased utilization of privacy protection services as well as due to new privacy regulations such as GDPR, WHOIS records are mostly voided for registrant information. Even without the registrant information, WHOIS records, which may be seen as thin WHOIS records, can be a useful first line of defense in identifying malicious domains early. There are many third-party organizations that make the thin WHOIS information of NRDs. In one example, the NRD feed from WhoisXMLAPI may be utilized. This data may be utilized to extract features for the NRD classifier.
[0033] It may then be determined, using at least one first model (e.g., the NRD classifier), a first likelihood of whether a first domain name of the received or acquired domain names is a brand squatting domain based on the first domain name (block 204). In one example, to train the NRD classifier, top brands from Alexa top 1 million 1-year domains and most phished domains from Phishtank were identified. The NRD feed domains can be filtered that consist of at least one of these brands. The filtered domains may be referred to as EBS domains. Then, Abusive and Non-Abusive ground truth were collected from the EBS domains utilizing VirusTotal scan reports. Further, verify the domains may be manually verified that they are infact abusive. Abusive EBS domains either demonstrate malicious intent or impersonates the brand in the domain. Then, WHOIS and lexical features (e.g., the features in the table of
[0034] An important consideration in identifying brand impersonation attacks is to identify which brands to monitor. Some brands such as ge, att, sc and aa are quite short and may lead to ambiguous attributions. Further, some brands such as business, live, and mail are very popular English words and they may result in many incorrect attributions. To reduce the brand ambiguity, the following example filtering pipeline can be followed. The Alexa Top 1 million domains consistently seen through the last year (e.g., 14,422 2LDs) and also Phishtank top 100 phished brands (e.g., 100 2LDs) can be considered. Then, the unique domains can be taken from these 2LDs, which results in 13,230 domain names. Short domain names having 4 or less characters may be pruned. This results in 11,390 domain names. Further pruning may be done to exclude domain names that are in the top 10,000 of popular English words and those having disproportionately high number of matches (e.g. games, services, homes). All discarded brands may be inspected so as to add back the popular brands. This includes the brands apple, oracle, delta, orange, chase, discover, telegraph and adobe. After pruning, the consider 11,152 brands in total.
[0035]
[0036] The inventors profiled historical malicious domains and identified a list of TLDs that are frequently associated with malicious activities. The table illustrated in
[0037] The WHOIS features are gathered from thin WHOIS records. The feature duration corresponds to the time difference from registration to expiration date. The inventors observes that non-abusive domains are more likely to have duration greater than 1 year compared to abusive EBS domains. The feature whoisServer identifies the registrar as each registrar has a unique WHOIS server. The inventors observed that non-abusive EBS domains are more likely to register with reputed registrars such as Mark Monitor compared to abusive EBS domains. The feature is_parked identifies if the domain under consideration is parked. The inventors observed that abusive EBS domains are more likely to be parked before they are used compared to non-abusive EBS domains.
[0038] Returning to the method 200 of
[0039] It may then be determined, using at least one second model (e.g., the hosting classifier), a second likelihood of whether the first domain name is a brand squatting domain based on the first domain name and the hosting information of the first domain name (block 208). In one example, the hosting classifier may be trained in the same manner described above for the NRD classifier, except that the hosting classifier utilizes additional hosting feature (e.g., features from passive DNS).
[0040] The feature #ns captures the number of authoritative name servers utilized with all domains belonging to a given apex. The inventors observed that non-abusive EBS domains utilize a few authoritative name servers compared to abusive EBS domains. One reason for this behavior is that abusive-domains may host their services with different hosting providers in order to make their attack infrastructure resilient for taking down. The feature is_ns_sus_tld is similar to suspicious_tld but it checks in the name server domains. #ip counts the number of IPs on which the domains belonging a given apex are hosted. The inventors observed that non-abusive domains are hosted on a few IPs compared to abusive domains. One reason for this observation is that some abusive EBS domains utilize fast fluxing to frequently change IP address to evade take down or blacklist. The feature #soa measures the number of start of authority (SOA) domains for all domains belonging to a given apex domain. The feature ns matching checks if at least one 2LDs of the name servers matches with apex domain. The inventors observed that non-abusive EBS domains demonstrate more matches than abusive EBS domains. One reason for this behavior is that non-abusive domains setup their own recursive name servers in order to improve DNS security whereas many abusive DNS domains utilize the name servers assigned by hosting providers.
[0041] Returning to the method 200 of
[0042] It may then be determined using at least one third model (e.g., the TLS classifier), a third likelihood of whether the first domain name is a brand squatting domain based on the first domain name, the hosting information of the first domain name, and the certificate information of the first domain name (block 212). In one example, the TLS classifier may be trained in the same manner described above for the NRD and hosting classifiers, except that the input data fed to the TLS classifier is fed from CT logs and the TLS classifier utilizes additional features extracted from pDNS and CT log feeds. In at least some aspects, the certificates from a CT log feed may be used to train the TLS classifier.
[0043]
[0044] The features ct_duration_mean, ct_duration_std, ct_duration_min, and ct_duration_max capture first and second order statistics of certificate duration. The inventors observed that non-abusive EBS domains are more likely to have a higher variation in these measurement compared to abusive EBS domains. One reason for this observation is that reputed organizations behind non-abusive EBS domains have long-lived trusted certificates for their parent domains whereas short-lived free certificates such as those issued by Let's Encrypt for experimental subdomains.
[0045] The features #domain_mean, #domain_std, #domain_min, and #domain_max measure first and second order statistics of domains in both CN (common name) and SAN (subject alternative name) list of a certificate. #2ld_mean, #2ld_std, #2ld_min, and #2ld_max measure first and second order statistics of apex domains. The inventors observed that certificates related abusive EBS domains are more likely to have a high variation in the domains and apexes involved compared to non-abusive case. In one example, the TLS classifier may be trained with the lexical and WHOIS features described above for the NRD classifier, with the hosting features described above, and with the lexical features described for the TLS classifier and the CT log features. In another example, the TLS classifier may be trained with only the lexical features described for the TLS classifier and the CT log features.
[0046] The inventors validated the classifiers of the provided brand squatting domain detection system 102 as shown by
[0047]
[0048]
[0049]
[0050] As demonstrated, the performance progressively improved with each classifier (e.g., the NRD to the hosting to the TLS classifier) as additional information about the domains was available.
[0051] Without further elaboration, it is believed that one skilled in the art can use the preceding description to utilize the claimed inventions to their fullest extent. The examples and aspects disclosed herein are to be construed as merely illustrative and not a limitation of the scope of the present disclosure in any way. It will be apparent to those having skill in the art that changes may be made to the details of the above-described examples without departing from the underlying principles discussed. In other words, various modifications and improvements of the examples specifically disclosed in the description above are within the scope of the appended claims. For instance, any suitable combination of features of the various examples described is contemplated.