Domain name recognition method and domain name recognition device
10931714 ยท 2021-02-23
Assignee
Inventors
- Pin-Cyuan Lin (New Taipei, TW)
- Yu-Chun Wu (New Taipei, TW)
- Ming-Kung Sun (Taipei, TW)
- Zong-Cyuan Jhang (Taipei, TW)
- Yi-Chung Tseng (TAIPEI, TW)
- Chiung-Ying Huang (Taipei, TW)
Cpc classification
G06F17/16
PHYSICS
G06F16/9566
PHYSICS
International classification
G06F16/955
PHYSICS
G06F17/16
PHYSICS
Abstract
The disclosure provides a domain name recognition method and a domain name recognition device. The domain name recognition method includes the following steps. A first string of a first domain name and a second string of a second domain name are obtained. Multiple characters of the first string and the second string are classified into multiple clusters. Multiple vectors corresponding to the clusters are generated, wherein each of the characters corresponds to one of the vectors. A first vector set corresponding to the first string and a second vector set corresponding to the second string are generated. A similarity of the first vector set and the second vector set is calculated.
Claims
1. A domain name recognition method performed by hardware electronic device with a processor and memory, comprising: obtaining a first string of a first domain name and a second string of a second domain name; classifying a plurality of characters of the first string and the second string into a plurality of clusters, and generating a plurality of vectors corresponding to the plurality of clusters, wherein each of the plurality of characters corresponds to one of the plurality of vectors; generating a first vector set corresponding to the first string and a second vector set corresponding to the second string; and calculating a similarity between the first vector set and the second vector set using an algorithm, wherein the algorithm is a dynamic time warping (DTW) algorithm.
2. The domain name recognition method according to claim 1, wherein lengths of the first string and the second string are not necessarily the same.
3. The domain name recognition method according to claim 1, wherein each of the plurality of vectors is a unit vector and different ones of the clusters correspond to different ones of the vectors.
4. The domain name recognition method according to claim 1, wherein the algorithm generates a matrix corresponding to a length of the first string and a length of the second string, establishes a shortest distance path of a bottom leftmost element to a top rightmost element in the matrix, calculates a distance of one of the first vector sets on and one of the second vector sets corresponding to each of the elements on the shortest distance path, and calculates a similarity according to a sum of each of the distances on the shortest distance path.
5. The domain name recognition method according to claim 4, wherein a value of each of the elements of the matrix is a sum of the distance of each of the elements plus a smallest value of a value of an element of a left element, a bottom element, and a bottom left element of each of the elements, and the shortest distance path is generated by selecting an element with a smallest element value in the left element, the bottom element, and the bottom left element of the top rightmost element from the top rightmost element of the matrix.
6. A domain name recognition device, comprising: a processor; and a memory coupled to the processor, wherein the processor obtains a first string of a first domain name and a second string of a second domain name; classifies a plurality of characters of the first string and the second string into a plurality of clusters, and generates a plurality of vectors corresponding to the plurality of clusters, wherein each of the plurality of characters corresponds to one of the plurality of vectors; generates a first vector set corresponding to the first string and a second vector set corresponding to the second string; and calculates a similarity between the first vector set and the second vector set using an algorithm, wherein the algorithm is a dynamic time warping (DTW) algorithm.
7. The domain name recognition device according to claim 6, wherein lengths of the first string and the second string are not necessarily the same.
8. The domain name recognition device according to claim 6, wherein each of the plurality of vectors is a unit vector and different ones of the clusters correspond to different ones of the vectors.
9. The domain name recognition device according to claim 6, wherein the algorithm generates a matrix corresponding to a length of the first string and a length of the second string, establishes a shortest distance path of a bottom leftmost element to a top rightmost element in the matrix, calculates a distance of one of the first vector sets on and one of the second vector sets corresponding to each of the elements on the shortest distance path, and calculates a similarity according to a sum of each of the distances on the shortest distance path.
10. The domain name recognition device according to claim 9, wherein a value of each of the elements of the matrix is a sum of the distance of each of the elements plus a smallest value of a value of an element of a left element, a bottom element, and a bottom left element of each of the elements, and the shortest distance path is generated by selecting an element with a smallest element value in the left element, the bottom element, and the bottom left element of the top rightmost element from the top rightmost element of the matrix.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
DETAILED DESCRIPTION OF DISCLOSED EMBODIMENTS
(5) In an embodiment, the Euclidean distance, the Hamming distance, the Edit distance, and the Cosine similarity may be used to perform the sequence similarity calculation. The Euclidean distance and the Hamming distance limit that the lengths of the two string vectors to be compared must be the same. However, in the context of domain name comparison, the lengths of various domain names are not the same most of the time. The Edit distance compares the similarity between two strings by calculating the minimum number of edit(s) required to convert one string into the other string. However, the minimum number of edit(s) does not effectively reflect the similarity of the domain names (for example, the minimum number of edit to convert google into oogle is 1 while the minimum number of edits to convert google into g00g1e is 3, but in fact the latter is a malicious website relatively more difficult to be detected by the user). The Cosine similarity requires the two strings to be first converted into the same length through a specific vectorization method. However, the intermediate process during vectorization may easily cause the order of the domain names to be distorted.
(6)
(7) Referring to
(8) In an embodiment, the processor 110 can find a malicious website disguised as a normal secure website by comparing the domain names and assist the user in detecting that he or she has been directed into a high-risk website before the user is victimized. Specifically, when a list of secure domain names (for example, a list of domain names of 500 global websites or a whitelist) is given, the processor 110 can compare the similarity between the string of domain name of a specific website and the strings of domain names of the whitelist before the user enters the specific website. If the similarity between the string of domain name of the specific website and the strings of domain names of the whitelist is too high, but the domain name of the specific website is not a domain name in the whitelist, the processor 110 can generate a warning notification to the user to alert the user that the website being visited may be a malicious website.
(9) In an embodiment, the processor 110 can classify characters in the string of domain name into multiple clusters. Multiple characters, which may be mistaken from one another by human eye, may be included in each of the clusters. Table 1 is an example of classifying characters into multiple clusters and Table 1 can be recorded in the memory 120.
(10) TABLE-US-00001 TABLE 1 Cluster Character C0 other symbols C1 o, 0 C2 P, q, g, 9 C3 i, 1, j, 1
(11) Taking goo.gl as an example, each of the characters of goo.gl will be converted into the following code under the above classification:
(12) TABLE-US-00002 g o o . g l C2 C1 C1 C0 C2 C3
(13) In addition, each of the clusters can also correspond to a unit vector, as shown in
(14) The following will illustrate how to compare the similarity of strings of two domain names.
(15)
(16) Referring to
(17) In particular, the processor 110 can generate a matrix 300 with dimensions corresponding to the lengths of the first string and the second string (i.e. a 23 matrix). In the matrix 300, each of the element values is calculated by calculating the distance of two vectors corresponding to each of the elements plus the smallest value in the left element, the bottom element, and the bottom left element of the element. For example, since an element 301 does not have a left element, a bottom element, and a bottom left element, the value of the element 301 is the distance 2 of C1 and C0 (the distance of two clusters is the sum of absolute value of subtraction of all corresponding elements of the two clusters). The value of an element 302 is the sum 2 of the distance 0 of C1 and C1 plus the left element value 2 of the element 302, as the element 302 does not have a bottom element and a bottom left element. Since the smallest value of a left element 304, the bottom element 302, and the bottom left element 301 of an element 303 is 2, and the distance of C1 and C2 corresponding to the element 303 is 2, the value of the element 303 is 2 plus 2 equals to 4. The values of other elements may be deduced so on and so forth.
(18) Referring to
(19) Referring to
(20) Finally, processor 110 sums up the recalculated values of all of the elements on the shortest distance path and divides by the sum of lengths of the first string and the second string to obtain a final value. For example, the final value=(2+0+0)/(2+3)=0.4. The smaller the final value, the higher the similarity of the two strings. Therefore, the processor 110 can issue a warning notification to alert the user when the final value above is less than a threshold value.
(21)
(22) Referring to
(23) In Step S402, multiple characters of the first string and the second string are classified into multiple clusters, and multiple vectors corresponding to the clusters are generated, wherein each of the characters corresponds to one of the vectors.
(24) In Step S403, a first vector set corresponding to the first string and a second vector set corresponding to the second string are generated.
(25) In Step S404, a similarity of the first vector set and the second vector set is calculated using an algorithm.
(26) In summary, the domain name recognition method and the domain name recognition device of the disclosure divide the characters of the domain name into multiple clusters and generate multiple vectors corresponding to the clusters, and further generate vector sets of the two domain names. Finally, the similarity of the two vector sets corresponding to the two domain names is calculated. When the similarity is too high, a warning notification can be issued to alert the user. The disclosure adopts the DTW algorithm for comparing the similarity of the strings. Since the algorithm is designed based on dynamic programming, the time taken for comparison can be significantly reduced. In addition, the method of the disclosure relative to the blacklist of domain names may also prevent the situation whereby the blacklist is generated after the user has been victimized from happening.
(27) Although the disclosure has been disclosed in the above embodiments, the embodiments are not intended to limit the disclosure. It will be apparent to persons skilled in the art that various modifications and variations can be made to the disclosed embodiments without departing from the scope or spirit of the disclosure. In view of the foregoing, it is intended that the disclosure covers modifications and variations provided that they fall within the scope of the following claims and their equivalents.