Computer-implemented method of performing a search using signatures

Abstract

A computer-implemented method of processing a query vector and a data vector), comprising: generating a set of masks and a first set of multiple signatures and a second set of multiple signatures by applying the set of masks to the query vector and the data vector, respectively, and generating candidate pairs, of a first signature and a second signature, by identifying matches of a first signature and a second signature. The set of masks comprises a configuration of the elements that is a Hadamard code; a permutation of a Hadamard code; or a code that deviates from a Hadamard code or a permutation of a Hadamard code in less than 40% of its elements.

Claims

1. A computer-implemented method of processing a first vector representing a query, and a second vector, representing data stored in a digital database, the method comprising: receiving the first vector from an input to a computer or from a digital database and the second vector from a digital database or from an input to a computer; generating, using the computer, a set of masks; wherein the set of masks comprises first elements and second elements; generating, using the computer, a first set of multiple signatures and a second set of multiple signatures by applying the set of masks to the first vector and the second vector, respectively; generating, using the computer, candidate pairs, of a first signature and a second signature, by identifying matches of a first signature and a second signature so that every pair of the first and second vectors whose mismatch is smaller than a predefined threshold becomes a candidate pair; wherein the set of masks is arranged as multiple partitions so that each mask spans a fraction of the first vector or the second vector and the masks collectively span the full size of the first vector or the second vector, wherein a number oft masks are generated, wherein t is expressed as: t=(2.sup.k/b+1−1)±d where b is a number of partitions, k is a number of allowed mismatches, and d is a relative deviation selected from the group of: 5%, 10%, 15% and 20%, and wherein the set of masks comprises a configuration of the elements that is one or more of the following: a Hadamard code; a permutation of a Hadamard code; a code that deviates from a Hadamard code or a permutation of a Hadamard code in less than 40% of its elements.

2. A computer-implemented method according claim 1, wherein the first vector and the second vector agree in size; and wherein each mask agrees in size with the first vector and the second vector.

3. A computer-implemented method according to claim 1, wherein partitions are arranged with mutual overlaps such that for elements in the first vector and/or the second vector, each element is represented in multiple partitions, but not in all partitions.

4. A computer-implemented method according to claim 1, wherein the set of masks comprises a configuration of the elements that is a code that deviates from a Hadamard code or a permutation of a Hadamard code in accordance with one of the following: in less than 40% of its elements for k≧18, in less than 30% of its elements for k≧8, in less than 20% of its elements for k≧5, in less than 10% of its elements for k≧4; wherein k is a number of allowed mismatches.

5. A computer-implemented method according to claim 1, wherein the set of masks and the first vector and the second vector are arranged in respective multiple partitions; and wherein each or at least a majority of the partitions of the set of masks comprises a configuration of the elements that is one or more of the following: a Hadamard code; a permutation of a Hadamard code; a code that deviates from a Hadamard code or a permutation of a Hadamard code in less than 40% of its elements.

6. A computer-implemented method according to claim 1, comprising the steps of: computing a distance measure for respective candidate pairs; looking up the first vector and the second vector that produced the signatures of the respective candidate pair; wherein the distance measure is computed to represent the distance between the looked-up the first vector and the second vector that produced the signatures of the respective candidate pair.

7. A computer-implemented method according to claim 1; wherein the second vector is a data vector comprised by multiple data vectors that constitute a dataset; comprising: computing an index that links a signature to the vector that was applied to a mask to generate the signature; and using the index to identify the vector that was applied to a mask to generate the signature in an identified candidate pair.

8. A computer-implemented method according to claim 1, comprising: generating a predefined permutation scheme for elements in either the query vector or the data vector; and generating a permutation of the query vector and generating a permutation of the data vector using the same predefined permutation scheme.

9. A computer-implemented method according to claim 1 wherein the first vector is received via a user interface such as a user interface accessible via the Internet.

10. A computer system loaded with a computer program configured to perform the computer-implemented method as claimed in claim 1.

11. A non-transitory computer-readable medium carrying a program configured to perform the computer-implemented method as claimed in claim 1 when run on a computer.

12. A non-transitory computer-readable medium carrying data structure configured to store the signatures generated by the computer-implemented method of claim 1.

13. A computer configured in one or both of hardware or software for processing a first vector, representing a query, and a second vector, representing data stored in a digital database, the computer comprising one or more components configured to: receive the first vector from an input to the computer or from a digital database and the second vector from a digital database or from an input to the computer; generate a set of masks; wherein each mask agrees in size with the first vector and the second vector; and wherein the set of masks comprises first elements and second elements; generate a first set of multiple signatures and a second set of multiple signatures by applying the set of masks to the first vector and the second vector, respectively; generate candidate pairs, of a first signature and a second signature, by identifying matches of a first signature and a second signature so that every pair of the first and second vectors whose mismatch is smaller than a predefined threshold becomes a candidate pair; wherein the set of masks is arranged as multiple partitions so that each mask spans a fraction of the first vector or the second vector and the masks collectively span the full size of the first vector or the second vector, wherein a number oft masks are generated, wherein t is expressed as: t=(2.sup.k/b+1−1)±d where b is a number of partitions, k is a number of allowed mismatches, and d is a relative deviation selected from the group of: 5%, 10%, 15% and 20%, and wherein the set of masks comprises a configuration of the elements that is one or more of the following: a Hadamard code; a permutation of a Hadamard code; a code that deviates from a Hadamard code or a permutation of a Hadamard code in less than 40% of its elements.

Description

BRIEF DESCRIPTION OF THE FIGURES

(1) A more detailed description follows below with reference to the drawing, in which:

(2) FIG. 1 shows an example flow of computing similarity joins;

(3) FIG. 2a shows a performance curve showing an improvement over a signature-based method using partitioning;

(4) FIGS. 2b and 2c show the number of expected matches as a function of the number of allowed mismatches;

(5) FIG. 3 shows a flowchart of processing a query; and

(6) FIG. 4 shows the structure of a system for processing a query.

DETAILED DESCRIPTION

(7) FIG. 1 shows an example flow of performing a search using signatures. In this example flow the first vector 101 has a length d=15 and comprises 15 elements holding 15 characters. The second vector 102 also has a length d=15 and comprises 15 elements holding 15 characters. The first vector and the second vector then agree in size.

(8) Albeit a simple example, the first vector 101 is considered a query vector and the second vector 102 is considered a data vector. The vectors 101 and 102 contain in this example the following characters:

(9) TABLE-US-00001 Pos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 101: # m a # l i s h e # # m p l e

(10) TABLE-US-00002 Pos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 102: s m a l l i s h e x a m p l e

(11) The ‘#’ character explicitly represents the positions of mismatches between the query vector and the data vector, in order to make it easier to follow the example. We note that such a representation is also convenient if the positions of where the mismatches may occur are known. If the positions of the mismatches are not known, use of special characters for that purpose can be dispensed with.

(12) Typically the second vector 102 is comprised by multiple data vectors that constitute a dataset. Such a data set may comprise any number of vectors such as hundreds, thousands, millions, billions or an even higher number of data vectors.

(13) The query vector and the data vector are partitioned in step 103 into smaller partitions. In this example the number of partitions is b=2, which means that the query vector and the data vector each are partitioned into b=2 partitions. Thus the first vector 101 is partitioned into partitions 101a and 101b, and the second vector 102 is partitioned into partitions 102a and 102b.

(14) Additionally, a predefined permutation scheme is used for generating a permutation of the query vector and generating a permutation of the data vector using the same predefined permutation scheme. In this example case, the predefined permutation scheme is represented in the following way:

(15) TABLE-US-00003 1 .fwdarw. 3 6 .fwdarw. 14 6 .fwdarw. 14 2 .fwdarw. 9 7 .fwdarw. 1 7 .fwdarw. 1 3 .fwdarw. 5 8 .fwdarw. 13 8 .fwdarw. 13 4 .fwdarw. 2 9 .fwdarw. 4 9 .fwdarw. 4 5 .fwdarw. 11 10 .fwdarw. 7 10 .fwdarw. 7

(16) In this representation of the permutation scheme, arrows indicate that the content of a first element position on its left side is repositioned to a second element position as indicated on the right side of the arrow; e.g. element number 1 in the query vector or the data vector is repositioned to be element number 3.

(17) Using this permutation scheme the content of the above vectors 101 and 102 becomes:

(18) TABLE-US-00004 Pos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 101: s # s e a p # e m # l # h 3

(19) TABLE-US-00005 Pos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 102: s l s e a p x e m m l l h i a
The partitioning then partitions the vectors by splitting them into b=2 partitions, 101a and 101b:

(20) TABLE-US-00006 Pos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 101a: s # s e a p #

(21) TABLE-US-00007 Pos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 101b: e m # l # h 3
and 102a and 102b:

(22) TABLE-US-00008 Pos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 102a: s l s e a p x

(23) TABLE-US-00009 Pos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 102b: e m m l l h i a

(24) The partitions 101a, 101b and 102a, 102b are then input to an enumeration step, which expands the vectors into respective sets of signatures 104 and 105 by means of a set of t masks (not shown). The set of masks may conform to a Hadamard code definition, wherein zeros represent elements not to transfer from a vector and onto the signature, and ones represent elements to transfer from a vector and onto the signature, such that the element becomes part of the signature.

(25) A mask in the set of masks may span the full length of the vectors or span the full length of a partition; it is shown that a mask spans the full length of each of the b=2 partitions. In this example the set of masks count t=7 masks. For the sake clarity, elements in the signature represented by the character ‘^’ (the escape character) represents blank elements, which are elements corresponding to masking elements in a respective mask i.e. elements which the mask didn't transfer from the vector to the signature.

(26) The enumeration step 103 is illustrated below for one of the partitions; wherein the data vector partition 102a is shown as the topmost square, followed by the mask and bottommost: the signature.

(27) TABLE-US-00010 Data vector 102a: s l s e a p x Mask: 1 0 1 0 1 0 1 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 1 Signature: s {circumflex over ( )} s {circumflex over ( )} a {circumflex over ( )} x {circumflex over ( )} L s {circumflex over ( )} {circumflex over ( )} p x s l {circumflex over ( )} {circumflex over ( )} a p {circumflex over ( )} {circumflex over ( )} {circumflex over ( )} {circumflex over ( )} e a p x s {circumflex over ( )} s e {circumflex over ( )} p {circumflex over ( )} {circumflex over ( )} l s e a {circumflex over ( )} {circumflex over ( )} s l {circumflex over ( )} e {circumflex over ( )} {circumflex over ( )} x

(28) Reverting to FIG. 1, in this example, the signature partitions 104a and 104b from respective partitions of the query vector are arranged in continuation of each other however, it is clear that various representations on a computer are foreseeable.

(29) The sets of signatures 104 and 105 comprising any partitions thereof may be stored in a dedicated memory and/or data storage.

(30) The step 106 of performing candidate generation uses the sets of signatures 104 and 105 to perform exact searches for one or more matching pairs of signatures from the first set of signatures 104 and the second set of signatures 105. A match identified by the step 106 of performing candidate generation is, in this example, is illustrated by two ellipses connected by the line designated Mtc. However, it should be noted that multiple matches may be identified since a predefined level of mismatches is allowed. Thus, multiple matches resulting from the one and same data vector or resulting from multiple data vectors may be identified. The matching signatures are shown as 107 and 108.

(31) Distance computation step 109 tracks the first vector 101 and the second vector 102 from the candidate pair, i.e. the matching signatures 107 and 108, and computes a distance measure between the two vectors. The distance measure may be a Hamming distance or another distance measure. In this simple example it is trivial to identify the first vector 101 and second vector 102 from the respective signatures 107 and 108. However, when the computer-implemented method operates on a multitude of vectors it can be convenient to compute an index for the multitude of data vectors. The index may be computed and stored at the same time or in the same process when the signatures are computed. Consequently, a data vector corresponding to a candidate pair can be looked up quickly. In some embodiments the index is a hash table, wherein the index values are computed from by a hash function or an alternative index generating function. Especially, in connection with distance computation in step 109, the addition of an index greatly speeds up the process of retrieving the data vector associated with signatures of a candidate pair, for computing the distance measure.

(32) The distance computation in step 109 may complete the computer-implemented method and point to those data vectors that have a distance measure relative to the query vector below a threshold value or point to the data vectors for which there is identified a candidate pair in a ranked order of distance e.g. in increasing or decreasing order.

(33) The above-mentioned set of masks can be chosen as a Hadamard code. The k.sup.th Hadamard code is the collection H.sub.k={x.sub.v|v in {0,1}.sup.k} of 2.sup.k vectors, wherein the i.sup.th bit of x.sub.v is 1 if and only if p=i & v has an odd number of bits set to 1, wherein i is interpreted as a k-bit binary number and & is the bitwise conjunction operator (AND-operator). To compute the i.sup.th bit of the Hadamard code word indexed by the vector v, the bitwise conjunction operator (AND operator) is applied to v and i, and the number of bits set to 1 is computed using a so-called popcnt( ) instruction. The bit is then extracted as the least significant bit of the count. In programming languages such as C this entire computation can be expressed as: popcnt(v & i) & 1, wherein popcnt( ) is a function that counts number of ones (1). In connection therewith it is noted that in some embodiments a mask containing only zeros (only masking elements) of a set of masks that conforms to a Hamming code is omitted i.e. effectively not included in the set of masks. The reason is that such a mask produces identical signatures for all vectors irrespective of their content. Further, in some embodiments, the very first column (0.sup.th coordinate in each vector) in the set of Hamming codes H.sub.k is effectively omitted from the set of masks since it may be 0 in all vectors of H.sub.k (confer below).

(34) In some embodiments, the vectors and the masks are configured to have the same length. Then the number of masks needed is 2.sup.k+1 wherein k is the number of errors accepted in a partition; that is

(35) $\frac{k}{b}$
errors are accepted in each partition, wherein b is the number of partitions; thus the number of masks needed is 2.sup.k+1.

(36) In some embodiments the vectors, be it a data vector and/or a query vector, or at least one of them, are shorter than the mask, and padding elements are added to the at least one vector to make its length agree with the length of the mask.

(37) In some embodiments, the vectors be it a data vector and/or a query vector, or at least one of them, are longer than mask and each mask element or at least one of them is then configured to handle multiple elements of the data vector and/or query vector. That is, each mask element may mask out or transfer multiple vector elements.

(38) In case, the vectors and the masks do not agree in length, any combination of the above embodiments may be employed to accommodate vectors and masks that do not agree in length.

(39) FIG. 2a shows a performance curve showing an improvement over a signature-based method using partitioning. The Cartesian coordinate system shows a first performance curve 201 representing a prior art method as suggested in Arasu et al. in U.S. Pat. No. 7,865,505 and a second performance curve 202 representing an improved method as suggested herein. The abscissa (x-axis) shows a normalized filtering probability, P, and the ordinate (y-axis) shows the logarithm to the number of signatures generated.

(40) The filtering probability, P, can be expressed by the following the approximation:

(41) $P = {(1 - \frac{1}{2 b})}^{d (q, x)}$
which expresses the probability per mask P of finding a match i.e. a candidate pair, wherein b is a parameter representing a number of partitions and d (q,x) is the number of mismatches between the query vector, q, and the data vector, x, that generated the signature in the candidate pair via the mask. The number of mismatches may be measured as the Hamming distance. The plot in FIG. 2 is shown for a fixed distance measured by d(q,x).

(42) The slightly curved, right hand side pointing arrow 203 shows the progression of the performance curves 202 and 203 as the number of partitions increases.

(43) The dashed line 204 shows the improvement of the method disclosed herein in terms of the number of signatures to be generated.

(44) The improved method yields an improved trade-off between the number of masks and the filtering efficiency that improves the one achieved by Arasu et al. In general, the trade-off is controlled by the number of partitions in the partitioning step: More partitions decrease the number of masks required, but also decreases the filtering efficiency.

(45) In a given application to a data set, the number of partitions should be chosen to balance the cost of generating all signatures (which is proportional to the number of masks) and the cost of false positives (determined by the filtering efficiency).

(46) It should be noted that the filtering probability decreases exponentially with the distance d(q,x) between the first vector x and the second vector q. The rate of decrease (i.e. with respect to FIG. 2a: towards 0.00 along the abscissa, x-axis) depends on the number of partitions—the fewer partitions the faster decrease.

(47) The effect of changing the number of partitions represented by b (and hence number of signatures) on filtering level is that the rate of decrease is slower for larger values of b (more partitions).

(48) FIGS. 2b and 2c show the number of expected matches as a function of the number of allowed mismatches. Both in FIG. 2b and FIG. 2c the abscissa indicates the number of allowed mismatches, k, and the ordinate indicates the number of expected matches (i.e. candidate pairs) per data vector.

(49) The curves 205 and 207 show the performance of the PartEnum prior art approach by Arasu et al., whereas the curves 206 and 208 show the performance of the improved method.

(50) In FIG. 2b the curve 206 for the improved method is drawn up using a set of masks that fully comply with a Hadamard code configuration or permutations of a Hadamard code. As can be seen the improved method performs better for a broad range of mismatches since generally fewer candidate pairs are generated per data vector.

(51) In FIG. 2c the curve 208 for the improved method is drawn up using a set of masks that comply with a code that deviates from a Hadamard code or a permutation of a Hadamard code in about 30% of its elements, since 30% of the non-masking elements are replaced by masking elements. As can be seen the improved method performs better for a broad range of mismatches for k≧8 since fewer candidate pairs are generated per data vector. For a deviation of 30% and for k<8 the improved method performs only marginally poorer than Arasu et al. when their masks are similarily modified by reducing the number of non-masking elements by 30%. The improved method is therefore a good alternative.

(52) FIG. 3 shows a flowchart of processing a query. Before processing a query a set of masks conforming to a Hadamard code are generated and stored in step 304. Step 303 then performs generation of signatures by applying the set of masks to data vectors stored in a data repository 302 e.g. in the form of a database. Optionally, data may be partitioned in step 301 as explained in connection with FIG. 1. While signatures are generated or as a subsequent step an index that links a signature to the vector that was applied to a mask in the set of masks to generate the signature is generated in step 305. The index is stored in a repository 306 in connection with the data repository 302.

(53) A query is received in step 307 and in case partitioning is applied the query is partitioned in step 308 to conform to the partitioning applied to the data in step 301. The set of masks generated in step 304 are applied to the query or the partitioned query to generate query signatures in step 309. In connection therewith or as a subsequent step 310 computes a hash value of the signature.

(54) In step 311 data signatures generated in step 303 are matched with query signatures generated in step 309. Matching signature pairs of data signatures and query signatures are identified.

(55) In step 312 the data vector and the query vector that generated the signature via mask is looked up via the computed index. Further, a distance measure is computed for the data vector and the query vector.

(56) Data vectors that generated a signature in a candidate pair may be presented in an order according to a value of the distance measure. In some embodiments data vectors are filtered such that data vectors with a distance measure relative to the query vector that satisfied as threshold criterion are presented separately.

(57) FIG. 4 shows the structure of a system for processing a query. The system is accessed via a so-called Application Programmable Interface, API, 408. The API is configured to:

(58) receive a query vector,

(59) initiate candidate generation,

(60) initiate distance computation,

(61) present or retrieve candidates comprising signature candidates, and

(62) present or retrieve data vectors comprising optionally ranking the data vectors according to the computed distance measure,

(63) The API is also configured to:

(64) initiate generation of a set of masks,

(65) initiate generation of an index,

(66) configure a set of masks,

(67) configure partitions,

(68) Additionally, the API may be configured to set up and initiate connection to a database 402 wherein the data vectors are stored.

(69) The API accesses a collection of software components that are configured to perform the operations described in connection with the flowcharts. The software components may be configured in according to an object oriented structure. The software components comprise a partitioning component 402, a signature generator 403, an index generator 405, a candidate generator 406, a distance computing component 407, and a storage 404 for storing the set of masks.

(70) A database 402 stores the data vectors and the index 406.

(71) In some embodiments the computer-implemented method is implemented in a general purpose computer such as a general purpose server computer or a server computer dedicated to database and search operations.

(72) Although a software based approach has been described in the above, it should be noted that portions or the entire system may be implemented in hardware.

Computer-implemented method of performing a search using signatures

Assignee

Inventors

Cpc classification

Classification Explorer

G06F16/215

PHYSICS

Classification Explorer

G06F16/24554

PHYSICS

Classification Explorer

H03M13/1575

ELECTRICITY

Classification Explorer

G06F16/24544

PHYSICS

International classification

Classification Explorer

G06F17/30

PHYSICS

Classification Explorer

G06F17/00

PHYSICS

Abstract

Claims

Description