Scalable binning for big data deduplication

Abstract

Fast record deduplication is accomplished by providing as an input, data records having multiple attributes, and local similarity functions of individual attributes with local similarity thresholds. Bin IDs are then generated based on the local similarity functions and the local similarity thresholds. The Bin IDs are unique identifiers of a respective bin of records, and the bin of records is a set of records that are possibly pairwise similar. Local candidate pairs are identified based on data records that share Bin IDs. The local candidate pairs are aggregated to produce a set of global candidate pairs. The set of global candidate pairs are filtered by deciding whether a pair of data records represents a duplicate.

Claims

1. A computer implemented method for use in fast record deduplication comprising: (a) inputting, into software running on one or more computer processors, data records having multiple attributes; (b) inputting, into the software, local similarity functions of individual attributes with local similarity thresholds; (c) generating, by the software, Bin IDs based on the local similarity functions and the local similarity thresholds, wherein the Bin IDs are unique identifiers of a respective bin of records, and wherein all similar record pairs appear in the same bin; (d) identifying local candidate pairs based on data records that share Bin IDs; (e) aggregating, by the software, the local candidate pairs to produce a set of global candidate pairs; and (f) filtering, by the software, the set of global candidate pairs by deciding whether a pair of data records represents a duplicate.

2. The method of claim 1 wherein the candidate pairs are identified by a Cartesian product of all data records sharing Bin IDs.

3. A system for performing fast record deduplication comprising at least one non-transitory computer-readable medium containing computer program instructions that when executed by at least one computer processor causes the at least one computer processor to perform the steps of: (a) inputting data records having multiple attributes; (b) inputting local similarity functions of individual attributes with local similarity thresholds; (c) generating Bin IDs based on the local similarity functions and the local similarity thresholds, wherein the Bin IDs are unique identifiers of a respective bin of records, and wherein all similar record pairs appear in the same bin; (d) identifying local candidate pairs based on data records that share Bin IDs; (e) aggregating the local candidate pairs to produce a set of global candidate pairs; and (f) filtering the set of global candidate pairs by deciding whether a pair of data records represents a duplicate.

4. The system of claim 3 wherein the candidate pairs are identified by a Cartesian product of all data records sharing Bin IDs.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) In the drawings, closely related figures and items have the same number but different alphabetic suffixes. Processes, states, statuses, and databases are named for their respective functions.

(2) FIG. 1 depicts the search space for an example record containing tokens {A, B, C, D}.

(3) FIG. 2 depicts the space partitioning of a numeric attribute domain into overlapping bins.

(4) FIG. 3 shows an experimental comparison with previous art.

DETAILED DESCRIPTION, INCLUDING THE PREFERRED EMBODIMENT

(5) In the following detailed description, reference is made to the accompanying drawings which form a part hereof, and in which are shown, by way of illustration, specific embodiments which may be practiced. It is to be understood that other embodiments may be used, and structural changes may be made without departing from the scope of the present disclosure.

Terminology

(6) The terminology and definitions of the prior art are not necessarily consistent with the terminology and definitions of the current invention. Where there is a conflict, the following definitions apply.

(7) Local Similarity Function: a function to measure the similarity of two values that belong to the same attribute.

(8) False Negatives Budget: The maximum number of similar record pairs that can be absent from the generated candidate pairs.

(9) Disjunctive Normal Form: A specific form of Boolean formula that is represented as multiple ORed clauses, each of which is represented by ANDed atomic predicates.

(10) Local Candidate Pairs: A set of record pairs that have local similarity function above a certain threshold.

(11) Global Candidate Pairs: A set of record pairs that are obtained by aggregating multiple local candidate pair sets.

(12) Bin of Records: A set of records that are possibly pairwise similar.

(13) Bin ID: A unique identifier of a bin of records.

Operation

(14) Binning is the process of assigning the input records to (possibly overlapping) bins such that all similar record pairs appear in the same bin. All record pairs that belong to different bins can be pruned.

Input

(15) A set of values D={v_1, . . . v_n} where each value v_i is represented as a (multi) set of elements denoted {e_i1, . . . , e_in}. Possible methods to split each value v_i to elements include tokenization and q-gramming. The exact split method to use depends on the desired granularity of the similarity computation. For example, if word-level similarity is required, tokenization should be used. Alternatively, if character-level similarity is required, q-gramming should be used.

(16) A similarity function sim(v_i, v_j). For example, cosine similarity, Jaccard similarity, Dice similarity.

(17) A similarity threshold T in range [0, 1].

Output

(18) For each value v_i, the corresponding bin IDs, denoted B(v_i), are output such that sim(v_i, v_j) j) implies B(v_i)∩B(v_j) is not empty.

(19) Let O={e_1, . . . , e_m} be a total ordering on all unique elements in v_1 U . . . U v_n. Although the total ordering does not affect the correctness of the algorithm (i.e., no false negatives), it does affect the pruning power of the binning criteria. One efficient ordering is based on the increasing frequency of element occurrences in the data set (ties are broken arbitrarily).

(20) A search state S_i is a mapping from each element in O to {0, 1, ?}. An element mapped to 0 indicates the absence of that element in the intersection of two sets v_i, v_j. Mapping to 1 indicates the presence of such an element. Mapping to ? indicates either absence or presence of the element. P(S_i) is the elements in S_i that are mapped to 1. N(S_i) indicates the elements in S_i that are mapped to 0. D(S_i) are the elements on S_i that are mapped to ?.

(21) For a given element set v_i, the benefit of a state S_k is the best similarity between v_i and any other element set v_j.Math.O such that v_i∩v_j=P(S_k). That is, benefit(v_i, S_k)=Max.sub.v_j:v_i∩v_j=P(S_k) sim(v_i, v_j)

(22) A goal state satisfies benefit(v_i, S_k)>=T.

(23) A benefit upper bound is obtained by assuming that elements in D(S_k) belong to v_i∩v_j: benefit(v_i, S_k)=Max.sub.v_j:v_i∩v_j=P(S_k)∪D(S_k) sim(v_i, v_j)

(24) The benefit function and its upper bound can easily be computed for various monotone similarity functions such as cosine similarity and Jaccard similarity. For example, for cosine similarity, assume that each element e in v_i is associated with a normalized weight w(e). Additionally, each element e is associated with mw(e), which is the maximum weight associated with term e across all values v_1, . . . , v_n.

(25) A value v_i and a state S_k has the following upper and lower bounds: benefit(v_i, S_k)=Min((Σ.sub.e∈v_i∩P(S_k) w(e).sup.2).sup.1/2, Σ.sub.e∈v_i∩P(S_k) w(e) mw(e)) benefit(v_i, S_k)=Min((Σ.sub.e∈v_i∩(P(S_k)∪D(S_k)) w(e).sup.2).sup.1/2, Σ.sub.e∈v_i∩(P(S_k)∪D(S_k)) w(e) mw(e))

(26) State S is represented as a triple (P, N, D) where P is the set of elements mapped to 1, N is the set of elements mapped to 0, and D is the set of elements mapped to ?. Since the union of P, N, and D is always equal to the set of all elements {e_1, . . . , e_m}, the set N can be omitted and S represented using (P, D) only.

(27) The FIG. 1 tree shows the search space for a value containing elements {A, B, C, D}, assuming that the global ordering O is {A, B, C, D, E, F}.

(28) For example, (AC, D) represents a state where A and C are mapped to 1, B, E, and F are mapped to 0, and D is mapped to ?. Also, (BD, Ø) represents a state where B and D are mapped to 1, A, C, E, and F are mapped to 0, and nothing is mapped to ?.

(29) Pseudo-code for binning a value (A*-search):

(30) TABLE-US-00001 Algorithm 1: bin_text_value (v.sub.i, O, benefit, benefit, τ) Require: v.sub.i: the value to be binned, consisting of elements (T_1,...,T_k) Require: O: a total order on all terms Require: benefit: a function to compute benefit of a state Require: benefit: a function to compute a benefit upper bound of a state 1: B(ν.sub.i) ← Ø 2: let PQ be a priority queue that contains search states ordered descendingly on benefit 3: push state S = (Ø, v.sub.i) into PQ with score benefit(v.sub.i,S) = 1 4: while PQ is not empty do 5: pop the top state S from PQ 6: if (benefit(v.sub.i,S) < τ) then 7: break 8: else if (benefit(S) ≥ τ) then 9: add S to B(v.sub.i) 10: else 11: for each element T.sub.i ∈ D(S) such that the index i is greater than the index of the last term in P(S) based on O do 12: S′ ← (P(S) ∪{T.sub.i},{T.sub.i+1,...,T.sub.k}) 13: push S′ into PQ with score benefit(v.sub.i,S′) 14: end for 15: end if 16: end while 17: return B(v.sub.i)

(31) Algorithm 1 provides the smallest number of distinct candidates (i.e., after unioning the candidates coming from all bins). However, the number of candidates before the union step can be large. This is mainly because the algorithm emits the longest states with benefit>=T, which might be deep in the search space. The number of these states can be very large, and a given candidate pair is going to appear in a possibly large number of bins.

(32) In order to reduce the number of candidates, another stopping criterion is introduced: freq(S_i)<freq_threshold, where freq(.) indicates the co-occurrence frequency of elements in S_i in the data set, and freq_threshold is a relatively small value (e.g., 0.1%). Computing the exact value of freq(S_i) is prohibitively expensive, especially if the size of S_i is arbitrarily long. Possible heuristics are: 1: freq(S_i)=Π.sub.e∈S_ifreq(e) (assuming independence) 2: freq(S_i)=min.sub.e∈S_ifreq(e) (upper bound)

(33) Another heuristic is to limit the size of P(S_i) to, e.g., 3 elements.

(34) The algorithm is modified to:

(35) TABLE-US-00002 Algorithm 2: bin_text_value_prune (v.sub.i,O, benefit, benefit, τ, freq.sub.threshold) Require: v.sub.i: the value to be binned, consisting of elements (T_l,...,T_k) Require: O: a total order on all terms Require: benefit: a function to compute benefit of a state Require: benefit: a function to compute a benefit upper bound of a state Require: freq.sub.threshold: a cut-off threshold for freq(S) 1: B(v.sub.i) ← Ø 2: let PQ be a priority queue that contains search states ordered descendingly on benefit 3: push state S = (Ø, v.sub.i) into PQ with score benefit(v.sub.i,S) = 1 4: while PQ is not empty do 5: pop the top state S from PQ 6: if (benefit(v.sub.i,S) < τ) then 7: break 8: else if (freq(S) < freq.sub.threshold)OR (benefit(v.sub.i,S) ≥ τ) then 9: add S to B(v.sub.i) 10: else 11: for each element T.sub.i ∈ D(S) such that the index i is greater than the index of the last term in P(S) based on O do 12: S′ ← (P(S) ∪{T.sub.i},{T.sub.i+1,...,T.sub.k}) 13: push S′ into PQ with score benefit(v.sub.i,S′) 14: end for 15: end if 16: end while 17: return B(v.sub.i)

(36) For example, in FIG. 1, assume that (ABC, D) and (ABD, Ø) satisfy benefit>=T. Assuming that elements A and B occur rarely in the data set, emit (AB, CD) instead of (ABC, D) and (ABD, Ø). Note that freq(S_i) depends only on the state and it does not depend on the element being binned. This ensures that the bin IDs are emitted in a consistent way among all element sets and hence no false negatives are introduced.

Approximate Binning

(37) Algorithms 1 and 2 are guaranteed to have zero false negatives (i.e., no missed matching pair). However, a user may be able to tolerate a small number of false negatives (e.g., <1% of true candidates) in return for faster computation of candidates. This can be achieved by multiplying bene fit(v.sub.i, S) and score benefit(v.sub.i, S) by a constant c<1.0 (e.g., 0.95). Alternatively, redefine mw(e) to be the 95.sup.th percentile of weights assigned to e across all values instead of the maximum weight. These two relaxations result in less than 1% false negatives when applied on various data sets (e.g., DBLP, Twitter, US spending).

Binning Numeric Values

(38) Given a threshold T, the numeric binner divides the attribute domain into two sets of bins, where the length of each bin is equal to 2*T, the bins within each set are disjoint, and the bins in both sets are interleaved with overlap equal to threshold. Specifically, the first set of bins is { . . . , [0, 2T), [2T, 4T), [4T, 6T), . . . }. The second set of bins is { . . . , [T, 3T), [3T, 5T), [5T, 7T), . . . }. Each bin in the two lists is given a unique ID. For example, the bins in the first set are assigned to IDs { . . . , 0, 2, 4, . . . }, and the bins in the second set are assigned to IDs { . . . , 1, 3, 5, . . . }.

(39) Each value is mapped to exactly two bin IDs: one from each set.

(40) The following equations compute the two bin IDs for each numeric value V_i:
B_1(V_i, T)=2 floor(V_i/(2 T)))
B_2(V_i, T)=2 floor((V_i+T)/2T)+1).
FIG. 2 depicts an example of partitioning the space based on these equations.

(41) Any two values with distance<=T will necessarily belong to the same bin.

Use in Improved Computing Systems

(42) The binning algorithms detailed above may be implemented in computing systems to enable improved and computationally faster data deduplication. A computing system may be any single or multiple processor machine, or multiple network connected machines, with data input and output capabilities. Input can be direct, such as through disk, keyboard, or mouse, or indirect, such as over a network through an application programming interface (API) or webpage driven interface. Output may similarly be direct, such as display on a connected screen, or indirect such as written to disk or database for later or remotely connected access. A computing system may also be a virtual computing environment operating on top of physical hardware, such as within cloud platform computing.

(43) The computing system receives as input data records with multiple attributes. One example is US census data which contains hundreds of millions of records each of which contains tens of fields. Another example is a publications database such as DBLP, which contains tens of millions of publications, each of which contains tens of fields. The data records can come from any source, such as text files, databases, or other data repositories, as long as they are resolved into individual records having multiple attributes.

(44) Data deduplication is removal of duplicate or equivalent records within the set of input data records. Deduplication software running on the computer system scans the input data records and generates different Bin IDs using statistics and pre-analysis such as inverse document frequency (IDF), which is the logarithm of the total number of records over the number of records containing a specific term. The Bin IDs cover a list of building blocks based on the data records. For example, building blocks may be, but are not limited to, terms, q-grams, or ranges, based on the type of content in the data records. Choosing which building blocks to use depends on the granularity of the similarity function that the user needs. For example, if the user is interested in word-level similarity, the building blocks are the words occurring in each value. If the user is interested in character-level similarity function, the building blocks are q-grams of the values. Data records may be input to the deduplication software by selecting or otherwise identifying a source, such as a file location or database, through a graphical or textual user interface, input by file, or input from another program through an API.

(45) The deduplication software also takes local similarity functions (L_1, . . . , L_k), local thresholds (T_1, . . . , T_k), and a global similarity aggregation function (A). For example, one may union two local candidate pair sets coming from (first name similarity>0.9) and (last name similarity>0.8) to obtain a set of global candidate pairs. These inputs can be typed or selected through a graphical user interface, input through one or more files read by the deduplication software, or input through an API. The local similarity function must take parameters of an attribute and a local threshold.

(46) The deduplication software generates all candidate pairs, called Global Candidate Pairs. The list of Global Candidate Pairs is sent later for a final filtering step of deciding on duplicate pairs by running candidate pairs through classifiers. Such classifiers are (potentially computationally expensive) methods that determine with high accuracy whether a candidate pair is a true duplicate or not. The overall time consumed by the classifiers is tractable since the system only needs to classify Global Candidate Pairs, which are strictly less than all possible pairs of records.

(47) The deduplication software generates sets of Global Candidates by aggregating pairs of records called Local Candidate Pairs, which are candidate pairs according to a local similarity function (based on one attribute). For example if the aggregation function A is the union operator, the software obtains local candidate pairs from L_1>T_1, . . . , L_k>T_k, and unions the resulting sets to obtain Global Candidate Pairs. In general, multiple similarity criteria can be combined using a Boolean formula in disjunctive normal form (DNF), where each atomic predicate represents a single similarity criterion. One example is (first name similarity>0.9 AND last name similarity>0.8) OR (age similarity>0.7 AND social security number similarity>0.99). DNF aggregation is evaluated by obtaining local candidate pairs, intersecting local sets within each clause, and unioning the sets resulting from all clauses.

(48) In order to generate local candidate pairs, software bins the values using Algorithm 1 or 2. Then, the software performs candidate generation. Two values v_i and v_j are candidates if they are binned to two bins S.sub.i and S.sub.j respectively, such that P(S.sub.i).Math.P(S.sub.j)∪D(S.sub.j) and P(S.sub.j).Math.P(S.sub.i)∪D(S.sub.i). This can be done by expanding each entry (P, D) into multiple bin IDs by concatenating P with each subset of D. The resulting bin IDs can be indexed (e.g., using a hash index). The index can then be used to obtain all pairs of values having the same bin ID. Alternatively, the software can create a Trie index where each (state, value) pair is stored with key P(state)∪D(state). Then, for each (state, value) pair, the software probes the index looking for keys that are supersets of P(State), which corresponds to the condition P(S.sub.i).Math.P(S.sub.j)∪D(S.sub.j). Then, the software verifies the second condition P(S.sub.j).Math.P(S.sub.i)∪D(S.sub.i) for the obtained results. Local candidate pairs consist of all record pairs that satisfy both conditions. Since a pair might be generated multiple times as it appears in various bins, duplicate pairs are removed from local candidate pair set.

Experiments

(49) The deduplication system was tested using 400,000 records from Digital Bibliography and Library Project (DBLP) dataset. The algorithms were performed by the computer processors on data records in computer memory. The following approaches were tested: AP: The algorithm described in “Scaling up all pairs similarity search.” LA2P: A variant of the algorithm described in “L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds.” BF1: Binning filter using algorithm 1, using frequency threshold of 0.01 (1%). In method 1, the frequency of a particular combination of terms (or state) was approximated using frequency of the least frequent term. BF2: Binning filter using algorithm 2, using frequency threshold of 0.000078125. In method 2, frequency of a particular combination of terms (or state) was approximated using frequency of each term multiplied together by assuming independence.

(50) FIG. 3 shows experimental results indicating the improved performance using BF1 and BF2. All methods were implemented as C++ programs. Experiments were executed on the following machine specs: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit CPU(s): 12 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Stepping: 2 CPU MHz: 1200.187 BogoMIPS: 3201.22 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-5 NUMA node1 CPU(s): 6-11

OTHER EMBODIMENTS

(51) Given example matches, an optimal candidate generation criteria (e.g., a DNF of local candidate generation functions) is determined such that the maximum number of false negatives is less than a threshold N, and the number of generated candidates is minimal. This is basically a supervised machine learning technique to come up with candidate generation criteria. Alternatively, a user can determine a DNF directly if no matching examples are available, and/or maximum result predictability of candidate generation is desired.

(52) Global Candidate Pairs may be generated holistically without generating the local candidate pairs according to each local similarity function. Multiple local Bin IDs for each record may be generated according to the local similarity functions. The local Bin IDs are then used to calculate Global Bin IDs (e.g., by concatenating the local Bin IDs). The Global Candidate Pairs may be generated from each Global Bin (e.g., the Cartesian Product of all records in a Global Bin).

(53) It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. The scope of invention should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Scalable binning for big data deduplication

Assignee

Inventors

Cpc classification

Classification Explorer

G06F16/90344

PHYSICS

Classification Explorer

G06F3/0641

PHYSICS

Classification Explorer

G06F16/1748

PHYSICS

Classification Explorer

G06F16/906

PHYSICS

Classification Explorer

G06F16/355

PHYSICS

Classification Explorer

H03M7/6023

ELECTRICITY

Classification Explorer

H03M7/3091

ELECTRICITY

International classification

Classification Explorer

G06F7/02

PHYSICS

Classification Explorer

G06F16/903

PHYSICS

Classification Explorer

G06F3/06

PHYSICS

Classification Explorer

G06F16/00

PHYSICS

Abstract

Claims

Description