Satisfiability filter and query tool and method of building a satisfiability filter and query tool
09753484 · 2017-09-05
Inventors
- Victor W. Marek (Lexington, KY)
- Andrew J. Mayer (San Diego, CA, US)
- Katrina J. Ray (Odenton, MD, US)
- Sean A. Weaver (Baltimore, MD, US)
Cpc classification
International classification
Abstract
A SAT filter builder and method for building a SAT filter is provided. Hash functions are utilized to map elements of a domain subset to a set of literals which are in turn used to create a set of equations. These equations are conjoined to provide a satisfiability instance. The satisfiability instance is provided to an equation solver and the solution is provided to an array to build the filter. A query tool is also provided which allows the filter built by the SAT filter to test an element for set membership. The query tool utilizes the same hash functions used by the SAT filter builder to map the element to be tested to a set of literals. These literals are used to create an equation. The solution identified by the SAT builder tool is then provided to the equation created by the query tool and set membership is determined by determining whether the equation provided by the query tool is satisfied by the solution provided by the SAT filter builder.
Claims
1. A satisfiability filter builder implemented by a computer including memory for storing information comprising: a domain subset identified for identifying element of a domain subset; a hash function selector for selecting a set of hash functions, wherein variables produced by said hash functions range from −n to n; a mapper in communication with said domain subset identifier and said hash function selector, wherein said mapper maps each of said identified elements of said domain subset based upon said selected set of hash functions to a set of literals, wherein the size of said set of literal is k; an equation creator in communication with sad mapper, said equation creator combines each said set of literals to create a plurality of equations wherein the number of equations created is identified as m and wherein each equation of said plurality of equations relates to an identified element of said domain subset; and a conjoiner in communication with said equation creator for conjoining said plurality of equations to provide a satisfiability instance, wherein said satisfiability instance is stored in 32*m*k bits of memory.
2. The satisfiability filter builder of claim 1 further comprising: an array builder in communication with an equation solver, wherein said array builder receives a solution to said satisfiability instance from said equation solver and creates an array based upon said solution and wherein said solution is stored in the array, said array having a length of n bits.
3. The satisfiability filter builder of claim 2, further comprising: a transposer in communication with said array builder for transposing the array provided by said array builder.
4. The satisfiability filter builder of claim 1, further comprising: a Boolean operation selector in communication with said equation creator and for selecting a Boolean operation; and wherein said plurality of equations crated by said equation creator are based upon said selected Boolean operation.
5. A method of building a satisfiability filter comprising the steps of: identifying a subset of a domain; identifying the elements in the subset; selecting a set of hash functions; mapping each element of the subset to a set of literals, utilizing the selected set of hash functions; combining each set of literals using a Boolean function to create a set of equations; conjoining the equations of the set of equations to provide a satisfiability instance; creating an array based on a solution to the satisfiability instances; identifying an element of a domain to be tested for set membership; selecting an element for which set membership is to be tested; utilizing the selected set of hash functions to map the selected element to a set of literals; combining the literals to create a test equation; identifying a solution associated with the satisfiability filter; determining whether said test equation is satisfied by the solution associated with the satisfiability filter; and providing an indication of set membership based upon the determination of whether the test equation is satisfied.
6. The method of claim 5, wherein said Boolean function is disjunctive OR and said step of conjoining the equations is performed using the conjunctive normal form (CNF).
7. The method of claim 5, wherein said Boolean function is exclusive OR and said step of conjoining the equations is performed using a sparse matrix format.
8. The method of claim 5, wherein said Boolean function is NOT ALL EQUAL and said step of conjoining the equations is performed using the conjunctive normal form (CNF).
9. The method of claim 5, wherein said step of creating an array includes creating an array based on a plurality of solutions to said satisfiability instance.
10. The method of claim 9, wherein said multiple solutions are disparate.
11. The method of claim 10, further including the steps of: calculating the observed false positive rate associated with each solution; and adjusting the equations if the observed false positive rate does not meet an identified false positive rate threshold.
12. The method of claim 9, further including the step of adjusting the false positive rate provided by the filter by adjusting the number of solutions on which the array is based.
13. The method of claim 9, further including the step of transposing the array.
14. The method of claim 5, further comprising the step of: selecting an additional set of hash functions; mapping each element of the subset to an additional set of literals utilizing the additional set of hash functions; combining each additional set of literals using the Boolean function to create an additional set of equations; conjoining the equations of the additional set of equations to provide an additional satisfiability instance; and creating an additional array based on a solution to the additional satisfiability instance.
15. The method of claim 14, further comprising the step of: adjusting the false positive rate of the satisfiability filter by adjusting the number of set of hash functions utilized to provide satisfiability instances.
16. The method of claim 5, wherein said step of identifying a solution is repeated to identify multiple solutions associated with the satisfiability filter and said step of determining whether said equation is satisfied is performed for each solution identified.
17. The method of claim 16, wherein said step of identifying a set of hash functions is repeated to identify multiple set of hash functions; said steps of utilizing, combining, identifying a solution and determining are repeated for each set of hash functions; and wherein said step of providing an indication of set membership is based upon the determination of whether the equation is satisfied for each set of hash functions.
18. A satisfiability filter query tool implemented by a computer for querying a satisfiability filter to determine membership of an element in a set, the computer including memory for storing information and the tool comprising: a hash function identifier for identifying the hash functions used to build the satisfiability filter wherein variables produced by said hash functions range from −n to n; a mapper in communication with said hash function identifier for mapping the element to a set of literals utilizing the identified hash functions; an equation creator in communication with said mapper wherein said equation creator creates an equation based upon said literals; a solution identifier, for receiving the satisfiability filter and identifying a solution to the satisfiability filter stored in memory as an array, said array having a length of n bits; a membership checker in communication with said solution identifier and said equation creator; wherein said membership checker applies said solution to said equation created by said equation creator and determines whether said solution satisfies said equation; and an indicator for providing an indication to user indicating whether said element is a member of the set based upon said determination of said membership checker.
19. The tool of claim 18, further comprising: a Boolean operation identifier in communication with said equation creator, wherein said equation creator creates said equations based upon said Boolean operation identified by said Boolean operation identifier.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The organization and manner of the structure and operation of the invention, together with object and advantages thereof, may best be understood by reference to the following description, taken in connection with the accompanying drawings, wherein like reference numerals identify like elements in which:
(2)
(3)
(4)
(5)
(6)
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
(7) While the invention may be susceptible to embodiment in different forms, there is shown in the drawings, and herein will be described in detail, a specific embodiment with the understanding that the present disclosure is to be considered an exemplification of the principles of the invention, and is not intended to limit the invention to that as illustrated and described herein.
(8) Building the Sat Filter
(9) Phase 1—Creating a Conjunction of Equations χ.sub.Y
(10) The method of building a SAT filter 100, F.sub.Y, is illustrated in
(11) At step 108, the Boolean function B which is to be used for creation of the SAT filter, F.sub.Y, is identified.
(12) At step 110, the integers n and k are selected. The integer n represents the number of variables provided by each SAT instance and thus the amount of memory available to store the SAT filter. The integer k represents the number of literals associated with each equation. Selection of the integers n and k is discussed below in further detail.
(13) At step 112, at least one set of k hash functions, {h.sub.1, h.sub.2, . . . h.sub.k} to be associated with each element, y.sub.i∈Y, is selected.
(14) At step 114, each hash function is used to map each element, y.sub.i∈Y, to a set of literals {l.sub.1, l.sub.2, . . . l.sub.k}. The hash functions are selected to map each element of Y uniformly at random to the set of literals where k represents the number of hash functions associated with each element and therefore the resulting number of literals associated with each element. For example, the set of hash functions {h.sub.1, h.sub.2, . . . h.sub.k} are associated with element y.sub.1 to provide the set of literals {l.sub.1,1, l.sub.1,2, l.sub.1,k} and the same set of hash functions {h.sub.1, h.sub.2, . . . h.sub.k} are associated with element y.sub.2 to provide the set of literals {l.sub.2,1, l.sub.2,2, . . . l.sub.2,k}. Each literal l is provided by a Boolean variable or its negation (NOT). A pair of literals is said to be complementary if both are the same variable but have different signs, i.e., x.sub.i and
(15)
The literals are represented by integers ranging from −n to n, exclusive of 0.
(16) At step 116, for each y.sub.i∈Y, the set of literals is provided to the Boolean function selected at step 108 to create an equation C.sub.i.
C.sub.i=B(l.sub.i,1, . . . , l.sub.i,k.sub.
(17) TABLE-US-00002 Algorithm 1 ELEMENTTOSETOFLITERALS(y ε D , n, k, h.sub.1, . . . , h.sub.k) k is the number of literals per clause h.sub.1, . . . h.sub.k are functions that map elements of D to [−n, n]\{0} 1: repeat 2: C: { } empty set of literals 3: for i := 1 to k nonce i : = 0 do 4: l := h.sub.i (y , nonce) 5: C := C ∪ {l} 6: end for 7: nonce := nonce + 1 8: until all literals of C are distinct and no pair is complementary 9: return C
(18) Thus, the hash functions map each element of Y uniformly at random into a set of literals that when provided to the Boolean function constitute a random equation of width k. For example, if the identified Boolean function is logical disjunction, the literals are provided to the disjunction function to provide the equation C.sub.i as follows:
C.sub.i=l.sub.i,1vl.sub.i,2 . . . vl.sub.i,k.sub.
for each C.sub.i, 1≦i≦m, where the symbol V represents logical disjunction (OR). When applying the logical disjunction function to the literals, for example, the set of literals relating to element, y.sub.1, is provided to the function to provide an equation C.sub.1 and the set of literals relating to element, y.sub.2, is provided to the function to provide an equation C.sub.2. In this manner, a set of equations C.sub.Y including the equations, C.sub.1, C.sub.2, . . . C.sub.m is associated with the set of elements y.sub.1, y.sub.2 . . . y.sub.m.
(19) Another Boolean function which may be selected at step 108 is EXCLUSIVE OR (⊕). When providing the literals to the EXCLUSIVE OR function, the resulting equation from step 116, C.sub.i, may be expressed as follows:
C.sub.i=l.sub.i,1⊕l.sub.i,2 . . . ⊕l.sub.i,k.sub.
(20) Yet another Boolean function which may be selected at step 108 is NOT ALL EQUAL (NAE). When providing literals to the NAE function, the equation, C.sub.i, provided at step 118 may be expressed as follows:
C.sub.i=(l.sub.i,1vl.sub.i,2 . . . vl.sub.i,k.sub.
(21) Once the set of equations, C.sub.Y, has been determined, at step 118 the equations of the set C.sub.Y are conjoined utilizing the Conjunctive Normal Form (CNF), or an equivalent formula, to provide a random k-SAT instance, χ.sub.Y.
χ.sub.Y=C.sub.1^ . . . ^C.sub.m,
where the symbol ^ represents logical conjunction (AND) and 1≦i≦m. A random k-SAT instance is a conjunction of equations drawn uniformly, independently, and with replacement from the set of all width k equations. Because |Y|=m, each SAT instance provides m number of equations.
(22) As described above, the Bloom filter is also created utilizing hash functions, however, construction of the Bloom filter is significantly different than construction of the SAT filter of the present invention. The Bloom filter utilizes hash functions to map each element of Y to a series of integers. Each integer is associated with a position in an array. If an element of Y is mapped to a particular position, the bit at that position is changed from 0 to 1 to designate set membership. In the SAT filter of the present invention, however, hash functions are utilized to map each element to a set of literals. The literals are then used to create a set of equations associated with each element of Y. These equations are then conjoined to create a SAT instance.
(23) Phase 2—Finding a Solution for χ.sub.Y
(24) Once the SAT filter has been built, at step 122 a satisfying assignment (also called a “solution”) for χ.sub.Y is found. An assignment v is a function from the set of variables Vars (|Vars|=n) into the set Bool, i.e. {0, 1}. An assignment v satisfies a variable x.sub.i if v(x.sub.i)=1 and v satisfies
(25) For example, if the selected Boolean function is logical disjunction:
C.sub.i=l.sub.i,1v . . . vl.sub.i,k.sub.
where the symbol v represents logical disjunction (OR), then an assignment v satisfies C.sub.i if for some j, 1≦j≦k.sub.i, v(l.sub.i,j)=1.
(26) A solution to the conjunction of equations may be found using an equation solver. The particular equation solver used is irrelevant, so long as the equation solver is capable of solving moderate-to-large satisfiable random k-SAT instances. For example, the equation solver may be a SAT solver such as, for example, SBSAT, Dimetheus, and WalkSAT.
(27) Algorithm 2 is an example of an algorithm which may be used to build a SAT filter including a satisfiable k-SAT instance, χ.sub.Y.
(28) TABLE-US-00003 Algorithm 2 BUILDSATFILTER(Y ⊂ D, n, k, B, h.sub.1, . . . , h.sub.k) n is the amount of memory available to store the SAT filter F.sub.Y, namely |F.sub.Y| k is the number of literals per equation B is a Boolean function h.sub.1, . . . , h.sub.k are functions that map elements of D to [−n, n ]\{0} 1: m := |Y| 2: C.sub.Y := { }, the empty formula 3: for each element y ∈ Y do 4: C.sub.y := B(ELEmENTTOSETOFLITERALS(y, n, k, h.sub.1, . . ., h.sub.k)) 5: C.sub.Y := Cy ∪ {Cy} 6: end for 7: X.sub.Y = CY 8: if the random k-SAT instance X.sub.Y is unsatisfiable then 9: return maybe 10: else 11: Let F.sub.Y be a solution to X.sub.y 12: return F.sub.Y 13: end if
(29) The false positive rate of the filter can be improved by utilizing the equation solver to find multiple (S.sub.1) solutions to the SAT instance. At step 122 it is determined whether multiple solutions to the SAT instance are desired. If multiple solutions to the SAT instance are desired, the process returns to step 120 and the equation solver is again used to determine a new solution. Once the desired number of solutions to the SAT instance have been found, the process proceeds to step 124.
(30) The false positive rate of the filter may be improved by utilizing multiple sets of hash functions to generate multiple (s.sub.2) SAT instances to be satisfied. At step 124 a determination is made as to whether an additional set of k hash functions is to be utilized. If an additional set of k hash functions is to be utilized the process returns to step 112 and an additional set of k hash functions is selected. When s.sub.2 sets of hash functions are used, s.sub.2 set of equations are generated, resulting in s.sub.2 SAT instances. The equation solver is then used to find s.sub.1 solutions to each of the s.sub.2 SAT instances. Thus, the total number of solutions, s, used to define the filter can be found with:
s=s.sub.1*s.sub.2
(31) When querying the SAT filter to test for set membership, as described below, the equation C.sub.x must be satisfied by each of the s.sub.i solutions to the corresponding SAT instance in order to provide a result of maybe. Therefore, each of the solutions is stored in long term memory in order to perform the query. Thus, the amount of long term storage required to store the solutions will be n*s. The benefit of using multiple solutions to improve the false positive rate is that query time is not negatively impacted. Because only s.sub.2 equations are generated per query, each equation can be checked against all s.sub.1 solutions in parallel using bit-packing and word-level operations. To accomplish parallel checking, however, the solutions need to be stored such that all s.sub.i solution bits corresponding to a variable are stored together.
(32) When it is determined at step 124 that the desired number of sets of k hash functions have been selected and utilized, the process proceeds to step 126.
(33) Phase 3—Creating the Filter F.sub.Y
(34) At step 126, a filter based upon the s solution(s) is created. Specifically, the s.sub.1 solution(s) to each of the s.sub.2 k-SAT instance(s) χ.sub.Y discovered by the equation solver are provided as an array of bits, F.sub.Y. If multiple k-SAT instances are provided, an array will be associated with each k-SAT instance. The array(s), F.sub.Y, represents the SAT filter of the present invention.
(35) Next, at step 128, the array(s) of bits associated with F.sub.Y is stored. The array of bits may be stored such that the value of x.sub.1 in solution 1 is stored at index 1, the value of x.sub.2 in solution 1 is stored at index 2, and so on. Alternatively, the array of bits may be transposed, i.e., the first s.sub.1 bits of the array correspond to the values of variable x.sub.1 (either 1 or 0) in each of the s.sub.1 solutions, and the next s.sub.1 bits correspond to the values of variable x.sub.2, and so on. The advantage of providing a transposed array is that the bit-packed values may be queried in parallel using word level operations thereby significantly decreasing query time.
(36) It is noted that providing multiple solutions to a single SAT instance may not decrease the false positive rate if the solutions are not independent, i.e. the improvement to the false positive rate depends on how different the solutions are from one another. Thus, if multiple solutions to the SAT instance are found in order to improve the false positive rate, the specific equation solver used must be capable of finding disparate solutions to the same SAT instance. If at step 122 it is determined that multiple solutions to the SAT instance are to be found, the process may include steps for determining whether the solutions are disparate. These optional steps 130, 132, 134 are illustrated in
(37)
(38) The domain subset identifier 302, identifies the subset Y of a domain D and provides this subset Y on an output of the identifier 302.
(39) The hash function selector 304 includes an input and an output. The input of the selector 304 receives information from a user regarding the hash functions to be applied to the subset Y of the domain D. The hash function selector 304 further includes an output on which a set of hash functions, {h.sub.1, h.sub.2, . . . h.sub.k}, is provided.
(40) The mapper 306 includes a first input, a second input and an output. The first input of the mapper 306 is in communication with the output of the domain subset identifier 302 for receiving the domain subset Y from the domain subset identifier 302. The second input is in communication with the hash function selector 304 and receives the set of hash functions provided by the hash function selector 304. The mapper 306 maps elements of the subset Y of the domain D to a set of literals utilizing the identified set of hash functions. The literals are provided at the output of the mapper 306.
(41) The Boolean operation selector 310 includes an input from which a user may define the Boolean operation to be implemented by the SAT filter builder 300 and an output on which the selected Boolean operation is output.
(42) The equation creator 308 includes a first input, a second input and an output. The first input is in communication with the output of the mapper 306 and receives the literals provided by the mapper 306. The second input is in communication with the Boolean operation selector 310 and receives identification of the Boolean operation to be applied to the literals. The equation creator 308 combines the literals utilizing the selected Boolean operation to provide a set of equations, C.sub.Y at the output of the equation creator 308.
(43) The conjoiner 312 includes an input in communication with the output of the equation creator 308. The conjoiner 312 receives the set of equations C.sub.Y and conjoins the equations of the set to provide a SAT instance χ.sub.Y at the output of the conjoiner 312.
(44) The SAT filter builder 300 optionally includes a false positive rate calculator 314. Solutions to the SAT instance χ.sub.Y may be provided to the false positive rate calculator 314 for the purpose of determining whether multiple solutions provided by the equation solver 320 are sufficiently disparate as described above.
(45) The SAT filter builder 300 is in communication with the equation solver 320 via the output of the conjoiner 312. The equation solver 320 includes an input in communication with the output of the conjoiner 312. The equation solver 320 receives the SAT instance χ.sub.Y and finds a solution to the SAT instance χ.sub.Y. The solution to the SAT instance is provided to the SAT filter builder 300 via the array builder 316.
(46) The array builder 316 includes an input and an output. The input of the array builder 316 is in communication with the output of the equation solver 320 and receives the solution from the equation solver 320. The array builder 316 builds an array based upon the solution provided by the equation solver and provides an array on the output of the array builder 316 representing the filter F.
(47) The optional transposer 318 includes an input in communication with the output of the array builder 316 and an output. The transposer receives the array from the array builder 316 and transposes the array. The transposed array is provided on the output of the transposer 318 as the filter, F.sub.Y in the transposed form [F.sub.Y].sup.T. Transposition of the array provided by the array builder 316 is not required, therefore, the transposer 318 has been identified as optional.
(48) It is noted that the set of equations C.sub.Y and the SAT instance(s) χ.sub.Y, can be discarded once a solution has been found. Hence, the amount of long term storage required to store a solution is n*s.sub.1*s.sub.2 bits. In general, this amount of memory is small compared to the memory required to fully specify the original data set.
(49) An asymptotic analysis of the SAT filter may be performed. During this analysis we assume that the SAT filter is using disjunction as its Boolean function, and note that this analysis can be successfully performed using other Boolean functions.
(50) For a given element x∈D, to pass a SAT filter comprised of the s.sub.1 solutions to s.sub.2 random-k SAT instances, where the Boolean function B is disjunction, means that at least one of the k literals in each of x's corresponding s.sub.2 equations is set to 1. For each literal in the equation there is a ½ chance that a random solution provides that the filter is set to 1. So the false positive rate, i.e. the probability of passing x is
p=(1−2.sup.−k).sup.s.
(51) A SAT filter uses n*s bits of long term memory. Therefore, the efficiency of such a SAT filter is
(52)
(53) As noted above, the selection of the variable n, and k along with the selection of the Boolean function to be applied to the literals will impact the performance of the filter. In order for the filter to function appropriately, the range of integers, n, used to represent the size of the filter and the number of literals associated with each hash function, k, must be chosen appropriately based upon the size m of the domain Y. If n is too small or k is too large, all the bits associated with the filter F.sub.Y could essentially become set to 1. As a result, the filter will rarely reject when queried (i.e. the filter will have a high false positive rate), rendering F.sub.Y useless.
(54) As previously described in connection with SAT instances, the ratio m/n (i.e. α.sub.χ.sub.
m/n<2.sup.k ln 2−O(k),
then as k, n, m.fwdarw.∞, a random k-SAT instance with n variables and m equations is almost surely satisfiable, and unsatisfiable otherwise. That is, as k gets large, a large random k-SAT instance with m/n<2.sup.k ln 2−O(k) is satisfiable. Applying this to the present invention, in order to maximize the efficiency of a SAT filter, therefore, m/n should be as large as possible while keeping the random k-SAT instance satisfiable. Thus, a lower boundary is defined, i.e., if m/n=2.sup.k ln 2−k, it is probable that the equation is satisfiable. Utilizing this calculation for m/n in the above efficiency formula provides the following estimate of the efficiency of a SAT filter:
(55)
A simple calculus argument shows that the expression on the right tends to 1 as k tends to infinity. Therefore, such SAT filters can achieve the information-theoretic limit.
(56) Since the first derivative of this SAT filter efficiency function is always positive, SAT filter efficiency is a monotonically increasing function. Thus, efficiency increases as k increases. Theoretically achievable SAT filter efficiency for various values of k are provided in the table below. As also illustrated in Table 2, as k increases, the rate at which the efficiency increases is reduced. As a result, there is not a specific value of k that will maximize SAT filter efficiency, but as k grows there is a diminishing return on the efficiency provided. Thus, the value of k need only be relatively small (e.g. five or six) to achieve near optimal efficiency.
(57) TABLE-US-00004 TABLE 2 Theoretically achievable disjunctive SAT filter efficiency for various k. k 3 4 5 6 7 α.sub.k 4.26 9.93 21.11 43.37 87.79 Ε 0.82 0.92 0.96 0.98 0.99
(58) Parameters of the SAT filter, F.sub.y, may be defined to meet specific performance requirements. The size of the data set m=|Y| is typically assumed to be given. Thus, the remaining SAT filter parameters will be defined based upon the filter requirements. For example, in some scenarios a user may be presented with limited memory for storing the filter. In other scenarios, the user may be presented with a false positive rate threshold which must be accomplished.
(59) In some instances, filter requirements may dictate a value for k, i.e., the number of hash functions to be applied to each element to provide a set of literals. As discussed below, keeping the value of k low is beneficial because as k increases, the amount of time required to query the SAT filter increases. As stated earlier, a small k (say five or six) is sufficient to achieve near perfect efficiency.
(60) With a value for k selected and the value for m known, a value for n (i.e., the range of integers associated with each literal) is determined. In determining a value for n, the user must balance the desire to reduce n (thereby reducing the size of the long term memory needed to store the filter), with the need to provide a high probability of the SAT instance being satisfiable (i.e., as n increases, the probability of satisfying the SAT instance increases but as n is increased more long term memory is required to store the filter). As α.sub.χ.sub.
(61)
(62) In addition to determining values for n, both s and p should be selected. These parameters bring about a trade-off between the amount of long term storage (n*s) and the false positive rate. The lower the false positive rate, the higher the amount of long term storage. The lower the amount of long term storage, the higher the false positive rate.
(63) If the amount of long term storage is known, the false positive rate, p, can be determined as follows,
p=(1−2.sup.−k).sup.s.
If, instead, the false positive rate is known, the amount of long term storage needed can be determined by first determining the total number of solutions, s, needed as follows,
(64)
(65) The amount of one-time memory needed (in bits) to build all SAT instances simultaneously for a given filter is (assuming 64-bit integers are used to represent literals)
64 kms.
If enough memory is available to hold all the SAT instances, they can be solved in parallel, otherwise they can be solved sequentially. Although the previous equation shows the memory necessary to store the equations for all SAT instances, there is typically additional memory needed for other SAT solver data structures (such as an occurrence list or incidence graph) that can cause this number to grow by an order of magnitude while a SAT solver is at work.
(66) Note that, like a Bloom filter, the every-time work for a SAT filter (i.e., the query work) can be performed rather quickly. Unlike a Bloom filter, however, SAT filters can be queried quickly while achieving high efficiency. Thus, although building a SAT filter requires more one-time work than a Bloom filter, this is typically not critical as many common applications spend far more time querying than building.
(67) It is noted that although utilizing additional (s.sub.2) SAT instances improves the false positive rate, query speed is somewhat decreased because k*s.sub.2 hashes are computed. It is further noted, however, that all but one of the hash functions may be simple. For example, a set of hashes may include h.sub.i(x)=h.sub.1(x)+ih.sub.2(x) for 3≦i≦k.
(68) Examples of filters built utilizing the present invention are detailed in Table 3. Specifically, four dictionaries were built consisting of 2.sup.14, 2.sup.15, 2.sup.16, and 2.sup.17 random strings. To ensure that random k-SAT instances were generated, the strings were transformed into disjunctive clauses using the MurmurHash3 hash function combined with a technique described by Adam Kirsch and Michael Mitzenmacher in ‘Less Hashing, Same Performance: Building a Better Bloom Filter’. Specifically, the set of hashes h.sub.i(x)=h.sub.1(x)+ih.sub.2(x) for 3≦i≦k were used without increasing the false positive rate of the filter. Next, utilizing the present invention, disjunctive SAT filters were built for k=3 . . . 6 with p≈¼ and α.sub.χ.sub.
(69) As noted earlier, the Bloom filter achieves an efficiency of approximately 0.693. The results illustrated in Table 3 demonstrate that SAT filters can be built with an efficiency near the information-theoretic limit (1.0). Although SAT filter
(70) TABLE-US-00005 TABLE 3 Seconds taken to build SAT filters for p ≈ ¼ ξ 0.75 0.77 0.79 0.81 0.83 0.85 0.87 0.89 0.91 0.93 0.95 0.97 k = 3, s.sub.2 = 11 .fwdarw. p ≈ 23.02% m = 2.sup.14 1 1 2 137 — — — — — — — — m = 2.sup.15 1 3 15 101 — — — — — — — — m = 2.sup.16 2 6 67 101 — — — — — — — — m = 2.sup.17 7 17 116 120 — — — — — — — — k = 4, s.sub.2 = 22 .fwdarw. p ≈ 24.18% m = 2.sup.14 1 1 1 1 1 2 6 39 — — — — m = 2.sup.15 1 1 1 2 2 9 19 62 — — — — m = 2.sup.16 1 2 2 4 10 26 36 71 — — — — m = 2.sup.17 4 5 7 11 25 62 74 144 — — — — k = 5, s.sub.2 = 44 .fwdarw. p ≈ 24.74% m = 2.sup.14 1 1 1 1 4 33 183 285 1436 — — — m = 2.sup.15 1 1 2 4 21 96 97 117 349 — — — m = 2.sup.16 3 3 5 8 71 211 219 222 548 — — — m = 2.sup.17 7 9 14 23 471 484 499 517 626 — — — k = 6, s.sub.2 = 89 .fwdarw. p ≈ 24.62% m = 2.sup.14 1 2 6 29 54 68 115 235 — — — — m = 2.sup.15 3 11 105 105 110 123 167 389 — — — — m = 2.sup.16 7 79 250 251 258 268 324 671 — — — — m = 2.sup.17 21 555 567 569 579 604 640 1283 — — — —
build times appear large (as compared to Bloom filters), achieving higher efficiency (less long term storage) at the expense of more one-time work is often acceptable.
(71) The SAT filters built were evaluated to determine whether the desired false positive rate was achieved. Specifically, each SAT filter was queried with 2.sup.27 random elements and the realized false positive rate was compared with the desired false positive rate. This evaluation illustrated that the SAT filters achieved the desired false positive rate.
(72) Table 4 provides a comparison between the sizes of the smallest SAT filters built versus optimal Bloom filters built using the same data set and false positive rate.
(73) Table 4 provides a comparison between pairs of Bloom filters and SAT filters. Each pair was built with optimal parameters given the same dictionaries and false positive rate. As illustrated in Table 3, a SAT filter with a false positive rate of p≈¼ and k=5 needs s=44 total solutions to achieve an efficiency of 0.91 utilizing
(74) TABLE-US-00006 TABLE 4 Bloom filter size versus SAT filter size (in bits). m = 2.sup.14 m = 2.sup.15 m = 2.sup.16 m = 2.sup.17 Bloom SAT Bloom SAT Bloom SAT Bloom SAT k = 3, s = 11 .fwdarw. p ≈ 23.02% 50090 42856 100179 85734 200356 171446 400715 342914 k = 4, s = 22 .fwdarw. p ≈ 24.18% 48419 37708 96837 75416 193674 150832 387348 301664 k = 5, s = 44 .fwdarw. p ≈ 24.74% 47638 36256 95275 72556 190550 145112 381099 290268 k = 6, s = 89 .fwdarw. p ≈ 24.62% 47797 37202 95593 74404 191186 148897 382371 297794 The s and k parameters are only applicable to the SAT filters.
current equation solvers. If m=2.sup.16, the SAT filter will use n=145112 bits of long term storage. In comparison, an equivalent optimal Bloom filter requires n=190550 bits of longer term storage. Thus, an equivalent SAT filter with these parameters achieves a 23 percent reduction in size relative to the Bloom filter.
(75) Filters requiring a small amount of long term storage, i.e. those having a small value for n will have a value for α.sub.χ closer to this phase transition than those with a larger n. These filters, therefore, provide greater efficiency.
(76) Querying the SAT Filter
(77) Once the SAT filter, F.sub.Y, has been created, it can be queried to determine if element x is contained within the set Y. Each time an element is tested for membership in the data set, the filter must be queried. A method 200 of querying the SAT Filter is illustrated in
(78) The method 200 begins at step 202 by identifying the domain D. Next at step 204, element x, of the domain D is identified. Next at step 206, the filter F.sub.Y defined at step 126 in
(79) Next, at step 210, the sets of hash functions associated with the identified SAT instance χ.sub.Y are identified. At step 212, the identified set of hash functions are utilized to map the element x to a set of literals. Thus, the same hash functions utilized to map the elements, y, to a set of literals in step 114 of the method for building a filter illustrated in
(80) At step 214, the literals are combined to create an equation C.sub.x. The Boolean operation utilized to create the equation C.sub.x is the same Boolean operation utilized to create the equation C.sub.y at step 118 of the method for building a filter illustrated in
(81) Next, at step 216, s.sub.1 solutions associated with the SAT instance, χ.sub.Y, are identified. The solutions identified are based upon the filter defined at step 126 of the method for building a filter illustrated in
(82) At step 218, it is determined whether the equation C.sub.x is satisfied by the identified solutions. If at step 218 it is determined that the equation C.sub.x is not satisfied by the identified solutions, at step 220 the filter provides an indication of no (i.e. x is not a member of the set Y). Otherwise, if at step 218 it is determined that the equation C.sub.x is satisfied by the identified solutions, the method proceeds to step 226.
(83) At step 226, it is determined whether there are additional solutions to be tested. If at step 226 it is determined that there are additional SAT instances to be tested, the process returns to step 212 where the set of hash functions associated with the next identified solutions is identified. If at step 226 it is determined that there are no additional solutions to be tested, the process proceeds to step 228 and the indication of maybe is returned. As noted above, the indication of maybe indicates that x may be a member of the set Y.
(84) As noted above the array(s) defining the filter may be transposed. In the event the array(s) has not been transposed, each of the s.sub.1 solutions will need to be tested against the equation C.sub.x to determine if the equation C.sub.x is satisfied by the identified solution.
(85) Algorithm 3 below provides an example of an algorithm used to query the SAT filter. The Boolean function utilized to create the equation C.sub.x in Algorithm 3 is disjunction.
(86) A SAT filter query tool of the present invention is illustrated in
(87) The hash function identifier 402 identifies the hash function utilized by the SAT filter builder in building each SAT instance χ.sub.Y. The hash function identifier includes an output on which the identified hash functions, {h.sub.1, h.sub.2, . . . h.sub.k} are
(88) TABLE-US-00007 Algorithm 3 QUERYSATFILTER(S.sub.Y, x, n, k, h.sub.1, . . . , h.sub.k) n is the amount of memory available to store the SAT filter S.sub.Y, namely |S.sub.Y| k is the number of literals per clause h.sub.1, . . . , h.sub.k are functions that map elements of D to [−n, n ]\{0} 1: nonce := 0 2: repeat 3: C.sub.x := { }, the empty clause 4: for i := 1 to k do 5: lit := h.sub.i (x , nonce) 6: C.sub.x := C.sub.x ∪{lit} 7: end for 8: nonce := nonce + 1 9: until all literals of C.sub.x are distinct and no pair is complementary 10: for each lit in C.sub.x do 11: if S.sub.Y maps lit to 1 then 12: return maybe 13: end if 14: end for 15: return no
provided.
(89) The mapper 404 includes a first input for receiving an element x to be checked for set membership. The mapper 404 includes a second input in communication with the output of the hash function identifier which receives the set of hash functions {h.sub.1, h.sub.2, . . . h.sub.k}. The set of hash functions {h.sub.1, h.sub.2, . . . h.sub.k} identified by the hash function identifier 404 are the same hash functions selected by the hash function selector 304 of the SAT filter builder illustrated in
(90) The Boolean operation identifier 408 identifies the Boolean operation selected by the Boolean operation selector 310 of the SAT filter builder 300 and provides this identification at its output.
(91) The equation creator 406 includes a first input in communication with the output of the mapper 404. The first input receives the literals from the mapper 404. The equation creator 406 further includes a second input in communication with the Boolean operation identifier 408 and receives identification of the Boolean operation at the second input. The equation creator 406 further includes an output. The equation creator creates an equation C.sub.x by combining the literals utilizing the identified Boolean operator and provides the equation C.sub.x at its output.
(92) The solution identifier 412 includes an input and an output. The filter identifier 412 receives the filter, F.sub.Y or optionally the transposed filter provided by the SAT filter builder 300 and identifies a solution based upon this filter. The solution identifier provides the identified solution on the output of the solution identifier 412.
(93) The membership checker 410 includes a first input, a second input and an output. The first input is in communication with the output of the equation creator 406 and receives the equation C.sub.x. The second input is in communication with the solution identifier 412 to receive the identified solution. The membership checker 410 determines whether the identified solution satisfies the equation C.sub.x and provides this determination at its output.
(94) The indicator 414 includes an input and an output. The input of the indicator 404 is in communication with the output of the membership checker 410. The indicator receives the output of the membership checker 410 and based upon this result provides an indication of set membership as maybe or no at the output of the indicator 414. More specifically, if the membership checker 410 determines that the identified solution satisfies the equation C.sub.x, the indicator 414 provides an indication of maybe at its output and if the membership checker 410 determines that the identified solution does not satisfy the equation C.sub.x, the indicator 414 provides an indication of no at its output. In sum, the query tool provides that if the clause C.sub.x is falsified by the SAT filter associated with Y, then by construction, x is definitely not in Y. If on the other hand, C.sub.x is satisfied by the SAT filter, then x is potentially in Y.
(95) The query times for filters referenced in Table 4 were measured and provided below in Table 5. Table 5 provides a comparison between the query times of the most efficient SAT filters (i.e. those filters with an efficiency closest to 1) versus optimal Bloom filters built using the same data set and false positive rate. Specifically, 2.sup.27 random elements were provided to each of the filters and the time taken to query the filter and determine whether each element of the set at issue was measured.
(96) Table 5 illustrates that a SAT filter of the present invention where, k=5 can
(97) TABLE-US-00008 TABLE 5 Seconds taken to query the Bloom and SAT filters described in Table 4 with 2.sup.27 random elements m = 2.sup.14 m = 2.sup.15 m = 2.sup.16 m = 2.sup.17 Bloom SAT Bloom SAT Bloom SAT Bloom SAT k = 3, s = 11 .fwdarw. p ≈ 23.02% 30 84 30 84 30 84 30 85 k = 4, s = 22 .fwdarw. p ≈ 24.18% 31 186 31 186 31 185 31 185 k = 5, s = 44 .fwdarw. p ≈ 24.74% 31 415 31 410 31 401 31 421 k = 6, s = 89 .fwdarw. p ≈ 24.62% 31 1034 31 1031 31 1028 31 1011 The s and k parameters are only applicable to the SAT filters.
perform approximately 300,000 queries per second (i.e.,
(98)
Although this is approximately 14 times slower than some Bloom filters, the SAT filter query speed can be increased by either decreasing k (which may increase long term storage, i.e. decrease efficiency) or by storing the solutions transposed and making use of optimization techniques such as bit-packing and word-level instructions. This second option produces SAT filters that can be queried within an order of magnitude of Bloom filters. It is recognized that the use of multiple solutions requires more one-time work in order to find disparate solutions to a single random k-SAT instance.
(99) The results of building and querying a set SAT filters are presented below in Table 6. The desired false positive rate for the filters to be built was ¼. As in the filters of Table 5, each filter was queried with 2.sup.27 random elements.
(100) TABLE-US-00009 TABLE 6 Build time, filter size, query time, and achieved false positive rate for single instance SAT filters built with E = 0.75 and m, = 2.sup.14. Build Time Size Query Time SAT Filter k = 4, s = 22 .fwdarw. p ≈ 24.18% 20802 44748 47 k = 5, s = 44 .fwdarw. p ≈ 24.74% 610 44000 51 k = 6, s = 89 .fwdarw. p ≈ 24.62% 643 44144 61 Bloom Filter k = 2 p ≈ 24.18% 1 48219 31 p ≈ 24.74% 1 47638 31 p ≈ 24.62% 1 47797 31
(101) Table 6 illustrates that SAT filters can be built that use less long term storage than Bloom filters and also have comparable query speed and false positive rates.
(102) Thus the SAT filter of the present invention is a simple filter construction that is efficient with respect to the amount of long-term memory storage required and with respect to query time; i.e. SAT filters achieve the information-theoretic limit and support fast queries. SAT filter can therefore be used effectively for testing set membership in large families of sets, providing significant improvement over current techniques such as the standard Bloom filter and the compressed Bloom filter.
(103) While preferred embodiments of the present invention are shown and described, it is envisioned that those skilled in the art may devise various modifications of the present invention without departing from the spirit and scope of the appended claims.