Method and system to estimate the cardinality of sets and set operation results from single and multiple HyperLogLog sketches
11561954 · 2023-01-24
Assignee
Inventors
Cpc classification
G06N7/01
PHYSICS
G06F16/2379
PHYSICS
G06F17/18
PHYSICS
International classification
G06F17/18
PHYSICS
Abstract
A system and method for the estimation of the cardinality of large sets of transaction trace data is disclosed. The estimation is based on HyperLogLog data sketches that are capable to store cardinality relevant data of large sets with low and fixed memory requirements. The disclosure contains improvements to the known analysis methods for HyperLogLog data sketches that provide improved relative error behavior by eliminating a cardinality range dependent bias of the relative error. A new analysis method for HyperLogLog data structures is shown that uses maximum likelihood analysis methods on a Poisson based approximated probability model. In addition, a variant of the new analysis model is disclosed that uses multiple HyperLogLog data structured to directly provide estimation results for set operations like intersections or relative complement directly from the HyperLogLog input data.
Claims
1. A computer-implemented method of estimating cardinality of data sets, comprising: receiving a first data set recorded in a first sketching data structure in accordance with a HyperLogLog method where the first sketching data structure represents set A; receiving a second data set recorded in a second sketching data structure in accordance with a HyperLogLog method, where the second sketching data structure represents set B; defining a statistical model with model parameters λ.sub.a, λ.sub.b, and λ.sub.x, where λ.sub.a indicates the cardinality of elements in set A that are not in set B, λ.sub.b indicates the cardinality of elements in set B that are not in set A, and λ.sub.x indicates the cardinality of elements that are both in set A and set B; and the statistical models describes probability distributions of values stored in the first sketching data structure and the second sketching data structure; determining an initial value for the model parameters using an inclusion-exclusion principle; and starting with the initial value for the model parameters, determining optimized values for the model parameters that maximize a likelihood function using data in the first sketching data structure and the second sketching data structure as fixed input of the likelihood function, where the likelihood function determines an estimate probability value for a given set of model parameters; and iteratively determining the optimized values for the model parameters by changing the values of the model parameters to find optimized values of the model parameters until an optimization condition is satisfied, where the optimized values for the model parameters occur when output of the likelihood function has highest value.
2. The method of claim 1 wherein the likelihood function is further defined as a log likelihood function.
3. The method of claim 1 further comprises iteratively determining the optimized values for the model parameters using Broyden-Fletcher-Goldfarb-Shanno method.
4. The method of claim 1 further comprises creating a delta record with data extracted from the first data set and the second data set, where the delta record includes a first histogram, a second histogram, a third histogram, a fourth histogram and a fifth histogram, each histogram has a bin for each possible value recorded in the first and second sketching data structures, wherein the bins in the first histogram contain a count of occurrences that a register value in the first sketching data structure equals corresponding register value in the second sketching data structure, wherein the bins in the second histogram contain a count of occurrences that a register value in the first sketching data structure is greater than corresponding register value in the second sketching data structure, wherein the bins in the third histogram contain a count of occurrences that a register value in the first sketching data structure is less than corresponding register value in the second sketching data structure, wherein the bins in the fourth histogram contain a count of occurrences that a register value in the second sketching data structure is greater than corresponding register value in the first sketching data structure, wherein the bins in the fifth histogram contain a count of occurrences that a register value in the second sketching data structure is less than corresponding register value in the first sketching data structure; and iteratively determining the optimized values for the model parameters using values of the delta record.
5. The method of claim 1 wherein the first data set and the second data set represent users of a computer application in a distributed computing environment.
6. The method of claim 1 wherein the first data set and the second data set are derived from a plurality of transactions executed in a distributed computing environment.
7. A computer-implemented method for estimating the cardinality of set operations on data sets, comprising: receiving a first data set recorded in a first sketching data structure in accordance with a HyperLogLog method, where the first sketching data structure represents set A; receiving a second data set recorded in a second sketching data structure in accordance with a HyperLogLog method, where the second sketching data structure represents set B; calculating a delta record that describes differences between the first sketching data structure and the second sketching data structure by comparing a value for a register at a specific register address from the first sketching data structure with a value of a register at the same specific register address from the second data structure and recording results of this comparison in the delta record; defining a statistical model with model parameters, where the model parameters indicate cardinality of values which result from set operations performed on set A and set B; and determining values for the model parameters by maximizing a likelihood function for the statistical model in relation to the delta record, where the likelihood function determines an estimate probability value for a given set of model parameters.
8. The method of claim 7 wherein the set operations for which the result cardinality is estimated is one of the intersection of set A and set B, the relative complement of set A with respect to set B, and the relative complement of set B with respect to set A.
9. The method of claim 7 further comprises defining the statistical model with model parameters λ.sub.a, λ.sub.b, and λ.sub.x, where λ.sub.a indicates the cardinality of elements in set A that are not in set B, λ.sub.b indicates the cardinality of elements in set B that are not in set A, and λ.sub.x indicates the cardinality of elements that are both in set A and set B.
10. The method of claim 7 wherein determining values for the model parameters further comprises determining an initial value for the model parameters; starting with the initial value for the model parameters, determining an optimized value for the model parameter that maximizes the likelihood function; and iteratively determining the optimized value for the model parameter by changing the value of the model parameters to find an optimized value of the model parameter until an optimization condition is satisfied, where the optimized value of the model parameter occur when the likelihood function has highest value.
11. The method of claim 10 further comprises determining an initial value for the model parameters using an inclusion-exclusion principle.
12. The method of claim 7 wherein the likelihood function is further defined as a log likelihood function.
13. The method of claim 12 further comprises iteratively determining the optimized values for the model parameters using Broyden-Fletcher-Goldfarb-Shanno method.
14. The method of claim 7 further comprises storing differences between the first and the second sketching data structure in form of histograms, where each possible register value is represented by a bin of one of the histograms, and the value of the bin is based on the result of the comparison of the values of two registers from first and second sketching data structure having the same register address.
15. The method of claim 7 further comprises creating in the delta record a first histogram, a second histogram, a third histogram, a fourth histogram and a fifth histogram, each histogram has a bin for each possible value recorded in the first and second sketching data structures, wherein the bins in the first histogram contain a count of occurrences that a register value in the first sketching data structure equals corresponding register value in the second sketching data structure, wherein the bins in the second histogram contain a count of occurrences that a register value in the first sketching data structure is greater than corresponding register value in the second sketching data structure wherein the bins in the third histogram contain a count of occurrences that a register value in the first sketching data structure is less than corresponding register value in the second sketching data structure, wherein the bins in the fourth histogram contain a count of occurrences that a register value in the second sketching data structure is greater than corresponding register value in the first sketching data structure, wherein the bins in the fifth histogram contain a count of occurrences that a register value in the second sketching data structure is less than corresponding register value in the first sketching data structure; and iteratively determining the optimized values for the model parameters using values of the delta record.
16. The method of claim 7 further comprises iteratively determining the optimized value for the model parameter by determining current values for the model parameters that maximizes the likelihood function using data in the sketching data structure; computing differences between current values of the model parameters with corresponding most recent estimate of the model parameters; and continuing iterating until the largest computed difference is less than a minimum change threshold, where the minimum change threshold is set as
17. The method of claim 7 wherein the first data set and the second data set represent users of a computer application in a distributed computing environment.
18. The method of claim 7 wherein the first data set and the second data set are derived from a plurality of transactions executed in a distributed computing environment.
Description
DRAWINGS
(1) The drawings described herein are for illustrative purposes only of selected embodiments and not all possible implementations, and are not intended to limit the scope of the present disclosure.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15) Corresponding reference numerals indicate corresponding parts throughout the several views of the drawings.
DETAILED DESCRIPTION
(16) Example embodiments will now be described more fully with reference to the accompanying drawings.
(17) The described embodiments are directed to monitoring systems that are capable to provide set cardinality estimations for large sets of transaction trace data in real-time or near real-time, with small and predictable memory consumption caused by the cardinality estimation processing.
(18) An enhancement to existing HLL data evaluation method is presented, that first summarizes the observations presented by a HLL data structure by creating a histogram of the data representing frequency of individual register values. The histogram data is used to calculate both a cardinality estimation value and correction terms for cardinalities in the extreme high or low value ranges.
(19) Further, an alternative evaluation method for HLL data records is presented that uses maximum likelihood method combined with a HLL data interpretation based on a Poisson approximation model to get bias free estimation results combined with a slightly better estimation accuracy. A variant of the maximum likelihood based cardinality estimation method is shown that uses differential histogram data extracted out of two or more HLL data records describing different sets to calculate cardinality estimations for set operation results directly from the differential histogram data. The present cardinality estimation method for set operation results does not require post-estimation calculations on estimation results that potentially increase the estimation error.
(20) Although the disclosed enhanced interpretation methods for HLL data records are exemplary presented in the scope of the interpretation and analysis of sets of transaction trace records provided by agent based transaction monitoring systems, those enhancements are independent of this specific domain and may be practiced in various other areas without departing from the spirit and scope of the present invention. Exemplary areas on which the disclosed technologies may be used include the determination of unique search queries e.g. processed by an internet search server, number of unique SQL statements processed by a database server or the determination of unique log lines in log files.
(21) Referring now to
(22) The created end-to-end transaction trace data may be analyzed by a transaction classifier 107 to identify transaction categories and to mark end-to-end transaction trace data with data identifying the categories to which the transactions match, to create categorized end-to-end transaction trace data 108. The transaction classifier may operate according to the teachings of U.S. patent application Ser. No. 15/227,029 “Method and System for Real-Time, Load-Driven, Multidimensional and Hierarchical Classification of Monitored Transaction Executions for Visualization and Analysis Tasks Like Statistical Anomaly Detection” by Greifeneder et al. which is incorporated herein by reference in its entirety. The categorized end-to-end transaction trace data may be stored in transaction repository 109 which may be accessed by further analysis and visualization modules to e.g. identify anomalies in transaction executions and to determine potential causal relationships between identified anomalies.
(23) The categorized end-to-end transaction trace data may further be forwarded to a HLL sketch generator 113 of a cardinality estimator module 112. The HLL sketch generator 113 may receive a categorized end-to-end transaction trace, use the category data of the received end-to-end transaction record to determine the observed sets for which the received end-to-end transaction is relevant further fetch the HLL sketch data records corresponding to those sets and use data extracted from the received end-to-end transaction trace record to update those HLL sketch data records. The HLL sketch generator may further determine if HLL sketch data records 116 are completed and move 120 completed HLL sketches to a completed HLL sketch repository 115. The determination of the completion status of a HLL sketch may e.g. be performed based on an elapsed observation time or the number of observed transactions corresponding to the set described by HLL sketch.
(24) A HLL sketch evaluator 114, which may also be a part of the cardinality estimator module 112, may receive cardinality estimation requests 118 and may in response to such requests determine and fetch the HLL sketch records corresponding to the sets for which cardinality estimations are requested. The HLL sketch evaluator may then evaluate the fetched HLL sketch records to calculate a cardinality estimation result which may then be returned by the HLL sketch evaluator in form of a cardinality estimation result 119.
(25) Coming now to
(26) A block diagram showing the internal components of a HLL sketch generator 113 is shown in
(27) A HLL sketch updater module 302 of the HLL sketch generator 113 receives both the incoming categorized end-to-end transaction trace data 108 and the corresponding hash value created by the hash value generator 301 and uses the category data of the received transaction trace data to fetch a matching HLL sketch record 116 from a working HLL sketch repository 305 or to create a new HLL sketch record with matching category data 202 if no one is available in the working HLL sketch repository 305. Afterwards, the HLL sketch updater 302 updates the fetched or created HLL sketch according to the received hash value and stores the updated HLL sketch in the working HLL sketch repository 305. For a detailed description of the HLL sketch record update process, please see
(28) A HLL sketch completeness check executed by the HLL sketch completeness checker 308 is cyclically performed on HLL sketch records existing in the working HLL sketch repository 305. The HLL sketch completeness check uses data stored in HLL sketch records 116, like e.g. the specified observation period 203 to determine if HLL sketch records are finished and should not receive further updates. The HLL sketch completeness checker moves such finished HLL sketch records from the working HLL sketch repository 305 to the completed HLL sketch repository 115 to make them available for cardinality estimation requests 118.
(29) Flowcharts of processes performed by the HLL sketch updater and the HLL sketch completeness checker are depicted in
(30) Some variants of the present embodiments, which maintain NLZ histogram generation while the HLL sketch update process is ongoing, may further execute a step 405, which initializes a HLL histogram 220 of the new created HLL sketch record 116. Step 405 initializes the HLL histogram with q+2 bins 221 to cover the whole NLZ register value range. The register values 222 of those bins are set to values from 0 to q+1, to get one bin for each possible register value. Further, the count 223 of bin with register value 0 is set to 2.sup.p and the counts of all other bins are set to 0 to represent the current state of the NLZ registers which were all initialized with the value 0. The process then ends with step 406.
(31) Referring now to
(32) Coming now to
(33) The calculation of a NLZ histogram from a completed HLL sketch record is described in the flow chart shown in
(34)
(35) Step 604 extracts the least significant q bits from the received hash value and determines the NLZ (number of leading zero bits) of those q bits. Following decision step 605 checks if the determined NLZ is greater or equal than the NLZ threshold. In case the determined NLZ is smaller than the NLZ threshold, the process ends with step 613. The NLZ threshold is set to the minimal value stored in a NLZ register (at the beginning of the recording this is 0, therefore the NLZ threshold is initialized with 0). A determined NLZ that is smaller than the NLZ threshold could therefore not cause an update to any NLZ register and can consequently be ignored.
(36) In case the determined NLZ is greater or equal than the NLZ threshold, the processing continues with step 606, which extracts the most significant p from the received hash value, interprets those p bits as address of an NLZ register and selects the addressed NLZ register. Following decision step 607 determines if the value 213 of the selected NLZ register is smaller than 1+the NLZ determined in step 604. In case the register value is not smaller, the process ends with step 613. Otherwise the process continues with step 608, which selects the bin 221 of the NLZ histogram representing calculated update value (1+the NLZ determined in step 604) and increments the count 223 of this bin and subsequent step 609 selects the bin with register value equal to the current value of the selected NLZ register and decrements its count. Following decision step 610 checks if the count value reached 0 after the decrement. In case the count is now 0, step 611 is executed, otherwise the process continues with step 612. Step 611 determines the first bin (i.e. the bin representing the lowest register value) that has a count >0. At the beginning, all NLZ registers have the value of 0, the count of the bin corresponding to register value 0 is 2.sup.p and the count of all other bins is 0. With continuous HLL sketch updates, the values of all registers receive updates and therefore also the counts of the bins in the histogram will change, until bins representing value 0 or 1 etc. will get 0. If the histogram indicates that all registers have a value higher than a specific value (i.e. all bins from the bin representing the value 0 to the bin representing the value n have a count of 0), then register updates having a lower value than n cannot change the value of any register and can therefore be ignored. After the first bin with a count >0 has been determined by step 611, the NLZ threshold is set to the register value represented by this bin.
(37) Subsequent step 612 updates the value of the selected NLZ register to 1+the NLZ determined in step 604. The process ends with step 613.
(38) Referring now to
(39) In case step 702 determines that a cardinality estimation is requested for the result of a set operation based on multiple sets, step 706 is executed which fetches the HLL sketch records 116 representing those sets that represent the input for the set operation. Following step 707 creates a register delta record, which summarizes the differences between the fetched HLL sketches in form of a set of delta histograms. An example for such a register delta record for two HLL sketch records is shown in
(40) Coming now to
(41)
(42) The value “m” in equation 1 represents the number of registers (i.e. 2.sup.p) and count.sub.0 (i.e. the number of NLZ registers with value 0). The value of count.sub.0 is divided by m and the result is used as input for equation 2 and the result of equation 2 is multiplied by m to create the low cardinality range correction corr.sub.low.
(43) The weighted sum is calculated by step 804 according to equation 3.
(44)
(45) The high cardinality correction value is calculated by step 805 according equation 4 and equation 5, where an input value for equation 5 is first calculated by dividing count.sub.q+1 by m and then subtracting this value from 1. The result of equation 5 is then multiplied by m and 2.sup.−q to get a correction factor for high cardinalities.
(46)
(47) After the three terms have been calculated by steps 803 to 805, subsequent step 806 uses equation 6 to calculate a final estimation value. The process then ends with step 807.
(48)
(49) α.sub.∞ may be calculated as 1/(2ln2). For the theoretical motivation of above equations, the mathematical proof showing their validity, simulation results showing their ability to provide correct results and possible numerical optimizations, the reader is kindly referred to the research paper of the inventor available via the GitHub project “oertl/hyperloglog-sketch-estimation-paper” and via the Cornell University Library preprint publication Otmar Ertl, “New cardinality estimation algorithms for HyperLogLog sketches”, arXiv preprint:1702.01284 (2017).
(50) Referring now to
(51)
(52) The idea is to use a maximum likelihood approach to calculate an estimate for the mean λ of the assumed Poisson distribution and use it as estimate for the cardinality. The maximum likelihood method uses a set of observations and an assumed parameterized probability model to find those values for the parameters of the probability model that best match the observations and use those values as estimates for the parameters. In this case, the maximum likelihood approach would use a likelihood function which may be constructed by using the probability mass function as given in equation 7 with a fixed NLZ register set and a variable λ. Typically, the process to estimate a parameter value for a probability model starts with an initial value for the parameter and then successively adapts the parameter value in order to maximize the likelihood function. The process continues until the difference between two subsequent steps is below a certain threshold. Equation 8 shows a variant of a likelihood function derived from equation 7 that may be used to maximize the mean parameter λ for an observed set of register counts. The depicted likelihood function is the log-likelihood function which is a logarithmized version of the original likelihood function. Applying the logarithm does not change the maximum point of the likelihood function (i.e. the parameter value that maximizes the value of the function), but it provides an equation that is easier to process numerically, e.g. because the logarithm translates exponents into multipliers or multiplications into additions. Equation 8 also exploits the fact that likelihood function can be expressed in terms of the compacted histogram data. This dramatically reduces the number of terms that need to be evaluated (e.g. for p=12 and q=20, 22 histogram bins instead of 2.sup.12 NLZ registers). See e.g. the first part of equation 8 where the log likelihood function for λ for a given NLZ register set is set equal to the log likelihood function for λ for a given set of histogram bins (details can be found in “oertl/hyperloglog-sketch-estimation-paper” and in arXiv preprint:1702.01284 (2017)).
(53)
(54) Consequently, following step 904 calculates an initial estimate for A as starting point for the maximum likelihood optimization. The determination of the initial estimate may evaluate the number of NLZ m (=2.sup.p), a term “A” depending on the number of NLZ registers with a value smaller than q+1 and a term “B” depending on the number of register values greater than 0. Equation 9 may be used to calculate term “A” and equation 10 may be used to calculate term “B”.
(55)
(56) Further, step 904 may determine if “B” is smaller by a certain factor (e.g. 1.5) than “A”, and in this case, use equation 11 to calculate an initial estimate, otherwise it may use equation 12. Alternatively or additionally, step 904 may use 0 as initial estimate.
(57)
(58) Following step 905 calculates a termination condition parameter for the maximum likelihood optimization in form of a minimum relative change between two iterations. A value change below this minimum change indicates that no substantial estimation improvement can be expected from further iterations and that the optimization process can be terminated. Step 905 may consider a desired maximum estimation error in form of a factor ε and a factor describing the quality of the input data, like e.g. the number of NLZ registers to calculate the termination condition. Equation 13 shows an exemplary way to calculate this termination condition parameter.
(59)
(60) After an initial value and a termination condition for the optimization process have been determined by steps 904 and 905, an iterative optimization method that is applicable for the optimization problem is chosen. Exemplary applicable iterative optimization methods include but are not limited to the secant method and Newton-Raphson method. Afterwards, the process continues with step 906 which executes a step of the chosen iterative optimization method to calculate an updated value for the estimate of λ that is more likely according to the likelihood function given the observations of the HLL Sketch Record. Following step 907 evaluates if the optimization termination condition is met, e.g. by determining if the relative difference of the estimate is smaller than min.sub.change calculated according to equation 13. In case the termination condition is not met, step 912 is executed which makes the new estimate available as initial estimate and starts a new iteration cycle with step 906.
(61) In case the termination condition is met, step 909 is executed which returns the current estimate for λ as cardinality estimate. The process then ends with step 910.
(62) Next to the cardinality estimation of individual sets, the estimation of the cardinality of the result of set operations are important input for various analysis tasks. The common analysis methods for HLL sketch records provide good support for the cardinality estimation for set unions, e.g. by using the per register address maximum NLZ register value of HLL sketch records that represent the input sets for the set union operation, but they do not provide support for other set operations like intersection or relative complement. The current approach to calculate cardinality estimates for such set operations is based on the inclusion-exclusion principle that uses the cardinality estimate of the input sets and the cardinality estimate for the result of the union operation to calculate cardinality estimates for intersections and the like.
(63)
(64) One reasons for the large estimation error is that the inclusion-exclusion approach is based on multiple individual cardinality estimations and the estimation error is accumulated. Another reason is that this approach only uses data from incoming HLL sketch relevant to determine the cardinality of an individual set, but the combined observations of two or more HLL sketches may also be interpreted to determine the cardinality of set elements contained by all, multiple or only one of the input sets. As an example, NLZ registers with the same address showing the same value in multiple HLL registers may be interpreted as indicators for elements of an intersection of the sets described by the HLL sketches. Further, a HLL sketch for a set A 1001 may be interpreted as the union-combined HLL sketch (using max value of each input NLZ register pair) of a set containing the elements of A\B and a set containing the elements of A∩B. This and other observations and findings (details can be found in “oertl/hyperloglog-sketch-estimation-paper” and in arXiv preprint:1702.01284 (2017)) may be used to extract histogram data from multiple HLL sketches that contain statistics that provide input data for the estimation of the cardinality for various set operation results.
(65) An exemplary data record that may be used for statistics extracted from two HLL sketch records as basis for the estimation of the intersection and relative complements of the sets described by the HLL sketch records is shown in
(66) A register delta record 1101 as depicted in
(67) Coming now to
(68) The process starts with step 1201 when a pair of HLL records containing sketch A and sketch B is available. Subsequent step 1202 creates a new register delta record 1101 and sets all counts 223 of all histograms (1102 to 1106) to 0. Following step 1203 selects the first register address (i.e. address 0) and subsequent step 1204 fetches the NLZ value of the selected register address from HLL sketch A as value X and from HLL sketch B as value Y. Afterwards, decision step 1205 determines if X is greater than Y. If X is greater than Y, step 1209 is executed which increments the count of delta histogram A>B 1103 for the bin representing value X and which further increments the count of delta histogram B<A 1106 for the bin representing value Y. In case decision step 1205 determines that X is not greater than Y, following decision step 1206 determines if Y is greater than X. In case Y is greater than X, step 1207 is executed which increments the count of delta histogram A<B 1104 for the bin representing value X and which further increments the count of delta histogram B>A 1105 for the bin representing value Y. In case Y is not greater than X (and not smaller, which means that X is equal to Y), step 1208 is executed which increments the count of delta histogram A=B 1102 for the bin representing value X.
(69) Decision step 1211 is executed after step 1207, 1208 or 1209 and determines if a next register address is available. If a next register address is available, this address is selected in step 1210 and the processing of the next register pair continues with step 1204. If no more register address is available, the creation of the register delta record 1101 is finished and the process ends with step 1212. A register delta record 1101 generated out of two HLL records represents a sufficient statistic for the analysis task to determine cardinality estimates for the intersection and the relative complements of the sets described by the two HLL records.
(70) Referring now to
(71) Afterwards, the process continues with step 1303 which interprets the received observation data as result obtained from a process that follows a model where the cardinalities for |A∩B|, |A\B| and |B\A| are not fixed but distributed according to a Poisson distribution. As described earlier, this is a tolerable approximation. The assumption of a Poisson process further provides a more relaxed probability model (independent probability functions for individual NLZ registers) that provides a mathematically simple way (multiplication of individual probability functions) to generate an overall probability function for all NLZ registers. The generated overall probability function interprets the observation data of the received register delta record as generated by a Poisson process, where the Poisson process is parameterized by three parameters λ.sub.a, λ.sub.b and λ.sub.x. Those parameters also describe the cardinality of the intersection of both input sets and the cardinalities of their relative complements.
(72) Following step 1304 uses the above explained inclusion-exclusion principle as described above to create initial cardinality values for the following optimization set, and subsequent step 1305 determines a value for a termination condition of the optimization process. Similar to the maximum likelihood-based evaluation for one HLL sketch, the value termination condition may be calculated using equation 13. However, as there are now three parameters that are optimized, the evaluation of the termination condition also must consider all three parameter changes.
(73) Following step 1306 executes one step of an appropriate optimization methodology (e.g. the Broyden-Fletcher-Goldfarb-Shanno algorithm for solving nonlinear optimization problems) that evaluates the overall probability function under the observations received with the register delta and with the initial approximation of the current three parameter values to calculate three updated parameter values that are more likely explain the received observation data under the assumed overall probability function. The optimization methodology may operate on a log-likelihood function derived from the overall probability function. Equation 14 shows the log-likelihood function that may be used. In equation 14, “hist.sub.A=B” represents counts from the delta histogram A=B 1102, “A>B” the counts from delta histogram A>B 1102 etc. and “count(A>B).sub.k” denotes the count of the bin for register value k of delta histogram A>B 1102. Equation 14 also expresses the fact that the original observation data provided by the two analyzed HLL records A and B in form of NLZ register sets NLZ.sub.A and NLZ.sub.B is equivalent to the data of a corresponding register delta record 1101 for the task to calculate cardinality estimates for |A∩B|, |A\B| and |B\A|.
(74)
(75) Subsequent step 1307 determines if the optimization termination condition is met, e.g. by determining if the relative change of all three parameter values is below the optimization termination value calculated in step 1305. If the termination condition is not met, following decision step 1308 continues the process with step 1309 which uses the new parameter values as initial parameter values and afterwards starts a new optimization iteration with step 1306.
(76) If otherwise the optimization termination condition is met, the process continues with step 1310 which uses the optimized estimate for parameter λ.sub.x as estimate for |A∩B|, the estimate for parameter λ.sub.a as estimate for |A\B| and the estimate for λ.sub.b as estimate for |B\A|. The estimates for the cardinalities |A∩B|, |A\B| and |B\A| are returned and the process ends with step 1311.
(77) The techniques described herein may be implemented by one or more computer programs executed by one or more processors. The computer programs include processor-executable instructions that are stored on a non-transitory tangible computer readable medium. The computer programs may also include stored data. Non-limiting examples of the non-transitory tangible computer readable medium are nonvolatile memory, magnetic storage, and optical storage.
(78) Some portions of the above description present the techniques described herein in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. These operations, while described functionally or logically, are understood to be implemented by computer programs. Furthermore, it has also proven convenient at times to refer to these arrangements of operations as modules or by functional names, without loss of generality.
(79) Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system memories or registers or other such information storage, transmission or display devices.
(80) Certain aspects of the described techniques include process steps and instructions described herein in the form of an algorithm. It should be noted that the described process steps and instructions could be embodied in software, firmware or hardware, and when embodied in software, could be downloaded to reside on and be operated from different platforms used by real time network operating systems.
(81) The present disclosure also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored on a computer readable medium that can be accessed by the computer. Such a computer program may be stored in a tangible computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
(82) The algorithms and operations presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatuses to perform the required method steps. The required structure for a variety of these systems will be apparent to those of skill in the art, along with equivalent variations. In addition, the present disclosure is not described with reference to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present disclosure as described herein.
(83) The foregoing description of the embodiments has been provided for purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure. Individual elements or features of a particular embodiment are generally not limited to that particular embodiment, but, where applicable, are interchangeable and can be used in a selected embodiment, even if not specifically shown or described. The same may also be varied in many ways. Such variations are not to be regarded as a departure from the disclosure, and all such modifications are intended to be included within the scope of the disclosure.