Estimating Accuracy of Privacy-Preserving Data Analyses

20230205916 · 2023-06-29

Inventors

Cpc classification

International classification

Abstract

Systems and methods for estimating the accuracy, in the form of confidence intervals, of data released under Differential Privacy (DP) mechanisms and their aggregation. Reasoning about the accuracy of aggregated released data can be improved by combining the use of probabilistic bounds like union and Chernoff bounds. Some probabilistic bounds, e.g., Chernoff bounds, rely on detecting statistical independence of random variables, which in this case corresponds to sources of statistical noise of DP mechanisms. To detect such independence, and provide accuracy calculations, provenance of statistical noise sources as well as information flows of random variables are tracked within data analyses, i.e., where, within data analyses, randomly generated statistical noise propagates and how it gets manipulated.

Claims

1. A method, in an electronic device, for providing an accuracy estimation of a data analysis using differential privacy, DP, mechanism, the method comprising steps of: receiving information about a data set to be analyzed, a data analysis, and receiving at least one scenario parameter set related to the data analysis, the scenario parameter set comprising one of: I. a DP-parameter set for a given mechanism together with a wanted confidence parameter β for accuracy calculations; or II. a wanted confidence parameter β and confidence interval; applying taint analysis using the information about the data set and in the taint analysis attaching provenance tags to the generation of noise values, wherein the tags comprise: a) an identifier indicating a distribution from where statistical noise will be sampled by the DP mechanism; b) a parametrization of the distribution; and c) identifiers denoting a statistical dependency on other noisy values; computing the provenance tags, for the result of aggregating noisy values, based on the provenance tags attached to the noisy values received by an aggregation operation as well as the aggregation operation itself; estimating accuracy as a narrowest confidence interval provided by concentration bounds when provenance tags indicate statistical independence among noisy values and determining a scenario response in relation to the received scenario parameter set; and providing the scenario response, wherein the scenario response is one of: I. the accuracy estimation as confidence interval for the data analysis; or II. a DP-parameter to be used by the received data analysis in order to achieve a wanted accuracy with the received confidence parameter β and received confidence interval.

2. The method according to claim 1, further comprising deciding which concentration bounds are applicable from at least one of union and Chernoff bounds.

3. The method according to claim 1, wherein computing provenance tags, when noisy values are aggregated, comprise at least one of sum, scalar multiplication, and negation of noisy values.

4. The method according to claim 1, further comprising computing provenance tags when calculating custom-character .sub.∞,.sub.2,.sub.1 norms based on noisy values.

5. The method according to claim 1, wherein the information about the data analysis comprises at least one of structure of data set, data-set, query to run, or differential privacy parameter.

6. The method according to claim 1, wherein determining when tags indicate statistical independence comprises inspecting for all tags the identifier indicating a distribution from where statistical noise will be sampled by the DP mechanism and the identifier denoting a statistical dependency on other noisy values.

7. The method according to claim 1, wherein the differential privacy, DP, mechanism is at least one of Laplace and Gaussian mechanisms.

8. The method according to claim 1, wherein the electronic device receives information about the data analysis from a remote device via a digital communications network and providing the scenario response via the digital communications network.

9. The method according to claim 1, further comprising injecting noise into a result of a data analysis according to a chosen DP-mechanism and providing the result with injected noise.

10. An electronic device for providing estimations of an accuracy, the electronic device comprising: one or more processors, at least one memory storing one or more programs configured to be executed by the one or more processors, the one or more programs including instructions for performing the method of claim 1.

11. A computer-readable storage medium storing one or more programs configured to be executed by one or more processors of an electronic device, the one or more programs including instructions for performing the method of claim 1.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0019] In the following the invention will be described in a non-limiting way and in more detail with reference to exemplary embodiments illustrated in the enclosed drawings, in which:

[0020] FIG. 1 is a schematic block diagram illustrating an example system;

[0021] FIG. 2 is a schematic block diagram illustrating an exemplary device;

[0022] FIG. 3 is a schematic block diagram illustrating an exemplary method;

[0023] FIG. 4 is a schematic block diagram illustrating an exemplary method;

[0024] FIG. 5 is a schematic block diagram illustrating an exemplary method;

[0025] FIG. 6 is a schematic block diagram illustrating an exemplary method;

[0026] FIG. 7 is a schematic block diagram illustrating an exemplary method; and

[0027] FIG. 8 is a schematic block diagram illustrating an exemplary method.

DETAILED DESCRIPTION OF THE DRAWINGS

[0028] In FIG. 1 reference numeral 100 generally denotes a system for determining and providing an estimation of accuracy of a statistical data analysis to be performed. The system comprise an electronic device 101 arranged to perform calculations and operate data analyses on data sets and/or accuracy estimations of data analyses and on data sets and data structures. Data analyses may for instance be statistical analyses and machine learning analyses but other types of analyses incorporating differential privacy mechanisms may be performed as well. The electronic device is optionally connected to a display 102 for interacting with a user and displaying settings and results from provided functionality. The electronic device 101 may be arranged to receive information about data analyses to be performed, information about data sets, data sets, data structure information, or parameters relating to data analyses from a remote query device 110 communicating with the electronic device via a digital communications network 120 and network communication lines 115. Furthermore, the electronic device may be arranged to transmit results to the remote device in the same manner. It should be noted that the electronic device may receive data sets, data analyses, and/or data structure information using other means, such as using portable memory modules such as universal storage bus modules or similar. The network communication may be based on Ethernet or other communication protocols using wired or wireless technologies as physical transmission media. Using a communications interface the electronic device may receive relevant information for performing accuracy estimations according to the present solution from remote devices and can optionally provide the accuracy estimations as a service to different entities and the remote entities may perform the actual (statistical) data analyses. However, the electronic device may also be arranged to perform the data analyses together with the accuracy estimations and thus providing a complete analysis package.

[0029] As can be seen in FIG. 2, the electronic device 101 comprises one or more processors or processing units 210, one or more memory 211 for storing data and/or instruction sets for operating functionality, at least one communication interface 215, and optionally a user interface (UI) 216 interface. The processing unit comprises one or several modules for operating different types of functionality, such as an instruction set operation module 220 arranged to operate calculations and other functionality of the processing unit and a communication module 230 for handling receiving and transmitting data via the digital communications network 120. Furthermore, the processing unit 210 may comprise a user interface module for handling user interface functionality such as displaying data and functionality on a display 102 and/or receiving user instructions from a keyboard, mouse or other user interface devices (not shown).

[0030] The one or more processors 210 may comprise any suitable processor or combination of processors arranged to operate instruction sets for operating software functions. For example, the processing unit may be a central processing unit (CPU), microprocessor, digital signal processor (DSP), a graphical processing unit (GPU), a field programmable gate array (FPGA), application specific integrated circuit (ASIC), or any other similar device arranged to operate processing functionality and calculations.

[0031] Memory 211 of electronic device 101 can include one or more non-transitory computer-readable storage mediums, for storing computer-executable instructions, which, when executed by one or more computer processors 210, for example, can cause the computer processors to perform the techniques described below. A computer-readable storage medium can be any medium that can tangibly contain or store computer-executable instructions for use by or in connection with the instruction execution system, apparatus, or device. In some examples, the storage medium is a transitory computer-readable storage medium. In some examples, the storage medium is a non-transitory computer-readable storage medium. The non-transitory computer-readable storage medium can include, but is not limited to, magnetic, optical, and/or semiconductor storages. Examples of such storage include magnetic disks, optical discs based on CD, DVD, or Blu-ray technologies, as well as persistent solid-state memory such as flash, solid-state drives, and the like. The computer-readable storage medium stores one or more programs configured to be executed by the one or more processors of an electronic device, the one or more programs including instructions or instruction sets for performing functions and methods as described in this document.

[0032] The electronic device 101 is arranged to operate instruction sets and functionality for operating data analyses and accuracy estimation methods as will be described below.

[0033] Differential privacy (DP) is a quantitative notion of privacy that bounds how much a single individual's private data can affect the result of a data analysis. Formally, differential privacy is a property of a randomized query {tilde over (Q)}(⋅) representing the data analysis, as follow.

[0034] Definition (Differential Privacy)

[0035] A randomized query {tilde over (Q)}(⋅):db.fwdarw. custom-character satisfies (ε, δ)-differential privacy if and only if for any two datasets D.sub.1 and D.sub.2 in db, which differ in one row, and for every output set S .Math. , it holds that Pr[{tilde over (Q)}(D.sub.1) ∈ S]≤e.sup.ε.Math.Pr[{tilde over (Q)}(D.sub.2) ∈ S]+δ.

[0036] In the definition above, the parameters (ε, δ) determines a bound on the distance between the distributions induced by {tilde over (Q)}(⋅) when adding or removing an individual from the dataset.

[0037] When the parameter δ=0, the definition above is referred as pure-DP, while when δ>0 is called approximated-DP.

[0038] To protect all the different ways in which an individual's data can affect the result of a query, the noise needs to be calibrated to the maximal change that the result of the query can have when changing an individual's data. This is formalized through the notion of sensitivity.

[0039] Definition (Sensitivity)

[0040] The (global) sensitivity of a query Q:db.fwdarw. custom-character is the quantity Δ.sub.Q=max{|Q(D.sub.1)−Q(D.sub.2)|} for D.sub.1, D.sub.2 differing in one row.

[0041] A standard way to achieve DP is adding some carefully calibrated noise to the result of a query, where it is also important the choice of the kind of noise that one adds. A standard approach to achieve pure-DP is based on the addition of noise sampled from the Laplace distribution.

[0042] Theorem (Laplace Mechanism)

[0043] Let Q:db.fwdarw. custom-character be a deterministic query with sensitivity Δ.sub.Q. Let {tilde over (Q)}(⋅):db.fwdarw. be a randomized query defined as {tilde over (Q)}(D)=Q(D)+η, where η is sample from the Laplace distribution with mean μ=0 and scale

[00001] $b = \frac{Δ_{Q}}{ε} .$

Then, {tilde over (Q)}(⋅) is (ε, 0)-differentially private, or simply ε-differentialy private.

[0044] A standard approach to achieve approximated-DP is based on the addition of noise sampled from the Gaussian distribution.

[0045] Theorem (Guassian Mechanism)

[0046] Let Q:db.fwdarw. custom-character be a deterministic query with sensitivity Δ.sub.Q. Let ε and δ be values in the interval (0,1). Let {tilde over (Q)}(⋅):db.fwdarw. be a randomized query defined as {tilde over (Q)}(D)=Q(D)+η, where η is sample from the Gaussian distribution with mean μ=0 and standard deviation

[00002] $σ = Δ_{Q} .Math. \frac{\sqrt{2 .Math. \log (\frac{1.25}{δ})}}{ε} .$

Then, {tilde over (Q)}(⋅) is (ε, δ)-differentialy private. In general, the notion of accuracy using confidence intervals can be defined as follows.

[0047] Definition (Accuracy)

[0048] Given a (ε, δ)-differentialy private query {tilde over (Q)}(⋅), a target deterministic query Q(⋅), a distance function d(⋅), a bound α, and the probability β, {tilde over (Q)}(⋅) is (d(⋅), α, β)-accurate with respect to Q(⋅) if and only if for all dataset D, it holds that Pr[d({tilde over (Q)}(D), Q(D))>α]≤β.

[0049] This definition allows one to express data independent error statements such as: with probability at least 1−β the result of query {tilde over (Q)}(D) diverges from the result of Q(D), in terms of the distance d(⋅), for less than α. Then, we will refer to α as the error, β as the confidence probability, and (−α, α) as the confidence interval. For the rest of the document, the considered distance function is that on real numbers: d(x,y)=|x−y|. There are known results about the accuracy for queries using the Laplace and Gaussian Mechanisms.

[0050] Definition (Accuracy for the Laplace Mechanism)

[0051] Given a ε-differentialy private query {tilde over (Q)}(⋅):db.fwdarw. custom-character implemented with the Laplace Mechanism, it holds that

[00003] $\Pr [.Math. \tilde{Q} (D) - Q (D) .Math. > \log (\frac{1}{β}) .Math. \frac{Δ_{Q}}{ε}] \leq β .$

[0052] Definition (Accuracy for the Gaussian Mechanism)

[0053] Given a (ε,δ)-differentialy private query {tilde over (Q)}(⋅):db.fwdarw. custom-character implemented with the Gaussian Mechanism, it holds that

[00004] $\Pr [.Math. \tilde{Q} (D) - Q (D) .Math. > σ .Math. \sqrt{2 .Math. \log (\frac{2}{β})}] \leq β .$

[0054] There are two known concentration bounds for random variables which are useful to reason about the aggregation of released data.

[0055] Definition (Union Bound)

[0056] Given n≥2 random variables V.sub.j with their respective inverse cumulative distribution function iCDF.sub.j, where j=1, . . . , n and

[00005] $α_{j} = {iCDF}_{j} (\frac{β}{n}),$

then the addition Z=Σ.sub.j=1.sup.n V.sub.j has the following accuracy: Pr[|Z|>Σ.sub.j=1.sup.n α.sub.j]≤β.

[0057] Union bound makes no assumption about the distribution of the random variables V.sub.j, j=1, . . . , n. In contrast Chernoff bound often provides a tighter error estimation than the commonly used union bound when adding several statistically independent random variables.

[0058] Definition (Chernoff Bound for Laplace Distributions)

[0059] Given n≥2 random variables V.sub.j which distribution is Laplace with mean μ=0 and scale b.sub.j, where j=1, . . . , n, b.sub.M=max{s.sub.j}.sub.j=1, . . . , n, and

[00006] $v > \max {\sqrt{{.Math.}_{j = 1}^{n} s_{j}^{2}}, b_{M} .Math. \sqrt{\log (\frac{2}{β})}},$

then the addition Z=Σ.sub.j=1.sup.n V.sub.j has the following accuracy:

[00007] $\Pr [.Math. Z .Math. > v .Math. \sqrt{8 .Math. \log (\frac{2}{β})}] \leq β .$

[0060] Definition (Chernoff Bound for Gaussian Distributions)

[0061] Given n≥2 random variables V.sub.j which distribution is Gaussian with mean μ=0 and standard deviation σ.sub.j where j=1, . . . , n, then the addition Z=Σ.sub.j=1.sup.n V.sub.j has the following accuracy:

[00008] $\Pr [.Math. Z .Math. > \sqrt{2 .Math. {.Math.}_{j = 1}^{n} σ_{j}^{2} .Math. \log (\frac{1}{β})}] \leq β .$

[0062] FIG. 3 illustrates an overall view of a method 300 for error estimations for a data analysis for scenarios where (i) DP-parameters are given and accuracy needs to be provided and (ii) where a wanted accuracy is given and DP-parameters are provided. For these two scenarios, we also consider a possibility to inject noise according to a contemplated DP-parameters and provide a noisy result.

[0063] An exemplary method 300 according to the present solution will now be discussed in relation to FIG. 3. The method is performed in an electronic device as discussed previously in this document. The method provides one or several accuracy estimations of a data analysis using differential privacy, DP, mechanisms. The method comprises a number of steps for operating different functions:

[0064] In a first step, the electronic device receives 301 information about a data set to be analyzed, information about the data analysis, and receiving at least one scenario parameter set related to the data analysis. Depending on scenario to be determine the accuracy for, the scenario parameter set may comprise one of:

[0065] 301a) a DP-parameter set (ε, δ) for a given mechanism together with a wanted confidence parameter for accuracy calculations; or

[0066] 301b) a wanted confidence parameter β and confidence interval.

[0067] In a next step, the electronic device applies 302 taint analysis using the information about the data set and in the taint analysis attaching provenance tags to the generation of noise values, wherein the tags comprise: [0068] a) an identifier indicating a distribution from where statistical noise will be sampled by the DP mechanism; [0069] b) a parametrization of the distribution; and [0070] c) identifiers denoting a statistical dependency on other noisy values.

[0071] The electronic device computes 303 the provenance tags, for the result of aggregating noisy values, based on the provenance tags attached to the noisy values received by an aggregation operation as well as the aggregation operation itself.

[0072] The electronic device then estimates 304 accuracy as a narrowest confidence interval provided by concentration bounds when provenance tags indicate statistical independence among noisy values and determining a scenario response in relation to the received scenario parameter set.

[0073] The scenario response is provided 305, wherein the scenario response is one of: [0074] 305a) the accuracy estimation as confidence interval (−α, α) for the data analysis; or [0075] 305b) a DP-parameter to be used by the received data analysis in order to achieve a wanted accuracy with the received confidence parameter β and received confidence interval (305b).

[0076] The scenario response is provided either internally or to an external device.

[0077] The method may further comprise deciding which concentration bounds are applicable from at least one of union and Chernoff bounds.

[0078] Computing provenance tags, for when noisy values are aggregated, may comprise at least one of sum, scalar multiplication, and negation of noisy values.

[0079] The method further comprising computing provenance tags when calculating custom-character .sub.∞,.sub.2,.sub.1 norms based on noisy values.

[0080] The information about the data analysis may comprise at least one of structure of data set, data-set, query to run, or differential privacy parameter.

[0081] The step of determining when tags indicate statistical independence may comprise inspecting for all tags the identifier indicating a distribution from where statistical noise will be sampled by the DP mechanism and the identifier denoting a statistical dependency on other noisy values.

[0082] The differential privacy, DP, mechanism may be at least one of Laplace and/or Gaussian mechanisms.

[0083] The electronic device may receive information about the data analysis from a remote device via a digital communications network and providing the scenario response via the digital communications network to the remote device or some other entity.

[0084] The method may further comprise steps of injecting noise into a result of a data analysis according to a chosen DP-mechanism and providing the result with injected noise.

[0085] The different functional parts will now be discussed in more detail below.

[0086] The solution provides a static analysis capable to compute the accuracy of DP data analyses which aggregate released data. From now on, released data and any aggregation of it is referred as noisy data or noisy value. The accuracy analysis does not execute the DP analysis but rather inspects its components and sub-components looking for where noisy data is generated and aggregated. The solution follows the principle of improving accuracy calculations by detecting statistical independence. For that, it applies a taint analysis which, for each noisy value, tracks (i) information about the distribution used for noise generation, (ii) the parametrization of the distribution, and (iii) identifiers denoting statistical dependence of other noise sources. The taint analysis uses provenance tags—comprising (i), (ii), and (iii)—which are associated with noisy values and propagated along the different operations found in the considered data analysis. Based on such tags, the solution may use inverse Cumulative Distribution Function (iCDF), which given a 0≤β≤1, it returns the corresponding (theoretical) confidence interval. For instance, if function f is an iCDF, and f(0.05)=7, it indicates that the released (noisy) data by a given DP mechanism is no more than distance 7 from the no noisy version of the release data with confidence of 95%—the confidence in percentage is calculated as (1−β)*100.

[0087] In order to perform the accuracy calculations, the solution scrutinizes the operations which constitute a data analysis and proceeds to perform error calculations on them based on the following cases: A) where noisy data is generated, B) how such noisy data gets subsequently aggregated, C) negated, and D) scaled; and if the analysis calculates E) norms.

[0088] A. Released data: For each component of the data analysis where the underlying DP mechanism will inject 404 noise (using a noise source 402) to a value produced by the data analysis 403, the accuracy analysis generates accuracy information comprising a freshly generated 405 provenance tag and a corresponding iCDF based on the privacy parameters (ε, δ) of the underlying DP mechanism—see FIG. 4 illustrating a data analysis 400 under a DP mechanism and which returns 406 the provenance tag and corresponding iCDF. The tag is a set which comprises a 3-tuple with the following components: [0089] i. an identifier indicating the distribution of the statistical noise used by the underlying DP mechanism; [0090] ii. a parametrization of such distribution; and [0091] iii. a freshly generated identifier

[0092] Below, the tag and iCDF are instantiated for the Laplace and Gaussian mechanisms.

[0093] For a data analysis {tilde over (Q)}(⋅):db.fwdarw. custom-character using the Laplace Mechanism with privacy parameter ε and sensitivity Δ.sub.Q:

[00009] $tag = {(L, \frac{Δ_{Q}}{ε}, {p})},$

where L indicates that statistical noise is drawn from the Laplace distribution with location parameter μ=0 and scale parameter

[00010] $b = \frac{Δ_{Q}}{ε},$

and the singleton set {p} comprises a freshly generated identifier p.

[00011] $iCDF (β) = \log (\frac{1}{β}) .Math. \frac{Δ_{Q}}{ε},$

which indicates that the confidence interval for a given β is characterized by

[00012] $α = \log (\frac{1}{β}) .Math. \frac{Δ_{Q}}{ε} .$

[0094] For a data analysis {tilde over (Q)}(⋅):db.fwdarw. custom-character using the Gauss Mechanism with privacy parameter ε, δ and sensitivity Δ.sub.Q: [0095] tag={(G, σ.sup.2, {p})}, where G indicates that statistical noise is drawn from the Gauss distribution with mean μ=0 and standard deviation

[00013] $σ = Δ_{Q} .Math. \frac{\sqrt{2 .Math. \log (\frac{1.25}{δ})}}{ε}$

and the singleton set {p} comprises a freshly generated identifier p.

[00014] $iCDF (β) = σ .Math. \sqrt{2 .Math. \log (\frac{2}{β})}$

which indicates that the confidence interval for a given β is characterized by

[00015] $α = σ .Math. \sqrt{2 .Math. \log (\frac{2}{β})} .$

[0096] B. Aggregation: When the accuracy analysis finds an instruction to add noisy values, it proceeds to find corresponding provenance tags and iCDFs 501 . . . 502 for the operands—see FIG. 5 illustrating an aggregation method 500. The tag for the aggregated (noisy) value 503 is computed based on the tags of the operands as determined by a function addTag. This tag reflects—if possible—the parametrization of the distribution characterizing the total noise injected to the result of the aggregation. The analysis also uses the information of the operands' tags to select the concentration bound (i.e., union or Chernoff) which yields the narrowest confidence interval for a given confidence parameter β—sometimes union bound provides tighter estimations when aggregating few noisy values.

[0097] The accuracy analysis uses the information in the tags to (a) determine 504 that no operand can influence the value of another, i.e., they are independent, which is a check performed by a function independent, and (b) to determine if all the operands got injected with statistical noise coming from the same distribution but with possibly different parameters—a check that is also performed by such function. If one of the conditions (a) or (b) is not fulfilled 506, the analysis computes the iCDF of the aggregation by using the union bound with the iCDFs of the operands. On the other hand, if (a) and (b) are fulfilled 505, the analysis creates an iCDF which, given a confidence parameter β, compares and selects the narrowest confidence interval yielded by the union and Chernoff bounds.

[0098] A function iCDF.sub.UB calculates the iCDF of the aggregation using the union bound, which makes no assumption about the statistical noise injected in the operands. More specifically, function iCDF.sub.UB takes the iCDFs of the corresponding operands and returns an iCDF as follows.

[00016] ${iCDF}_{UB} ({iCDF}_{1}, .Math., {iCDF}_{n}) (β) = {.Math.}_{j = 1}^{n} {iCDF}_{j} (\frac{β}{n})$

[0099] In contrast, function iCDF.sub.CF calculates the iCDF of the aggregation using the Chernoff bound according to the chosen underlying DP mechanism. Finally, function iCDF.sub.min takes two iCDFs and generates an iCDF which, when given a confidence parameter β, chooses the narrowest confidence interval of the iCDFs given as arguments. More formally, we have the following definition:

iCDF.sub.min(iCDF.sub.1, iCDF.sub.2)(β)=min{iCDF.sub.1(β), iCDF.sub.2(β)}

[0100] Below, functions addTag, independent, and iCDF.sub.CF are instantiated for the Laplace and Gaussian DP mechanisms.

[0101] For a data analysis using the Laplace Mechanism, [0102] addTag (tag.sub.1, . . . , tag.sub.n)=ø, which sets the tag of the result of the aggregation as empty. This reflects that the scale of the noise and distribution of the addition is unknown—adding two Laplace distributions do not yield a Laplace distribution. [0103] Let define the inputs from FIG. 5 tag.sub.j, iCDF.sub.j, j=1, . . . , n; then we define independent(tag.sub.1, . . . , tag.sub.n)=(∀ j ∈ {1, . . . , n}.Math.tag.sub.j≠ø) ∧ (∩.sub.j=1, . . . , n P.sub.j=ø) [0104] where tag.sub.j={(L, s.sub.j, P.sub.j)} for j ∈ {1, . . . , n}. This function evaluates to true (YES in FIG. 5) when none of the tags is empty and consists on a single element (as indicated above) and the identifiers of the operands are disjoint. In any other case, it returns false 506 (NO in FIG. 5). [0105] Let define the inputs from FIG. 5 tag.sub.j, iCDF.sub.j, j=1, . . . , n; then we define

[00017] ${iCDF}_{CF} ({iCDF}_{1}, .Math., {iCDF}_{n}) (β) = v .Math. \sqrt{8 .Math. \log (\frac{2}{β})}$

where tag.sub.j={(L, s.sub.j, P.sub.j)} for j ∈ {1, . . . , n}, b.sub.M=max{s.sub.j}.sub.j=1, . . . , n, given a τ>0, and v=

[00018] $\max {\sqrt{{.Math.}_{j = 1}^{n} s_{j}^{2}}, b_{M} .Math. \sqrt{\log (\frac{2}{β})}} + τ .$

Any positive value of τ can be used in this formula, but the smaller, the better, e.g., τ=0.00001.

[0106] For a data analysis using the Gauss Mechanism, [0107] addTag(tag.sub.1, . . . , tag.sub.n)={(G, Σ.sub.j=1.sup.n s.sub.j, ∪.sub.j=1, . . . , n P.sub.j)}, where tag.sub.j={(G, s.sub.j, P.sub.j)} for j ∈ {1, . . . , n}. In this case, the produced tag reflects the fact that the addition of statistical noise arising from Gaussian distributions results into statistical noise under a Gaussian distribution—note the label G in the resulting tag −, which variance is Σ.sub.j=1.sup.n s.sub.j, i.e., the addition of the variance of the Gauss distributions associated to the noisy operands. Furthermore, the provenance of the noise consists of all the operands' identifiers, i.e., ∪.sub.j=1, . . . , n P.sub.j, since all of them contributed to the noise injected into the result. [0108] Let us define the inputs from FIG. 5 tag.sub.j, iCDF.sub.j, j=1, . . . , n; then independent(tag.sub.1, . . . , tag.sub.n)=(∀ j ∈ {1, . . . , n}.Math.tag.sub.j≠ø) ∧ (∩.sub.j=1, . . . , n P.sub.j=ø) where tag.sub.j={(G, s.sub.j, P.sub.j)} for j ∈ {1, . . . , n}. This function evaluates 505 to true (YES in FIG. 5) when none of the tags is empty and consists on a single element (as indicated above) and the set of identifiers of the operands are disjoint. In any other case, it returns false (NO in FIG. 5). [0109] Let us define the inputs from FIG. 5 tag.sub.j, iCDF.sub.j, j=1, . . . , n; then

[00019] ${iCDF}_{CF} ({iCDF}_{1}, .Math., {iCDF}_{n}) (β) = \sqrt{2 .Math. {.Math.}_{j = 1}^{n} s_{j} .Math. \log (\frac{1}{β})},$

where tag.sub.j={(G, s.sub.j, P.sub.j)} for j ∈ {1, . . . , n}. Observe that in this case, Chernoff bound does not use the iCDFs of the operands.

[0110] C. Negation: Negating a noisy value, i.e., multiplying it by −1, becomes useful when implementing subtractions of noisy data. If the given data analysis performs such operation, the accuracy analysis associates 601 the same tag and iCDF to the result as the input—see FIG. 6 illustrating a negation function 600. The reason for that is that multiplying a noisy value by −1 does not affect the size of the estimated confidence intervals for a given confidence parameters β, thus, after the negation 602, the analysis keeps the same iCDF associated to the input. Similarly, the tag associated to the negation of the input is the same as the input itself. The reason for that relies on the fact that distributions characterizing statistical noise in DP mechanisms involve both negative and positive noise (e.g., the Laplace and Gauss mechanisms). Therefore, the noise of the result is negated but drawn from the same distribution as the input—thus the same tag. The function returns 603 the negated tag and iCDF.

[0111] D. Scalar: This case deals with calculating the accuracy of noisy values 701 which get multiplied 705 by a non-noisy constant n 702—see FIG. 7 illustrating a scalar function 700. The resulting tag 710 depends on the statistical noise distribution of the input. More specifically, there are three cases to consider: [0112] scalarTag(tag)=ø, where tag=ø. This case triggers when it is unknown the distribution from where the noise was injected in the input noisy value. Thus, the resulting tag is also the empty set ø. [0113] scalarTag(tag)=ø, where tag={(L, s, P)} and n≠0. This case considers a noisy value generated by the Laplace mechanism. The resulting tag is ø indicating that the distribution which characterize the noise in the result (i.e., scaled value) is unknown—this arises from the fact that multiplying a constant value by a Laplace distribution is not necessarily a Laplace distribution. [0114] scalarTag(tag)={(G, n.Math.σ.sup.2, P)}, where tag={(G, σ.sup.2, P)} and n≠0. This case considers a noisy value with statistical noise drawn from a Gaussian distribution. The resulting tag changes the variance of the distribution to indicate that the noise has been multiplied by the constant n.
The iCDF of the resulting noisy value simply scales the confidence interval by the absolute value of constant n:

iCDF.sub.scalar(n, iCDF)(β)=iCDF(β).Math.|n|.

[0115] E. Norms: It becomes useful for data analyses to use norms (e.g., the standard custom-character .sub.∞,.sub.2,.sub.1 norms) to aggregate the accuracy calculations of many released values 801, 802, . . . 803 into a single measure 810—a useful tool when dealing with vectors. When the accuracy analysis finds an instruction to compute a norm, it proceeds to find the corresponding provenance tags and iCDFs for the elements in the vector. Then, it creates the tag and iCDF for the vector as follows: (i) it determines the tag (provenance) of the vector as empty (see function normTag) and (ii) it calculates the iCDF for the vector based on the chosen norm and the iCDFs of the elements in the vector. In FIG. 8, the function normICDF performs 805 such calculation.

[0116] Below, the same definitions for functions normTag and normICDF work for both the Laplace and Gaussian mechanisms. [0117] normTag(tag.sub.1, . . . , tag.sub.n)=ø, which indicates that the noise found in the norm calculation cannot be characterized by a distribution. [0118] Let us define the inputs from FIG. 5 tag.sub.j, iCDF.sub.j, j=1, . . . , n

[0119] The calculation of the iCDF for the custom-character .sub.∞-norm (L infinite):

[00020] $normICDF ({iCDF}_{1}, {iCDF}_{2}, .Math., {iCDF}_{n}) (β) = \max_{j = 1, .Math., n} {{iCDF}_{j} (\frac{β}{n})}$

[0120] The calculation of the iCDF for the custom-character .sub.2-norm (L1):

[00021] $normICDF ({iCDF}_{1}, {iCDF}_{2}, .Math., {iCDF}_{n}) (β) = \sqrt[2]{{.Math.}_{j = 1}^{n} {{iCDF}_{j} (\frac{β}{n})}^{2}}$

[0121] The calculation of the iCDF for the custom-character .sub.1-norm (L2):

[00022] $normICDF ({iCDF}_{1}, {iCDF}_{2}, .Math., {iCDF}_{n}) (β) = {.Math.}_{j = 1}^{n} .Math. {iCDF}_{j} (\frac{β}{n}) .Math.$

[0122] It should be noted that the word “comprising” does not exclude the presence of other elements or steps than those listed and the words “a” or “an” preceding an element do not exclude the presence of a plurality of such elements. It should further be noted that any reference signs do not limit the scope of the claims, that the invention may be at least in part implemented by means of both hardware and software, and that several “means” or “units” may be represented by the same item of hardware.

[0123] The above mentioned and described embodiments are only given as examples and should not be limiting to the present invention. Other solutions, uses, objectives, and functions within the scope of the invention as claimed in the below described patent embodiments should be apparent for the person skilled in the art.

Estimating Accuracy of Privacy-Preserving Data Analyses

Inventors

Cpc classification

Classification Explorer

G06F16/24556

PHYSICS

Classification Explorer

G06F21/6245

PHYSICS

International classification

Classification Explorer

G06F21/62

PHYSICS

Classification Explorer

G06F16/2455

PHYSICS

Abstract

Claims

Description