Systems and methods for optimal privacy-preserving information revelation
11574076 · 2023-02-07
Inventors
Cpc classification
G06F21/6254
PHYSICS
International classification
Abstract
The present system relates a platform for addressing the optimal privacy-accuracy trade-off in the revelation of a user's valuable information to a third party. Specifically, the present system formalizes the privacy-accuracy trade-off in a precise mathematical framework, wherein mathematical formalization captures user's privacy preference with a single parameter. The system possesses a revelation method of user data that is optimal, in the sense of abiding by user's privacy preference while providing the most accurate description to third party subject to the aforementioned privacy preference constraint.
Claims
1. A method of preserving privacy in a data set used to estimate information configured to be used by a third party, the method comprising the steps of: receiving an initial information data set and a user's privacy setting, the user's privacy setting including one or more privacy instructions defining conditions for sharing regarding the initial information data set; using the initial information data set and the user's privacy setting as input, producing an adjusted information data set and a privacy-preserving stochastic map describing a mechanism used to produce the adjusted information data set; and using the adjusted information data set and the privacy-preserving stochastic map as inputs, applying a stochastic inference algorithm to produce an estimate of the initial information data set and an estimation error value; wherein the estimate of the initial information data set is an estimate of the user's likelihood of having a given genetic trait, the first data set is the user's DNA, and the initial information data set is inferred from the user's DNA and produced using a DNA sequencer.
2. The method of claim 1, wherein the adjusted information data set is constrained to meet every requirement included within the user's privacy setting.
3. The method of claim 1, further including the step of extracting an initial information data set from a first data source.
4. The method of claim 3, further including the step of producing extraction noise statistics describing the noise introduced into the initial information data set by the extraction.
5. The method of claim 4, wherein the extraction noise statistics is an additional input used to produce the adjusted information data set and the privacy-preserving stochastic map describing the mechanism used to produce the adjusted information data set.
6. The method of claim 1, further including the step of updating a prior knowledge data set based on information derived from one or more elements of a first data source.
7. The method of claim 6, wherein the updated prior knowledge data set is an additional input used to produce the adjusted information data set and the privacy-preserving stochastic map describing the mechanism used to produce the adjusted information data set.
8. The method of claim 1, wherein the initial information data set is time-variant.
9. The method of claim 1, wherein the initial information data set is time-invariant.
10. The method of claim 1, wherein the initial information data set is directly available.
11. The method of claim 1, wherein the initial information data set is inferred from a first data set.
12. The method of claim 11, wherein the estimate of the initial information data set is an estimate of the user's likelihood of taking a given action.
13. The method of claim 12, wherein the first data set is a web browsing history.
14. The method of claim 13, wherein the user's likelihood of taking a given action is the user's likelihood of purchasing a given product within a given timeframe.
15. The method of claim 1, wherein the estimate of the initial information data set is an estimate of the user's likelihood of being in a given location within a given timeframe and is inferred from a history of a user's locations.
16. The method of claim 1, wherein the step of producing the adjusted information data set and the privacy-preserving stochastic map describing the mechanism used to produce the adjusted information data set includes producing the privacy-preserving stochastic map using an updated prior knowledge data set, the user's privacy setting, and extraction noise statistics as inputs and then applying the privacy-preserving stochastic map to an extracted initial information data set to produce the adjusted information data set.
17. A method of preserving privacy in a data set used to estimate information configured to be used by a third party, the method comprising the steps of: receiving an initial information data set and a user's privacy setting, the user's privacy setting including one or more privacy instructions defining conditions for sharing regarding the initial information data set; using the initial information data set and the user's privacy setting as input, producing an adjusted information data set and a privacy-preserving stochastic map describing a mechanism used to produce the adjusted information data set; and using the adjusted information data set and the privacy-preserving stochastic map as inputs, applying a stochastic inference algorithm to produce an estimate of the initial information data set and an estimation error value; wherein the step of producing the adjusted information data set and the privacy-preserving stochastic map describing the mechanism used to produce the adjusted information data set includes producing the privacy-preserving stochastic map using an updated prior knowledge data set, the user's privacy setting, and extraction noise statistics as inputs and then applying the privacy-preserving stochastic map to an extracted initial information data set to produce the adjusted information data set; and wherein: the updated prior knowledge data set is a probability distribution over a set from which the initial information data set can take its values; the user's privacy setting includes a condition enabling the user to control a statistical distance between the adjusted information data set and the initial information data set; the stochastic inference algorithm minimizes an estimation error, wherein the estimation error is defined by an expected value of a distance between the initial information data set and the estimate of the initial information data set, where the distance is measured with respect to a given loss function; and the privacy-preserving stochastic map minimizes the stochastic inference algorithm's estimation error subject to the user's privacy setting.
18. The method of claim 17, wherein: the statistical distance is defined as a convex functional of the conditional distribution between the adjusted information data set and the initial information data set; a loss function defining an induced expected distance between the initial information data set and the estimate of the initial information data set is a concave functional of the conditional distribution between the adjusted information data set and the initial information data set; and the privacy-preserving stochastic map is computed by a two-step procedure comprising: a polyhedral set of all possible maps is partitioned into sub-polyhedral regions; and a convex maximization algorithm is carried out over each convex set defined by an intersection of every sub-polyhedral region with a set of maps that satisfy user's privacy setting.
19. A system for preserving privacy in a data set used to estimate information configured to be used by a third party, the system comprising: an initial information data set; a user's privacy setting including one or more privacy instructions defining conditions for sharing regarding the initial information data set; and a processor including memory storing instructions that, when executed, cause the processor to: produce an adjusted information data set and a privacy-preserving stochastic map describing a mechanism used to produce the adjusted information data set using the initial information data set and the user's privacy setting as inputs; and apply a stochastic inference algorithm to produce an estimate of the initial information data set and an estimation error value using the adjusted information data set and the privacy-preserving stochastic map as inputs; wherein the estimate of the initial information data set is an estimate of the user's likelihood of having a given genetic trait, the first data set is the user's DNA, and the initial information data set is inferred from the user's DNA and produced using a DNA sequencer.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION OF THE DRAWINGS
(8)
(9) The meta data source 101 contains the valuable information of the user. Valuable information extractor 102 takes this as input and processes it to produce extracted valuable information 103, along with statistics of the noise introduced in this process as extraction noise statistics 104 and updated prior knowledge 105.
(10) In some examples of the system, meta data source 101 is the readily available valuable information itself, hence unit 102 does not perform any processing, thus 104 is identity mapping, i.e., there is no noise in the extraction process, and 105 is the initial prior knowledge, i.e., there is no update on the prior knowledge of valuable information. Examples of such embodiments include susceptibility to a certain disease due to family history.
(11) In some embodiments of the system, the valuable information is not readily available in 101, hence 102 performs a process to distill the valuable information from the meta data source 101. For example, if the valuable information is the susceptibility to a certain disease due to inherited genetic mutations in certain genes, then 102 needs to perform a measurement operation that is prone to imperfections, i.e., noise. Thus, in such embodiments extraction noise statistics 104 is not identity, but the estimated statistics of the introduced noise, whereas 105 is still the initial prior knowledge, since the sought-after valuable information here is time-invariant, by definition.
(12) In some embodiments of the system, the valuable information is neither readily available in 101 nor time-invariant. For example, user's likelihood to purchase a merchandise varies in time and is not directly observable. As such, for the embodiments of the system for which valuable information is the aforementioned likelihood, this information should be inferred from a relevant meta data source, such as user's web history, by using a certain fraction of this meta data source, such as past web activity, as a training set. Therefore, in such examples, system outputs both statistics of the noise introduced in this process, i.e., 104, as well as updated prior knowledge, i.e., 105, gained through this training process.
(13) In light of the above examples, it will be clear to those skilled in the art that the operation of 102, and hence its outputs, are directly dictated by the type of the valuable information, and hence the meta data source 101 including it. For example, if the valuable information is whether user has genetic mutations in certain genes, which, in turn, increases her susceptibility to certain diseases, then 102 might be a DNA sequencing system, along with its data processing pipeline. Extracted valuable information 103 is a noisy observation of whether the user has genetic mutations. Hence, outputs of the system are this noisy observation, along with the statistics of the noise introduced by the aforementioned measurement process. Note that prior knowledge on whether user has these genetic mutations is agnostic to the aforementioned extraction process.
(14) As another example, if the valuable information is user's likelihood to purchase a merchandise, then 102 might be a data mining procedure implemented on a computer or a mobile device on which meta data source is generated and/or stored. By using a certain amount of user's past web activity, this procedure might be trained to obtain a potentially noisy observation of the valuable information, as well as to estimate the parameters of an underlying dynamical model that governs the valuable information which is not directly observable, and the statistics of the observation noise 104. Due to the dynamic nature of this procedure, system also produces an updated prior knowledge 105 about the valuable information as one of its outputs.
(15) There are three relevant valuable information types: (i) directly available, (ii) not-directly available and time-invariant, and (iii) not-directly available and time-varying. Note that if the valuable information is directly available, whether it's time-invariant is irrelevant. (i) For the first case, the only non-trivial output of 102 is 103, since there is no noise introduced and prior knowledge need not be updated. (ii) For the second case, the system introduces observation noise, so 103 and 104 are the non-trivial outputs of 102. Time-invariant nature of the valuable information ensures prior knowledge does not need to be updated. (iii) Finally, for the third case, 103, 104, and updated prior knowledge 105 are all non-trivial outputs of 102.
(16) Privacy-adjusted valuable information computing unit 107 performs the core operation of the system by taking extracted valuable information 103, extraction noise statistics 104, updated prior knowledge 105, and user's privacy preference 106 as inputs and creating privacy-adjusted valuable information 108, along with the optimal privacy-preserving stochastic mapping 109. The optimal privacy-preserving stochastic mapping 109 is also utilized in the system to create 108, as explained below.
(17)
(18) It should be noted that privacy-adjusting operation of 107 is optimal, as reflected by the optimality of 109, in the sense that it creates a version of extracted valuable information 103 such that it abides by the privacy preference of the user and the inference of the valuable information based on the revelation is the most accurate possible subject to the constraint that privacy preference of the user should be satisfied. The notions of privacy and accuracy mentioned above are formalized in terms of precise mathematical definitions disclosed herein.
(19) Finally, in
(20) It should be noted that in some examples of the system, third party may want to perform the statistical inference step by herself. In those scenarios, the output of the system would be 104, 105, 108, and 109, so that the operation of valuable information inference unit 110 can be performed by third party herself.
(21) Description of Optimal Privacy-preserving Stochastic Mapping Computing Unit. The optimal privacy-preserving stochastic mapping 109, which is the output of optimal privacy-preserving stochastic mapping computing unit 201, is used to create the privacy-adjusted revelation of the extracted valuable information of the user. The optimality is in the sense of achieving the optimal trade-off between user's privacy and third party's utility, i.e., accuracy. In order to formalize the trade-off between privacy and accuracy, one needs to a have precise way of quantifying both privacy and accuracy. Quantification of these notions, however, necessitates a rigorous mathematical framework. To this end, the system includes a statistical framework to formalize privacy and accuracy.
(22) Statistical Model. The optimal privacy-preserving stochastic mapping 109 depends on both the statistics of the noise introduced 104 in the extraction of the valuable information and the estimation procedure employed in the valuable information inference unit 110. Therefore, any attempt to formalize the optimal privacy-accuracy trade-off should account for this interplay. In order to capture this interplay in the system depicted in
(23) User data 301, which will be denoted by V henceforth, is assumed to be distributed according to a probability mass function (p.m.f.) P.sub.V, which will be called prior distribution of user data henceforth. It is assumed that V takes values in the set :={1, . . . , K.sub.V}, where K.sub.V ∈{2, 3, . . . }. A noisy observation of user data 303, which will be denoted by Y henceforth, is available. It is assumed that Y takes values in the set
:={1, . . . , K.sub.Y}, where K.sub.Y ∈{2, 3, . . . }. Further, Y is the output of a stochastic mapping P.sub.Y|V:
.fwdarw.
with V as the input of this mapping. In other words, Y is the output of a noisy observation channel 302 with input V, where the channel noise is induced by the stochastic mapping P.sub.Y|V. Third party's observation regarding user data is 305, which will be denoted by Z henceforth. It is assumed that Z takes values in the set
:={1, . . . , K.sub.Z}, where K.sub.Z ∈{2, 3, . . . }. Similar to Y, Z is the output of a stochastic mapping P.sub.Z|Y:
.fwdarw.
with Y as the input of this mapping. In other words, Z is the output of a privacy-preserving noisy channel 304, which performs the main operation of the model. Third party's estimate of user data 307, which will be denoted by {circumflex over (V)}(Z) henceforth, is generated by the estimator 306, which is a (potentially stochastic) mapping denoted by {circumflex over (V)}:
.fwdarw.
, where
is a subset of real numbers that is a superset of
, i.e., the set in which V takes values. It should be noted that while it is customary to restrict the range of an estimator to the set in which the random variable that is aimed to be estimated takes values, aforementioned generalization is advantageous in terms of enabling a wider range of loss functions to be handled, as it will be evident in what follows.
(24) User data 301, i.e., V, corresponds to the valuable information of the user, which is contained in her meta data source 101. The valuable information may not be readily available in meta data source, and hence may need to be extracted. The transformation of V to Y by means of P.sub.Y|V captures this extraction process. Specifically, Y, i.e., noisy observation of user data 303, represents the extracted valuable information 103 and P.sub.Y|V, i.e., noisy observation channel 302, corresponds to the extraction noise statistics 104 in Section 2. P.sub.V, which encapsulates the prior knowledge regarding user data, corresponds to updated prior knowledge 105 on the valuable information. P.sub.Z|Y, i.e., privacy-preserving noisy channel 304, represents the statistics of the mapping carried out in privacy-adjusted valuable information computing unit 107. Z, i.e., third party's observation 305, corresponds to the privacy-adjusted valuable information 108. The operation of valuable information inference unit 110 is captured by the estimator 306 in
(25)
(26)
P.sub.VYZ=P.sub.VP.sub.Y|VP.sub.Z|Y (1)
(27) It is assumed that prior distribution of user data 301, i.e., P.sub.V, statistics of the noisy observation channel 302, i.e., P.sub.Y|V, and the statistics of the privacy-preserving noisy channel 304, i.e., P.sub.Z|Y, are known to both system designer and third party.
(28) A brief explanation of the operational meanings of the distributions on the right side of (1) is next. P.sub.V encapsulates the prior knowledge of both system designer and third party regarding user data 301. As such, one cannot expect either of the said parties to be able to control it. P.sub.Y|V models the imperfection(s) of the system's observation of user data, in particular it accounts for the noise in the system's observation process of user data. P.sub.V and P.sub.Y|V can be assumed to be arbitrary but fixed distributions and channels, respectively. Finally, P.sub.Z|Y introduces a certain amount of noise to 303, i.e., Y, according to user's privacy preference to create third party's observation 305, i.e., Z. Recalling the main goal of the model, i.e., to reveal a version of user data to the third party such that the privacy loss user experiences satisfies her privacy preference while third party's observation is as accurate as possible, statistics of this privacy-preserving noisy channel 304, i.e., P.sub.Z|Y, is the main design variable system designer chooses. As such, by optimally choosing P.sub.Z|Y, system designer needs to achieve the following competing goals: (1) user's privacy preference is satisfied; and (2) third party's estimate 307, when it is generated by an estimator 306 that is optimal in a precise mathematical sense, is as accurate as possible.
(29) In the following sections, the notions of optimality in system designer's choice of privacy-preserving noisy channel 304, and in third party's estimate 307, are formalized.
(30) Optimum privacy-preserving noise channel. The characterization of the optimum privacy-preserving noisy channel, which will be denoted by P*.sub.Z|Y henceforth, is accomplished by imposing the privacy of the user as a constraint and optimizing the accuracy of the third party's estimate subject to this constraint. In mathematical terms, P*.sub.Z|Y is a maximizer of the following optimization problem:
(31)
where the maximum is over all noisy channels from to
, which will be denoted by
(
|
) henceforth, ƒ.sub.priv(P.sub.V, P.sub.Y|V, P.sub.Z|Y) is a normalized measure of privacy, ƒ.sub.acc(P.sub.V, P.sub.Y|V, P.sub.Z|Y) is a measure of accuracy, and r.sub.o ∈[0,1] is the privacy preference of the user, which captures the privacy loss user can tolerate. Specifically, r.sub.o=1 means user is indifferent about privacy loss, r.sub.o=0 means she does not want any privacy loss at all.
(32) From a practical perspective, arguably the most relevant embodiments of the model depicted in
(33) It should be noted that (2) formalizes the privacy-accuracy trade-off for the system depicted in
(34) It will be evident to those skilled in the art that the choices of privacy and accuracy measures, i.e., ƒ.sub.priv and ƒ.sub.acc, in (2) determine the specific form of the optimization problem. Thus, in order to devise a system that computes P*.sub.Z|Y operationally meaningful choices of accuracy and privacy are required, which will be explained in the following sections.
(35) Statistical measures of accuracy. Typically, performance of a statistical inference procedure in Bayesian framework, in which the model described in (.,.) is a non-negative mapping defined on the cartesian product of the sets V and {circumflex over (V)} take their values, i.e.,
:
×
.fwdarw.[0, ∞). Thus, for a given loss function
(.,.), accuracy of an estimator {circumflex over (V)}(.) is measured in terms of the following quantity:
(36)
where [.] denotes the expectation operator and P.sub.VZ denotes the joint distribution of V and Z, defined by:
(37)
(38) From a practical point of view, arguably the most important case is when the estimator 306 third party uses to form her estimate 307 is optimal. As such, said estimator will be assumed to be optimal henceforth.
(39) In light of the above discussion on the practical merit of using an optimal estimator in 306, following is an operationally meaningful choice for an accuracy measure, given a loss function:
(40)
for a constant c whose choice is a part of the definition of the accuracy measure.
(41) The minimum in (5) can depend on the choice of the loss function. Thus, a complete characterization of the accuracy measure defined in (5) necessitates to particularize a loss function. Two widely used loss functions in statistics literature are the zero-one and squared-error loss functions, defined as:
(42)
respectively. The next step in the complete characterization of the accuracy measure in (5) with loss function choices of either (6) or (7) is the characterization of the corresponding minima on the right side of (5) with these choices, which is explained next.
(43) First, it is well known that for the loss function in (6), the optimal estimator is the maximum a posteriori probability (MAP) estimator, defined as:
(44)
where P.sub.V|Z, i.e., conditional distribution of V given Z, is defined according to Bayes' rule:
(45)
and the ties in (8) are broken uniformly at random, i.e., if the maximizer in (8) is unique, then it is declared as the outcome of the estimator; else, one of the maximizers is chosen uniformly at random and declared as the outcome of the estimator. This randomization in the presence of multiple maximizers is fairly standard in MAP estimation theory, since any optimizer of (8) is exactly as good as any other optimizer in terms of minimizing the cost function in (6). It will be evident to those skilled in the art that the average correct decision probability of the MAP rule, which will be denoted by (P.sub.V, P.sub.Y|V, P.sub.Z|Y) henceforth, can be written as follows:
(46)
where 1{.Math.} is the standard indicator function. Equation (11), along with the choice of c=1 in (5), gives the following accuracy measure:
ƒ.sub.acc,MAP(P.sub.V,P.sub.Y|V,P.sub.Z|Y):=(P.sub.V,P.sub.Y|V,P.sub.Z|Y) (12)
(47) Secondly, it is well known that for the loss function in (7), the optimal estimator is the conditional mean, i.e.,
{circumflex over (V)}.sub.MSE(z):=[V|Z=z] (13)
with the corresponding minimum mean-square error (MMSE):
(48)
where Var [V|Z] denotes the conditional variance of V given Z. Hence, picking c=0 in (5) gives the following accuracy measure:
ƒ.sub.acc,MSE(P.sub.V,P.sub.Y|V,P.sub.Z|Y):=−[Var [V|Z]] (14)
Equations (12) and (14) complete the characterization of (5) for the loss functions given in (6) and (7).
(49) Statistical measures of privacy. In general, it is not a straightforward task to quantify an abstract notion like privacy. Nevertheless, one can gain some insight by recalling the extremes of privacy preferences in the framework where r.sub.o=1 and r.sub.o=0. Specifically, r.sub.o=1, which means that user is indifferent about privacy loss, implies that user allows the possibility of the following scenario: Y and Z convey equal amount of information about V. Conversely, r.sub.o=0, which means that user does not want any privacy loss, necessitates the following: V and Z are statistically independent, because otherwise one cannot guarantee that knowledge of Z provides no additional information compared with the prior knowledge about V. In light of these observations, decreasing the information content of Z about V compared to the information content of Y about V is a way of reducing the privacy loss user experiences. Alternatively, the amount of statistical dependence of V and Z compared to the amount of statistical dependence of V and Y can be an indicator of the amount of privacy loss user experiences. Thus, we have the following two broad avenues to formalize a statistical measure of privacy: (1) information Z conveys about V compared to information Y conveys about V; and (2) the amount of statistical dependence between V and Z compared to the amount of statistical dependence between V and Y.
(50) In order to proceed with the first alternative above, one needs to choose a measure of information one random variable conveys about another random variable, i.e., information measure. Among many information measures in the literature, Rényi information is noteworthy thanks to its generality and operational significance in various practical settings. Specifically, for any λ>1 and pair of random variables (V, Y) with joint distribution P.sub.VY, Rényi information of order λ between V and Y, denoted by I.sub.λ(V;Y), is defined as follows:
(51)
It is well known that I.sub.λ(V;Y) is increasing in λ, and also satisfies:
(52)
where the right side is the well-known mutual information, which is arguably the most relevant information measure from a practical perspective, defined as:
(53)
with P.sub.Y denoting the marginal distribution of Y. Thus, (16) ensures that Rényi information recovers mutual information as a limiting case. Further, Rényi information also satisfies the data processing inequality, i.e.,
I.sub.λ(V;Z)≤I.sub.λ(V;Y) (18)
which, in turn, aides us to put forward the following measure of privacy:
(54)
where the last assertion follows from (18) and the non-negativity of (15).
(55) A classical way of measuring statistical dependence is to quantify the “distance” between joint distributions of random variables to the product of their marginal distributions, which would have been the joint distribution if they were independent. Divergences provide such a tool that also has an operational meaning in terms of various statistical inference methods. A broad family of divergences is the ƒ-divergence. In particular, for a given convex function ƒ such that ƒ(0)=1, ƒ-divergence between P.sub.VY and P.sub.VP.sub.Y is defined as follows:
(56)
Various well-known divergences, such as relative entropy, chi-squared divergence, Hellinger divergence, Rényi divergence, and so on, can be shown to be an ƒ-divergence for some convex function ƒ. Further,
D.sub.f(P.sub.VY∥P.sub.VP.sub.Y)≥0 (21)
with equality if and only if V and Y are independent, and it also satisfies the data processing inequality, i.e.,
D.sub.f(P.sub.VY∥P.sub.VP.sub.Y)≤D.sub.f(P.sub.VY∥P.sub.VP.sub.Y) (22)
which, in turn, aides us to put forward the following measure of privacy:
(57)
where the last assertion follows from (21) and (22).
(58) Analysis. The choices for measuring the accuracy and privacy give a special structure to the constrained optimization problem in (2), which, in turn, employed to devise an efficient method to compute the optimum value and an optimizer in (2).
(59) Theorem 1. Particularization of (2) with:
ƒ.sub.acc←ƒ.sub.acc,acc-meas (35)
ƒ.sub.priv←ƒ.sub.priv,priv-meas (35)
where priv-meas ∈{inf, div} and acc-meas ∈{MAP, MSE} is a convex maximization problem over a convex set.
Proof. We prove that with the aforementioned choices, the feasible set as well as the cost function is convex and begin with the feasible set, i.e., privacy measure. For notational convenience, let:
(60)
where λ>1, ƒ is a convex function with ƒ(0)=1, and r.sup.o ∈[0, 1].
Lemma 1. S(P.sub.V, P.sub.Y|V, λ, r.sub.o) is a convex set, i.e.,
αP.sub.Z|Y+(1−α)Q.sub.Z|Y∈S(P.sub.V,P.sub.Y|Vλ,r.sub.o) (26)
for any P.sub.Z|Y,Q.sub.Z|Y ∈S(P.sub.V,P.sub.Y|V,λ,r.sub.o) and α∈(0,1).
Proof. First, the condition that P.sub.Z|Y ∈(
|
) induces a set of linear equality constraints and the Rényi information ratio term induces a non-linear inequality constraint. In order to represent the non-linear constraint in a more convenient way, define the following function for a given P.sub.A, P.sub.B|A pair and λ>1:
(61)
Equations (15) and (27) imply that:
(62)
which, along with the monotonicity of log(.Math.), implies that:
(63)
Hence, in order to conclude the proof, it suffices to prove that (P.sub.V, P.sub.Y|V, P.sub.Z|Y, λ) is a convex function of P.sub.Z|Y. Note that if we define the conditional distribution of Z given V, i.e., P.sub.Z|V, as:
(64)
then, it is easy for those skilled in the art to verify that as a function of P.sub.Z|V,(P.sub.V,P.sub.Z|V,λ)=
(P.sub.V,P.sub.Y|VP.sub.Z|Y,λ) (31)
is convex. Further, if one views P.sub.Y|V and P.sub.Z|Y as stochastic matrices, i.e., each row containing P.sub.Y|V(.Math.|v) and P.sub.Z|Y(.Math.|y), which are probability distributions on and
, respectively, then (30) can be written as a standard matrix multiplication, i.e.,
P.sub.Z|V=P.sub.Y|VP.sub.Z|Y (32)
One can conclude the proof by capitalizing on the linearity of the right side of (32), along with the aforementioned convexity of (P.sub.V, . . . , λ). QED
Lemma 2. S(P.sub.V, P.sub.Y|V, ƒ, r.sub.o) is a convex set, i.e.,
αP.sub.Z|Y+(1−α)Q.sub.Z|Y ∈S(P.sub.V,P.sub.Y|V,ƒ,r.sub.o) (33)
for any P.sub.Z|Y, Q.sub.Z|Y ∈S(P.sub.V, P.sub.Y|V, ƒ, r.sub.o) and α∈(0,1).
Proof. The proof follows via arguments similar to Lemma 1, by replacing the convexity of the function defined in (27) with the convexity of D.sub.f(P.sub.VZ∥P.sub.VP.sub.Z) as a function of P.sub.Z|Y, which is an easy consequence of the facts that D.sub.f(.Math.∥.Math.) is jointly convex in its arguments, and the linearity of the operations in (32) and the marginalization of a joint distribution. QED
Lemma 3. ƒ.sub.acc,MAP (P.sub.V, P.sub.Y|V, .Math.) is a convex function over noisy channels from to
.
Proof. One can write Γ.sub.MAP(P.sub.V, P.sub.Y|V, P.sub.Z|Y) as follows:
(65)
In light of (34), one can conclude the proof by recalling the facts that the sum of convex functions is also a convex function and the maximum of the sum of two functions is smaller than the sum of the individual maxima of the functions. QED
Lemma 4. ƒ.sub.acc,MSE(P.sub.V, P.sub.Y|V, .Math.) is a convex function over noisy channels from to
.
Proof. It is easy to show that −[Var [V|Z]] is convex as a function of the joint distribution P.sub.VZ. This, along with the linearity of the operation in (32), allows us to conclude the proof. QED Lemmas 1-4 imply Theorem 1. QED
(66) Equipped with Theorem 1, a high-level description of a system to compute the statistics of an optimal privacy-preserving noisy channel is given in
(67) Computation of the optimal privacy-preserving noisy channel. All the maximizer(s) of a convex function over a convex set occur(s) at the extreme points of the feasible set. Thus, finding a global maximizer of a convex function over a convex set typically necessitates evaluating the value of the cost function for all extreme points, which, in turn, makes it computationally demanding. Further, when the feasible convex set is not a polyhedron (a polyhedron is a convex set defined by the intersection of a finite number of half-spaces, a bounded polyhedron is called a convex polytope or simply polytope), then its extreme points may form a continuum, i.e., cardinality of the set of extreme points of the feasible set is uncountable infinity, it becomes extremely difficult to devise an algorithm that computes the global maximum exactly infinite time. Thus, it is customary to devise iterative methods that will compute upper and lower bounds on this value that are approaching each other in every iteration. Although one such methodology, in the spirit of the outer approximations in convex maximization literature, can be directly applied to implement the system 404, it is possible to get a significant performance improvement by exploiting the particular properties of the problem in (2) beyond its convexity.
(68) Method for computing optimal privacy preserving noisy channel.
(69) Offline method consists of two sub-methods, initial polyhedron generator 504 and the pre-processing method 507. Pre-processing method, which will be explained separately in a following section, has two inputs, depth of partitioning 506 and initial polyhedron 505. Initial polyhedron 505, denoted by S.sub.o, is generated by the method 504 whose only input 503 is the pair (K.sub.Y, K.sub.Z), i.e., cardinalities of the input and output alphabets of the privacy-preserving noisy channels. Output of the offline method is a non-redundant partition of the initial polyhedron 508.
(70) Main operation of the online method is carried out by 510, i.e., purging partitioned polyhedron (P.P.P.) method, which will be explained separately in a following section. P.P.P. method has two inputs, a non-redundant partition of the initial polyhedron 508 and a 5-tuple of parameters 509, which includes P.sub.V, i.e., prior distribution of user information, P.sub.Y|V, i.e., statistics of the noisy observation channel, r.sub.o, i.e., user's privacy preference, acc-meas ∈{MAP, MSE}, i.e., accuracy measure choice, priv-meas ∈{inf, div}, i.e., privacy measure choice. The output of the online method 511, which is also the output of the overall method, consists of an approximation of the optimum value of the optimization problem in (2) and a channel achieving this value.
(71) Unlike the pre-processing method 507 and the P.P.P. method 510, the operation of the initial polyhedron generating method 504 will be evident to those skilled in the art and it is explained for the sake of completeness. To this end, first define the following mapping from a given stochastic matrix of size K.sub.Y×K.sub.Z to an K.sub.YK.sub.Z-length vector:
{right arrow over (x)}(P.sub.Z|Y):=[P.sub.Z|Y(.Math.|1),P.sub.Z|Y(.Math.|2), . . . ,P.sub.Z|Y(.Math.|K.sub.Y)] (37)
where P.sub.Z|Y(.Math.|1) is a probability distribution on , corresponding to the transition probability when the input to the channel is i ∈{1, . . . , K.sub.Y}. Let P.sub.Z|Y({right arrow over (x)}) denote its inverse mapping from a given K.sub.YK.sub.Z-length vector to a matrix of size K.sub.Y×K.sub.Z. The fact that P.sub.Z|Y is a transition probability matrix can be captured by imposing the following linear constraints on {right arrow over (x)}(P.sub.Z|Y):
(72)
where the subscript i denotes the i-th element of a vector. Equations (38) and (39) suffice to characterize the linear constraints of the feasible set. Let A.sub.o denote the matrix that succinctly summarizes (38) and (39). In particular, let:
S.sub.o:={{right arrow over (x)}: A.sub.o{right arrow over (x)}≤{right arrow over (b)}.sub.o} (40)
denote the polyhedron corresponding to (38) and (39). Here, {right arrow over (b)}.sub.o represents the vector that collects the right sides of (38) and (39), inequality means all the elements of the vectors satisfy the inequality and we convert the equality constraints in (39) by the usual methodology exemplified below:
[a=b][a≥b and −a≥−b] (41)
for any a, b ∈. Finally, note that:
S.sub.o={{right arrow over (x)}(P.sub.Z|Y): P.sub.Z|Y∈(
|
)} (42)
which is 505 in to
in vectorized form, in the sense of (37).
(73) Pre-processing method. Pre-processing method 507, which is depicted in
(74) It is important to note that the redundancy removal routine 603 explained in this section does not depend on P.sub.V, P.sub.Y|V, r.sub.o, acc-meas, priv-meas. As such, it can be computed offline and used for any of the problem instances with the corresponding (K.sub.Y, K.sub.Z) values.
(75) Polyhedron partitioning routine. Consider an arbitrary but fixed polyhedron S⊂.sup.n. Let {
.sub.1, . . . ,
.sub.N} be the facets, i.e., k=1 dimensional faces, of S, and assume that N>2. Further, let {
.sub.1, . . . ,
.sub.N} denote the V-representation of the facets, i.e., for each i ∈{1, . . . , N},
.sub.i={{right arrow over (v)}.sub.i,1, . . . , {right arrow over (v)}.sub.i,M.sub.
.sub.i, which is a polyhedron by definition. Let {right arrow over (x)}.sub.o be the center of mass of S, which can be found as the average of the vertices of S. To be precise, let
={
.sub.o,1, . . . ,
.sub.o,M.sub.
(76)
For every i ∈{1, . . . , N}, define the polyhedron:
S.sub.o,i:=conv(.sub.i∪{{right arrow over (x)}.sub.o}) (60)
i.e., the convex hull of the points {.sub.i,1, . . .
.sub.i,M.sub.
.sub.i,1, . . . ,
.sub.i,M.sub.
(77) One can use this procedure in the polyhedron partitioning routine 601 as many times specified by the depth of partitioning 506, by forming a tree of polyhedra, where at each step each initial polyhedra is partitioned by the aforementioned procedure to form the nodes of the next step, and the root of the tree is S.sub.o, i.e., 505. The aforementioned steps will result in a partitioning of the initial polyhedron S.sub.o, i.e., 602.
(78) Redundancy removal routine. Consider above partitioning scheme with S←S.sub.o where S.sub.o is given in (42). Based on the descriptions provided herein, it will be evident to those skilled in the art that it is possible to determine an H-representation of the facets of S.sub.o by imposing each inequality constraint as an equality constraint sequentially, since H-representation of S.sub.o is in non-redundant form. Further, the center of mass is simply
(79)
where {right arrow over (1)} denotes the vector of 1s. Hence, applying the procedure in Section 3.2.1 results in the set of sub-polyhedra:
{S.sub.o,1. . . ,S.sub.o,K.sub.
since there are K.sub.YK.sub.Z inequality constraints in H-representation of S.sub.o.
(80) One can apply the same procedure to every S.sub.o,i to further partition the original set. However, as is argued below, some of the polyhedra in (61) might be “column permutations of each other”, a notion whose precise definition is given below, which, in turn, makes them redundant. Thus, filtering out these redundant sub-polyhedra as a starting point of the next level of partitioning significantly improves performance without resulting in a loss of optimality.
(81) Consider two arbitrary elements S.sub.o,i, S.sub.o,j of the set of polyhedra in (61). Define:.sub.o,i={
.sub.o,i,1, . . . ,
.sub.o,i,M.sub.
.sub.o,j={
.sub.o,j,1, . . . ,
.sub.o,j,M.sub.
denote V-representations of S.sub.o,i and S.sub.o,j, respectively. If: (i) |.sub.o,i|=|
.sub.o,j| (ii) For all k ∈{1, . . . , M.sub.i},
P.sub.Z|Y(.sub.o,i,k)=P.sub.Z|Y(
.sub.o,i,k)Q.sub.P (64)
for an arbitrary but fixed permutation matrix Q.sub.P ∈K.sub.Z×K.sub.Z, then S.sub.o,i and S.sub.o,j are equivalent as far as the optimization problem in (2) goes, since for any {right arrow over (x)}∈S.sub.o,i (resp. {right arrow over (y)}∈S.sub.o,j), there exists some {right arrow over (y)}∈S.sub.o,j (resp. {right arrow over (x)}∈S.sub.o,i), with exactly the same cost and constraint function values, whose proof follows from routine calculations that can be carried out by those skilled in the art.
(82) Thus, one can define an equivalence class of sub-polyhedra in terms of the property defined by the conditions in items (i) and (ii) above. In each equivalence class, it is sufficient to keep only one representative for either further partitioning or evaluating the optimum value.
(83) Purging partitioned polyhedral (P.P.P.) method. .sub.
.sub.o, respectively.
(84) Before proceeding further, define the following quantities: For any given polyhedron S⊂S.sub.o such that its intersection with the feasible set, i.e.,
S.sub.priv-meas(P.sub.V,P.sub.Y|V,r.sub.o):={{right arrow over (x)}(P.sub.Z|Y): P.sub.Z|Y∈(
|
),ƒ.sub.priv,priv-meas(P.sub.V,P.sub.Y|V,P.sub.Z|Y)≤r.sub.o} (67)
is not empty, ū(k,S) and u(k,S) denote the outer and inner approximation values of the optimization problem:
(85)
with k maximum iterations computed by the simple approximation routine, which is explained in Section 3.2.3.2.
(86) Returning back to the description, following steps are repeated n times, where .sub.m−1 denotes the indices of the sub-polyhedra at the beginning of step m, m=1, i.e., 704, is the initial value for the iteration counter, {k.sub.1, . . . k.sub.n}, i.e., 705, determines the max-iter for simple inner approximation for each step, as well as the total number of times the following steps are repeated, i.e., n: 1. For every i∈
.sub.m−1, compute ū(k.sub.m, S.sub.o,i(m−1)), u(k.sub.m, S.sub.o,i(m−1)), and the updated polyhedron S.sub.o,i(m) via simple approximation procedure, i.e., perform the operation 706, and output these quantities, i.e., 707. 2. Eliminate all sub-optimal polyhedra 708 with
(87)
(88) since they cannot contain a global maximizer. In other words, compute.sub.m:={i∈
.sub.m−1:ū(k.sub.m,S.sub.o,i(m−1))≥max.sub.j∈
.sub.
and output the following 4-tuple.sub.m,{S.sub.o,i(m)}.sub.i∈
.sub.
.sub.
.sub.
for the next iteration, i.e., perform the operation 709. 3. Check whether current iteration m exceeds n+1, i.e., perform 710. If not, increment counter by 1, i.e., perform 711.
(89) Finally, once .sub.n, {S.sub.o,i(n)}.sub.i∈
.sub.
.sub.
.sub.
(90) The following points regarding practical implementation of the above procedure should be noted. First, applications of the simple approximation routine in 706 and 713 can be carried out in parallel. Thus, computationally most demanding part of the method can be accomplished in a parallelized fashion. Second, in practice, as the granularity of the initial partition increases, the number of cuts needed to get close enough inner and outer approximations in the simple approximation procedure appears to decrease. Hence, computation time for each sub-polyhedra appears to decrease as the granularity of the partition increases.
(91) Feasibility Check via Alternating Projections. Consider a set of polyhedra {S.sub.o,i}.sub.i=1.sup.N such that S.sub.o,i⊂S.sub.o for all i∈{1, . . . , N}. Recall that the feasible set, i.e.,
S.sub.priv-meas(P.sub.V,P.sub.Y|V,r.sub.o)={{right arrow over (x)}(P.sub.Z|Y):P.sub.Z|Y∈(
|
),ƒ.sub.priv,priv-meas(P.sub.V,P.sub.Y|V,P.sub.Z|Y)≤r.sub.o} (67)
is a convex set. Hence, the well-known alternating projection algorithm can be used to decide whether S.sub.o,i and S.sub.priv-meas(P.sub.V, P.sub.Y|V, r.sub.o), are disjoint. Specifically, this algorithm either finds a point in their intersection, if it is not empty, or converges to two points in each set closest to each other, if the intersection is empty.
(92) For completeness, a summary of the alternating projection procedure is given as follows: Start with a point in S.sub.priv-meas(P.sub.V, P.sub.Y|V, r.sub.o), e.g., {right arrow over (x)}.sub.o, i.e., center of mass of S.sub.o. At j-th step (j≥1), compute the following quantities:
{right arrow over (y)}.sub.j: projection of {right arrow over (x)}.sub.j onto S.sub.o,i (68)
{right arrow over (x)}.sub.j+1: projection of {right arrow over (y)}.sub.j onto S.sub.priv-meas(P.sub.V,P.sub.Y|V,r.sub.o) (69)
(93) At each step, if {right arrow over (x)}.sub.j+1={right arrow over (y)}.sub.j terminate and declare that the intersection is not empty, which, in turn, implies that S.sub.o,i is feasible. Continue a maximum number of times (in practice, 5 to 10 iterations appear to suffice) and if {right arrow over (x)}.sub.j+1≠{right arrow over (y)}.sub.j after the final iteration, declare that the sets are disjoint, which is equivalent to saying S.sub.o,i is infeasible.
(94) Simple Approximation Routine. Recall that S.sub.priv-meas(P.sub.V, P.sub.Y|V, r.sub.o is a subset of S.sub.o and the containment is strict unless r.sub.o=1. However, since S.sub.o is a polyhedron, it has a finite number of extreme points (equivalently vertices) and there are known algorithms to characterize them, such as Fourier-Motzkin elimination and linear programming-based methods. Thus, if one can “peel-off” S.sub.o by successive cuts so that the maximum of the cost function over the reduced polyhedron strictly improves, i.e., gets lower, one should progress toward a (local) optimum value of the original problem. Depending on the cutting methodology, it is well known to those skilled in the art that outer approximations in the spirit of the above ideas will lead to a global optimum of the original problem.
(95) One such procedure is explained below where
priv-meas ∈{inf,div} (43)
acc-meas ∈{MAP,MSE} (44) 1. Initialize with S.sub.o and i=0. 2. Repeat the following steps until convergence, where S.sub.i denotes the polyhedron at i-th iteration. 2.1. Compute i-th outer approximation point:
(96)
(97)
where ∥.∥.sub.2 denotes the Euclidean norm. 2.3. Compute the supporting hyperplane of S.sub.priv-meas(P.sub.V, P.sub.Y|V, r.sub.o) at {right arrow over (y)}.sub.i and update A.sub.i by adding the equation of this hyperplane, i.e., cut S.sub.i with the aforementioned supporting hyperplane, to get A.sub.i+1 and {right arrow over (b)}.sub.i+1 to deduce:
S.sub.i+1={{right arrow over (x)}: A.sub.i+1{right arrow over (x)}≤{right arrow over (b)}.sub.i+1} (48)
(98) In what follows, we list some remarks regarding the implementation of the aforementioned routine.
(99) Computing the maximizer of the relaxed problem. Recall the problem:
(100)
where S.sub.i={{right arrow over (x)}: A.sub.i{right arrow over (x)}≤{right arrow over (b)}.sub.i}, i.e., the polyhedron that includes the feasible set. As we have noted above, any maximizer of this problem is a vertex of S.sub.i and one needs to evaluate the cost function over all these vertices to find the global maximizer. Computing the cost function is not computationally demanding. Although there are well-known numerical routines to compute the vertices of a given polyhedron, such as the double description method based on Fourier-Motzkin elimination and methods based on linear programming, it is computationally more demanding compared with the remaining parts, especially as the number of iterations, i.e., number of cuts, grows beyond a moderate number that depends on the dimensionality of the channels, i.e., K.sub.V, K.sub.Y, and K.sub.Z. As such, efficient implementation necessitates limiting the number of cuts.
(101) Computing the hyperplane cuts. An implementation of the cutting-plane procedure is the following: Given {right arrow over (y)}.sub.i on the boundary of the feasible set, a supporting hyperplane can be found by evaluating the gradient of ƒ.sub.priv,priv-meas(P.sub.V, P.sub.Y|V, .Math.) at this point, which will be the normal vector of the supporting hyperplane. In particular, the half-space defined by this supporting hyperplane is given by:.sub.i:={{right arrow over (x)}:
∇ƒ.sub.priv,priv-meas(P.sub.V,P.sub.Y|V,P.sub.Z|Y({right arrow over (y)}.sub.i)),{right arrow over (y)}.sub.i−{right arrow over (x)}
≥0} (50)
where ∇ and .,.
denote the gradient and inner product operators, respectively. One can augment the inequality constraint in (50) into A.sub.i to get the H-representation of the updated polytope S.sub.i+1 for (i+1)-th step.
(102) Thresholds for termination. The routine outlined above can be shown to converge to a global maximizer as the step size grows unboundedly. Yet, for practical applications, one needs to impose some termination conditions. To this end, define the following quantities:
(103)
where .sub.outer(i) (resp.
.sub.inner(i) are the outer (resp. inner) approximation values after i iterations, and υ.sub.outer(0) and
.sub.inner(0) are the initial values of these quantities, provided as part of the initial conditions of the routine.
Let δ.sub.outer-app-improv, δ.sub.outer-inner-app and δ.sub.const-qual denote the tolerance on the outer approximation improvement, inner-outer approximation distance, and distance to the feasible region, respectively. Also, let max-iter denote the maximum number of iterations allowed.
(104) The iteration is terminated if any of the following four conditions is satisfied:
|.sub.outer(i+1)−
.sub.outer(i)|≤δ.sub.outer-app-improv (55)
|.sub.outer(i)−
.sub.inner(i)|≤δ.sub.outer-inner-app (56)
|ƒ.sub.priv,priv-meas(P.sub.V,P.sub.Y|V,P.sub.Z|Y ({right arrow over (x)}*.sub.i))−r.sub.o|≤δ.sub.const-qual (57)
i≤max-iter (58)
(105) With these termination conditions, we conclude the remarks regarding the practical implementation of the simple approximation routine.
(106) In conclusion, we would like to reiterate the differences the present system offers compared to conventional systems and give a detailed comparison with the existing work in the literature that are arguably the most relevant to the present system.
(107) The present system offers the following differences as compared to conventional systems. First, in conventional systems, accuracy is based on closeness of Z to Y, measured either in terms of an average distortion measure or an information measure, rather than the accuracy of best statistical inference algorithm's performance of inferring V based on the observation Z as in our case. Second, conventional systems are “dual” to our formulation, i.e., accuracy is guaranteed, privacy is optimized. In our case, it is the other way around. Third, implementation of conventional systems involves a convex minimization over a convex set, rather than a convex maximization over a convex set as in our case. Fourth, in conventional systems, privacy measure is not normalized, as opposed to our case.
(108) Privacy-preserving Data Mapping Under a Privacy-Utility Trade-off. In their paper, entitled “Privacy against statistical inference,” in 50th Annual Allerton Conference on Communication, Control, and Computing, Monticello, Ill., 2012., F. du Pin Calmon and N. Fawaz consider the following setting, which is reproduced here in the notation we have used above: Given VY
Z
(109)
where
(110)
and C(V Q.sub.V) is a loss function from the cartesian product of V and the probability distributions defined on it to real numbers, d: ×
←
.sub.+ is a distortion measure and
(111)
The optimization variable of the minimization problems in (59) and (60) are stochastic mappings from Y to Z, i.e., P.sub.Z|Y. Note that the above formulation has been shown to be stronger than differential privacy, hence in a sense a generalization of differential privacy. It will be evident to those skilled in the art that the aforementioned setting is fundamentally different than the privacy-preserving sufficient statistic. The following is a list some of the most notable differences. First, accuracy in the aforementioned formulation is based on the closeness of Z to Y, measured in terms of an average distortion measure; whereas the accuracy of an optimal statistical inference algorithm's, for a given loss function, inference of V based on the observation Z, in privacy-preserving sufficient statistic. Second, aforementioned formulation is “dual” of the formulation of privacy-preserving sufficient statistic, i.e., in the former the accuracy of third party's observation is guaranteed to exceed a certain amount and the privacy loss of the user is minimized; whereas in the latter, the privacy-loss user experience is guaranteed not to exceed a certain amount and the accuracy of third party's inference is maximized. Third, the particularization of the aforementioned formulation with the self-information cost function, reduces to a convex minimization over a convex set; whereas the particularizations of the privacy-preserving sufficient statistic with the privacy and accuracy measures mentioned above reduces to a convex maximization over a convex set. It will be evident to those skilled in the art that these two types of optimization problems are fundamentally different as far as the methods to compute their respective optimizers. Fourth, in the aforementioned formulation, privacy measure is not normalized, whereas in privacy-preserving sufficient statistic privacy measure is normalized.
(112) Information Bottleneck/Privacy Funnel. The so-called information bottleneck, introduced by N. Tishby and his co-workers in their paper “The information bottleneck method,” arXiv:physics/0004057, 2000, and its dual privacy funnel, introduced by A. Makhdoumi and his co-workers in their paper “From the information bottleneck to the privacy funnel,” in 2014 IEEE Information Theory Workshop (ITW 2014), 2014, are two noteworthy attempts to formalize privacy-utility trade-off, the entirety of each is incorporated herein by reference in their entirety. In our notation, they read as follows: Given VY
Z and λ>0
(113)
where (61) (resp. (62)) called as information bottleneck (resp. privacy funnel). Note that in both cases, the second optimization problem is typically interpreted as a Lagrange multiplier version of the first one, claimed to be an equivalent formulation. There are also ƒ-divergence versions, to which the following comments also apply. Among many differences with the formulation presented herein, the most notable ones follow. First, privacy is measured in terms of the information Z contains about Y, yet in our case we use a normalized version of the information Z contains about the user preference V. Second, accuracy is measured in terms of the mutual information, i.e., an information measure, whereas in our case we use the accuracy of best statistical inference algorithm's performance in inferring V based on the observation Z and these two notions don't have a direct mapping between each other. Third, for the constrained versions, algorithms given only to compute the boundary of the set {[I(Y;Z), I(V;Z): VY
Z}, which does not give an optimizer but rather provides, in a sense, a subset of the feasible set that includes the global optimizer(s), whereas we have an algorithm that computes a global optimizer.
(114) An Example of the Implementation of Valuable Information Extractor, i.e., 102, and Clarification on the Role of Updated Prior Information. Consider the following scenario: valuable information is a rating of a user's likelihood of purchasing a product at a given time instance. A particular example can be constructed akin to stock ratings in which an analyst summarizes her belief of a stock's future potential in discretized options such as buy, hold, sell. Specifically, the aforementioned rating of a user's likelihood of purchasing a product at a given time instance can consist of three possible values: (i) Interested; (ii) Neutral; and (iii) Not interested.
(115) In order to extract this valuable information, the method can use the totality of a user's web activity, such as websites visited, search queries, shopping charts in e-commerce websites, etc., which represents the meta data source in the above discussion. In any practical embodiment of the aforementioned scenario, it should be clear that the valuable information is not readily available to the method and needs to be extracted from the meta data source. Next, we outline one way to implement the method in such a scenario and outline the role of prior knowledge plays in this implementation.
(116) Let {X.sub.1, . . . , X.sub.T} be a sequence of the rating of user's likelihood of purchasing a given product, where X.sub.i is the said rating at time i and assume the method's ultimate goal is to extract an estimate of X.sub.T, i.e., the said rating at time T. One way to achieve this in a dynamical manner is to extract estimates of X.sub.i at each time instance, i.e., iteratively refine the estimate of the rating in light of further information gathered at each time. Let {Y.sub.1, . . . Y.sub.T} represent the sequence of such iterative estimates. At the beginning of the procedure, i.e., the operation to produce Y.sub.1, the method would use all the prior knowledge about the aforementioned rating, along with all the available meta data at that point in time.
(117) One can succinctly summarize this prior knowledge with a probability distribution over the possible values this rating can take. For example, going back to the initial example, if no prior information was available, then one could simply assume that all the three possibilities are equally likely, i.e., all the possibilities has a probability of ⅓. After the extraction operation at each time i, the method has a potentially updated information, which is not only due to the availability of more meta data, but also due to the fact that the extraction operation distills more refined information regarding X.sub.i. As such, in order to accomplish the extraction during the next time instance, the prior knowledge to be used will be different and hence, needs to be updated. In particular, for the ultimate step, in which Y.sub.T is produced, all the previous estimates, i.e., {Y.sub.1, . . . , Y.sub.T−1}, are available to the system, and hence the prior knowledge, i.e., the probability distribution of X.sub.T, to be used will be a function of these values. Therefore, the prior knowledge used in the ultimate step is a potentially different, i.e., more refined, version of the prior knowledge at the very beginning of the procedure.
(118) We continue with a couple of remarks regarding above discussion. The updated prior knowledge has two roles in the aforementioned implementation.
(119) First, as depicted in
(120) Second, during next epoch of the extraction procedure, in which the goal is to extract an estimate of X.sub.2T, one will start the extraction procedure with the posterior distribution computed by using the prior distribution representing this updated prior knowledge, along with extraction noise statistics and Bayesian Theorem, as the prior knowledge.
(121) The operation in the preceding paragraph, is not necessarily reflected in
(122) The statistical framework used in
(123) It should be noted that various changes and modifications to the embodiments described herein will be apparent to those skilled in the art. Such changes and modifications may be made without departing from the spirit and scope of the present invention and without diminishing its attendant advantages. For example, various embodiments of the systems and methods may be provided based on various combinations of the features and functions from the subject matter provided herein.