METHOD AND SYSTEM FOR CREATING WORD-LEVEL DIFFERENTIAL PRIVACY USING FEATURE HASHING TECHNIQUES
20200382281 ยท 2020-12-03
Inventors
Cpc classification
G06N7/01
PHYSICS
H04L9/3239
ELECTRICITY
H04L9/0618
ELECTRICITY
International classification
H04L9/06
ELECTRICITY
Abstract
The present invention discloses a method of creating word-level differential privacy with the hashing trick to protect confidentiality of a textual data, the method comprising: receiving a list of a plurality of hashes with a weight (or weights) associated with each of the plurality of hashes; Updating said list with new hashes that are within the range of allowable hash values but not included in said received list of hashes; Updating said list with a new weight to each of said plurality of hashes that are missing said weight; Fitting a probability distribution to said list of said weights of said plurality of hashes; and generating said new weights and said adjusted weights based on sampling of said probability distribution.
Claims
1. A method of creating word-level differential privacy with the hashing trick to protect confidentiality of a textual data, the method comprising: receiving, by a differential privacy creating device, a list of a plurality of hashes with a weight (or weights) associated with each of the plurality of hashes; updating, by said differential privacy creating device, said list with new hashes that are within the range of allowable hash values but not included in said received list of hashes; updating, by said differential privacy creating device, said list with a new weight to each of said plurality of hashes that are missing said weight, wherein in cases where said received list of hashes is a newer version of a previously-existing list of hashes, updating said list with adjusted weights with respect to said hashes weights in said previously-existing list of hashes; fitting, by said differential privacy creating device, a probability distribution to said list of said weights of said plurality of hashes; and generating, by said differential privacy creating device, said new weights and said adjusted weights based on sampling of said probability distribution.
2. The method as claimed in claim 1, wherein a corpus of textual data is converted to said plurality of hashes.
3. The method as claimed in claim 1, wherein said probability distribution is continuous when said weight of said plurality of hashes are continuous, and wherein said probability distribution is discrete value when said weight of said plurality of hashes are discrete values.
4. The method as claimed in claim 1, wherein said probability distribution is multi-dimensional when number of types of said weights is more than one.
5. The method as claim in claim 4, wherein said probability distribution is 2d dimensional when type of said weight is d dimensional and previously-existing weights are being adjusted.
6. A non-transitory computer-readable medium storing computer-executable instructions for generating strategy and roadmap for end-to-end information technology (IT) infrastructure cloud implementation, the computer-executable instructions configured for: receiving a list of a plurality of hashes with a weight (or weights) associated with each of the plurality of hashes; updating said list with new hashes that are within the range of allowable hash values but not included in said received list of hashes; updating said list with a new weight to each of said plurality of hashes that are missing said weight, wherein in cases where said received list of hashes is a newer version of a previously-existing list of hashes, updating said list with adjusted weights with respect to said hashes weights in said previously-existing list of hashes; fitting a probability distribution to said list of said weights of said plurality of hashes; and generating said new weights and said adjusted weights based on sampling of said probability distribution.
7. The non-transitory computer-readable medium as claimed in claim 6, wherein said probability distribution is continuous when said weight of said plurality of hashes are continuous, and wherein said probability distribution is discrete value when said weight of said plurality of hashes are discrete values.
8. The non-transitory computer-readable medium as claimed in claim 6, wherein said probability distribution is multi-dimensional when the number of types of said weights is more than one.
9. The non-transitory computer-readable medium as claimed in claim 6, wherein said probability distribution is 2d dimensional when type of said weight is d dimensional and previously-existing weights are being adjusted.
Description
(4) BRIEF DESCRIPTION OF THE DRAWINGS
[0023] The invention will be better understood and objects other than those set forth above will become apparent when consideration is given to the following detailed description thereof. Such description makes reference to the annexed drawings wherein:
[0024]
[0025]
[0026]
(5) DETAILED DESCRIPTION OF THE INVENTION
[0027] In the following detailed description, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that the embodiments may be combined, or that other embodiments may be utilized and that structural and logical changes may be made without departing from the scope of the present invention, which is defined solely by the claims that follow the detailed description.
[0028] References will now be made in detail to the exemplary embodiment of the present disclosure.
[0029]
[0030] The system 100 may implement in a differential privacy creating device, in accordance with some embodiments of the present disclosure. The differential privacy creating device, may create word-level differential privacy with the hashing trick to protect confidentiality of textual data. In particular, the system 100 may include a differential privacy creating device (for example, server, desktop, laptop, notebook, netbook, tablet, smartphone, mobile phone, or any other computing device having a processor) that may create word-level differential privacy.
[0031] As will be described in greater detail in conjunction with
[0032] The system 100 may include one or more processors 101, a computer-readable medium (for example, a memory) 102, and a display 103. The computer-readable storage medium 102 may store instructions that, when executed by the one or more processors 101, cause the one or more processors 101 to validate the document, in accordance with aspects of the present disclosure. The computer-readable storage medium 102 may also store various data that may be captured, processed, and/or required by the system 100. The system 100 may interact with a user via a user interface 104 accessible via the display 103. The system 100 may also interact with one or more external devices 105 over a communication network 106 for sending or receiving various data. The external devices 105 may include, but may not be limited to, a remote server, a digital device, or another computing system.
[0033]
[0034] The database 201 may store a list of a plurality of hashes and a weight associated with each of the plurality of hashes. The internal architecture model 200 may build a hashing algorithm that reads a corpus of textual data and may convert each words into hash. The same word will always be converted to the same hash. But, it may be a possibility that multiple words may correspond with the same hash. In this case, the internal architecture model 200 may share that hash, and from that point on the computer algorithm may treat the multiple words as identical. During training, various weights (for example- parameters, values) may be assigned to the hashes, based on how useful the internal architecture may find each of the plurality of hashes to be when building a model.
[0035] Further, the missing hash adding module 202 may add the plurality of hashes with missing hash in the list of the plurality of hashes. Hence, for each hash with missing weight inside the maximum hash range which may not appear in the list of the plurality of hashes, the hash adding module 202 may add to the list. Further, for each new hash added, the hash adding module may generate a set of weights for it by sampling from the probability distribution and when the distribution includes dimensions for the update differences, the hash adding module 202 may ignore them.
[0036] Further, for each pre-existing hash with missing weights, missing weight generation module 203 may generate values for the missing weights that may be correlated with the pre-existing weights. Moreover, for each of the weight of hashes that may not be updated, the pre-existing hash updating module 205 may generate an update that may be correlated with the pre-existing weights and any pre-existing updates, and add it to the weight.
[0037] Further, a probability distribution module 204 may perform the probability distribution to the weights of the updated plurality of hashes. It should be noted that when there may be more than one type of weight then the probability distribution may be multi-dimensional. By way of an example, when there may be d types of weight then the probability distribution may be of 2d dimensional. Moreover, when the weights of the plurality of hashes may be continuous number then the probability distribution may be continuous and when the weights of the plurality of hashes may be discrete values then the probability distribution may be discrete values. The module 204 is in communication and extracts information from the pre-existing database 206.
[0038] Additionally, when the distribution of weights may be shaped like a known distribution such as the Normal (Gaussian) Distribution or Laplace Distribution, then the shape can be described with a small number of parameters and a parametric technique can be used. When the distribution of weights may have an unusual shape that can't be easily described with a few parameters, a non-parametric technique needs to be used, such as Kernel Density Estimation. It should be noted that how the distribution may be fitted, may not affect the overall process, but it may affect how good the confidentiality guarantee may be, and how well the model may perform.
[0039]
[0040] A hashing trick may also be known as feature hashing, the hashing trick involves using a hash function to assign indices to features. It differs from traditional hashing in that that instead of using each hash as a key mapped to some value, the trick is that the hash itself is the value. This has two main advantages: it is computationally fast and space-efficient; and it maps a potentially infinite number of features into a known, bounded hash range.
[0041] The hashing function is deterministic; a feature (which may just be a word in scenario of the present invention) will always be hashed to the same value h. Hashes cannot easily be reverse-engineered back to the original text string thoughthe hashing process involves overwriting and shifting the bit string so many times, that reversing the process leads to many trillions of possible original text strings.
[0042] It is possible for multiple words to collide to the same hash number. In these cases, all occurrences of the colliding words or features will be mapped to the same hash, and share the same weight(s) 0. The hashing process is often random enough to be considered uniform sampling with replacement.
[0043] When using the hashing trick, the universe of possible outputs U is finite, and known. For a particular problem, the output distribution H will only use a subset of the universe, H.Math.U, from which the training data x and testing data z are then drawn from. For an adequately large x and z, one can expect minimal covariate shift; the hash set outputted by a hashing function g(x) is assumed to be close to the hash set outputted by g(z):
Based on empirical experiments, this sampling assumption holds true in practice.
[0044] Moreover, differential privacy is a tractable, mathematically-rigorous definition of privacy, where a data holder makes the following promise to each userYou will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.
[0045] It has since been adopted as the de facto privacy standard by companies like Apple and Google, and has been applied to numerous ML algorithms. While other privacy definitions such as k-anonymity and 1-diversity exist, they do not protect against auxiliary information, or offer any provable guarantees. For purposes, one can think of each user as being a unique word, and define differential privacy as follows:
[0046] Expressed mathematically in example 1, an algorithm f().fwdarw.M may be (, )-differentially private if for all possible outputs in the universe M.Math.U, for all possible adjacent corpora x and x that differ only by all occurrences of one word:
PR(F(x)=M)e.sup.Pr(f(x)=M)+(1)
where may measure the maximum multiplicative change in output probabilities, and measures the maximum additive change (often thought of as the failure rate of the privacy guarantee).
[0047] For example, for a common value such as 0.1, the probability of any particular output should not change by more than 10% when a word in x may be added or removed (note that example 1 may be symmetrical for x, x). Basically, the removal or addition of a data point should only have a small chance of affecting a function's output.
[0048] User-Level Privacy: Traditionally, differential privacy compares two neighboring data sets x and x that differ by a single data point, such that |x||x|=1. This treats each data point independently. User-level privacy is a variation that takes into account that the same user may appear in x multiple times, and that one can want to totally hide the presence of the user, rather than just one of their data points. Two data sets are the to be adjacent to one another if they differ by all occurrences of a single user, such that |x||x|=k, k1.
[0049] This definition of adjacency matches the scenario of the present invention, in which one can hide the presence or absence of all occurrences of each word. Since the concept of a user is an ill fit for application, instead one can call it word-level privacy.
[0050] Composition: If a hashing function g maps words to hashes rather than features to hashes, then for any adjacent corpora x and x, g(x).fwdarw.H and g(x).fwdarw.H will be neighboring hash sets. That is, while |x||x|=k fork k1, one can can know that |H||H|=1 if g performs word hashing. If g instead performs feature hashing, then |H||H|=n for some n1. In this scenario, the privacy cost of hiding all occurrences of a word is the composition of the privacy cost of hiding the n corresponding hashes. Fortunately, differential privacy is well-equipped to handle composition.
[0051] By way of an example 2, Let f.sub.i(x) be (.sub.i, .sub.i)differentially private. Then f.sub.i=1 . . . n=(f.sub.1(x), . . . , f.sub.n(x)) may be
(.sub.i.sup.n.sub.i, .sub.i.sup.n.sub.i)
which may be differentially private.
[0052] In order to preserve the confidentiality of each word when the words themselves may be the features, one can need to design quite a unique differentially private solution. Unlike most differential privacy techniques, there is no aggregation one can employ when it comes to the hash setall hashes are equally distant from each other, and adding noise to a hash equates to entirely destroying it. The solution needs to account for the fact that if the attacker has hashing function, they can guess words to hash and look for them in the hash set. Solution provided in the present invention needs to release a hash set containing legitimate hashes, while simultaneously not allowing an attacker to detect which hashes are legitimate.
[0053] For any word that an attacker does not already know is in x, there are three ways that an attacker might be able to infer its presence or absence: a) by observing the presence or absence of a hash hH that the attacker knows maps to the target word, and that is unlikely to appear by chance; b) by observing parameters .sub.h paired with a new hash that are unlikely to have naturally occurred; and c) by observing parameters .sub.h paired with an already known hash that do not match the parameters the attacker expected to see (likely meaning that a collision occurred).
[0054] Note that goal is only to protect the words; the goal is not to prevent an attacker from learning parameter values in . One can assume that in the worst-case scenario, an attacker possesses: the entire corpus x except for one word; the hashing algorithm g; the training algorithm f; and that therefore they can potentially rebuild M in its entirety, minus the hash-parameter pairs related to the one word. As far as word-level confidentiality is concerned, the values paired with a hash h only affect confidentiality insofar as they reveal something about the legitimacy of h. One can use output perturbation to provide confidentiality; the anonymization process is performed after M is outputted by f(x), but before it is publicly released. One can refer to the anonymization of H as *(H).fwdarw.H*, and the modification of as *().fwdarw.*. The final anonymized output is then M*={(h*, *.sub.h); hH*}. Only M* is ever publicly released.
[0055] In the present invention, first presenting the strategy for *(H) and *() and prove that these parts are (0, 0)-differentially private and (,)-differentially private respectively.
[0056] Anonymizing H: Considering all possible hash values 1, . . . , R, one can note the hashes that do not appear in H. The function *(H) fills in these remaining hashes to produce H* , so that |H*|1=R. The resulting hash set H* is (, )-differentially private, with =0, =0.
[0057] Proof: Observing that H is the output of g(x), one can can consider two adjacent corpora x and x as in example 1. Function *(H) always outputs H*, no matter what x is. Thus P r(*(g(x))=H*)=100% for all possible x. Then, using equation 1, one can have
Pr(*(g(x))=H*)=Pr(*(g(x))=H*)
Pr(*(g(x))=H*)e.sup.Pr(*(g(x))=H*)+
e.sup.=1, =0
=0, =0
[0058] Not only does *(H) hide which hashes were originally in H, but it also hides the original size of H. This in turn hides the collision rate of hashes in H. There is no way for the attacker to learn how many words in x were likely mapped to the same hash by only observing H*.
[0059] Anonymizing is about ensuring each h*H* has a full set of plausible parameters *.sub.h. This includes filling any gaps in in preexisting hashes hH: a worst-case attacker will know which parameters each h should legitimately have, and which parameters they should have missing (if any). For each of the synthetic hashes h* (i.e., the hashes that are not in H), the function *() generates values for the d elements in . One can treat as a d-dimensional continuous space, where d=max(||; M). For preexisting hashes h, *() generates values for any missing elements in that are correlated with the preexisting values.
[0060] One way to match the distribution would be to simply sample from the pre-existing values, however this would unnecessarily expose the set of possible 's and their frequencies. If an attacker expected to see all of the 's except one, then they immediately learn that one of the hashes with the new 0 is legitimate. If there is only one such hash, confidentiality is breached. In other words, one can want the statistical difference between * and to be close, but not so close that any single has the potential to noticeably change *
[0061] To generate appropriate values, *() first fits a nonparametric joint probability density function (PDF) to {circumflex over ()}.Math., where {circumflex over ()} contains only 's with no missing elements. *() can use Kernel Density Estimation (KDE) to perform the fit. The dimensionality of the joint PDF is d. New values can then be sampled from this PDF; either full 's to be randomly assigned to the each H*, or the missing values of preexisting 's that are correlated to the preexisting values.
[0062] The question then becomes, how different is the final distribution * from all possible initial neighboring distributions {circumflex over ()}!(531, )-indistinguishability provides us a way to measure the statistical distance between the two distributions in a way that cleanly translates to differential privacy:
[0063] By way of an example 3, two probability distributions p and q on the same space are called (, )-indistinguishable if:
Pr(KL(pq))1
Pr(KL(qp))1
[0064] where KL() is the Kullback-Leibler divergence:
[0065] If p=D(x) and q=D(x) are (, )-indistinguishable for all neighboring x, x, then D is (, )-differentially private.
[0066] Anonymizing : Using the process described above, the output * of the function *() is (, )-differentially private, where
=max(KL(D({circumflex over ()})D({circumflex over ()})), KL(D({circumflex over ()})D({circumflex over ()})));
xs.t.|{circumflex over ()}||{circumflex over ()}|=1
[0067] =Pr(g(x.sub.i)=g(x.sub.j)); x.sub.i not equal to x.sub.i; and where x.sub.i and x.sub.j are different words in the corpus.
[0068] Proof: The calculation of is a direct application of Definition 3.
[0069] However since preexisting values are not perturbed, there is a risk of an attacker learning about a collision if a preexisting does not have the values that would have been outputted by f(g(x)) if there had not been a collision. captures this risk of a hash collision.
[0070] If the number of collisions C has been tracked by the data holder, =C/R. Otherwise if g is a sufficiently random hashing function that uniformly randomly outputs hashes, E[C] can be calculated by subtracting |H| from the total number of features T, where T can be found by solving:
R(1((R1)/R).sup.T)=|H|
[0071] For example if the number of hashes is |H|=106 and the hash range is R=232, then one can expect there were C=120 collisions, for a collision rate (and therefore ) of 0.00012.
[0072] If one wished to avoid >0 and instead increase , this could be achieved by adding noise to each of the genuine hashes' parameters. However because hashes are connected to individual words (as opposed to being aggregated linear queries or some other smoothed output), the amount of noise that would need to be added to each hash's parameters to hide collisions is prohibitively large.
[0073] In the present invention, it is easy to apply it in real-world scenarios will require care to ensure an acceptable and is achieved. If KDE is used to fit {circumflex over ()}'s distribution, the bandwidth parameter can be adjusted to raise or lower (with corresponding increases and decreases in model performance). One can also reduce by resampling some number of genuine hashes with new values from {circumflex over ()}'s PDF, effectively making them synthetic:
[0074] Reducing : The probability of finding a genuine collision, Pr(g(x.sub.i)=g(x.sub.j)); x.sub.i not equal to x.sub.j, can be diluted by resampling 0 for s genuine hashes in x, reducing 6 by a factor of s such that anonymizing of becomes (, /s)differentially private.
[0075] Composition: Using composition from example 2, one can trivially see that the combination of anonymizing H and composes to become (0+0+)-differentially private. This encapsulates the scenario where each word maps to a single hash. If the hashes instead represent features, and words can be part of multiple features, one can account for this with additional applications of example 2:
[0076] Accounting for n: If a single occurrence of a word is part of features that map to n1 hashes, then anonymizing H remains (0, 0)-differentially private and anonymizing becomes (n, n)-differentially private, where E and 6 are defined by anonymizing . One can then need to account for adjacent data sets, where removing all instances of a word results in |x||x|=k:
[0077] Accounting for k: If each occurrence of a word maps to n1 hashes, and a word appears k1 times in x, then anonymizing H remains (0, 0)-differentially private and anonymizing becomes (, )-differentially private, where and are defined by anonymizing .
[0078] Note that multiplying and by is the theoretical upper bound, where each occurrence of a word results in a unique set of features, with no overlap or repetition. In settings where the data holder can observe the maximum number of features generated by all occurrences of each word, can likely be reduced.
[0079] Additionally, is measuring the probability of at least one collision, while in reality if only one feature out of many collided, the attacker's confidence that the colliding feature belongs to a specific word, and not a different word, is essentially 1/. Multiple collisions would need to match the features of a particular word before the attacker's confidence would grow.
[0080] When calculating E, accounting for n and accounting for k may be considering two neighboring distributions {circumflex over ()} and {circumflex over ()} and then composing the privacy loss for distances larger than one. For , one can also measure the privacy loss using adjacency, where one can measure the maximum KL-divergence between two distributions differing by data points:
[0081] Using adjacency: When measuring the maximum KL-divergence for anonymizing , is measured using neighboring distributions, |{circumflex over ()}||{circumflex over ()}|=1. One can can instead consider two adjacent distributions |{circumflex over ()}||{circumflex over ()}|=:
=max(KL(({circumflex over ()})D({circumflex over ()})), KL(D({circumflex over ()})D({circumflex over ()})));
xs.t.|{circumflex over ()}||{circumflex over ()}|=nk
[0082] This may result in a tighter bound on the privacy loss, if <
[0083] If the number of features a word can appear in is proportional to how frequently the word appears in the corpus, one can can consider each word's frequency separately when assessing its level of confidentiality. Each word can be considered to have its own k, with rarer words being more heavily protected.
[0084] Model Performance: Recall from Observation 1: For any particular scenario being modelled, the output distribution H will only use a subset of the universe, H.Math.U, from which the training data and future data are both then drawn from.
[0085] The (h, ) tuples in M are untouched by *(H). Only tuples that do not appear in the training data are affected, which by definition are outside of H. When the same h's are seen again in future data, they are correctly assigned the unperturbed values contained in . The rarity of observing new hashes outside the training distribution means that the addition of the fake hashes h* does not overly distort M's predictions.
[0086] Note that this sampling assumption weakens as the training size decreases; a small x{circumflex over ()} may not have converged to E[x] as well as a larger x{circumflex over ()} would have. This also holds true when considering the size of batches in online learning.
[0087] Online Learning: In the online learning setting, a model M.sub.v is updated with a new batch of documents x.sub.b+1. This can be repeated any number of times; the model is a living model, adapting to new data as it becomes available. So far one can have only considered a single model M.sub.0 being trained on x.sub.0. For b>0, the training function takes the form f.sub.b(x.sub.b, H.sub.b).
[0088] In order to anonymize models after the first, M.sub.b>0, the anonymization function *(.sub.b) fits a distribution to all .sub.b1's that were updated by f.sub.b. The update amounts (i.e., the element-wise differences between .sub.b1 and .sub.b) are considered as additional dimensions in the joint PDF, for a total of 2d dimensions. *(.sub.b) then samples updates for each .sub.b that was not updated by f.sub.b, with the updates being correlated with .sub.b1's weights.
[0089] Anonymizing M.sub.b>0. For executions of f.sub.b>0, performing *(H.sub.b) is not necessary. *(*.sub.b) is still required in order to anonymize all 's that were not updated by f.sub.b. For each b, {circumflex over ()}.sub.b is the set of 's that were updated by f.sub.b. Assuming an attacker could have access to all previous batches, *.sub.b is
(.sub.i.sup.bnk.sub.i.sub.i, .sub.i.sup.bnk.sub.i.sub.i)
which may be differentially private.
[0090] Proof: *(H.sub.b) is not necessary because all hash values have already been filled in; H.sub.b=H*.sub.0. The proof for *(*b) follows the same process as the proof for anonymizing , using example 2 for composition as seen in the accounting for k. Observe that n is a product of the hashing process g and does not change between batches, while k and E are unique to each batch.
[0091] In online settings, where the same word may appear in multiple update cycles, the cost of querying the same word needs to be paid each time. Depending on what amount of risk is considered acceptable, and the size of k.sub.b, .sub.b and .sub.b in each iteration, a upper bound may need to be placed on b. Once the limit is reached, no more updates are allowed without exceeding the accepted risk threshold.
[0092] Experiments performed for the present invention, one can measure both the performance of classification models and the corresponding privacy cost . Further, one can may compare the performance of traditional (non-confidential) and confidential models trained with static offline learning.
[0093] The documents used in the present invention have been collected by Kira Systems as part of business operations. Overall, experiments involve hundreds of legal documents containing tens of millions of words. Nineteen models are built in both the offline and online scenarios, with each model trained to extract different text segments. For each model, text segments are labelled as either relevant or not relevant; these are two class labels, with between 0.7% and 0.01% of the corpus being labeled as relevant. Details for each model are provided in conjunction with table I.
[0094] The model f one can use for experiments is a Conditional Random
[0095] Field trained with the Passive-Aggressive algorithm. One can use a proprietary Go implementation of the model based on the implementation used by the CRF suite software.
[0096] To feature the text, Kira Systems uses a purpose-built variation of the punkt algorithm. One can use binary n-grams and skip-grams among other features. One can found stemming, case normalization and part-of-speech tagging did not provide a meaningful performance improvement. For any arbitrary word in x, the maximum number n of features that can be produced by featurization process is n=6. If that word appears k times in x, then using the accounting for k, one can have a per-word scaled with =6k.
TABLE-US-00001 TABLE 1 DETAILS OF THE 20 MODELS TRAINED IN OUR EXPERIMENTS. Unique Word Word Hash Count Count Count Model Document (millions, (millions, (millions, ID Count approx.) approx.) approx.) 1086 60 8 0.053 1.165 1238 100 20 0.108 1.601 1239 300 24 0.135 1.790 1240 300 30 0.140 1.805 1242 300 24 0.134 1.790 1243 280 19 0.120 1.738 1244 240 18 0.118 1.707 1245 350 28 0.140 1.827 1262 280 20 0.120 1.740 1300 360 28 0.150 1.862 1444 240 20 0.071 1.545 1460 180 15 0.083 1.501 1500 230 14 0.098 1.601 1509 350 20 0.122 1.777 1512 350 20 0.122 1.777 1520 360 21 0.128 1.793 1524 160 9 0.061 1.366 1551 300 24 0.124 1.754 1601 190 17 0.084 1.458 1611 310 25 0.142 1.846 Average 273 20 0.115 1.679
[0097] From the table-I, one can can see that most unique words are less than 0.7% of the total words, so one can use that as approximation for k, giving us 0.042||0.042|{circumflex over ()}|.
[0098] The hashing strategy g one can use is MurmurHash3. Due to memory constraints one can use a rather small hash range of R =2.sup.21 2.097 million compared to average hash count in the table-I, leading to =0.5 on average, which one can reduce to =0.01 for all models using in the reducing S. This is still larger than would be naively acceptable in most settings, however one can note that only represents the risk of at least one collided hash being detected, not of a word being leaked. Upon detecting all the collisions (some of which are synthetic) the attacker would then need to guess words that map to all the collided hashes, and trust that their guess, and not some other word, is what caused the perhaps-genuine collision. When combined with the fact that to even detect a collision, the attacker requires hashing code and training code of the present invention, and every word in the corpus except for one, one can find the practical risk of a leak to be much lower than the worst-case theoretical leak described by .
[0099] Each hash tuple (h, ) has two elements in , corresponding to one weight value per label. Therefore the dimensionality of the joint PDF for *() is d=2 in the offline case (and for M in the online case), and d=4 when updating M.sub.b>0. only contains a weight for labels that f(x) has observed h having. One can find that in practice due to the high label imbalance, in at least 97% of cases if 0 contains a weight for the relevant label, it also contains one for the not relevant label. Thus the size of {circumflex over ()} in the present invention is approximately 0.97c||, where c is the prevalence of the relevant label.
[0100] One can fit a joint PDF to {circumflex over ()} using Kernel Density Estimation (KDE) with a Guassian kernel. Performing KDE is where the most computation time is required to execute functions of the present invention, *(H) and *(), however in all experiments it required several orders of magnitude less time than the training process f(x). One can experiment with three kernel bandwidths: w=0.01, w=0.001 and w=0.0001. The bandwidth affects the smoothness of the fit, and controls the impact that each data point makes to the distribution. Increasing the bandwidth therefore decreases the privacy cost, but also decreases the model performance as will be shown below.
[0101] Initial testing found that was always smaller than E, and so one can use adjacency to measure the privacy loss in experiments. Rather than compute the KL-divergence for all possible adjacent x, x combinations (which would likely require checking every {circumflex over ()} that is distance from {circumflex over ()}), one can observe that weights are heavily focused around the mean and approximate the maximum KL-divergence by removing the =0.042|{circumflex over ()}| hashes with weights furthest from the mean and treating that as x. Initial experiments confirmed that this gave the largest KL-divergence within one order of magnitude, with all other divergence values being between 100 and 10-12 times smaller.
[0102] One can use the F-measure to assess the performance of the models, with the default =1, commonly referred to as the F1 score. Using the F1 score is appropriate for use-case due to the high label imbalance.
Offline Learning Results
[0103]
TABLE-US-00002 TABLE II Traditional Private Model Model Model w = 0.01 w = 0.001 w = 0.0001 ID F1 F1
F1
F1 1086 0.933 0.000040 0.903 0.0
6 0.935 0.689 0.931 1238 0.949 0.000054 0.271 0.0288 0.946 0.620 0.953 1239 0.866 0.000010 0.869 0.0136 0.867 0.213 0.867 1240 0.964 0.000002 0.963 0.0001 0.964 0.001 0.964 1242 0.928 0.000014 0.920 0.0159 0.9
6 0.275 0.927 1243 0.767 0.000015 0.743 0.0124 0.768 0.208 0.767 1244 0.688 0.000067 0.648 0.0356 0.685 0.496 0.686 1245 0.873 0.000002 0.867 0.0057 0.872 0.104 0.873 1262 0.841 0.000003 0.827 0.0059 0.837 0.092 0.840 1300 0.893 0.000001 0.888 0.0025 0.893 0.040 0.893 1444 0.875 0.000006 0.866 0.0050 0.874 0.089 0.875 1460 0.818 0.000124 0.771 0.0707 0.821 1.073 0.8
0 1500 0.933 0.000029 0.934 0.018
0.
33 0.276 0.
150
0.868 0.000124 0.872 0.0357 0.871 0.535 0.868 1512 0.878 0.000043 0.782 0.032
0.877 0.575 0.873 1520 0.931 0.001083 0.904 0.1751 0.936 2.383 0.936 1524 0.866 0.000020 0.853 0.0
0.867 0.315 0.869 1551 0.865 0.000188 0.680 0.0
0.84
1.063 0.855 1601 0.978 0.000008 0.970 0.0114 0.978 0.190 0.976 1611 0.9
3 0.00014
0.510 0.0621 0.978 1.204 0.982 Average 0.885 0.000099 0.798 0.0315 0.884 0.522 0.885
indicates data missing or illegible when filed
[0104] In the present invention, five-fold cross-validation and report the average F1 scores in results of the present invention. The privacy cost E is also included. Also provided results for three KDE bandwidth values, w=0.01, w=0.001 and w=0.0001.
[0105] Starting broadly, one can can see that both privacy costs and F1 scores increase as the bandwidth is reduced. This is expected, given that the smaller the bandwidth, the more emphasis that is placed on each data point and the tighter the PDF fits to the data. This causes the spread of the sampled weights to decrease, meaning that hashes in the test data not appearing in the training data are more likely to have small weights, and therefore be less likely to skew the results.
[0106] Looking at the average results in the bottom row of table-II one can see that an increase in the magnitude of the bandwidth results in a decrease in the magnitude of the privacy cost. Interestingly, the average F1 score when w=0.001 is almost exactly equal to the average performance of the traditional models, at 0.884. In fact many of the private models are within 1% of the F1 scores of the corresponding non-private models at the larger w=0.01, highlighted in bold.
[0107] Only one model, Model 1551, requires a smaller bandwidth to reach a performance comparable to the traditional model, costing E=1.063 in privacy. The next highest bolded privacy cost is E=0.1751, for Model 1520. Comparatively, the smallest bolded privacy cost is for Model 1300, which only costs =10 to reach F1=0.888 compared to F1=0.893 for the non-private model.
[0108] Even without further fine-tuning of the bandwidth, when selecting the best-fitting bandwidth from the three presented in table-II (i.e. the results in bold), the average privacy cost is E=0.077. With Dwork often recommending that the privacy cost remain below 0.1 [8], and state-of-the-art machine learning algorithms going as high as =1, =2, 4, 8 or even =8.6, spending as little as 0.077 is a very strong privacy guarantee.
Online Learning Results
[0109]
TABLE-US-00003 TABLE III Number Traditional Private Model Model of Batches Model w = 0.01 w = 0.001 w = 0.0001 ID b F1 F1
F1
F1 1086 6 0.891 0.0003 0.707 1.36 0.844 4.01 0.900 1238 19 0.890 0.0017 0.587 4.13 0.844 1
.66 0.810 1239 30 0.747 0.0004 0.671 4.85 0.718 11.08 0.755 1240 3
0.84
0.0005 0.775 3.08 0.833 9.
6 0.852 1242 30 0.772 0.0002 0.721 3.35 0.769 9.57 0.770 1243 28 0.586 0.0120 0.510 19.65 0.561 48.76 0.517 1244 24 0.636 0.0031 0.457 7.35 0.553 21.32 0.576 1245 3
0.760 0.0001 0.700 1.61 0.757 4.
3 0.739 1262 28 0.698 0.0001 0.647 2.12 0.705 5.59 0.688 1300 36 0.853 0.0001 0.798 1.26 0.844
.50 0.850 1444 24 0.796 0.0001 0.729 0.34 0.768 6.64 0.802 14
0 18 0.695 0.0022 0.559 10.74 0.676 24.39 0.674 1500 23 0.814 0.0009 0.755 5.58 0.819 14.80 0.786 1509 35 0.823 0.0087 0.687 16.22 0.787 43.44 0.788 1512 35 0.700 0.0017 0.532 8.01 0.680 22.97 0.692 1520 36 0.883 0.0220 0.
20 28.61 0.8
5 91.34 0.
32 1524 16 0.744 0.0004 0.647 2.16 0.756 6.23 0.770 1551 30 0.548 0.0059 0.445 12.88 0.526 39.29 0.573 1601 19 0.8
5 0.0001 0.819 1.52 0.887 4.33 0.898 1611
1 0.929 0.0071 0.635 18.64 0.86
47.87 0.935 Average 27 0.773 0.0036 0.655 7.67 0.750 21.62 0.757
indicates data missing or illegible when filed
[0110] Table-III, provides the F1 scores for models built using private online learning solution. For batch b, performance is measured on the next batch, b+1. F 1 scores are averaged across all batches. Results for the same bandwidths as table-II are presented, along with the privacy costs . The F1 scores within 0.03 of the non-private scores with the lowest privacy costs are bolded.
[0111] For private online learning, one can see a small but noticeable drop in average F1 performance; from 0.773 to 0.750 for the middle bandwidth of w=0.001. The drop in performance is expected given the much smaller training set size of each x.sub.b compared to the whole x, making it more likely that hashes within the true distribution did not get sampled by x.sub.b, and so were given synthetic 's.
[0112] As expected, having to compose the privacy cost of multiple batches means that privacy costs are higher across the board. Interestingly, the composed privacy costs are only mildly correlated to the number of batches b, with a Pearson's correlation coefficient of r=0.34. This indicates that the shape and fit of the multivariate PDF for each batch still plays an important role, and can vary between batches.
[0113] Unlike in the offline case, no private online models have an F1 score within 0.03 of the non-private version when the bandwidth is w=0.01. For four models1238, 1244, 1509 and 1520performance drops by more than 0.03 even when w=0.0001. However some models still perform quite well. For example when w=0.001, Model 1444 manages to only cost E=0.34 after 24 batches, with performance dropping from 0.796 to 0.768. For a cost of E=2.16, Model 1524 actually improves in performance by over 1%. On average, the private models with F1 scores within 0.03 of the nonprivate models cost =8.438. This is comparable to those reported by other works, such as (=8.6) and (=2, 4, 8).
[0114] The benefits and advantages which may be provided by the present invention have been described above with regard to specific embodiments. These benefits and advantages, and any elements or limitations that may cause them to occur or to become more pronounced are not to be construed as critical, required, or essential features of any or all of the embodiments.
[0115] While the present invention has been described with reference to particular embodiments, it should be understood that the embodiments are illustrative and that the scope of the invention is not limited to these embodiments. Many variations, modifications, additions and improvements to the embodiments described above are possible. It is contemplated that these variations, modifications, additions and improvements fall within the scope of the invention.