Privacy-enhancing technologies for medical tests using genomic data
09536047 · 2017-01-03
Assignee
Inventors
- Erman Ayday (Ankara, TR)
- Jean-Pierre Hubaux (St-Sulpice, CH)
- Jean L. Raisaro (Lausanne, CH)
- Amalio Telenti (La Jolla, CA)
- Jacques Fellay (St. Maurice, CH)
- Paul J. McLaren (Winnipeg, CA)
- Jacques Rougemont (Geneva, CH)
- Mathias Humbert (Saarbrücken, DE)
Cpc classification
H04L2209/76
ELECTRICITY
G16B50/00
PHYSICS
H04L9/0866
ELECTRICITY
G16H10/60
PHYSICS
H04L9/0894
ELECTRICITY
International classification
G06F12/14
PHYSICS
G06F21/62
PHYSICS
Abstract
In this invention, we propose privacy-enhancing technologies for medical tests and personalized medicine methods, which utilize patients' genomic data. Assuming the whole genome sequencing is done by a certified institution, we propose to store patients' genomic data encrypted by a patient's public keys at a Storage and Processing Unit (SPU). A part of the corresponding private key is also stored on the SPU. At the time of the test by a Medical Unit (MU), the patient provides the second part of the private key to the MU. A test with its associated markers is determined by the MU and sent to the SPU. The test is carried out on the encrypted values thanks to homomorphic operation and returned back to the MU. The latter uses the second part of the private key to access the result.
Claims
1. A method to process genomic data comprising the steps of: associating, by a Certified Institution, a patient identification with a given patient; generating, by the Certified Institution, a pair of asymmetric keys related to said patient comprising a private and a public key; dividing, by the Certified Institution, the private key into at least a first and a second part; storing, by the Certified Institution, the second part of the private key in a medical unit or in a patient device; transmitting, by the Certified Institution, the first part of the private key to the Storage and Processing Unit; analyzing, by the Certified Institution, an output of a Deoxyribonucleic Acid (DNA) sequencer and preparing an aligned genomic data for said patient comprising approved variants, such as Single Nucleotide Polymorphisms (SNPs) or structural variants (SVs), each approved variant representing a position in the genome and a value representing a nucleotide that varies between individuals; extracting, by the Certified Institution, real and potential variants from said approved variants, said real and potential variants having each a position, said real variants being a subset of the approved variants and being different for each human being, said potential variants being the remaining part of the approved variants; encrypting the value of each real variant and of at least some selected potential variants with the public key of the patient; and sending the encrypted values with their respective positions and the patient identification to a Storage and Processing Unit.
2. The method of claim 1, further comprising: selecting, by the Certified Institution, all or part of the potential variants; analyzing, by the Certified Institution, the correlation between the selected potential variants and a privacy sensitivity of the real variants; selecting, by the Certified Institution, a number of other potential variants, said number being determined according to the previous analysis and a level of privacy required.
3. The method of claim 1, further comprising the steps of: generating, by the Certified Institution, a dummy variant comprising a dummy position and a dummy value, said dummy position being outside of the overall variant positions of a sequence; encrypting, by the Certified Institution, the positions of the real variants with a symmetric key of the patient; encrypting, by the Certified Institution, the dummy value with the public key of the patient; encrypting, by the Certified Institution, the position of the dummy variant with the symmetric key of the patient; sending, by the Certified Institution, to said Storage and Processing Unit, together with the encrypted variants, the dummy variant as well as the encrypted positions and the encrypted dummy position.
4. The method of claim 3, further comprising the steps of: storing, by the Certified Institution, the position of the dummy variant into a patient device; determining by the Certified Institution a set of positions which are common between the marker's position and the real variant's positions; receiving by the Certified Institution from the medical unit an encrypted set of positions with the symmetric key of said patient, and for the marker's positions not present in the variant's position, dummy positions; sending by the Certified Institution to the Storage and Processing Unit the encrypted marker's positions as well as the patient identification.
5. A method to process genomic data, said method comprising the steps of: receiving by a Storage and Processing Unit encrypted values of real variants, such as Single Nucleotide Polymorphisms (SNP) or structured variants (SVs), for a patient, each real variant representing a position in the genome and a value representing a nucleotide that varies between individuals; storing in the Storage and Processing Unit the encrypted values with their respective positions into the Storage and Processing Unit, as well as an identification of the patient; receiving from a Certified Institution a first part of a private key of the patient, said private key comprising said first part and a second part, said second part being stored in a medical unit or in a patient device; receiving by the Storage and Processing Unit from a medical unit genetic markers related to a personalized clinical test, the respective contributions of the related genetic markers and the patient identification of the patient; retrieving by the Storage and Processing Unit the encrypted values for said patient matching the position of the genetic markers; executing by the Storage and Processing Unit a genetic test by using the retrieved encrypted values, and the contribution of those markers thanks to homomorphic operations; partially decrypting by the Storage and Processing Unit the result of the genetic test using said first part of the private key; sending by the Storage and Processing Unit the partly decrypted result to a medical unit.
6. The method of claim 5, further comprising receiving by the Storage and Processing Unit encrypted values of at least some potential variants of said patient, said real and potential variants having each a position, said real variants being different for each human being, said potential variants being the remaining part of the approved variants.
7. The method of claim 5, further comprising the steps of retrieving by the Storage and Processing Unit, together with the encrypted variants, dummy variants encrypted with the public key of the patient, as well as positions of the real variant and of the dummy variants encrypted with a symmetric key of the patient.
8. The method of claim 5, further comprising receiving by the Storage and Processing Unit the allele associated with said genetic markers related to the personalized clinical test, and the corresponding probabilities.
Description
1.4 Brief Description of the Figures
(1) The invention will be better understood thanks to the attached figures in which:
(2) The
(3) The
(4) The
(5) The
(6) The
(7) The
(8) The
(9) The
(10) The
(11) The
2 PETS FOR MEDICAL TESTS AND PERSONALIZED MEDICINE METHODS
(12) In the present case, we study the privacy issues of medical tests and personalized medicine methods. Most medical tests and personalized medicine methods (that use genomic data) involve a patient and a medical unit. The patient is identified by a patient identification (ID), which could be a user name or a pseudonym (e.g., hash value of his social security number). In general, the medical unit is the family doctor, a physician, a pharmacist, or a medical council. In this study, we consider a malicious medical unit as the potential attacker. That is, a medical unit can be a malicious institution trying to obtain private information about a patient. Even if the medical unit is non-malicious, it is extremely difficult for medical units to protect themselves against the misdeeds of a hacker or a disgruntled employee. Similarly, the genomic data is too sensitive to be stored on users' personal devices (mostly due to security, availability, and storage issues), hence it is risky to leave the users' genomic data in their own hands. In addition, extreme precaution is needed between the patient and the medical unit due to the sensitivity of genomic data. Thus, we believe that a Storage and Processing Unit (SPU) should be used to store and process the genomic data. We note that a private company (e.g., cloud storage service), the government, or a non-profit organization could play the role of the SPU. We also assume that the SPU is an honest organization, but it might be curious (e.g., existence of a curious party at the SPU), hence genomic data should be stored at the SPU in encrypted form (i.e., the SPU should not be able to access the content of patients' genomic data). This general architecture is illustrated in
(13) For the simplicity of presentation, in the rest of this section, we will focus on a particular medical test (namely, computing genetic disease susceptibility). We note that similar techniques would apply for other medical tests and personalized medicine methods. In a typical disease-susceptibility test, a medical unit (MU) wants to check the susceptibility of a patient (P) to a particular disease X (i.e., probability that the patient P will develop disease X). It is shown that a genetic disease-susceptibility test can be realized by analyzing particular genetic variants of the patient via some operations, such as weighted averaging [25] or Likelihood Ratio (LR) test [26]. For simplicity, we focus on the simplest type of variant, the Single Nucleotide Polymorphism (SNP). Yet, the proposed methods are also valid for more complex types of variants such as the Structural Variants (SVs) that include among others: Copy-number Variations (CNVs), Inversions, Insertions, Deletions, etc. A SNP is a position in the genome holding a nucleotide (A, T, C or G), which varies between individuals. For example, it is reported that there are three particular genes bearing a total of ten particular SNPs necessary to analyze a patient's susceptibility to Alzheimer's disease [27]. Each SNP contributes to the susceptibility in a different amount and the contribution amount of each SNP is determined by previous studies on case and control groups (these studies are published in several papers). Furthermore, some of the SNPs contribute to the development of a disease, whereas some are protective.
(14) In general, there are two alleles observed at a given SNP position: (i) The major allele is the most frequently observed nucleotide, and (ii) the minor allele is the rare nucleotide. Everyone inherits one allele of every SNP location from each of his parents. If an individual receives the same allele from both parents, he is said to have a homozygous variant for that SNP location. If, however, he inherits a different allele from each parent (one minor and one major), he has a heterozygous variant. There are approximately 40 million approved variants (SNPs) in the human population as of now (according to the NCBI dbSNP [28]) and each patient carries on average 4 million SNPs (i.e., real variants) out of this 40 million. Moreover, this set of 4 million SNPs is different for each patient. From now on, to avoid confusion, for each patient, we refer to these 4 million variants as the real SNPs and the remaining non-variants (approved SNPs that do not exist for the considered patient) as the potential SNPs of the patient; when we only say SNPs, we mean both the real and potential SNPs.
(15) At this point, it can be argued that these 4 million real SNPs (nucleotides) could be easily stored on the patient's computer or mobile device, instead of the SPU. However, we assert that this should be avoided due to the following issues. On one hand, the number of approved SNPs in human population continues to increase with new discoveries. Further, as mentioned above, types of variations in human population are not limited to SNPs, and there are other types of variations such as Copy-Number Variations (CNVs), rearrangements, or translocations (our proposed privacy-preserving mechanisms can be smoothly adapted for these alternative variations), consequently the required storage per patient is likely to be considerably more than only 4 million nucleotides. This higher storage cost might still be affordable to an average patient (via desktop computers or USB drives), however, genomic data of the patient should be available any time (e.g., for emergencies), thus it should be stored at a reliable source such as the SPU. On the other hand, as we discussed before, leaving the patient's genomic data in his own hands and letting him store it on his computer or mobile device is risky, because his mobile device can be stolen or his computer can be hacked.
(16) A potential attacker can learn about the susceptibilities of the patient to privacy-sensitive diseases if he obtains some specific real SNPs of the patient. Moreover, the knowledge of 75 real SNPs (out of approximately 4 million), if not fewer, will enable the attacker to identify a person [29]. These situations could lead to genetic discrimination such as denying a person's access to health (or life) insurance or obstructing his employment opportunities. As we discussed before, in our setting, both the MU and SPU pose a threat to the patient's privacy. On one hand, the MU can either be a malicious institution trying to obtain private information about the patient or it can be hacked by another malicious entity. On the other hand, the SPU is considered as an honest but curious entity. Thus, our goal is to build mechanisms in which the patient can preserve the privacy of his genomic sequence (his real genetic variants) while enabling the MU to access his genomic data and conduct genetic tests.
(17) We assume that the whole genome sequencing is done by a Certified Institution (CI) with the consent of the patient. Moreover, the genomic data of the patient is encrypted by the same CI (using the patient's public key) and uploaded to the SPU so that only the patient can decrypt the stored (potential or real) SNPs, and the SPU cannot access the SNPs of the patient. We are aware that the number of discovered SNPs increases with time. Thus, the patient's complete DNA sequence is also encrypted as a single vector file (via symmetric encryption using the patient's key) and stored at the SPU, thus when new SNPs are discovered, these can be included in the pool of the previously stored SNPs of the patient. We also assume the SPU does not have access to the real identities of the patients and data is stored at the SPU by using pseudonyms; this way, the SPU cannot associate the conducted genomic tests to the real identities of the patients. As an alternative, the privacy of the genomic data at the SPU can be further increased using privacy enhanced access control [30] or Oblivious RAM (O-RAM) storage [31] techniques, in which the data access patterns are completely hidden from the server (SPU). Note that even the most efficient implementation of O-RAM introduces high storage overhead to the client (patient), and it introduces 2025 times more overhead with respect to non-oblivious storage. Thus once it becomes more efficient, O-RAM storage could be considered as a future add-on to the proposed privacy-preserving mechanisms.
(18) Depending on the access rights of the MU, the SPU can either (i) compute Pr(X), the probability that the patient will develop the disease X by checking the patient's encrypted variants via homomorphic encryption techniques [33] (In one of our proposed schemes, see Method 3 in Section 2.4, Pr(X) is computed at the MC via homomorphic operations), or (ii) provide the relevant variants to the MU (e.g., for complex diseases that cannot be interpreted using homomorphic operations). These access rights are defined either jointly by the MU and the patient or by the medical authorities. Further, access rights can be enforced by using a secure attribute-based system as in [34]. We note that homomorphic encryption lets the SPU (or MU) compute Pr(X) using encrypted variants of the patient P. In other words, the SPU (or MU) does not access P's variants to compute his predicted disease susceptibility. We use a modification of the Paillier cryptosystem (described in Section 2.1) to support the homomorphic operations at the SPU (or MU).
(19) We propose four different techniques for the storage and process of the SNPs at the SPU and the preservation of the patient's privacy: (i) Method 0 in Section 2.2, (ii) Method 1 in Section 2.3, (iii) Method 2 in Section 2.4, and (vi) Method 3 in Section 2.5. We describe these proposed techniques in detail in the following subsections. We also discuss the computation of genetic disease susceptibility by using homomorphic operations in Section 2.6.
(20) In the rest of this work, for simplicity of the presentation, we do not consider the type of the variant at a real SNP location (i.e., whether the variation is homozygous or heterozygous for that real SNP); we only consider whether the patient has a real SNP or not at a particular location. However, the proposed approaches and the analysis (in Section 2.4) can easily be extended to cover the types of the variants. In order to facilitate future references, frequently used notations are listed in Table I for the different stages of the proposed schemes.
(21) TABLE-US-00001 TABLE I NOTATIONS AND DEFINITIONS. General Notations SNP.sub.i.sup.P Type of SNP i, SNP.sub.i, of the patient P. SNP.sub.i.sup.P {0, 1}, 0 representing a potential SNP (i.e., non-variant) for P, and 1 representing a real SNP (i.e., a variant) for P. S.sub.P.sup.X Predicted susceptibility of the patient P to disease X. .sub.P Set of real SNPs of the patient P (SNPs at which P has a variant: around 4 million at each patient). .sub.P Set of potential SNPs of the patient P (SNPs at which P does not have a variant: around 36 million at each patient). Cryptographic Notations n, g Public parameters of modified Paillier cryptosystem. x Weak private key of the patient P. x.sup.(i) i.sup.th share of the patient P's private key. g.sup.x Public key of the patient P. E(m, g.sup.x) Encryption of message m with the patient P's public key. Susceptibility Test via Weighted Averaging p.sub.j.sup.i(X) Probability that P would develop disease X, given SNP.sub.i.sup.P = j, Pr(X|SNP.sub.i.sup.P = j). C.sub.i.sup.X Contribution of SNP.sub.i to the susceptibility to disease X. Susceptibility Test via Likelihood Ratios I.sub.X.sup.P Initial risk of the patient P for disease X. L.sub.X.sup.i(j) Likelihood Ratio (LR) when SNP.sub.i = j for disease X.
2.1 Paillier Cryptosystem
(22) In this section, we briefly review the modified Paillier cryptosystem (described in detail in [33, 35]), which we use in this work, and its homomorphic properties. We note that the usual notation in Paillier cryptosystem is to use a pair of keys named public and secret key. However, for the present description, we will use the denote the keys as public and private.
(23) The public key of the patient P is represented as (n, g, h=g.sup.x), where the strong private key is the factorization of n=pq (p, q are safe primes), the weak private key is x[1, n.sup.2/2], and g of order (p1)(q1)/2. Such a g can be easily found by selecting a random aZ*.sub.n.sub.
(24) Encryption of a Message: To encrypt a message mZ*.sub.n.sub.
T.sub.1=g.sup.r mod n.sup.2 and T.sub.2=h.sup.r(1+mn)mod n.sup.2.(1)
(25) Re-Encryption of a Message:
(26) An encrypted message (T.sub.1, T.sub.2) can be re-encrypted under the same public key, using a new random number r.sub.1[1, n/4] as below:
{circumflex over (T)}.sub.1=g.sup.r.sup.
(27) Decryption of a Message: The message m can be recovered as follows:
(28)
(29) Homomorphic Properties:
(30) Assume two messages m.sub.1 and m.sub.2 are encrypted using two different random numbers r.sub.1 and r.sub.2, under the same public key, (n, g, h=g.sup.x), such that E(m.sub.1, r.sub.1, g.sup.x)=(T.sub.1.sup.1, T.sub.2.sup.1) and E(m.sub.2, r.sub.2, g.sup.x)=(T.sub.1.sup.2, T.sub.2.sup.2). Assume also that c is a constant number. Then the below-mentioned homomorphic properties are supported by Paillier cryptosystem: The product of two ciphertexts will decrypt to the sum of their corresponding plaintexts.
D(E(m.sub.1,r.sub.1,g.sup.x).Math.E(m.sub.2,r.sub.2,g.sup.x))=D(T.sub.1.sup.1.Math.T.sub.1.sup.2.Math.T.sub.2.sup.1.Math.T.sub.2.sup.2 mod n.sup.2)=m.sub.1+m.sub.2 mod n.(4) An encrypted plaintext raised to a constant c will decrypt to the product of the plaintext and the constant.
D(E(m.sub.1,r.sub.1,g.sup.x).sup.c)=D((T.sub.1.sup.1).sup.c,(T.sub.2.sup.1).sup.c mod n.sup.2)=cm.sub.1 mod n.(5)
(31) These homomorphic operations are conducted at the SPU (or MU depending on which approach is used) to compute the predicted susceptibility of the patient P to disease X, as will be discussed in Section 2.6.
(32) Proxy encryption: The patient's weak private key x is divided (preferably randomly or by any other rule) into two shares: x.sup.(1) and x.sup.(2) (such that x=x.sup.(1)+x.sup.(2)). x.sup.(1) is given to the SPU and x.sup.(2) is given to the MU. Using the above Paillier cryptosystem, an encrypted message (T.sub.1, T.sub.2) (under the patient's public key) can be partially decrypted by the SPU (using x.sup.(1)) to generate the ciphertext pair ({tilde over (T)}.sub.1, {tilde over (T)}.sub.2) as below:
{tilde over (T)}.sub.1=T.sub.2 and {tilde over (T)}.sub.2=T.sub.2/T.sub.1.sup.x.sup.
(33) Now, ({tilde over (T)}.sub.1,{tilde over (T)}.sub.2) can be decrypted at the MU using x.sup.(2) to recover the original message. x.sup.(2) can be provided to the MU once the patient is registered to the medical unit or through the patient's digital ID card. Further details about the distribution of shares are out of the scope of this paper. We note that this approach is not proxy re-encryption; it is based on secret-sharing.
(34) Overall, this modified Paillier cryptosystem is not key optimal, because the size of the MU's and SPU's secret storages do not remain constant. That is, both the MU and SPU need to store a secret for every patient.
(35) However, this storage cost can be considered negligible when compared to the storage of the genomic data. Further, the shares (e.g., x.sup.(1) and x.sup.(2)) can be stored by the patient and sent to the MU and SPU only when it is needed in order to resolve this storage issue at the expense of extra communication overhead. Furthermore, the above modified Paillier cryptosystem is not proxy invisible, because all participants of the systems (i.e., P, MU and SPU) should be aware of the existence of the proxy. We discuss the security evaluation of this cryptosystem in Section 3.2.
2.2 Method 0
Only Store the Real SNPS at the SPU
(36) In this approach, the real SNPs of the patient are stored encrypted (via the patient's public key) and the locations of the corresponding real SNPs are stored in plaintext at the SPU.
(37) We assume that SNP.sub.i at the patient P is represented as SNP.sub.i.sup.P and SNP.sub.i.sup.P=1, if P has a real SNP (i.e., variant) at this location, and SNP.sub.i.sup.P=0, if P does not have a variant at this location. We let .sub.p be the set of real SNPs of the patient P (at which SNP.sub.i.sup.P=1). We also let P represent the set of potential SNPs (at which SNP.sub.i.sup.P=0).
(38) Below, we summarize the proposed approach for the privacy protecting disease-susceptibility test by using this particular storage technique. Step 0: The asymmetric keys (public and private keys) of each patient are generated and distributed to the patients during the initialization period. Then, symmetric keys are established between the parties, using which the communication between the parties is protected from an eavesdropper. We note that the distribution, update and revocation of cryptographic keys are handled by a trusted entity (similar to e-banking platforms). Step 1: The patient (P) provides his sample (e.g., his saliva) to the Certified Institution (CI) for sequencing. Step 2: The CI sequences P, and encrypts the contents of his real SNP locations (in .sub.P) by using P's public key. Step 3: The CI sends the encrypted real SNPs of P to the SPU (so that the SPU cannot access to P's SNPs). Step 4: We divide the private key into a first and a second part, the patient provides the first part of his private key (x.sup.(1)) to the SPU. Step 5: The MU wants to conduct a susceptibility test on P to a particular disease X, and P provides the second part of his private key (x.sup.(2)) to the MU as well as his identification ID. Step 6: The MU provides genetic variant markers, along with their individual contributions (to the disease susceptibility), to the SPU. Step 7: If the disease susceptibility can be interpreted by homomorphic operations, the SPU computes P's total susceptibility to disease X from the individual effects of SNPs by using the homomorphic properties of the Paillier cryptosystem as described in Section 2.6. Otherwise, the SPU provides the relevant real SNPs to the MU based on MU's access rights. Step 7: The SPU partially decrypts the end-result (or the relevant SNPs) using the first part of P's private key for example by following a proxy encryption protocol (Section 2.1). Step 8: The SPU sends the partially decrypted end-result (or the relevant real SNPs) to the MU. Step 9: The MU decrypts the message received from the SPU using the second part of P's private key and recovers the end-result (or the relevant real SNPs).
2.3 Method 1
Plaintext Locations at the SPU
(39) Method 0 in Section 2.2 might leak private information to the curious party at the SPU. As the locations of the SNPs are stored in plaintext, if the SPU only stores the real SNPs in .sub.P, a curious party at the SPU can learn all real SNP locations of the patient, and hence, much about his genomic sequence. The nucleotides corresponding to variants at particular locations of the DNA sequence are public knowledge. Thus, even though the contents of patient's real SNPs are encrypted, a curious party at the SPU can infer the nucleotides corresponding to these SNPs from their plaintext locations. Therefore, in this method, the SPU stores the contents of both real and potential SNP locations (in {.sub.P.sub.P}) in order to preserve the privacy of the patient. The locations of the corresponding SNPs are again stored in plaintext at the SPU. This is because, when a particular SNP (or set of SNPs) are queried by the MU, the SPU should know which SNPs to process (or send to the MU).
(40) As before, we assume that SNP.sub.i, at the patient P is represented as SNP.sub.i.sup.P and SNP.sub.i.sup.P=1, if P has a real SNP (i.e., variant) at this location, and SNP.sub.i.sup.P=0, if P does not have a variant at this location. We let .sub.P be the set of real SNPs of the patient P (at which SNP.sub.i.sup.P=1). We also let P represent the set of potential SNPs (at which SNP.sub.i.sup.P=0). Below, we summarize the proposed approach for the privacy protecting disease-susceptibility test by using this particular storage technique. This approach is illustrated in
(41) The above technique provides a high level of privacy and practicality for the patient, because (i) from the view point of a curious party at the SPU, inferring the locations of the patient's real SNPs with the stored information is equivalent to inferring them with no information about the patient, and (ii) the patient is not involved in the protocol after the sequencing (except for the consent between the patient and the MU for a particular test). However, this level of privacy and practicality comes at the cost of extra storage overhead at the SPU (due to the storage of both real and potential SNPs as discussed in Section 3.1).
2.4 Method 2
Redundant Storage at the SPU
(42) Due to the significant storage overhead mentioned in Section 2.3, here we propose another technique that reduces the storage overhead at the SPU at the expense of decrease in privacy. In a nutshell, we leave everything the same as in Section 2.3, but, instead of storing the contents of all potential and real SNP locations, we store the real SNPs of the patient along with a certain level of redundancy (i.e., contents of some potential SNP locations). In other words, to mislead a curious party at the SPU, among the 40 million discovered SNPs, we store the approximately 4 million real SNPs (for which SNP.sub.i.sup.P=1, i.sub.P) along with some redundant content from .sub.p (with SNP.sub.j.sup.P=0), for each patient.
(43) Again, we assume that the location of the encrypted (real or potential) SNPs are stored in plaintext at the SPU and there exists a potential curious party at the SPU trying to infer the real SNPs of the patient (in .sub.P). An important issue to consider in this approach is the Linkage Disequilibrium (LD) between SNPs [36].
(44) LD occurs when SNPs at two loci (SNP positions) are not independent of each other. For simplicity, we represent the LD relationship between two SNPs i and j as Pr(SNP.sub.i|SNP.sub.j), where SNP.sub.i (or SNP.sub.j) takes values from the set {0, 1}. In compliance with genetic observations, we assume that the LD between two SNPs are not symmetric, i.e., Pr(SNP.sub.i|SNP.sub.j)Pr(SNP.sub.j|SNP.sub.i). We note that LD relationships are defined among all 40 million discovered SNPs, regardless of their type (i.e., real or potential) at a particular patient.
(45) As in Section 2.3, the SPU provides the end-result of a disease-susceptibility test or the relevant SNPs to the MU. However, in this case, if a particular potential SNP (requested by the MU or needed in the susceptibility test) is not stored at the SPU (i.e., SNP.sub.j.sup.P=0), one of the following two scenarios occurs: (i) If the SPU provides the relevant SNPs to the MU, MU infers the missing potential SNPs from the reference genome (since it is known that the missing potential SNPs are not a variant for P), or (ii) if the SPU provides the end-result of the susceptibility test, the SPU uses the fact that SNP.sub.j.sup.P=0 for each missing potential SNP.sub.j.
(46) As expected, the amount of storage redundancy (due to the storage of the content from .sub.p), along with the LD between the SNPs and their characteristics, determine the level of a patient's genomic privacy.
(47) Therefore, in the rest of this section, we analyze the relationship between the amount of redundancy, LD values, characteristics of the SNPs, and the level of privacy. To do so, first, we observe the average probability of correctly inferring the locations of P's real SNPs (in .sup.y .sub.p) considering varying amounts of redundancy and the LD values between the SNPs. That is, how much information from a patient's un-stored potential SNPs is revealed to the curious party at the SPU about the locations of his real SNPs? This problem can also be formulated similarly if the goal of the attacker is to determine the type of the variant at a real SNP location (e.g., homozygous or heterozygous). In this case, SNP.sub.i.sup.P can take three different values from the set {0, 1, 2}, 0 representing a potential SNP (i.e., non-variant) 1 representing a real homozygous SNP, and 2 representing a real heterozygous SNP for P. It is worth noting that for this study, we create realistic models for the LD values and the characteristics of the SNPs. Further, for the created models, we try a wide range of parameters and observe a wide range of results to address most potential scenarios. However, as the field of genomics becomes more mature, our models can be replaced by the values obtained from the medical research.
(48) We let .sub.p.sup.s and .sub.p.sup.u denote the set of P's potential SNPs that are stored (for redundancy) and not stored at the SPU, respectively (.sub.p.sup.s.sub.p.sup.u=.sub.p). Further, K.sub.i is the set of SNPs with which a particular SNP i has LD, and |K.sub.i|=k (for each SNP, these k SNPs are chosen among approximately 40 million SNPs). We assume that k0 and it is a truncated Gaussian random variable with only discrete values and obtained from a distribution with mean (k) and standard deviation (k).
(49) Initially, we compute Pr(SNP.sub.i.sup.P) for all (real and potential) SNPs in {.sub.p.sub.p.sup.z} by using the LD relationships between these SNPs and those in .sub.p.sup.u. As all SNPs in {.sub.p.sub.p.sup.z} are encrypted and stored at the SPU, only the LD relationships between these SNPs and the un-stored SNPs in .sub.p.sup.u are helpful for the curious party.
(50) Therefore, for each real SNP i.sub.p, we observe Pr(SNP.sub.i.sup.P=1|SNP.sub.m.sup.P=0) for all m{K.sub.i(i).sub.lp.sup.ts}|}, get the average of these values, and compute Pr(SNP.sub.i.sup.P=1). Similarly, for each potential SNP j.sub.p.sup.s, we observe Pr(SNP.sub.j.sup.P=0|SNP.sub.m.sup.P=0) for all m{K.sub.j[.sub.p.sup.u)], average these values, and compute Pr(SNP.sub.j.sup.P=0). We let / be the indicator of the LD strength between two SNPs. Thus, we represent Pr(SNP.sub.i.sup.P=1|SNP.sub.m.sup.P=0)=/(i.sub.p, m{K.sub.i[.sub.p.sup.u)]) and Pr(SNP.sub.j.sup.P=0|SNP.sub.m.sup.P=0)=/(j.sub.lp.sup.ts,m[K.sub.i(j).sub.lp.sup.tu]) as truncated Gaussian random variables with range [0.5, 1], obtained from a distribution with mean (l) and standard deviation (l).
(51) Finally, if |K.sub.i|=k=0 or |K.sub.i.sub.p.sup.u=0 for a SNP i in {.sub.p.sub.p.sup.s}, we update Pr(SNP.sub.i.sup.P=1) considering the fact that the expected value of all stored SNPs is known by the curious party as below:
(52)
(53) In the following, we illustrate our numerical results that represent the relationship between storage, inference power of the curious party at the SPU, and LD values. We assume |.sub.P=4 million and |.sub.P.sub.P|=40 million. We define the percentage of storage redundancy at the SPU as
(54)
and compute the average value of Pr(SNP.sub.i.sup.P=1) for a SNP in .sub.P for varying values of (k), (k), (l), and (l). Higher values of Pr(SNP.sub.i.sup.P=1) indicate a higher inference power for the curious party at the SPU. We repeat each simulation 100 times to obtain an average. Note that Method 1 (in Section 2.3) is a special case of Method 2 (when the storage redundancy at the SPU is 900%), hence its privacy is the same as 900% redundancy in the following results.
(55) In
(56) We observe that the strength of LD, however, does not affect the inference power as strong as k. Then,
(57) The
(58) In the
(59) Next, considering the individual characteristics of the real SNPs (i.e., their severity levels), we analyze the level of genomic privacy of a patient against a curious party at the SPU. By the level of genomic privacy, we understand the level of information that a third party can infer about the real variants of a patient. The severity of a SNP i can be defined as the privacy-sensitivity of the SNP when SNP.sub.i.sup.P=1 (i.e., when it exists as a variant at the patient P). For example, a real SNP revealing the predisposition of a patient for Alzheimer's disease can be considered more severe than another real SNP revealing his predisposition to a more benign disease. Severity values of the SNPs are determined as a result of medical studies (depending on their contributions to various diseases) and tables of disease severities provided by insurance companies (e.g., percentage of invalidity). We denote the severity of a real SNP i as Vi, and 0Vi1 (1 denotes the highest severity). Thus, we define the genomic privacy of the patient P as below:
(60)
(61) We do not use the traditional entropy metric [37, 38] to quantify privacy, as only one state of SNP.sub.i.sup.P poses privacy risks (i.e. SNP.sub.i.sup.P=1), as discussed before.
(62) First, we study the relationship between the storage redundancy and the severity of the real SNPs by focusing on three types of patients: (i) patient A, carrying mostly low severity real SNPs (in .sub.A), (ii) patient B, carrying mostly high severity real SNPs (in .sub.B), and (iii) patient C, carrying mixed severity real SNPs (in .sub.C). For each patient, the highest level of privacy is achieved when the storage redundancy is maximum (as in Method 1 in Section 2.3). Thus, we recognize this level as 100% genomic privacy for the patient. For the evaluation, we take the highest privacy level of patient C as the base and normalize everything with respect to this value. We use the following parameters for the simulation. The severities of patient A's and patient B's real SNPs are represented as truncated Gaussian random variables with (A, A)=(0.25, 0.15) and (B, B)=(0.75, 0.15), respectively. Furthermore, the severity of patient C's real SNPs are represented as a uniform distribution between 0 and 1. We also set (l)=0.8, (l)=0.25, (k)=2, and (k)=0.75. In
(63) Finally, we study the relationship between the severity of the real SNPs, the number of LD pairs per SNP (number of SNPs with which a particular SNP has LD, i.e., k), and the storage redundancy. We assign the Vi values of the real SNPs (in .sub.P) following a uniform distribution between 0 and 1. We set the LD parameters as (l)=0.8, (l)=0.25, (k)=2, and (k)=1.5. Then, we observe and compare the following potential scenarios in different types of patients: (i) The real low severity SNPs of the patient (i.e., his real SNPs with low Vi values) have a higher number of LD pairs (i.e., higher k values) with respect to his high severity real SNPs (we note that, in all cases, k values are obtained from the same truncated Gaussian distribution with (k)=2, and (k)=1.5); (ii) k values are assigned randomly to the SNPs; and (iii) the real high severity SNPs of the patient (i.e., his real SNPs with high V.sub.i values) have a higher number of LD pairs (i.e., higher k values) with respect to his low severity real SNPs. Again, we set a patient's genomic privacy to 100% when the storage redundancy is maximum at the SPU (as in Method 1 in Section 2.3). We illustrate our results in
(64) We obtained similar patterns for further variations of the variables but we do not present these results due to the space limitation. In summary, depending on the actual (k), (k), (l), (l), and V.sub.i values (which will be determined as a result of the medical research), the storage redundancy can be determined (and customized for each patient based on the types of his variations) for this approach to keep the genomic privacy of the patient at a desired level. Note that the curious party at the SPU cannot infer the real SNPs of the patient (or the severities of the patient's real SNPs) from the amount of customized storage redundancy, because the storage redundancy (for a desired level of genomic privacy) depends on various factors. For example, a patient with low storage redundancy (for a desired level of genomic privacy) could mean that (i) he carries mostly low severity real SNP (as in
2.5 Method 3
Encrypted Locations at the SPU
(65) Let L.sup.P={L.sub.i:i.sub.P} represent the set of locations (on the DNA sequence) of the patient P's real SNPs (in .sub.P). As opposed to the previous two approaches, here, we propose to encrypt the locations of the SNPs along with their contents. By doing so, we save storage costs by storing only the real SNPs in .sub.p at the SPU (around 4 million) while providing the highest level of privacy (as in Section 2.3). These benefits, however, come with a cost in the practicality of the algorithm, introducing extra steps for the patient (P) during the protocol. Although we can assume that these extra steps can easily be handled via the patient's device such as smart card or mobile device, this approach still requires more message exchanges (as will be described next) between the parties, compared to the previous two approaches.
(66) In some environments, dividing the weak private of the patient, and distributing two shares of the weak private key to the SPU and MU might not be acceptable (e.g., when it is likely that the SPU and MU will collaborate to retrieve patient's weak private). Therefore, for the sake of completeness, in the following, we present Method 3 with and without proxy encryption (i.e., without distributing the patient's private key to other parties). The Method 1 and Method 2 can also be modified similarly to avoid proxy encryption.
2.5.1 With Proxy Encryption
(67) The initial steps of the protocol are the same as in Section 2.3, except for Steps 2 and 3 in which the locations of the SNPs are encrypted and a Bloom filter [39] is generated. Below, we summarize the different steps of this approach (the unchanged steps are not repeated). These steps are illustrated in
(68) A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries [39]. A Bloom filter for representing a set L.sup.P is described by an array of bits, initially all set to 0. It employs independent hash functions H.sub.1, . . . H.sub. with range {1, . . . , }. For every element L.sub.iL.sup.P, the bits H.sub.1(L.sub.i), . . . , H.sub.(L.sub.i) in the array are set to 1. A location can be set to 1 multiple times, but only the first change has an effect.
(69) After constructing the Bloom filter, the CI encrypts each element in L.sup.P by using a symmetric key shared between the CI and P (established during Step 0 as in Section 2.3) and generates L.sup.P.sub.E={E(L.sub.i):i.sub.P}. The CI also encrypts a dummy variant (representing the potential SNPs in .sub.P) along with the real SNPs of the patient (using P's public key). Furthermore, the CI associates a dummy position L.sub.0 for this dummy variant and encrypts L.sub.0 using the symmetric key between the CI and P to obtain the encrypted dummy position E(L.sub.0). Step 3: The CI sends the constructed Bloom filter and the encrypted dummy position E(L.sub.0) to the patient for storage into the patient device, and encrypted SNPs and locations to the SPU. Step 6: The MU tells the patient the locations of the SNPs that are required for the susceptibility test or requested directly as the relevant SNPs. Step 7: The patient inputs each requested location L.sub.j to the Bloom filter to determine if the corresponding location is stored at his Bloom filter (i.e., to determine if he has a real SNP at the corresponding location).
(70) To check if L.sub.j belongs to L.sup.P, the patient checks whether all H.sub.1(L.sub.j), . . . , H.sub.(L.sub.j) are set to 1. If not, L.sub.j definitely does not belong to L.sup.P. Otherwise, the patient assumes L.sub.jL.sup.P, although this may be wrong with some probability. That is, a Bloom filter could yield a false positive, where it suggests that L.sub.j is in L.sup.P even though it is not. This probability can be decreased at the expense of increasing Bloom filter length (i.e., ). Further, the false positive probability can be significantly reduced by using some proposed techniques such as [40, 41]. As a result of this process
(71) (a) If the location is in his Bloom filter (i.e., if he has a real SNP at the corresponding location), P encrypts the location with the symmetric key between the CI and P.
(72) (b) If the location is not in his Bloom filter (i.e., if he does not have a real SNP at the corresponding location), P uses E(L.sub.0) as the encrypted location.
(73) We note that the above operations can be easily done via the patient's device (e.g., by reading the patient's device at the MU as a consent to the test) or mobile device (e.g., by consenting via a smart phone application) by using the stored Bloom filter output, E(L.sub.0), and symmetric key between the CI and P. Step 8: The patient sends the SPU the encrypted locations of the SNPs which will be provided to the MU.
(74) Step 9: The encrypted SNPs are sent to the MU in the same order as they are requested in Step 6.
(75) (a) If only the end-result is requested, the corresponding SNPs are re-encrypted at the SPU under the patient's public key (re-encryption under the same public key is discusses in Section 2.1). As there is only one value stored at the SPU representing the contents of the potential SNPs at which P does not have a variant (at position E(L.sub.0)), this value is re-encrypted for each different request of a non-variant, so that the MU cannot infer the locations of the non-variants of the patient.
(b) If relevant SNPs are requested, the SPU partially decrypts the relevant SNPs by using a part of P's private key following a proxy encryption protocol (Section 2.1). Step 10: Re-encrypted (or partially decrypted) SNPs are sent to the MU by the SPU. Step 11: One of the following two scenarios occur at the MU: (a) If only the end-result is requested, the MU computes P's total susceptibility to disease X by using the homomorphic properties of the Paillier cryptosystem (similar to the discussion in Section 2.6) under the patient's public key. Although the discussion in Section 2.6 is held considering Method 1 (or Method 2), a similar technique is used for this approach at the MU, hence we do not discuss it again.
(b) If relevant SNPs are requested, the MU decrypts the message received from the SPU by using the other part of P's private key and recovers the relevant SNPs. Step 12: The MU sends the encrypted end-result to the SPU. Step 13: The SPU partially decrypts the end-result using a part of P's private key by following a proxy encryption protocol (Section 2.1) and sends it back to the MU. Step 14: The MU decrypts the message received from the SPU by using the other part of P's private key and recovers the end-result.
2.5.2 Without Proxy Encryption
(76) In this approach, the SPU stores only the encrypted SNPs and encrypted locations. Genomic data encrypted by P's public key is only decrypted at P, and the weak private key of P remains only at P (i.e., shares of the weak private key are not distributed to the SPU or MU). Most of this approach is the same as Method 3 with proxy encryption. Indeed, the first 8 steps of the algorithm are the same, except for the distribution of parts of P's private key. The only difference is the transfer of the end-result or the relevant SNPs to the MU as follows: If the relevant SNPs are requested by the MU, the SPU sends the encrypted SNPs (by P's public key) to P. P decrypts these SNPs (using his weak private key) and sends them to the MU. If the end-result of the susceptibility test is requested by the MU, the disease-susceptibility test is done (via homomorphic operations) at the MU and the encrypted end-result is sent to P. Then, P decrypts the end-result and sends it back to the MU.
(77) We note that the security of the communication between P and the MU is provided by symmetric keys as discussed before. The above operations put some more burdens on the patient during the protocol. However, we emphasize that these operations can be smoothly done on the patient's device without requiring a substantial effort from the patient himself.
(78) In summary, as the locations of the real SNPs are encrypted, a curious party at the SPU cannot infer the contents of the SNPs from their locations (as in Section 2.3), hence it is enough to store only the real SNPs in .sub.P. Furthermore, the privacy provided by this approach (with or without proxy encryption) is the same as 900% redundancy in Method 2 (i.e., similar to Method 1), hence we do not discuss it again. Another advantage of this approach (i.e., Method 3 in general) is that individual contributions of the genetic variant markers remain secret at the MU, because the homomorphic operations are conducted at the MU. This advantage might become more significant when this approach is used for personalized medicine methods in which the pharmaceutical company (embodied in this case as the medical unit) does not want to reveal the genetic properties of its drugs. Thus, if introducing the described extra steps for the patient and few additional message exchanges between the parties are tolerated, this approach operates with relatively modest storage and yet provides very good privacy.
2.6 Computing Disease Susceptibility Via Homomorphic Operations
(79) We now present the disease-susceptibility test via homomorphic operations at the SPU for Method 1 (Section 2.3) and Method 2 (Section 2.4). Similar techniques can be used for Method 3 at the MU, as discussed in Section 2.5.
(80) The SPU uses a proper function to compute P's predicted disease susceptibility via homomorphic encryption. There are different functions for computing the predicted susceptibility. In [25], focusing on one example of many diseases that require a susceptibility test involving multiple SNPs, Kathiresan et al. propose to count the number of unfavorable alleles carried by the patient for each SNP related to a particular disease. Similarly, in [26], Ashley et al. propose to multiply the Likelihood Ratios (LRs) of the most important SNPs for a particular disease in order to compute a patient's predicted susceptibility. LR values are determined as a result of medical studies. Furthermore, a weighted averaging function can also be used, which computes the predicted susceptibility by weighting the contributions of SNPs by their contributions (e.g., LR values of the SNPs). Note that our proposed privacy-preserving mechanisms are not limited by the types of the functions (used to test the disease susceptibility). It is expected that these functions will evolve over time; hence the proposed algorithms can be developed to keep up with this evolution.
(81) In the following, we discuss how to compute the predicted disease susceptibility at the SPU by using a toy example to show how the homomorphic encryption is used at the SPU. Initially, we assume that the function at the SPU is weighted averaging (which is an advanced version of the function proposed in [25]) and show how the predicted susceptibility is computed using encrypted SNPs. Then, we show how the function proposed in [26] (i.e., multiplication of LR values) can be utilized at the SPU.
2.6.1 Weighted Averaging
(82) Assume that (for simplicity) the susceptibility to disease X is determined by the set of SNPs ={SNP.sub.m,SNP.sub.n}, which occur at particular locations of the DNA sequence. SNP.sub.m.sup.P and SNP.sub.n.sup.P are not necessarily among the real SNPs of the patient P (i.e., P does not need to have a variant at those locations). The contributions of different states of SNP.sub.i.sup.P for i{m, n} to the susceptibility to disease X are computed via previous studies (on case and control populations) and they are already known by the MU. That is, p.sup.i.sub.0(X)Pr(X|SNP.sub.i.sup.P=0) and p.sup.i.sub.1(X)Pr(X|SNP.sub.i.sup.P=1) (i{m, n}) are determined and known by the MU. Further, the contribution (e.g., LR value) of SNP.sub.i to the susceptibility to disease X is denoted by C.sub.i.sup.X. Note that these contributions are also computed by previous studies on case and control groups and they are known by the MU.
(83) As we have discussed before, the SPU stores the set of SNPs of the patient P, encrypted by P's public key (n, g, h=g.sup.x). Encryption is done using the modified Paillier cryptosystem as discussed in Section 2.1. Thus, the SPU uses E(SNP.sub.m.sup.P, g.sup.x) and E(SNP.sub.n.sup.P, g.sup.x) for the computation of predicted susceptibility of P to disease X. From now on, we drop the r values in the above encrypted messages for the clarity of the presentation (r values are chosen randomly from the set [1, n/4] for every encrypted message as discussed in Section 2.1). Similarly, the MU provides the following to the SPU in plaintext: (i) the markers for disease X (SNP.sub.m and SNP.sub.n), (ii) corresponding probabilities p.sup.i.sub.j(X), i{m,n} and j{0,1}, and (iii) the contributions of each SNP C.sub.i.sup.X.
(84) Next, the SPU encrypts j(j{0,1}) using P's public key to obtain E(0, g.sup.x) and E(1, g.sup.x) for the homomorphic computations. This encryption can also be done at the MU and sent to the SPU. Alternatively, we might assume that SNPs of a patient are stored at the SPU in pairs of {E(|SNP.sub.i.sup.P0|, g.sup.x), E(|SNP.sub.i.sup.P1|, g.sup.x)} for each SNP.sub.i.sup.P, instead of the actual values of the SNPs. In this case, the above encryption at the SPU would not be required.
(85) The SPU computes the predicted susceptibility of the patient P to disease X by using weighted averaging.
(86) This can be computed in plaintext as below:
(87)
(88) The computation in (9) can be realized using the encrypted SNPs of the patient (and utilizing the homomorphic properties of the Paillier cryptosystem) to compute the encrypted disease susceptibility, E(S.sub.P.sup.X, g.sup.x) as below:
(89)
(90) We note that the end-result in (10) is encrypted by P's public key.
(91) Then, the SPU partially decrypts the end-result E(S.sub.P.sup.X, g.sup.x) using its share (x.sup.(1)) of P's private key (x) as discussed in Section 2.1 to obtain E(S.sub.P.sup.X, g.sup.x.sup.
(92) In some genetic tests, the types of the real SNPs (e.g., homozygous or heterozygous) become also important. In this case, SNP.sub.i.sup.P can take three different values from the set {0, 1, 2} to represent a potential SNP (i.e., nonvariant), a real homozygous SNP, and a real heterozygous SNP, respectively. In such a scenario, to conduct the disease-susceptibility test via homomorphic operations, the SPU should store the squared values of the SNPs. That is, for each SNP.sub.i.sup.P of the patient P, the SPU should store E((SNP.sub.i.sup.P).sup.2, g.sup.x). Depending on the types of genomic tests that would be supported by the SPU (and the functions required for these tests), the format of storage of patient's SNPs can be determined beforehand, and SNPs can be stored accordingly just after the sequencing process.
2.6.2 Likelihood Ratio Test
(93) We now assume that the predicted disease susceptibility is computed from the multiplication of Likelihood Ratios (LRs) of the corresponding SNPs as in [26] and show how such a computation would be handled at the SPU by using homomorphic operations.
(94) In this approach, the predicted disease susceptibility is computed by multiplying the initial risk of the patient (e.g., for disease X) by the LR value of each SNP related to that disease (LR value of a SNP i depends on the value of SNP.sub.i.sup.P at the patient P). The initial risk of the patient P for the disease X is represented as I.sub.X.sup.P. We note that I.sub.X.sup.P is determined by considering several factors (other than patient's genomic data) such as patient's age, gender, height, weight, and environment. Thus, this initial risk can be computed directly by the MU. We also note that if the LR value corresponding to a particular SNP is less than one, the risk for the disease decreases. Otherwise, if the LR value is greater than one, the risk increases for the corresponding disease.
(95) Similar to before, we assume that the susceptibility to disease X is determined by the set of SNPs in ={SNP.sub.m,SNP.sub.n}. We denote the LR values due to SNP.sub.i.sup.P=0 and SNP.sub.i.sup.P=1 for disease X as L.sup.i.sub.X (0) and L.sup.i.sub.X (1), respectively.
(96) The SPU stores the SNPs of the patient P, encrypted by P's public key. The MU sends the following to the SPU: (i) L.sup.i.sub.X (j) values (i{m,n} and j{0,1}) in plaintext, and (ii) the markers for disease X. The MU also encrypts the log of initial risk value, ln(I.sub.X.sup.P), by P's public key and sends E(ln(I.sub.X.sup.P), g.sup.x) to the SPU. Alternatively, the contribution of the initial risk to the disease susceptibility can be included to the end-result at the end, at the MU.
(97) The Paillier cryptosystem does not support multiplicative homomorphism in ciphertext (it only supports the multiplication of a ciphertext with a constant as discussed in Section 2.1). Thus, instead of multiplying the LR values, we propose using addition in log-domain at the SPU. Thus, the SPU computes the predicted susceptibility of P to disease X as below:
(98)
(99) We note that (12) corresponds to the below computation in plaintext:
(100)
(101) As before, the SPU partially decrypts E(ln(SX P), g.sup.x) using x(1) (its share of P's private key) to obtain E(ln(S.sub.P.sup.X), g.sup.x.sup.
2.6.3 Use of Structural Variations
(102) In this section, we describe how the method proposed in Section 2.6.1 for computing disease susceptibility based on weighted averaging can be extended beyond the use of SNPs. A similar approach can be used to also extend the likelihood ratio test as proposed in Section 2.6.2. We consider the case of more complex human genetic variations that involve multiple nucleotides (SNPs involve just a single nucleotide) such as insertions, deletions, copy number variants, and inversions. These variations are more generally referred as structural variations (SVs) [43]. Note that Method 0, Method 1, Method 2 and Method 3 described respectively in Sections 2.2, 2.3, 2.4, and 2.5 can be used with SVs without any further change. Only the disease susceptibility computation changes.
(103) Insertions and deletions (jointly called INDELs) are a specific type of SVs consisting of an insertion or a deletion of one or multiple contiguous nucleotides with respect to the reference genome. INDELs can determine the susceptibility or resistance to diseases susceptibility. For example, INDEL rs333, that is a 32 nucleotides deletion in the CCR5 gene, protects against HIV.
(104) Copy number variants (CNVs) consist of a segment of DNA (several contiguous nucleotides) that is present at a variable copy number in comparison with a reference genome. Some CNV can be associated with diseases. For example, a higher copy number for the EGFR gene can be associated with lung cancer while a higher copy number for CCL3L1 gene can been associated with lower susceptibility to HIV infection [44,45].
(105) Inversions are segment of DNA that are reversed in orientation with respect to the reference genome. Like INDELs and CNVs, also inversions can be associated with diseases. One of the best-characterized recurrent inversions giving rise to disease causes haemophilia A [46].
(106) Differently from SNPs that are mostly bi-allelic (i.e., they have two alleles or versions at a given SNP position), SVs can often be multi-allelic (i.e., they have more than two alleles at a given SV position; for example, for a CNV different copy numbers represent different alleles and there can be multiple copy numbers at a given CNV position).
(107) In the case of a bi-allelic SVs, we can treat them as SNPs. Hence, we can assume that SV.sub.i at the patient P is represented as SV.sub.i.sup.P and SV.sub.i.sup.P=1, if P has a real SV (i.e., a variant) at this location, and SV.sub.i.sup.P=0, if P does not have a variant at this location. The disease susceptibility can then be computed as discussed in Sections 2.6.1 and 2.6.2.
(108) In the case of multi-allelic SVs, different alleles might have a different impact on the susceptibility or resistance to a disease. Hence, a case-by-case assessment is needed to know which allele has a given patient. We can assume that SV.sub.i.sup.P=k.sub.i, if P has a real SV (i.e., a variant) at this location with the allele k.sub.i, and SV.sub.i.sup.P=0, if P does not have a variant at this location (i.e., his sequence is the same as the reference sequence in that position). We assume k.sub.i is a number from 1 to .sub.i, where .sub.i is the total number of alleles for SV.sub.i. Then, the disease susceptibility can be computed as follows.
(109) Similarly to the case of SNPs, we assume that the susceptibility to disease X is determined by the set of SVs ={SV.sub.m, SV.sub.n}. Assume also that a specific allele k.sub.i of SV.sub.i, for i{m, n}, is associated to the susceptibility to disease X and its contribution is computed by previous studies and known by the MU. That is, p.sub.k.sup.i(X)=Pr(X|SV.sub.i.sup.P=k.sub.i) and p.sub.
(110) The SPU stores the SVs of the patient P, encrypted by P's public key. The MU sends the following to the SPU: (i) markers for disease X (SV.sub.m,SV.sub.n), (ii) the allele associated with X encrypted with P's public key E(k.sub.i, g.sup.x), i{m, n}, (iii) the corresponding probabilities p.sub.k.sup.i(X) and p.sub.
(111)
and .sub.e(E(a,g.sup.x),E(b,g.sup.x)) is a secure equality function that takes as input two l bits integers encrypted with the same key E(a,g.sup.x) and E(b,g.sup.x), and outputs E(0, g.sup.x) if ab and E(1,g.sup.x) if a=b. In particular, the proposed secure equality function is computed as follows:
.sub.e(E(a,g.sup.x),E(b,g.sup.x))=.sub.c(E(a,g.sup.x),E(b,g.sup.x)).sub.c(E(b,g.sup.x),E(a,g.sup.x))E(1,g.sup.x).sup.1(3)
where .sub.c(E(a,g.sup.x),E(b,g.sup.x)) is a secure comparison protocol adapted from [47] that takes the same inputs as .sub.e and outputs E(1,g.sup.x) if ab and E(0, g.sup.x), otherwise. The secure comparison function .sub.c is described in detail in Algorithm 1.
(112) We note that (1) corresponds to the below computation in plaintext:
(113)
(114) As before, the SPU partially decrypts E(S.sub.P.sup.X,g.sup.x) using x.sup.(1) (its share of P's private key) to obtain E(S.sub.P.sup.X, g.sup.x.sup.
(115) TABLE-US-00002 Algorithm 1 Secure Comparison f.sub.c (E(a, g.sup.x), E(b, g.sup.x)) Input: @ SPU: E(a, g.sup.x), E(b,g.sup.x) and x.sup.(1). @ MU: x.sup.(2). Output: @ SPU: f.sub.c (E(a, g.sup.x),E(b, g.sup.x)) = E((a b), g.sup.x). @ MU: . // Let a and b be two l-bit integers 1: SPU computes E(z, g.sup.x) E(a, g.sup.x) * E(b, g.sup.x).sup.1 * E(2.sup.l, g.sup.x) = E(a b + 2.sup.l, g.sup.x). 2: SPU generates a random number r, 0 r < n.sup.2, and blinds E(z, g.sup.x): E(z, g.sup.x) E({circumflex over (z)}, g.sup.x) E(r, g.sup.x) = E(z + r, g.sup.x). 3: SPU partially decrypts E({circumflex over (z)}, g.sup.x) with x.sup.(1) and sends E (z, g.sup.x(2)) to MU 4: MU decrypts E ({circumflex over (z)}, g.sup.x(2)) with x.sup.(2) and obtains {circumflex over (z)} 5: MU computes {circumflex over (z)} mod 2.sup.l. 6: SPU computes r mod 2.sup.l. 7: SPU and MU run a DGK or a modified DGK comparison with private inputs and and obtain .sub.SPU (@ SPU) and .sub.MU (@ MU) as described in [6]. 8:
3 Evaluation and Implementation of the Proposed Methods
(116) In
3.1 Implementation and Complexity Evaluation
(117) To evaluate the practicality of the proposed privacy-preserving algorithms, we implemented them, and assessed their storage requirements and computational complexities on Intel Core i7-2620M CPU with 2.70 GHz processor under Windows 7 Enterprise 64-bit Operating System. We set the size of the security parameter (n in Paillier cryptosystem in Section 2.1) to 1024 bits. We computed the disease susceptibility using weighted averaging (at the SPU or MU, see Section 2.6.1 as well as LR test in Section 2.6.2 which also has similar complexity) and real SNP profiles from [42]. Our implementation relies on a MySQL 5.5 database managed by the open source tool MySQL Workbench. To provide a platform-independent implementation, we used the Java programming language along with the open-source Integrated Development Environment, NetBeans IDE 7.1.1., for the implementation of the Java code. We note that our code for the implementation is not optimized, and better results can be expected with an optimized implementation.
(118) In Table II, we summarize the computational and storage complexities of the proposed methods at (i) Certified Institution (CI), (ii) SPU, (iii) MU, and (iv) P. We evaluate the proposed methods considering the following costs: (i) encryption of patient's variants, (ii) disease-susceptibility test at the SPU via homomorphic operations (using ten variants), (iii) decryption of the end-result (or relevant SNPs), (iv) proxy encryption, and (v) storage costs, in which B represent the percentage of storage redundancy at the SPU. We did not explicitly implement the Bloom filter (for Method 3) and symmetric encryption/decryption between the parties for the security of the communication. However, the computational costs due to these operations are negligible compared to Paillier encryption/decryption and homomorphic operations.
(119) We emphasize that the encryption of the variants at the CI is a one-time operation and is significantly faster than the sequencing and analysis of the sequence (which takes days). Further, this encryption can be conducted much more efficiently by computing some parameters, such as (g.sup.r, h.sup.r) pairs, offline for various r values, for each patient. Indeed, by computing (g.sup.r, h.sup.r) pairs offline, we observe that the encryption takes only 0.017 ms per variant at the CI.
(120) TABLE-US-00003 TABLE 2 Computational and Storage Complexities of the Proposed Methods Method 1 and Method 2 @ CI @ SPU @ MU
(121) It is also possible to conduct private statistical tests (by a medical researcher) on the data stored at the SPU in order to get statistics about the variants of multiple patients. Conducting such a statistical test for a variant (about its type) on 100K patients takes around 55 minutes at the SPU and scales linearly with the number of patients. Note that such a statistical test is only possible with Method 1 or Method 2; using Method 3 and querying the encrypted locations of SNPs from 100K patients is not practical for this application.
(122) In summary, all these numbers show the practicality of our privacy-preserving algorithms.
3.2 Security Evaluation
(123) The proposed schemes preserve the privacy of patients' genomic data relying on the security strength of modified Paillier cryptosystem (in Section 2.1). The extensive security evaluation of the modified Paillier cryptosystem can be found in [33]. Below we summarize two important security features of this cryptosystem. One-wayness: This property means that no efficient adversary has any significant chance of finding a preimage to the ciphertext when he sees only the ciphertext and the public key of the patient. It is shown in [33] that the one-wayness of the modified Paillier cryptosystem can be related to the Lift Diffie-Hellman problem which is shown to be as hard as the partial Discrete Logarithm problem. Semantic security: This property ensures that an adversary will be unable to distinguish pairs of ciphertexts based on the message they encrypt. It is shown in [33] that if Decisional Diffie-Hellman Assumption (a computational hardness assumption about a certain problem involving discrete logarithms in cyclic groups) in Z*.sub.n.sub.
(124) Finally, if the weak private key of the patient, x, is randomly divided and distributed to the Storage and Processing Unit (SPU) and Medical unit (MU) as in Method 1, this weak private key could be revealed if the MU colludes with the SPU, but the factors n, p, and q remain secret. We note that such a collusion is not considered in this study. However, for the sake of completeness, in Section 2.5.2, we present an alternative approach (Method 3 without proxy encryption) that avoids distributing the patient's weak private key to other parties, hence is robust against such a collusion.
(125) The invention is also related to a computer readable storage medium having recorded thereon a computer program for processing genomic data of a patient and performing the steps of any of the method claims.
REFERENCES
(126) [1] A. Cavoukian, Privacy by design, 2009, http://www.ontla.on.ca/library/repository/mon/23002/289982.pdf. [2] S. F. Gurses, Multilateral privacy requirements analysis in online social network services, 2010, PhD thesis, K U Leuven. [3] M. Langheinrich, Principles of privacy-aware ubiquitous systems, Proceedings of Ubiquitous Computing (UbiComp), 2001. [4] G. van Blarkom, J. Borking, and J. Olk, Handbook of privacy and privacy-enhancing technologies (the case of intelligent software agents), College bescherming persoonsgegevens, 2003. [5] http://www.personalgenomes.org/consent/PGP Consent Approved 02212012. pdf. [6] J. R. Troncoso-Pastoriza, S. Katzenbeisser, and M. Celik, Privacy preserving error resilient DNA searching through oblivious automata, CCS '07: Proceedings of the 14th ACM Conference on Computer and Communications Security, pp. 519-528, 2007. [7] M. Blanton and M. Aliasgari, Secure outsourcing of DNA searching via finite automata, DBSec10: Proceedings of the 24th Annual IFIP WG 11.3 Working Conference on Data and Applications Security and Privacy, pp. 49-64, 2010. [8] S. Jha, L. Kruger, and V. Shmatikov, Towards practical privacy for genomic computation, Proceedings of the 2008 IEEE Symposium on Security and Privacy, pp. 216-230, 2008. [9] F. Bruekers, S. Katzenbeisser, K. Kursawe, and P. Tuyls, Privacy-preserving matching of DNA profiles, Tech. Rep., 2008. [10] M. Kantarcioglu, W. Jiang, Y. Liu, and B. Malin, A cryptographic approach to securely share and query genomic sequences, IEEE Transactions on Information Technology in Biomedicine, vol. 12, no. 5, pp. 606-617, 2008. [11] P. Baldi, R. Baronio, E. De Cristofaro, P. Gasti, and G. Tsudik, Countering GATTACA: efficient and secure testing of fully-sequenced human genomes, CCS '11: Proceedings of the 18th ACM Conference on Computer and Communications Security, pp. 691-702, 2011. [12] M. Canim, M. Kantarcioglu, and B. Malin, Secure management of biomedical data with cryptographic hardware, IEEE Transactions on Information Technology in Biomedicine, vol. 16, no. 1, 2012. [13] D. Eppstein, M. T. Goodrich, and P. Baldi, Privacy-enhanced methods for comparing compressed DNA sequences, CoRR, vol. abs/1107.3593, 2011. [Online]. Available: http://arxiv.org/abs/1107.3593 [14] D. Eppstein and M. T. Goodrich, Straggler identification in round-trip data streams via Newton's identities and invertible Bloom filters, IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 2, pp. 297-306, 2011. [15] R. Wang, Y. F. Li, X. Wang, H. Tang, and X. Zhou, Learning your identity and disease from research papers: information leaks in genome wide association study, CCS '09: Proceedings of the 16th ACM Conference on Computer and Communications Security, pp. 534-544, 2009. [16] B. Malin and L. Sweeney, How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems, Journal of Biomedical Informatics, vol. 37, pp. 179-192, June 2004. [17] N. Homer, S. Szelinger, M. Redman, D. Duggan, and W. Tembe, Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays, PLoS Genetics, vol. 4, August 2008. [18] J. Gitschier, Inferential genotyping of Y chromosomes in Latter-Day Saints founders and comparison to Utah samples in the HapMap project, Am. J. Hum. Genet., vol. 84, pp. 251-258, 2009. [19] X. Zhou, B. Peng, Y. F. Li, Y. Chen, H. Tang, and X. Wang, To release or not to release: evaluating information leaks in aggregate human-genome data, ESORICS'11: Proceedings of the 16th European Conference on Research in Computer Security, pp. 607-627, 2011. [20] S. E. Fienberg, A. Slavkovic, and C. Uhler, Privacy preserving GWAS data sharing, Proceedings of the IEEE 11th International Conference on Data Mining Workshops (ICDMW), December 2011. [21] Y. Chen, B. Peng, X. Wang, and H. Tang, Large-scale privacy-preserving mapping of human genomic sequences on hybrid clouds, NDSS12: Proceeding of the 19th Network and Distributed System Security Symposium, 2012. [22] R. Wang, X. Wang, Z. Li, H. Tang, M. K. Reiter, and Z. Dong, Privacy-preserving genomic computation through program specialization, Proceedings of the 16th ACM Conference on Computer and Communications Security, pp. 338-347, 2009. [23] R. Agrawal, A. Evfinnievski, and R. Srikant, Information sharing across private databases, Proceedings of SIGMOD Conference, 2003. [24] D. Dachman-Soled, T. Malkin, M. Raykova, and M. Yung, Efficient robust private set intersection, Proceedings of the 7th International Conference on Applied Cryptography and Network Security, pp. 125-142, 2009. [25] S. Kathiresan, O. Melander, D. Anevski, C. Guiducci, and N. Burtt, Polymorphisms associated with cholesterol and risk of cardiovascular events, The New England Journal of Medicine, vol. 358, pp. 1240-1249, 2008. [26] E. Ashley, A. Butte, M. Wheeler, R. Chen, and T. Klein, Clinical assessment incorporating a personal genome, The Lancet, vol. 375, no. 9725, pp. 1525-1535, 2010. [27] S. Seshadri, A. Fitzpatrick, M. A. Ikram, A. DeStefano, V. Gudnason, M. Boada, J. Bis, A. Smith, M. Carassquillo, J. Lambert, C. Consortium, G. Consortium, and E. Consortium, Genome-wide analysis of genetic loci associated with Alzheimer disease, JAMA, vol. 303, pp. 1832-1840, 2010. [28] http://www.ncbi.nInn.nih.gov/projects/SNP/. [29] D. Greenbaum, A. Sboner, X. Mu, and M. Gerstein, Genomics and privacy: Implications of the new reality of closed data for the field, PLoS Computational Biology, vol. 7, no. 12, 2011. [30] M. Raykova, H. Zhao, and S. M. Bellovin, Privacy enhanced access control for outsourced data sharing, Financial Cryptography and Data Security, 2012. [31] M. T. Goodrich and M. Mitzenmacher, Privacy-preserving access of outsourced data via oblivious RAM simulation, Proceedings of the 38th International Conference on Automata, Languages and ProgrammingVolume Part II, pp. 576-587, 2011. [32] E. Stefanov, E. Shi, and D. Song, Towards practical oblivious RAM, NDSS12: Proceeding of the 19th Network and Distributed System Security Symposium, 2012. [33] E. Bresson, D. Catalano, and D. Pointcheval, A simple public-key cryptosystem with a double trapdoor decryption mechanism and its applications, Proceedings of Asiacrypt 03, LNCS 2894, pp. 37-54, 2003. [34] M. Pirretti, P. Traynor, P. McDaniel, and B. Waters, Secure attribute-based systems, Proceedings of the 13th ACM Conference on Computer and Communications Security, pp. 99-112, 2006. [35] G. Ateniese, K. Fu, M. Green, and S. Hohenberger, Improved proxy re-encryption schemes with applications to secure distributed storage, ACM Transactions on Information and System Security, vol. 9, pp. 1-30, February 2006. [36] D. S. Falconer and T. F. Mackay, Introduction to Quantitative Genetics (4th Edition). Harlow, Essex, UK: Addison Wesley Longman, 1996. [37] C. Diaz, S. Seys, J. Claessens, and B. Preneel, Towards measuring anonymity, Proceedings of Privacy Enhancing Technologies Symposium (PETS), 2002. [38] A. Serjantov and G. Danezis, Towards an information theoretic metric for anonymity, Proceedings of Privacy Enhancing Technologies Symposium (PETS), 2002. [39] B. H. Bloom, Space/time trade-offs in hash coding with allowable errors, ACM Communications, vol. 13, no. 7, pp. 422-426, 1970. [40] F. Hao, M. Kodialam, and T. V. Lakshman, Building high accuracy Bloom filters using partitioned hashing, Proceedings of ACM International Conference on Measurement and Modeling of Computer Systems, pp. 277-288, 2007. [41] P. S. Almeida, C. Baquero, N. Preguica, and D. Hutchison, Scalable Bloom filters, Information Processing Letters, vol. 101, no. 6, pp. 255-261, 2007. [42] The 1000 Genomes Project Consortium, A map of human genome variation from population-scale sequencing, Nature, vol. 467, pp. 1061-1073, 2010. [43] Sudmant, Peter H., Tobias Rausch, Eugene J. Gardner, Robert E. Handsaker, Alexej Abyzov, John Huddleston, Yan Zhang et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, no. 7571 (2015): 75-81. [44] Cappuzzo, Federico, Fred R. Hirsch, Elisa Rossi, Stefania Bartolini, Giovanni L. Ceresoli, Lynne Bemis, Jerry Haney et al. Epidermal growth factor receptor gene and protein and gefitinib sensitivity in non-small-cell lung cancer. Journal of the National Cancer Institute 97, no. 9 (2005): 643-655. [45] Gonzalez, Enrique, Hemant Kulkarni, Hector Bolivar, Andrea Mangano, Racquel Sanchez, Gabriel Catano, Robert J. Nibbs et al. The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science 307, no. 5714 (2005): 1434-1440. [46] Lakich, Delia, Haig H. Kazazian, Stylianos E. Antonarakis, and Jane Gitschier. Inversions disrupting the factor VIII gene are a common cause of severe haemophilia A. Nature genetics 5, no. 3(1993): 236-241. [47] Veugen, Thijs. Comparing encrypted data. Multimedia Signal Processing Group, Delft University of Technology, The Netherlands, and TNO Information and Communication Technology, Delft, The Netherlands, Tech. Rep (2011). [48] Veugen, Thijs. Improving the DGK comparison protocol. In Information Forensics and Security (WIFS), 2012 IEEE International Workshop on, pp. 49-54. IEEE, 2012.