METHODS FOR ENABLING SECURED AND PERSONALISED GENOMIC SEQUENCE ANALYSIS
20220293222 · 2022-09-15
Inventors
- Francois Paillier (London, GB)
- Jackeline Palma (London, GB)
- Pascal Pallier (Paris, FR)
- Matthieu Rivain (Paris, FR)
- Louis Goubin (Paris, FR)
Cpc classification
G06F21/6245
PHYSICS
G16B20/20
PHYSICS
G16H10/60
PHYSICS
G16B20/00
PHYSICS
International classification
G06F21/62
PHYSICS
G16B20/20
PHYSICS
G16B30/00
PHYSICS
G16H10/60
PHYSICS
Abstract
Described herein is a secure integrated storage and analysis solution for personal genomic applications. The method guarantees data privacy whilst enabling access and ongoing analysis of genomic data when required. Described is a computer implemented homomorphic encryption method for securely producing natively encrypted sequencing data in a way that allows subsequent analysis on the encrypted data without requiring the file to be decrypted.
Claims
1. A computer implemented method for securely providing a user with a personally relevant analysis of biological information comprising: a taking a user specific electronic file containing a genetic sequence information; b. adding user specific personal information to form an integrated user specific file having the genetic sequence information and the user specific personal information; c. encrypting the integrated user specific file using fully homomorphic encryption supporting a non-linear prediction model, thereby combining all confidential information into encrypted data in a way that allows subsequent analysis directly on the encrypted data without need for decrypting the encrypted data to perform the computations; d. storing the encrypted file on a first user device or a first computation server; e. performing the non-linear prediction model on the encrypted data resulting in an encrypted analysis result; f. sending the encrypted result to a second user device or a second computation server for decryption; g. producing a personally relevant analysis report from the decrypted result.
2. The method according to claim 1, wherein a unique DNA based identifier is added to the user specific personal information at step b.
3. The method according to claim 2, wherein the unique DNA based identifier is selected from one or more of: a. analysis of single nucleotide polymorphisms (SNP's) composition; b. analysis of Short Tandem Repeat (STR) composition; c. analysis of Mitochondrial sequence composition; and/or d. analysis of insertion/deletion (InDel) markers.
4. The method according to claim 3, wherein the SNP's or STR's are from Chromosome Y or autosomes.
5. The method according to claim 1, wherein the genetic sequence information is a collection of SNP's.
6. The method according to claim 1, wherein the genetic sequence information is a whole genome sequence.
7. The method according to claim 1, wherein the genetic sequence information is a partial or exome sequence.
8. The method according to claim 1 wherein the genetic sequence information is compiled from a variety of different providers or experimental techniques, optionally including transcriptome, proteome, metabolome, medical data or any data stored in Electronic Medical Records or collected by quantify-self devices.
9. The method according to claim 8, wherein the encrypted file integrates genetic user data from two or more databases.
10. The method according to claim 1, wherein the user specific personal information added includes one or more of history of illness, blood group, dietary details; blood pressure; heart rate; allergy information, birth date, location of birth, nationality, family contacts or family history of illness.
11. The method according to claim 10, wherein the personal information is updated automatically from a fitness tracker or health monitoring device,
12. The method according to claim 1, wherein the encrypted file can have further genetic sequence information added after encryption.
13. The method according to claim 1, wherein interrogation of the encrypted file is operated through a mobile app providing access to a variety of analysis methods.
14. The method according to claim 13, wherein the analysis methods are applied to one or more fields of health that includes a risk prediction and a predispositions analysis, a nutrition field that includes a genetically optimised diet, a lifestyle field that includes a daily sunlight needs or life rhythms, a family history field that includes a genetic genealogy, paternity testing, or forensics), and genetic-centered social interactions that include a genetic interest group about syndromes or Orphan diseases.
15. The method according to claim 1, wherein the genetic sequence information is from a biological asset owned by the user, such as plants, animals, synthetic biological systems or microorganisms.
16. The method according to claim 15, wherein the biological asset is an animal or a plant used in an agro-food industry, a cosmetics industry, or any other industry or human activity.
17. The method according to claim 1, wherein the genetic sequence information is encrypted at the point of sequencing a sample provided by the user.
18. The method according to claim 1, wherein the genetic sequence information and authenticity of a sample is encrypted at the point of origin of the sample as provided by the user.
19. The method according to claim 1, wherein data of the encrypted file combines all or part of the following elements: a. user-specific raw data of different types, from different sources and at different levels of quality, b. user-specific analysed data including results from previous personal genomics analyses, c. user-specific preferences data including genetic data privacy preferences, preferences in terms of type of results that are communicated to who and how, and d. a unique digital signature.
Description
[0092] Described is a method allowing authentication of user by digital ways during the sample collection (such as saliva sample) as well as a method to guarantee the integrity of a sample and sample shipment to the sequencing laboratory. These digital methods implement biometric authentication as well as digital tracking of the shipment and may involve the following technologies (and any combination of these technologies): GPS tracking, remote Biometric Authentication on secured software or hardware including Drones, USB Stick logger embedded in shipment and Blockchain recording of sampling and transportation events.
[0093]
[0094]
[0095]
[0096]
[0097]
DETAILED DESCRIPTION OF THE INVENTION
[0098] Described herein is a secure integrated storage and analysis solution for personal genomic applications. The method guarantees data privacy whilst enabling access and ongoing analysis of genomic data when required. Described is a computer implemented homomorphic encryption method for securely producing natively encrypted sequencing data in a way that allows subsequent analysis on the encrypted data without requiring the file to be decrypted.
[0099] The system allows a user to directly mine his own information. Analyses are secured and new results are transferred encrypted. Each analysis method complies with End-to-End encryption. The user decrypts results using his unique key, guaranteeing total privacy. The methods bring the state-of-the-art algorithms close to the customer, for example using A.I. As new algorithms are developed they can be applied to the encrypted data. If a query cannot be applied due to insufficient genetic data, the system determines the quickest and most cost-effective way to generate the additional data required for the query to be satisfied.
[0100] The file can be supplemented with additional data, including additional genetic data or phenotype data.
[0101] Data on the file may include one or more of: [0102] Contact details for the patient, the data owner, medical practitioners, genetic counsellor, family members, next of kin, emergency contacts. [0103] Personal data (address, language, profile photo, current health status, current location, current diet type, lifestyle, cryptographic information) [0104] Genetic data privacy preferences for user and family members [0105] Personal objectives [0106] Main risks and categorised health status (metabolic, cardiovascular, inflammatory, fitness-frailty, ontological, psychological, cognitive, infectious)
[0107] Encryption technology described herein allows fully homomorphic encryption to support super-fast operations in the encrypted domain. The technology comes under the form of a set of software tools for use-case specifications and semi-automatic code generation.
[0108] A user's genomic data is provided in encrypted form to a service provider in order to predict a genetic trait or a risk of disease. The service provider evaluates a proprietary prediction model homomorphically on the encrypted data and returns an encrypted result without ever being able to access the genomic data in the clear. The encrypted result is then decrypted by the user—or an associated device—to view the prediction result value.
[0109] The method supports a wide class of prediction models that combine table look-ups and additive aggregation of independent gene-level contributions. Thus the invention extends far beyond logistic regression—the classical linear model for genome-wide association.
[0110] The classes of prediction models supported by the invention and the methods of their application are described below. [0111] 1. General Application of Prediction Models [0112] 1.1. Input and output
[0113] The prediction service provider is provided with a set of single nucleotide polymorphisms (or SNPs)
S=((rsid.sub.1,x.sub.1)(rsid.sub.2, x.sub.2), . . . , (rsid.sub.n, x.sub.n))
[0114] where rsid.sub.i indicates the identifier of the i-th SNP and x.sub.i indicates its value. For instance, when the SNPs contain a pair of nucleotide bases, each x.sub.i is an ordered pair of symbols in the standard alphabet “—ACGTYRWSKMDVHBN” and can only have 136 possible values.
[0115] In addition to the set of SNPs, the prediction may require a set of covariates cov providing additional information such as age, weight, height, body mass index, ethnicity or other relevant user-specific information.
[0116] The output value of the prediction is a probability that measures the presence of a genetic trait or a health risk:
p=prediction_model(S, cov)
[0117] By applying comparison with a selected threshold probability, the result value can be made a binary value (yes or no). By apply several models in parallel, the output may also be a vector of probabilities and/or binary values.
[0118] The sets of SNPs and covariates are input into the prediction models as a single vector of value:
V=(v.sub.1,v2, . . . v.sub.k) [0119] 1.2. Supported Prediction Models [0120] 1.2.1 Linear Models (e.g. Logistic Regression)
[0121] Given the input vector V=(v.sub.1, v.sub.2, . . . , v.sub.k), a linear model returns the output probability
p=f(w.sub.0+w.sub.1.Math.v.sub.1+. . . +w.sub.kv.sub.k)
[0122] where the function f and all the weights w.sub.0, w.sub.1, . . . , w.sub.k are real-valued and constitute the model.
[0123] For instance, when f is chosen to be the logistic function f(t)=1/(1+e.sup.−t), the model is said to be a logistic model and w.sub.o,w.sub.i, W.sub.k are called the regression coefficients. However other linear models may use different functions.
[0124] Linear models have 2 intrinsic limitations:
[0125] Limitation 1. They assume that all input variables have independent contributions in the computation of p. Indeed the contribution w.sub.i.Math.v.sub.i of v.sub.i is independent from all the other input variables.
[0126] Limitation 2. The contribution of an input variable v.sub.i is linear in v.sub.i. [0127] 1.2.2 Non-linear models
[0128] What we call here non-linear models are a generalization of linear models where
P=f(w.sub.0+f.sub.1(v.sub.1)+ . . . +f.sub.k(v.sub.k))
[0129] and the coefficient w.sub.0 as well as the functions f, f.sub.1. . . f.sub.k are arbitrary and belong to the model.
[0130] Thus non-linear models escape Limitation 2. However each contribution f.sub.i(v.sub.i) remains independent from the other input variables, resulting in that Limitation 1 still applies. [0131] 1.2.3 Non-linear Co-Dependent Models
[0132] Non-linear co-dependent models allow each contribution to depend on arbitrary subsets of input variables.
[0133] As an example, assume that input variables in V form contiguous clusters of co-dependent variables, for instance
V=((v.sub.1,v.sub.1, v.sub.2)v.sub.3),(v.sub.4v.sub.5,v.sub.6),v.sub.7, . . .).
[0134] In this example, v.sub.1 and v.sub.2 form a cluster, v.sub.3 is independent, v.sub.4, v.sub.5 and v.sub.6 form another cluster, v.sub.7 is independent, and so forth. A non-linear co-dependent model outputs
p=f(w.sub.0+f.sub.12(v.sub.1, v.sub.2)+f.sub.3(v.sub.3)+f.sub.456(v.sub.4,v.sub.5,v.sub.6)+f.sub.7(v.sub.7)+ . . . )
[0135] and the model parameters now include arbitrary multivariate functions.
[0136] In the general case, V is a collection of clusters (V.sub.1, . . . , V.sub.q) where each cluster V.sub.1 is a collection of input variables V.sub.l.Math.{v.sub.1, . . . , v.sub.k}. An input variable may belong to several clusters. The contribution of cluster V.sub.l in the computation of p is f.sub.l(V.sub.l) and the output of the model is
p=f(w.sub.0+f.sub.1(V.sub.1)+ . . . +f.sub.q(V.sub.q)).
[0137] We see that non-linear co-dependent models have no longer Limitation 1 and that [0138] linear modelsnon-linear models
non-linear co-dependent models
[0139] The method as per the invention supports these 3 categories of models. [0140] 1.2.4. Why Non-Linear Co-Dependent Models Matter in Ggenomics
[0141] In linear or non-linear models, all input SNP variables have an independent effect on the final prediction result.
[0142] However, in potentially many concrete cases of genomic predictions, this is not accurate because some of the input SNPs may belong to the same gene, resulting in dependencies between the contributions of these SNPs being observed in acquired medical data.
[0143] Therefore one gets a far more accurate model by combining the SNPs belonging to the same gene together in the same cluster, and possibly adding relevant covariates to that cluster as well, so that all observed dependencies are taken into account in the model.
[0144] The particular parameters of a model (the coefficient w.sub.0 and functions f, f.sub.1, . . . , f.sub.q) can be extracted from medical acquisitions in various ways e.g. using machine learning techniques. [0145] 2. Homomorphic Evaluation of Prediction Models
[0146] We now show how the invention allows to evaluate any non-linear co-dependent prediction model over encrypted input variables using homomorphic encryption.
[0147] Because this is the most general class of models, this description also applies—with simplifications—to linear and non-linear models.
[0148] The description that follows makes use of a generic homomorphic encryption scheme that supports: [0149] the public encryption of integer values, [0150] the homomorphic addition of encrypted values, [0151] the homomorphic application of table lookups.
[0152] An encryption of an integer x is denoted [[x]].
[0153] Section 3 describes one particular reduction to practice in more detail using a particular scheme.
[0154] 2.1. Step 1: Key Generation
[0155] Using the key generation procedure of the encryption scheme, the user generates 3 different cryptographic keys: [0156] a secret encryption/decryption key sec_key, [0157] a public encryption key enc_key, [0158] a public evaluation key eva_key.
[0159] The user publishes enc_key so that third parties can encrypt genomic data on behalf of the user.
[0160] The user publishes eva_key so that third parties such as prediction service providers can carry out homomorphic computations over encrypted data.
[0161] The user keeps sec_key private and will use it to decrypt the encrypted prediction results.
[0162] Optionally, sec_key can also be used by the user to provide encrypted genomic data to prediction service providers. [0163] 2.2. Step 2: Encryption of User Data
[0164] User data is divided into 2 distinct categories:
[0165] 1. The set of SNPs attached to the user (genomic data),
[0166] 2. The set of covariates attached to the user (medical profile). [0167] 2.2.1 Encrypting the SNPs
[0168] In their standard form, the value of an SNP is an ordered pair of symbols in the alphabet “-ACGTYRWSKMDVHBN”. For non-autosomal chromosomes, or in cases of trisomy, an SNP can be composed of less or more than 2 symbols.
[0169] A convention must be adopted to encode the SNP value into an integer in an appropriate range. Typically, SNPs containing a pair of standard symbols can be encoded as an integer ranging from 1 to 136.
[0170] Alternately, the values of an SNP may be categorized into genetic variants, or groups of variants that are known to produce the same statistical effect on the medical condition of the user. In that case, the SNP value is replaced with an integer that encodes the group of variants the SNP belongs to.
[0171] In any case, if (rsid.sub.i, x.sub.i) denotes an SNP, we identify x.sub.i with the integer-valued encoding of its value.
[0172] The above SNP is made available in encrypted form as (rsid.sub.i, [[x.sub.i]]) where [[x.sub.i]] is a homomorphic encryption of x.sub.i under the user's public encryption key enc_key. [0173] 2.2.2 Encrypting the Covariates
[0174] Covariates may be of very different nature and may rely on medical measurements in various units. By convention, the numeric representation of the j-th covariate may adopt the generic format
(Description.sub.j
, c.sub.j)
[0175] where (Description.sub.j) is a unique descriptive object (e.g. a character string or a reference to some class in an ontology) and c.sub.j an integer-valued encoding of the value of the covariate. For instance,
(‘Height(cm)@2019-05-13’, 189)
[0176] may represent the user's height in centimeters at a certain date.
[0177] The above covariate is made available in encrypted form as
(Description.sub.j
, [[c.sub.j]])
[0178] where [[c.sub.j]] is a homomorphic encryption of c.sub.1. [0179] 2.3. Step 3: Homomorphic Prediction [0180] 2.3.1. The Homomorphic Prediction Model
[0181] The homomorphic prediction model, known by the service provider who is performing the evaluation homomorphically, is composed of: [0182] The identifiers of all the SNPs required as input
(rsid.sub.1, . . . , rsid.sub.n) [0183] The descriptions of all covariates required as input
(Description.sub.1
, . . . ,
Description.sub.m
) [0184] The vector input clusters V.sub.1, . . . , V.sub.q and more precisely, for l=1, . . . , q [0185] which SNP variables i.sub.1, . . . , i.sub.n.sub.
[0190] Since the homomorphic prediction model is necessarily integer-valued, it may be obtained by approximating a continuous prediction model with an appropriate degree of precision. [0191] 2.3.2. Step 3a: Fetching the Encrypted Input Data
[0192] The prediction service provider is given the encrypted input data
[[x.sub.1]], . . . , [[x.sub.n]], [[c.sub.1]], . . . , [[c.sub.m]]
[0193] and for l=1, . . . , q, collects the encrypted variables belonging to cluster V.sub.l:
[[V.sub.l]]=( [[x.sub.i.sub.]], [[c.sub.j.sub.
]])) [0194] 2.3.3. Step 3b: Fetching the User's Public Evaluation Key
[0195] The prediction service provider is given the user's public evaluation key eva_key. [0196] 2.3.4. Step 3c: Homomorphic Evaluation of the Model
[0197] For a given query from a user, using eva_key, the prediction service provider performs the following algorithm:
[0198] 1. Initialize acc=w.sub.0
[0199] 2. For l=1 to q (2a). Perform a homomorphic table lookup with
[[V.sub.l]]=([[x.sub.i .sub.]], [[c.sub.j.sub.
]])
[0200] 3. on table T.sub.f.sub.
z.sub.l=[[T.sub.f.sub., c.sub.j.sub.
]]]
[0201] 4. of cluster V.sub.l. (2b). Use homomorphic addition to aggregate over l=1 to q
acc=acc+z.sub.l [0202] where acc is the accumulated value.
[0203] 5. Perform a homomorphic table lookup with acc on table T.sub.f. to get the encrypted prediction probability [[p]]. [0204] 2.3.4. Step 3d: Returning the Encrypted Result
[0205] The encrypted prediction result [[p]] is returned to the user. [0206] 2.4 Step 4: Decryption by the User
[0207] Using the secret decryption key sec_key, the user decrypts [[p]] to get the prediction result value p in the clear. [0208] 3. Reduction to Practice
[0209] In this particular embodiment of the invention, we make use of a set of techniques based on the Torus FHE (TFHE) homomorphic encryption scheme. TFHE defines 3 distinct encryption formats TLWE, TRLWE and TRGSW with the distinct features. Only the description of TLWE is needed to show how the invention is implemented using TFHE. [0210] TLWE Secret-Key Encryption
[0211] The plaintext is assigned a real value, μ, in the range [0,1) and is encrypted as
TLWE(μ)=(a.sub.1, . . . , a.sub.n, b)
[0212] with
[0213] where each a.sub.i˜U.sub.[0,1) is picked uniformly at random in the interval [0,1) and ϵ˜N(0, σ) is a centered Gaussian noise with variance σ.sup.2.
[0214] The secret encryption-decryption key is sec_key=(s.sub.1, . . . , s.sub.n) ∈ {0,1}.sup.n.
[0215] TLWE public-key encryption
[0216] Given sec_key, the encryption public key enc_key is derived by providing a vector of random encryptions of zero:
enc_key=(Z.sub.1, . . . , Z.sub.r)
[0217] where Z.sub.i=TLWE(0). The public-key encryption of μ ∈ [0,1) consists in selecting random bits a.sub.1, . . . , a.sub.r ∈ {0, 1} and computing
TLWE(μ)=a.sub.1.Math.Z.sub.1+ . . . +a.sub.r.Math.Z.sub.r+μ mod 1. [0218] 3.1 Step 1: Key Generation
[0219] 1. The user randomly selects sec_key=(s.sub.1, . . . , s.sub.n) ∈ {0,1}.sup.n uniformly at random.
[0220] 2. The user generates r encryptions of zero Z.sub.1, . . . , Z.sub.r and sets the encryption public key to enc_key=(Z.sub.1, . . . , Z.sub.r).
[0221] 3. The user randomly generates a bootstrapping key eva_key to allow homomorphic computations by third parties. [0222] 3.2 Step 2: Encryption of User Data
[0223] To encrypt an integer variable v (an SNP value or a covariate), v is decomposed into bits v.sub.0, . . . , v.sub.t-1 and [[v]] is defined as
[0225] Relying on the description of section 2.3.4, it is enough to provide a description of how homomorphic table lookups and homomorphic additions are performed for a single cluster of input variables. [0226] 3.3.2 Homomorphic Table Lookup
[0227] Given an encrypted cluster of integer variables
[[V.sub.l]]=([[x.sub.i.sub.]], [[c.sub.j.sub.
]]),
[0228] and since each encrypted variable is a vector of its encrypted bits under TLWE, we view [[V.sub.l]] as a concatenated vector of encrypted bits under TLWE:
[0229] Now, TFHE provides a technique for the homomorphic evaluation of a table lookup. Let T be an arbitrary t-dimensional table of 2.sup.t integer values in the range {0, . . . , 2.sup.d−1}. By applying the CMux tree and gate bootstrapping techniques on the vector of encrypted bits
[0230] one can compute
[0231] where the integer d>0 is a system parameter.
[0232] In this embodiment of the invention, these techniques are used for every table lookup made necessary by the prediction model. [0233] 3.3.2 Homomorphic Addition
[0234] Since TLWE supports homomorphic additions, the current accumulated value
[0235] can be updated as
[0236] As a result of successive accumulations, the final value of the accumulator acc contains the sum
z=w.sub.0+T.sub.f.sub.
[0237] of all contributions, namely
[0238] In this embodiment, the function f is not applied homomorphically on acc to compute [[p]]=f([[z]]). Instead, the prediction service provider directly returns acc=[[z]] to the user together with a description of f. The function f can also be chosen once and for all as a convention between users and prediction service providers. [0239] 3.4 Step 4: Decryption by the user
[0240] Using her secret encryption-decryption key sec_key, the user
[0241] 1. Decrypts 2. Applies f to z to get p=f(z).
[0242] An example is below
[0243] Among all predictive genetic tests currently available DTC, BRCA mutation testing can be considered the most actionable with proven clinical utility. Specific genetic variants in the BRCA1 and BRCA2 genes are associated with an increased risk of developing certain cancers, including breast cancer (in women and men) and ovarian cancer. These variants may also be associated with an increased risk for prostate cancer and certain other cancers. This test includes three genetic variants in the BRCA1 and BRCA2 genes that are most common in people of Ashkenazi Jewish descent.
[0244] Data relating to an individual was encrypted and the BRCA status analysed:
TABLE-US-00001 Case 1 BRCA1 185delAG: (-;-) 6 185delAG BRCA1 mutation genotype (-;AG) 6 BRCA1 (breast cancer) 185delAG carrier (AG;AG) 0 common in clinvar i4000377 17 41276045 II □ then NON Carrier; Elsif DI then Carrier; Elsif DD then MUTATED= breast cancer risk of about in Women 60%, 2% in men Case 2 (-;-) 0 Normal (-;C) 6 BRCA1 variant considered pathogenic for breast cancer (C;C) 6 BRCA1 variant considered pathogenic for breast cancer i4000378 17 41209083 DD □ then NON Carrier; Elsif DI then Carrier; Elsif II then MUTATED= breast cancer risk of about in Women 60%, 2% in men Or -- □ then NON Carrier; Elsif -C then Carrier; Elsif CC then MUTATED = breast cancer risk of about in Women 60%, 2% in men Case 3 (-;-) 6 BRCA2 variant considered pathogenic for breast cancer (-;T) 6 BRCA2 variant considered pathogenic for breast cancer (T;T) 0 common/normal i4000379 13 32914438 II □ then NON Carrier; Elsif DI then Carrier; Elsif DD then MUTATED = breast cancer risk of about 50%, 8% in men