METHOD AND DATA PROCESSING DEVICE FOR PROCESSING GENETIC DATA
20230021229 · 2023-01-19
Inventors
Cpc classification
H04L9/3239
ELECTRICITY
Y02A90/10
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
International classification
H04L9/32
ELECTRICITY
Abstract
A method for processing genetic data, which comprise a series of sequence elements each representing a biomolecule, comprises the steps of forming sequence fragments (S2), wherein each sequence fragment comprises a section of the series of sequence elements having a fragment length of at least two sequence elements, applying a coding function to each of the sequence fragments in order to generate a multiplicity of encrypted fragment data items (S3) winch are each assigned to one of the sequence fragments, and storing the encrypted fragment data (S4), wherein the sequence fragments are formed in such a manner that the sections of the series of sequence elements overlap and each sequence element is included in at least two sequence fragments. A description is also given of a data processing device for processing genetic data and a method for querying a database containing encrypted fragment data which were generated and stored using the method for processing genetic data.
Claims
1. A method for processing genetic data which comprise a series of sequence elements which represent, in each case, a biomolecule, comprising the steps forming sequence fragments, wherein each sequence fragment comprises a section of the series of sequence elements with a fragment length of at least two sequence elements, applying a coding function to each of the sequence fragments in order to generate a plurality of encrypted fragment data items, each being associated with one of the sequence fragments, and storing the encrypted fragment data, wherein the step of forming the sequence fragments takes place such that the sections of the series of sequence elements overlap and each sequence element is included in at least two sequence fragments.
2. The method according to claim 1, wherein the fragment length of each sequence fragment at least 3.
3. The method according to claim 1, wherein the step of forming the sequence fragments comprises specifying the fragment length and a start element in the genetic data, and providing the sequence fragments, in each case, using the sections of the series of sequence elements with the predetermined fragment length beginning at the start element and at all the subsequent sequence elements.
4. The method according to claim 1, wherein all the sequence fragments have the same length.
5. The method according to claim 1, wherein the sequence fragments form a plurality of fragment groups of sequence fragments, wherein the sequence fragments in each fragment group each have the same length, the sequence fragments of different fragment groups have different lengths, and the forming the sequence fragments takes place such that in each fragment group the sections of the series of sequence elements overlap and each sequence element is included in at least two sequence fragments.
6. The method according to claim 1, wherein the coding function is a hash function and the encrypted fragment data include hash values.
7. The method according to claim 1, wherein the step of forming the sequence fragments before the application of the coding function comprises addition, in each case, of a stochastically selected character string to each of the sequence fragments.
8. The method according to claim 1, wherein genetic data from a plurality of individuals are processed, wherein the genetic data of each individual comprise a series of sequence elements which represent, in each case, a biomolecule.
9. A data processing apparatus which is configured for generating and storing encrypted fragment data with the method according to claim 1, comprising a fragmenting device which is configured for forming the sequence fragments such that the sections of the series of sequence elements overlap and each sequence element is included in at least two sequence fragments, a coding device which is configured for generating the plurality of encrypted fragment data, and a storage device which is configured for storing the encrypted fragment data.
10. A computer program product which is stored on a computer-readable storage medium and is configured for forming the sequence fragments end for generating the plurality of encrypted fragment data in a method according to claim 1.
11. A computer-readable storage medium on which a computer program product is stored which is configured for forming the sequence fragments and for generating the plurality of encrypted fragment data in a method according to claim 1.
12. A database with a plurality of searchable, encrypted fragment data which have been generated with a method according to claim 1.
13. A method for querying a database containing encrypted fragment data which have been generated and stored with a method according to claim 1, comprising the steps specifying a search sequence comprising a predetermined series of sequence elements which represent, in each case, a biomolecule, applying the coding function, with which the encrypted fragment data have been generated, on the search sequence for generating an encrypted search sequence, and searching for the encrypted search sequence in the stored encrypted fragment data.
14. The method according to claim 13, wherein the specifying of the search sequence comprises a shortening of an initial search sequence to a search sequence length that is equal to the fragment length of the sequence fragments from which the encrypted fragment data have been generated.
15. Method according to claim 1, wherein the encrypted fragment data are stored in a database.
16. Method according to claim 1, wherein the predetermined series of sequence elements comprises a section of genetic material.
17. Method according to claim 1, wherein the genetic data represent a nucleotide sequence or an amino acid sequence.
Description
[0046] Further details and advantages of the invention are described below, making reference to the accompanying drawings, which show in:
[0047]
[0048]
[0049]
[0050] Details of preferred embodiments of the invention are described below, in particular in relation to the formation of the sequence fragments, their encoding and storage in a database and the querying of the database. Details of the selection of a coding function, in particular a hash function are not explained since they are known per se from conventional encoding techniques in bioinformatics or from other technical fields. Reference is made, by way of example, to the use of the invention in the processing of genetic data which comprise a nucleotide sequence. The use of the invention is not restricted to these data, but is also possible with other genetic data, such as for example amino acid sequences (protein sequences).
[0051]
[0052] In the method sequence according to
[0053] Step S1 is a preparation step of the method according to the invention. The preparation of the genetic data 1 in step S1 can be provided immediately before the subsequent processing with the steps S2 to S4 or temporally separated from them.
[0054] In step S2, the formation of the sequence fragments 3 from the genetic data 1 follows.
[0055] Subsequently, at step S3, the encoding of the sequence fragments 3 takes place with a coding device 20. The coding device 20 is configured to an application of a hash function ƒ.sub.H on the sequence fragments 3. As a result of the application of the hash function, a hash value table is obtained. The elements of the hash value table are encrypted fragment data 5 which represent the sequence fragments 3. This hash value table thus contains the genome sequence of a person in a form that permits no drawing of inferences to the identity of the person, or the like.
[0056] As distinct from the representation in
[0057] The encoding of the sequence fragments 3 provides the encrypted fragment data 5 in the hash value table. The encrypted fragment data 5 (encoded sequence fragments) are subsequently stored at step S4 in the storage device 30, for example, the database 30A. The database 30A is part of the data processing apparatus 100 or is provided separately therefrom. The encrypted fragment data 5 of a hash value table, i.e. of an individual, are stored in each case, in predetermined storage sections and/or together with a sequence identification (sample ID) representing the assignment to a particular hash value table, so that the association of the encrypted fragment data 5 with an anonymized sample from an individual is maintained.
[0058] For querying of the database 30A, as shown in the right-hand part of
[0059] Further details of a preferred use of the invention are shown in
[0060] A research facility 50 has an interest in an evaluation of the genetic data 1. For example, in the search for a particular disease, the question arises of whether a prepared search sequence 6 (step S5) is included in the genetic data 1 (see upper double-headed arrow). However, this direct query is made difficult or even precluded by the excessive effort for a search in the genetic data 1 and by the data protection. In order to be able nevertheless to search the genetic data 1, as described above, the search sequence 6 is subjected to the encoding for generating a hash value (step S6), after which a search can be carried out in the database 30A (step S7). If the search has the result that the stored encrypted fragment data 5 include the searched—for encrypted search sequence 7, the associated genetic data 1, i.e. the dataset of a particular individual is identified. Subsequently, a query relating to this special dataset can be placed by the research facility 50 back to the clinical facility 40 in order, while observing the rules of data security, to obtain further information regarding the individual with the relevant search sequence and/or cell material of the individual with the relevant search sequence, for example, from a cell bank.
[0061] It should be noted that the example given represents only one possible use of the invention in which it is enabled, without exact knowledge of the genetic data, to be able to process particular questions from the field of personalized medicine. Dependent upon the available data and/or the data format, only the necessary format of the search sequences and/or the search question are defined in order to prepare a hash-value match of the same data points in the database.
[0062] A further example for the use of the invention is where a research facility wishes to investigate a particular disease and for this purpose needs cell material with particular genetic features from a cell bank. If the genetic data of the material stored in the cell bank are processed according to the present invention, the invention can be applied to find suitable cell lines from the cell bank without accessing the genetic data. The research facility obtains information, with a significantly reduced cost and time expenditure, regarding which cell line is needed to carry out the planned investigations without having to sequence the cell material itself.
[0063] The features of the invention disclosed in the above description, the drawings and the claims can be significant either individually or in combination or sub-combination for the realization of the invention in its various embodiments.