METHOD AND DATA PROCESSING DEVICE FOR PROCESSING GENETIC DATA

20230021229 · 2023-01-19

    Inventors

    Cpc classification

    International classification

    Abstract

    A method for processing genetic data, which comprise a series of sequence elements each representing a biomolecule, comprises the steps of forming sequence fragments (S2), wherein each sequence fragment comprises a section of the series of sequence elements having a fragment length of at least two sequence elements, applying a coding function to each of the sequence fragments in order to generate a multiplicity of encrypted fragment data items (S3) winch are each assigned to one of the sequence fragments, and storing the encrypted fragment data (S4), wherein the sequence fragments are formed in such a manner that the sections of the series of sequence elements overlap and each sequence element is included in at least two sequence fragments. A description is also given of a data processing device for processing genetic data and a method for querying a database containing encrypted fragment data which were generated and stored using the method for processing genetic data.

    Claims

    1. A method for processing genetic data which comprise a series of sequence elements which represent, in each case, a biomolecule, comprising the steps forming sequence fragments, wherein each sequence fragment comprises a section of the series of sequence elements with a fragment length of at least two sequence elements, applying a coding function to each of the sequence fragments in order to generate a plurality of encrypted fragment data items, each being associated with one of the sequence fragments, and storing the encrypted fragment data, wherein the step of forming the sequence fragments takes place such that the sections of the series of sequence elements overlap and each sequence element is included in at least two sequence fragments.

    2. The method according to claim 1, wherein the fragment length of each sequence fragment at least 3.

    3. The method according to claim 1, wherein the step of forming the sequence fragments comprises specifying the fragment length and a start element in the genetic data, and providing the sequence fragments, in each case, using the sections of the series of sequence elements with the predetermined fragment length beginning at the start element and at all the subsequent sequence elements.

    4. The method according to claim 1, wherein all the sequence fragments have the same length.

    5. The method according to claim 1, wherein the sequence fragments form a plurality of fragment groups of sequence fragments, wherein the sequence fragments in each fragment group each have the same length, the sequence fragments of different fragment groups have different lengths, and the forming the sequence fragments takes place such that in each fragment group the sections of the series of sequence elements overlap and each sequence element is included in at least two sequence fragments.

    6. The method according to claim 1, wherein the coding function is a hash function and the encrypted fragment data include hash values.

    7. The method according to claim 1, wherein the step of forming the sequence fragments before the application of the coding function comprises addition, in each case, of a stochastically selected character string to each of the sequence fragments.

    8. The method according to claim 1, wherein genetic data from a plurality of individuals are processed, wherein the genetic data of each individual comprise a series of sequence elements which represent, in each case, a biomolecule.

    9. A data processing apparatus which is configured for generating and storing encrypted fragment data with the method according to claim 1, comprising a fragmenting device which is configured for forming the sequence fragments such that the sections of the series of sequence elements overlap and each sequence element is included in at least two sequence fragments, a coding device which is configured for generating the plurality of encrypted fragment data, and a storage device which is configured for storing the encrypted fragment data.

    10. A computer program product which is stored on a computer-readable storage medium and is configured for forming the sequence fragments end for generating the plurality of encrypted fragment data in a method according to claim 1.

    11. A computer-readable storage medium on which a computer program product is stored which is configured for forming the sequence fragments and for generating the plurality of encrypted fragment data in a method according to claim 1.

    12. A database with a plurality of searchable, encrypted fragment data which have been generated with a method according to claim 1.

    13. A method for querying a database containing encrypted fragment data which have been generated and stored with a method according to claim 1, comprising the steps specifying a search sequence comprising a predetermined series of sequence elements which represent, in each case, a biomolecule, applying the coding function, with which the encrypted fragment data have been generated, on the search sequence for generating an encrypted search sequence, and searching for the encrypted search sequence in the stored encrypted fragment data.

    14. The method according to claim 13, wherein the specifying of the search sequence comprises a shortening of an initial search sequence to a search sequence length that is equal to the fragment length of the sequence fragments from which the encrypted fragment data have been generated.

    15. Method according to claim 1, wherein the encrypted fragment data are stored in a database.

    16. Method according to claim 1, wherein the predetermined series of sequence elements comprises a section of genetic material.

    17. Method according to claim 1, wherein the genetic data represent a nucleotide sequence or an amino acid sequence.

    Description

    [0046] Further details and advantages of the invention are described below, making reference to the accompanying drawings, which show in:

    [0047] FIG. 1 a schematic illustration of the processing of genetic data according to preferred embodiments of the invention,

    [0048] FIG. 2 further details of the encryption and storage of genetic data and the querying of a database according to further embodiments of the invention, and

    [0049] FIG. 3 a schematic overview of a preferred use of the invention for the processing of clinically obtained genetic data and their searching by users.

    [0050] Details of preferred embodiments of the invention are described below, in particular in relation to the formation of the sequence fragments, their encoding and storage in a database and the querying of the database. Details of the selection of a coding function, in particular a hash function are not explained since they are known per se from conventional encoding techniques in bioinformatics or from other technical fields. Reference is made, by way of example, to the use of the invention in the processing of genetic data which comprise a nucleotide sequence. The use of the invention is not restricted to these data, but is also possible with other genetic data, such as for example amino acid sequences (protein sequences).

    [0051] FIG. 1 schematically shows the main steps of the method for processing genetic data according to preferred embodiments of the invention, wherein further details are set out, by way of example, in FIG. 2. FIG. 2 also shows schematically the components of a data processing apparatus 100 with a fragmenting device 10, a coding device 20 and a storage device 30/database 30A.

    [0052] In the method sequence according to FIG. 1, firstly the preparation of the genetic data 1 is shown with step S1. The preparation of the genetic data 1 comprises, for example, the sequencing of genetic material of at least one individual. The sequencing takes place using per se known sequencing techniques. Alternatively, the preparation of the genetic data 1 comprises the retrieval of genetic data 1 from existing data sources, for example, freely accessible databases. The genetic data 1 typically comprises parts of a genome of the individual, but can also represent the entire genome. For example, the genetic data 1 of a particular individual relates to genetic data of iPS cells (induced pluripotent stem cells) of the individual.

    [0053] Step S1 is a preparation step of the method according to the invention. The preparation of the genetic data 1 in step S1 can be provided immediately before the subsequent processing with the steps S2 to S4 or temporally separated from them.

    [0054] In step S2, the formation of the sequence fragments 3 from the genetic data 1 follows. FIG. 2 shows, by way of example, genetic data 1 from sequence elements in the form of a nucleotide sequence. The nucleotide sequence consists of the nucleobases adenine, thymine, guanine and cytosine which are usually abbreviated as A, T, G and C. As sequence fragments 3, k-mers (herein, e.g. k=3) are formed. Beginning with a start element 2 (e.g. T), the step-wise readout of sequence fragments 3 of length 3 takes place. The provision of the sequence fragments 3 takes place with a readout using a sliding window. As a result, the succession 4 of sequence fragments 3 is formed. Step S2 can be implemented with a per se known sliding window algorithm.

    [0055] Subsequently, at step S3, the encoding of the sequence fragments 3 takes place with a coding device 20. The coding device 20 is configured to an application of a hash function ƒ.sub.H on the sequence fragments 3. As a result of the application of the hash function, a hash value table is obtained. The elements of the hash value table are encrypted fragment data 5 which represent the sequence fragments 3. This hash value table thus contains the genome sequence of a person in a form that permits no drawing of inferences to the identity of the person, or the like.

    [0056] As distinct from the representation in FIG. 2, the single application of the hash function ƒ.sub.H can be replaced by the repeated (at least twofold) application of the hash function ƒ.sub.H in a first application to the sequence fragments 3 and in at least one further application to the encrypted fragment data 5.

    [0057] The encoding of the sequence fragments 3 provides the encrypted fragment data 5 in the hash value table. The encrypted fragment data 5 (encoded sequence fragments) are subsequently stored at step S4 in the storage device 30, for example, the database 30A. The database 30A is part of the data processing apparatus 100 or is provided separately therefrom. The encrypted fragment data 5 of a hash value table, i.e. of an individual, are stored in each case, in predetermined storage sections and/or together with a sequence identification (sample ID) representing the assignment to a particular hash value table, so that the association of the encrypted fragment data 5 with an anonymized sample from an individual is maintained.

    [0058] For querying of the database 30A, as shown in the right-hand part of FIG. 2, a search sequence 6 of nucleic acids, for example ATG, is initially prepared (step S5) and, by applying the hash function, is encrypted (step S6). As a result, an encrypted search sequence 7 is prepared in the form of a hash value. Subsequently, the database is searched with regard to the occurrence of this hash value using per se known search techniques (step S7). When the encrypted search sequence 7 is located, the hash value table to which the found search sequence belongs is acquired. By means of the data structure of the database 30A with a plurality of hash value tables, this search needs a constant runtime and is therefore efficient.

    [0059] Further details of a preferred use of the invention are shown in FIG. 3. With this use, a system 200 for preparing anonymized genetic data by clinical facilities and/or laboratories and for use of the data by an operator, for example, a university or industrial research facility is provided. On the left-hand side of FIG. 3 it is shown schematically how genetic data 1 are prepared, for example, at a clinical facility 40 (step S1). In a practical example, the system 200 can comprise a plurality of operators and a plurality of users who commonly access the database or a plurality of databases. Subsequently, the genetic data 1 is subjected to the method according to the invention with the steps S2 and S3 in order to prepare the encoded sequence fragments 5 and to store them in the database 30A (step S4).

    [0060] A research facility 50 has an interest in an evaluation of the genetic data 1. For example, in the search for a particular disease, the question arises of whether a prepared search sequence 6 (step S5) is included in the genetic data 1 (see upper double-headed arrow). However, this direct query is made difficult or even precluded by the excessive effort for a search in the genetic data 1 and by the data protection. In order to be able nevertheless to search the genetic data 1, as described above, the search sequence 6 is subjected to the encoding for generating a hash value (step S6), after which a search can be carried out in the database 30A (step S7). If the search has the result that the stored encrypted fragment data 5 include the searched—for encrypted search sequence 7, the associated genetic data 1, i.e. the dataset of a particular individual is identified. Subsequently, a query relating to this special dataset can be placed by the research facility 50 back to the clinical facility 40 in order, while observing the rules of data security, to obtain further information regarding the individual with the relevant search sequence and/or cell material of the individual with the relevant search sequence, for example, from a cell bank.

    [0061] It should be noted that the example given represents only one possible use of the invention in which it is enabled, without exact knowledge of the genetic data, to be able to process particular questions from the field of personalized medicine. Dependent upon the available data and/or the data format, only the necessary format of the search sequences and/or the search question are defined in order to prepare a hash-value match of the same data points in the database.

    [0062] A further example for the use of the invention is where a research facility wishes to investigate a particular disease and for this purpose needs cell material with particular genetic features from a cell bank. If the genetic data of the material stored in the cell bank are processed according to the present invention, the invention can be applied to find suitable cell lines from the cell bank without accessing the genetic data. The research facility obtains information, with a significantly reduced cost and time expenditure, regarding which cell line is needed to carry out the planned investigations without having to sequence the cell material itself.

    [0063] The features of the invention disclosed in the above description, the drawings and the claims can be significant either individually or in combination or sub-combination for the realization of the invention in its various embodiments.