Computer-Implemented Method and Computer System for Identifying Organisms
20170212985 ยท 2017-07-27
Inventors
Cpc classification
G16B50/00
PHYSICS
G16B35/00
PHYSICS
International classification
Abstract
To identify organism types from a target gene sequence, a server receives (S1) a target reference from a user via a telecommunications network. From a plurality of type-specific profiles, defining informative sequence regions for differentiating individual organisms, selected (S2) automatically is a profile having a highest correlation with the target gene sequence. The target gene sequence is compared (S4) automatically to reference sequences related to the selected profile. The comparison results related to the informative sequence regions are weighted (S5) and, from the reference sequences, determined (S9) is the organism type associated with the type-specific reference sequence, having a best match with the target gene sequence. The best match is determined based on the weighted comparison results. The profile search and weighted alignment provides identification of organism types from a target gene sequence while discriminating between trivial and significant inter-sequence differences.
Claims
1-27. (canceled)
28. A method of identifying or classifying organism types from a target gene sequence, said method comprising: (i) obtaining the target gene sequence at a server, the server comprising one or more processors; (ii) providing at least one database using the server, the at least one database comprising a plurality of organism type-specific profiles associated with one or more related reference sequences, wherein each organism type-specific profile defines informative sequence regions for differentiating individual organisms, and wherein each organism type-specific profile comprises position specific information derived from nucleotide positions of related reference sequences; (iii) correlating the target gene sequence and the plurality of organism type-specific profiles using the server; (iv) selecting an organism type-specific profile having a highest correlation with the target gene sequence based on the position specific information of the related organism type-specific profile using the server; (v) retrieving the related reference sequences associated with the selected organism type-specific profile from the at least one database using the server; (vi) comparing one or more nucleotides of the target gene sequence to the related reference sequences, and weighting the results of the nucleotide comparisons by weighting differentially nucleotide correspondences and nucleotide differences determined at said nucleotide positions which are informative for differentiating individual organisms of said organism type-specific profiles correlated with said target gene sequence using the server; (vii) determining, based on the nucleotide comparison results weighted for the informative sequence regions, an optimal organism type-specific reference sequence having a best match with the target gene sequence using the server; and (viii) identifying or classifying said organism types of said target gene sequence based on the optimal organism type-specific reference sequence by assigning to the target gene sequence the same organism type as the best matched organism type-specific reference sequence using the server; and (ix) communicating at least the organism type of the best matched organism type-specific reference sequence as an output of the server.
29. The method according to claim 28, wherein the nucleotide differences include a number of differences in nucleotide codes of each of the reference sequences when compared to the target gene sequence; wherein weighting the results of the nucleotide comparisons includes determining for each reference sequence a weighted number of differences by multiplying with a weighting factor the number of differences related to the informative sequence regions; and wherein the method further includes storing a list of the reference sequences, the list being sorted by the weighted number of differences of the respective reference sequences, when compared to the target gene sequence.
30. The method according to claim 28, further comprising: assessing the target gene sequence and the reference sequences related to the selected organism type-specific profile automatically for new informative sequence regions; and adapting the selected organism type-specific profile by storing a new informative sequence region as a part of the selected organism type-specific profile.
31. The method according to claim 30, wherein assessing the target gene sequence and the reference sequences includes: aligning the target gene sequence and the reference sequences related to the selected organism type-specific profile; and identifying the new informative sequence regions by identifying nucleotide codes corresponding at a same sequential position in at least a defined number of the target gene sequence and the reference sequences.
32. The method according to claim 28, wherein providing the at least one database comprises: aligning one or more organism type-specific gene sequences of the organism type-specific profiles; creating consensus sequences per organism type of the one or more organism type-specific gene sequences; identifying informative regions that enable differentiating individual organism types; and defining the organism type-specific profiles based on the informative regions.
33. The method according to claim 32, wherein the organism type-specific profiles stored in the at least one database include genus-specific or group-specific profiles, and wherein the genus-specific or group-specific profiles are determined by aligning genus-specific or group-specific gene sequences, by creating consensus sequences per organism, by identifying the informative regions that enable differentiating the individual organisms, and by defining the genus-specific or group-specific profiles based on the informative regions.
34. The method according to claim 28, further comprising: proofreading the target gene sequence based on the selected organism type-specific profile by at least: comparing the target gene sequence to the reference sequences related to the selected organism type-specific profile; assessing differences of nucleotide codes, located in informative sequence regions, whether the differences indicate another organism type; and initiating adaptation of the selected organism type-specific profile for differences assessed to indicate another organism type by determining sequence positions or regions that have a correlation across the respective reference gene sequence.
35. The method according to claim 28, wherein the target gene sequence comprises a target gene sequence received by the server via a telecommunications network; and wherein the method further comprises: transmitting the organism type of the target gene sequence as indicated by the organism type-specific reference sequence from the server via the telecommunications network to a user interface.
36. The method according to claim 28, wherein weighting differentially nucleotide correspondences and nucleotide differences comprises weighting more heavily nucleotide correspondences determined at said nucleotide positions and weighing less heavily nucleotide differences determined at said nucleotide positions.
37. The method according to claim 28, wherein weighting differentially nucleotide correspondences and nucleotide differences comprises weighting less heavily nucleotide correspondences determined at said nucleotide positions and weighing more heavily nucleotide differences determined at said nucleotide positions.
38. A method of identifying or classifying organism types, comprising: (i) obtaining a target gene sequence from an organism; (ii) providing at least one database comprising a plurality of organism type-specific profiles associated with one or more related reference sequences, wherein each organism type-specific profile defines informative sequence regions for differentiating individual organisms, and wherein each organism type-specific profile comprises position specific information derived from nucleotide positions of related reference sequences; (iii) correlating the target gene sequence and the plurality of organism type-specific profiles; (iv) selecting an organism type-specific profile having a highest correlation with the target gene sequence based on the position specific information of the related organism type-specific profile; (v) retrieving the related reference sequences associated with the selected organism type-specific profile from the at least one database; (vi) comparing one or more nucleotides of the target gene sequence to the related reference sequences, and weighting the results of the nucleotide comparisons by weighting differentially nucleotide correspondences and nucleotide differences determined at said nucleotide positions which are informative for differentiating individual organisms of said organism type-specific profiles correlated with said target gene sequence; (vii) determining, based on the nucleotide comparison results weighted for the informative sequence regions, an optimal organism type-specific reference sequence having a best match with the target gene sequence; and (viii) identifying or classifying said organism types of said target gene sequence based on the optimal organism type-specific reference sequence by assigning to the target gene sequence the same organism type as the best matched organism type-specific reference sequence.
39. The method according to claim 38, wherein the nucleotide differences include a number of differences in nucleotide codes of each of the reference sequences when compared to the target gene sequence; wherein weighting the results of the nucleotide comparisons includes determining for each reference sequence a weighted number of differences by multiplying with a weighting factor the number of differences related to the informative sequence regions; and wherein the method further includes storing a list of the reference sequences, the list being sorted by the weighted number of differences of the respective reference sequences, when compared to the target gene sequence.
40. The method according to claim 38, further comprising: assessing the target gene sequence and the reference sequences related to the selected organism type-specific profile automatically for new informative sequence regions; and adapting the selected organism type-specific profile by storing a new informative sequence region as a part of the selected organism type-specific profile.
41. The method according to claim 40, wherein assessing the target gene sequence and the reference sequences includes: aligning the target gene sequence and the reference sequences related to the selected organism type-specific profile; and identifying the new informative sequence regions by identifying nucleotide codes corresponding at a same sequential position in at least a defined number of the target gene sequence and the reference sequences.
42. The method according to claim 38, wherein providing the at least one database comprises: aligning one or more organism type-specific gene sequences of the organism type-specific profiles; creating consensus sequences per organism type of the one or more organism type-specific gene sequences; identifying informative regions that enable differentiating individual organism types; and defining the organism type-specific profiles based on the informative regions.
43. The method according to claim 42, wherein the organism type-specific profiles stored in the at least one database include genus-specific or group-specific profiles, and wherein the genus-specific or group-specific profiles are determined by aligning genus-specific or group-specific gene sequences, by creating consensus sequences per organism, by identifying the informative regions that enable differentiating the individual organisms, and by defining the genus-specific or group-specific profiles based on the informative regions.
44. The method according to claim 38, further comprising: proofreading the target gene sequence based on the selected organism type-specific profile by at least: comparing the target gene sequence to the reference sequences related to the selected organism type-specific profile; assessing differences of nucleotide codes, located in informative sequence regions, whether the differences indicate another organism type; and initiating adaptation of the selected organism type-specific profile for differences assessed to indicate another organism type by determining sequence positions or regions that have a correlation across the respective reference gene sequence.
45. The method according to claim 38, wherein the target gene sequence comprises a target gene sequence received by a server via a telecommunications network; and wherein the method further comprises: transmitting the organism type of the target gene sequence as indicated by the organism type-specific reference sequence from the server via the telecommunications network to a user interface.
46. The method according to claim 38, wherein weighting differentially nucleotide correspondences and nucleotide differences comprises weighting more heavily nucleotide correspondences determined at said nucleotide positions and weighing less heavily nucleotide differences determined at said nucleotide positions.
47. The method according to claim 38, wherein weighting differentially nucleotide correspondences and nucleotide differences comprises weighting less heavily nucleotide correspondences determined at said nucleotide positions and weighing more heavily nucleotide differences determined at said nucleotide positions.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] The present invention will be explained in more detail, by way of example, with reference to the drawings in which:
[0015]
[0016]
[0017]
[0018]
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0019] In
[0020] Connected to the personal computer 11 is a conventional sequencer 5, which provides the personal computer 11 with sequence data of DNA (Deoxyribonucleic Acid) fragments. For example, the fragment sequence data includes sequence signals and associated information (e.g. peak values) of the DNA fragments, each sequence signal including signals of the four nucleotide types Adenine, Cytosine, Guanine, and Thymine (A, C, G, T). Generally, the terms gene sequence, target sequence, or reference sequence are used herein to refer to a sequence of nucleotide codes, i.e. a sequence of codes, each representing one of the nucleotide types. The sequences are related to bacteria, fungi, microfungi, and viruses, for example.
[0021] As is illustrated in
[0022] As is illustrated schematically in
[0023] The server 3 includes different functional modules, namely a communication module 30, an application module 31, a profile selection module 32, a retrieval module 33, a comparison module 34, a type determination module 35, a profile adaptation module 36, a profiling module 37, and a proofreading module 37. The communication module 30 includes conventional hardware and software elements configured for exchanging data via telecommunications network 2 with a plurality of data entry terminals 1. The application module 31 is a programmed software module configured to provide users of the data entry terminal 1 with a user interface. Preferably, the user interface is provided through a conventional Internet browser such as Microsoft Explorer or Mozilla. Alternatively, the user interface is generated by user module 14. For example, application module 31 transmits a copy of the user module 14 via telecommunications network 2 to the personal computer 11 of the user. The user module 14 is installed and activated on the personal computer 11. When activated, user module 14 controls a processor of the personal computer 11 such that it generates the user interface on display 13. The profile selection module 32, the retrieval module 33, the comparison module 34, the type determination module 35, the profile adaptation module 36, the profiling module 37, and the proofreading module 37 are programmed software modules executing on a computer of server 3.
[0024] As is illustrated schematically in
TABLE-US-00001 TABLE 1 Sequence Profile Assigned Profile Position-specific information reference identifier Region/position Annotation Weighting gene specification factor sequence(s)
[0025] The initial profiles 42 are established and stored by profiling module 37, as will be described below with reference to
[0026] As is illustrated in
[0027] In step S21, validated type sequences are selected for the target specified in step S20. For example, the validated gene sequences are selected automatically by the profiling module 37 from the reference sequences 41 in database 4. However, at least initially, the validated gene sequences are either retrieved from a validated reference database or selected and entered by an expert using data entry terminal 1. The validated type sequences for the specified target cover all the known variable positions, i.e. all the informative sequence regions (positions) that are known to be diagnostic of inter-strain or inter-species differences and thus indicative of organism types (including genus, species, sub-type, variant, and/or clade).
[0028] In step S22, using the profiling module 37, a seed MSA (multiple sequence alignment) is generated from the validated type sequences selected in step S21. Particularly, using the profiling module 37 the type sequences are aligned and a consensus sequence is created for the respective organism type.
[0029] In step S23, identified are informative regions that enable differentiating individual organism types. Using the profiling module 37, the MSA generated in step S22 is provided with annotations for secondary structures (3 and 5 pairing regions), and for positions known to be diagnostic for inter-strain or inter-species differences in the target organism, i.e. positions known to be indicative of organism types (including genus, species, sub-type, variant, and/or clade).
[0030] In step S24, the profiling module 37 converts the annotated MSA to one or more annotated profiles and stores these profiles 42 in database 4.
[0031] In step S25, using the profiling module 37, the profiles 42 stored in step S24 are calibrated iteratively. Iterative calibration is achieved by using the profile(s) to search a collection of reference sequences known (validated by experts) to belong or not to belong to the respective organism type, i.e. genus, species, sub-type, variant, and/or clade.
[0032] In step S26, using the profiling module 37, the annotation of the profiles 42 are enriched by including positions that discriminate between the target organism and other genera.
[0033] In step S27, the annotated profiles are validated by experts in the field and through statistics on available samples sequences. For validation purposes, the profiles 42 stored in database 4 are made available to the experts through server 3 and telecommunications network 2. For example, the validated profiles are provided with an indicator, or an electronic certificate or signature in database 4.
[0034] In the following paragraphs, the identification of an organism type from a target gene sequence is described with reference to
[0035] In step S1, a target gene sequence is received by the communication module 30 via telecommunications network 2. The target gene sequence is provided by data entry terminal 1. For example, the target gene sequence is stored in personal computer 11 or generated from sequence data of DNA fragments provided by sequencer 5. Preferably, the target gene sequence is defined by a user through a user interface visualized on display 13 by application module 31 or by user module 14. Subsequently, the process for identifying the organism type is initiated automatically. Alternatively, the process is initiated by the user activating a control element such as a graphical button in the user interface.
[0036] In step S2, the profile selection module 32 determines in database 4 the sequence profile 42 having the highest correlation with the target gene sequence received in step S1. The degree of correlation between the target gene sequence and the sequence profiles 42 is determined based on the position-specific information contained in the sequence profiles 42, i.e. the profile selection module 32 uses the profile annotations on informative sequence regions to select the profile 42 having the best match with the target gene sequence. Preferably, the best matching sequence profile is determined by applying for each profile its position-specific weighting factors to deviations and/or correspondences of the target sequence from/with the profile.
[0037] In step S3, the retrieval module 33 loads from database 4 the reference gene sequences 41 associated with the sequence profile 42 selected in step S2.
[0038] In step S4, the comparison module 34 compares the target gene sequence, received in step S1, to one of the reference gene sequences, retrieved in step S3. Furthermore, the comparison module 34 weights the comparison results with weighting factors associated with the sequence profile, particularly, weighting factors associated with the informative sequence regions. Consequently, comparison results related to a first sequence region may be weighted with another weighting factor than comparison results related to another second sequence region. Thus, the comparison module 34 weights the number of differences and/or the number of correspondences, between the target gene sequence and the respective reference gene sequence, using weighting factors associated with the informative sequence regions outlined in the profile 42.
[0039] In step S6, the comparison module 34 stores a score, indicative of the matching level, assigned to the respective reference gene sequence. The score is based on the weighted comparison results. For example, comparison module 34 stores a score based on the weighted number of differences and/or correspondences.
[0040] In step S7, the application module 31 checks whether or not there are further reference gene sequences assigned to the selected profile that need to be processed. If there are more reference sequences to be processed, processing continues in step S4. Otherwise, if all reference sequences assigned to the selected profile have been processed, processing continues in step S8.
[0041] In optional step S8, the type determination module 35 generates a full or partial list of the reference sequences assigned to the selected profile. The list is sorted by the score assigned to the reference sequences. For example, the list (with its entries and assigned scores) is transmitted via telecommunications network 2 to the data entry terminal 1 where it is shown to the user on display 13.
[0042] In step S9, the type determination module 35 determines the type-specific reference gene sequence having the best match with the target gene sequence. The type determination module 35 determines the type-specific reference gene sequence based on the assigned scores, i.e. the weighted comparison results, e.g. the weighted number of differences and/or correspondences assigned to the reference sequences retrieved in step S3. For example, the type-specific reference gene sequence is defined by the lowest weighted number of differences and/or the highest weighted number of correspondences. The type information associated with the type-specific reference gene sequence is selected to define the organism type of the target gene sequence. Thus, the organism type of the target gene sequence is defined by the genus, species, sub-type, variant, and/or clade associated with the type-specific reference gene sequence. Preferably, the organism type and its assigned score are transmitted by the communication module 30 via telecommunications network 2 to the data entry terminal 1 where the organism type and its assigned score are shown to the user on display 13.
[0043] As is illustrated in
[0044] In step S11, the profile adaptation module 36 assesses the new reference gene sequence from step S10, or one or more sample gene sequences stored in step S10, for new informative sequence regions. Particularly, the profile adaptation module 36 determines whether the reference gene sequences (and possibly sample gene sequences) assigned to the respective sequence profile 42 have informative sequence regions that are not yet included in the sequence profile but indicate an organism type, i.e. a genus, species, sub-type, variant, and/or clade. In essence, the profile adaptation module 36 determines sequence positions or regions that have a correlation across the respective reference gene sequences (and possibly sample gene sequences) exceeding a defined correlation threshold. For example, the profile adaptation module 36 aligns the new reference sequence (and possibly the sample gene sequences) and the reference sequences related to the selected profile, and identifies new informative sequence regions by identifying nucleotide codes corresponding at the same sequential position in at least a defined number of the new reference sequence (and possibly sample gene sequences) and reference sequences.
[0045] In step S12, the profile adaptation module 36 determines whether or not there is one or more new informative sequence region. If there are new informative sequence regions, processing continues in step S13. Otherwise, processing of the target gene sequence ends in step S14.
[0046] In step S13, based on the new informative sequence region(s) determined in step S11, the profile adaptation module 36 adapts the sequence profile selected in step S2. Thus, the sequence profile selected in step S2 is refined by adding the new informative sequence region(s) to the sequence profile. Subsequently, the refined sequence profile may be subject to iterative calibration and validation, as described in the context of steps S25 and S27.
[0047] The proofreading module 38 is configured to support proofreading of or proofread automatically the target gene sequence, received in step S1, based on the sequence profile, selected in step S2. To support a user in proofreading the target sequence, the proofreading module 38 indicates in a user interface any informative sequence regions that enable differentiation of organism type.
[0048] Preferably, this indication is provided in an alignment of target sequence, reference sequence(s) and/or a consensus sequence, displayed in the user interface, by highlighting visually the informative sequence regions. To provide automatic proofreading, for differences arising from nucleotide codes located in informative sequence regions, indicative of organism types, the proofreading module 38 calculates values for the probability that the difference is due to an error in the target sequence or that the difference indicates another organism type. Furthermore, the proofreading module 38 applies threshold values to the calculated probability values and corrects automatically an error (e.g. in the consensus sequence), or inserts (into the consensus sequence) a special code indicating ambiguity, for example an IUPAC (International Union of Pure and Applied Chemistry) code, or triggers an adaptation of the sequence profile, as described above with reference to
[0049] In describing representative embodiments of the invention, the specification may have presented the method and/or process of the invention as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. As one of ordinary skill in the art would appreciate, other sequences of steps may be possible. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. In addition, the claims directed to the method and/or process of the invention should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the invention.
[0050] The foregoing disclosure of the embodiments of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many variations and modifications of the embodiments described herein will be apparent to one of ordinary skill in the art in light of the above disclosure. The scope of the invention is to be defined only by the claims appended hereto, and by their equivalents. Particularly, in addition to the sectors of bacteriology, mycology and virology, the present inventions can also be applied in any other sectors where sequence similarity searches are involved, e.g. in human and veterinary diseases and disease predispositions (e.g. cancer), in infectious diseases such as HBV, HIV etc., as well as in typing of animals, humans, plants, microorganisms and viruses.