CONSTRUCTION METHOD OF RIBOSOMAL RNA DATABASE

Abstract

A construction method of ribosomal RNA database is provided, including the following steps: selecting a source of nucleic acid sequence database; performing normalization and homogenization on species classification rules; using AI technology for normalized classification and correction; selecting the kingdom to which the sequence species belongs; filtering out redundant sequences and sequences with inconsistent lengths; setting a threshold for unknown bases other than A, T, C or G, and excluding unknown bases that exceed the threshold; and excluding sequences with insufficient classification information.

Claims

1. A method for constructing a ribosomal RNA database, comprising: selecting a source of a nucleic acid sequence database; performing normalization and homogenization on species classification rules; using an AI technology for normalized classification and correction; selecting a kingdom to which a sequence species belongs; filtering out redundant sequences and sequences with inconsistent lengths; setting a threshold for unknown bases other than A, T, C or G, and excluding the unknown bases that exceed the threshold; and excluding sequences with insufficient classification information.

2. The method for constructing the ribosomal RNA database according to claim 1, wherein the nucleic acid sequence database comprises a native repository database or a value-added database.

3. The method for constructing the ribosomal RNA database according to claim 1, wherein the ribosomal RNA database comprises a 16S rRNA gene database.

4. The method for constructing the ribosomal RNA database according to claim 1, wherein a seventh-order nomenclature is used for normalization to form a hierarchy relation table, and hierarchies defined in the seventh-order nomenclature comprise kingdom, phylum, class, order, family, genus, and species.

5. The method for constructing the ribosomal RNA database according to claim 4, wherein the method for homogenization comprises finding out information of other hierarchy in the classification hierarchy relation table based on species names in the nucleic acid sequence database, or using a serial number as a search target for comparison with a database that stores the serial numbers based on a serial number of a species in the nucleic acid sequence database, after a species name of the serial number is found, the information of other hierarchy is found from the classification hierarchy relation table.

6. The method for constructing the ribosomal RNA database according to claim 5, wherein the step of using the AI technology to perform normalized classification and naming comprises performing a comparison according to a species hierarchy, so as to confirm that there is no repetition in sequence classification information.

7. The method for constructing the ribosomal RNA database according to claim 3, wherein the step of selecting the kingdom to which the sequence species belongs comprises selecting sequences belonging to a kingdom of Archaea and a kingdom of Bacteria directed at the 16S rRNA gene database, and excluding other kingdoms or sequences where a kingdom name is mistakenly named as Archaea or Bacteria.

8. The method for constructing the ribosomal RNA database according to claim 3, wherein in the 16S rRNA gene database, when a sequence contains the same species sequence with 100% identical conditions, the sequence is a redundant sequence.

9. The method for constructing the ribosomal RNA database according to claim 3, wherein in the 16S rRNA gene database, sequences with inconsistent lengths are those that are shorter than 1200 bases or longer than 1800 bases in length.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0017] FIG. 1 is a schematic flowchart of a method for constructing a ribosomal RNA database according to an embodiment of the present disclosure.

[0018] FIG. 2 and FIG. 3 are schematic diagrams of a homogenization method in the construction method of a ribosomal RNA database according to an embodiment of the present disclosure.

[0019] FIG. 4 is a schematic diagram of performing normalized classification and naming using AI technology in the method for constructing a ribosomal RNA database according to an embodiment of the present disclosure.

[0020] FIG. 5 is a schematic diagram of excluding sequences with insufficient classification information in the method for constructing a ribosomal RNA database according to an embodiment of the present disclosure.

DESCRIPTION OF EMBODIMENTS

[0021] As used herein, a range defined by “one value to another value” is a general description that avoids listing all the values in a range in the specification. Therefore, the recitation of a particular numerical range includes any numerical value within the numerical range and a smaller numerical range defined by any numerical value within the numerical range, and such recitation is equivalent to explicitly describing said any numerical value and said smaller numerical value in the specification.

[0022] The following examples will be described in detail in conjunction with the accompanying drawings, but the provided examples are not intended to limit the scope of the present disclosure.

[0023] The present disclosure provides a method for constructing a ribosomal RNA database. FIG. 1 is a schematic flowchart of a method for constructing a ribosomal RNA database according to an embodiment of the present disclosure. Hereinafter, the construction method of a ribosomal RNA database according to an embodiment of the present disclosure will be described in detail with reference to FIG. 1.

[0024] Please refer to FIG. 1, first, step S10 is performed, and a source of a nucleic acid sequence database is selected. The nucleic acid sequence database may include a native repository database or a value-added database as an initial data source. In the embodiment, the constructed ribosomal RNA database is, for example, a 16S rRNA gene database, and the 16S rRNA gene database will be mainly used as an example in the description below, but the present disclosure is not limited thereto. 16S rRNA is an important component of prokaryotic ribosomal small subunits, which contains conserved regions and 9 highly variable regions. Many studies show that 16S rRNA is highly conserved among different species of bacteria, which means that even if genetic variation occurs to a single species, its 16S rRNA sequence is not easily changed. Therefore, 16S rRNA sequence is very suitable for identifying bacterial and archaeal species.

[0025] Next, please continue to refer to FIG. 1, and step S12 is carried out to perform normalization and homogenization on species classification rules. In terms of normalization, classification of species is typically based on the classification rules established by Carl Linnaeus. Over years of evolution of the rules, the hierarchies of the classification rules are mainly divided into seven hierarchies: “kingdom, phylum, class, order, family, genus and species”. All sequence classification information may be normalized by using this seventh-order classification nomenclature to form a classification hierarchy relation table. In terms of homogenization, the process is performed mainly to homogenize the nomenclature across databases, and correct the wrong information in the source database simultaneously. More specifically, the homogenization process may include, for example, the following two methods. FIG. 2 and FIG. 3 are schematic diagrams of a homogenization method in the construction method of a ribosomal RNA database according to an embodiment of the present disclosure. The first method is, for example, find out information of other hierarchy in the classification hierarchy relation table based on the species name in the nucleic acid sequence database. Please refer to FIG. 2, for example, the species (Abyssivirga alkaniphila) in EZBiocloud is utilized to correspond to the classification hierarchy relation table, and the correction result is shown in curated (please refer to the box marked in red). As for the second method, please refer to FIG. 3. For example, a serial number is used (ID in the species field, e.g. L81121 in FIG. 3) as a search target for comparison with a database that stores serial numbers based on the serial number of species in the nucleic acid sequence database. After the species name of the serial number is found, the information of other hierarchy may be found from the classification hierarchy relation table.

[0026] Next, please further refer to FIG. 1 and proceed to step S14. AI technology is used for normalized classification and correction. Comparisons are made mainly based on species hierarchy. In comparison, pairwise comparison is required to confirm that there is no duplication of sequence classification information in the data. There are two comparison methods below. In the first method, for example, a punctuation is replaced with a fixed symbol, and the punctuation is, for example, space, “.”, “-” or “I”, and the fixed symbol is “_”. For example: “Sinorhizobium sp. R-25067” is replaced with “Sinorhizobium_sp_R_25067”. Comparison is made according to the adjusted string, and the string is restored to original text after comparison. Since the punctuation may carry other meanings, such as: “sp.” refers to one or more species without specifying the exact species, the process may filter out bacterial strains with repeated sequence classification information; for example, “Sinorhizobium sp. R-25067” and “Sinorhizobium sp. R-25067.” FIG. 4 is a schematic diagram of performing normalized classification and correction using AI technology in the method for constructing a ribosomal RNA database according to an embodiment of the present disclosure. Please refer to FIG. 4. In the second method, for example, a dynamic time wrapping (DTW) algorithm is adopted. The DTW algorithm is a method through which the text dynamic distance between two strings is compared and the similarity between two characters is measured. Based on a given similarity threshold value, it is possible to determine that the bacterial strains with closer similarity are more likely to belong to the same type of bacteria strains. As such, the problems of similar characters and pinyin or redundant punctuation may be solved. In addition, the method of calculating distance between characters may be Manhattan distance, and the DTW similarity formula is D(i, j)=Dist(i, j)+min[D(i−1, j), D(i, j−1), D(i−1, j−1)]. For example, please refer to FIG. 4. In calculating the similarity between “sp” and “sp.”, after converting the text into a matrix, the Manhattan distance is adopted to calculate the distance between characters, and a sum of the minimum distance between each character in Reference is calculated as an index to measure the distances between pairs. After calculation, the similarity between “sp” and “sp.” is 2.

[0027] Then, please continue to refer to FIG. 1, and proceed to step S16. The kingdom to which the sequence species belongs is selected. In this embodiment, the constructed ribosomal RNA database is, for example, a 16S rRNA gene database. Since 16S rRNA only exists in the Archaea kingdom and Bacteria kingdom, firstly, the sequences belonging to Archaea kingdom and Bacteria kingdom are selected, and other kingdoms or sequences where the kingdom name is mistakenly named as Bacteria or Archaea are excluded. For example, “Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales; Monoraphidium;Monoraphidium” belong to Eukaryota.

[0028] Thereafter, please continue to refer to FIG. 1, and proceed to step S18.

[0029] Redundant sequences and sequences with inconsistent lengths are filtered out. In terms of filtering out redundant sequences, bacterial strains might contain one or more sets of 16S rRNAs with the same sequence. Due to the high degree of conservation of 16S rRNAs, different subtypes of the same species might have exactly the same sequences. When the sequence contains the same species sequence with 100% identical conditions, it is regarded that the sequence is a redundant sequence and should be filtered out. In terms of sequences with inconsistent lengths, the full length of 16S rRNA is about 1600 bases. Studies show that it is necessary to use sequences covering 9 variable regions in order to accurately identify bacterial strains in the hierarchy of species. If the sequence length is too short, the sequence range for identification is insufficient, which might lead to misclassification of species. If the sequence is too long, it means that the sequence contains two or more sets of 16S rRNA, and other genes might be mixed between the 16S rRNAs, which will also affect the accuracy of species classification. Exclusion conditions for length of sequences are, for example, defined as sequences with shorter than 1200 bases or more than 1800 bases in length.

[0030] Next, please continue to refer to FIG. 1, and proceed to step S20. The sequences with ambiguous or highly unknown bases are excluded. 16S rRNA is highly conserved among species, and therefore is highly discriminative among species. Within the classification units of the species hierarchy, the degree of difference between sequences of the same category is generally 1% to 1.3%. If the difference rate between the sequence bases is too high, the sequence will be classified in different species hierarchy. If the unknown bases (not A, T, C, G) contained in the sequence might be identified as sequence errors in the calculation process, the sequence error rate is too high, and it is easy to cause subsequent comparison errors, resulting in the sequence being misclassified in the species hierarchy. In order to exclude an excessively high difference rate and retain the flexibility to allow sequence errors due to sequencing, a threshold is set to exclude unknown bases that exceed the threshold. For example, sequences with 0.5% or more of unknown bases (not A, T, C, G) are excluded first. For example, if an N character is carried in the sequence, it means that the sequence of the site is unknown.

[0031] Finally, please continue to refer to FIG. 1 and proceed to step S22. Sequences with insufficient classification information are excluded. FIG. 5 is a schematic diagram of excluding sequences with insufficient classification information in the method for constructing a ribosomal RNA database according to an embodiment of the present disclosure. Because there are still a large number of species that cannot be separated and cultured in the laboratory, their names will be assigned as uncultured bacterium/uncultured archaeoote. Such sequences cannot provide effective information for species identification, and therefore their species names are uncultured bacterium/uncultured archaeoote, and sequences with no information in the first five classification hierarchies of the species are excluded, as the box marked in red in FIG. 5.

[0032] To sum up, the present disclosure provides a method for constructing a ribosomal RNA database, including multiple filtering processes and ensuring the integrity and interpretability of the sequence species classification hierarchy. It is expected to increase the accuracy for processing ribosomal RNA sequence data analysis, so as to improve the prediction accuracy of microbial phase. By using the construction method of a ribosomal RNA database of the present disclosure, a high-quality and high-accuracy ribosomal RNA database may be established, and the ribosomal RNA database may be used for cross-comparison with the data adopting the standard classification nomenclature, and the method of the disclosure may be directly applied to the analytical process of microbial phase.

[0033] More specifically, the construction method of a ribosomal RNA database of the present disclosure may ensure that the most important sequence names are not likely to be misspelt or mistaken based on the ribosomal RNA database that is normalized and homogenized while having cross-database comparability. After the database is filtered by setting multiple conditions, the amount of data is considerably reduced, which helps to reduce the calculation time and the database is easier to maintain. The constructed ribosomal RNA database is suitable for use as a standard database for comparison, for comparing with unknown sequences obtained by researchers, so the sequence information in the database must be representative and informative. Therefore, exclusion of sequences with a large number of ambiguous or highly unknown bases may improve the interpretability of analysis results.

CONSTRUCTION METHOD OF RIBOSOMAL RNA DATABASE

Assignee

Inventors

Cpc classification

Classification Explorer

G16B40/20

PHYSICS

Classification Explorer

G16B50/10

PHYSICS

Classification Explorer

G16B10/00

PHYSICS

Classification Explorer

G16B30/10

PHYSICS

Classification Explorer

G16B50/30

PHYSICS

Classification Explorer

G16B30/00

PHYSICS

International classification

Classification Explorer

G16B40/20

PHYSICS

Classification Explorer

G16B30/00

PHYSICS

Classification Explorer

G16B50/30

PHYSICS

Abstract

Claims

Description