CONSTRUCTION METHOD OF RIBOSOMAL RNA DATABASE
20230282312 · 2023-09-07
Assignee
- ACER INCORPORATED (New Taipei City, TW)
- Acer Medical Inc. (NEW TAIPEI CITY, TW)
- Chang Gung Memorial Hospital, Keelung (Keelung City, TW)
- National Health Research Institutes (Miaoli County, TW)
Inventors
- Yun-Hsuan Chan (New Taipei City, TW)
- I-Wen Wu (Keelung City, TW)
- Chieh Hua Lin (Miaoli County, TW)
- Yin-Hsong Hsu (New Taipei City, TW)
- Chi-Hsiao Yeh (Keelung City, TW)
- Yu-Chieh Liao (Miaoli County, TW)
- Tsung-Hsien Tsai (New Taipei City, TW)
Cpc classification
G16B10/00
PHYSICS
International classification
Abstract
A construction method of ribosomal RNA database is provided, including the following steps: selecting a source of nucleic acid sequence database; performing normalization and homogenization on species classification rules; using AI technology for normalized classification and correction; selecting the kingdom to which the sequence species belongs; filtering out redundant sequences and sequences with inconsistent lengths; setting a threshold for unknown bases other than A, T, C or G, and excluding unknown bases that exceed the threshold; and excluding sequences with insufficient classification information.
Claims
1. A method for constructing a ribosomal RNA database, comprising: selecting a source of a nucleic acid sequence database; performing normalization and homogenization on species classification rules; using an AI technology for normalized classification and correction; selecting a kingdom to which a sequence species belongs; filtering out redundant sequences and sequences with inconsistent lengths; setting a threshold for unknown bases other than A, T, C or G, and excluding the unknown bases that exceed the threshold; and excluding sequences with insufficient classification information.
2. The method for constructing the ribosomal RNA database according to claim 1, wherein the nucleic acid sequence database comprises a native repository database or a value-added database.
3. The method for constructing the ribosomal RNA database according to claim 1, wherein the ribosomal RNA database comprises a 16S rRNA gene database.
4. The method for constructing the ribosomal RNA database according to claim 1, wherein a seventh-order nomenclature is used for normalization to form a hierarchy relation table, and hierarchies defined in the seventh-order nomenclature comprise kingdom, phylum, class, order, family, genus, and species.
5. The method for constructing the ribosomal RNA database according to claim 4, wherein the method for homogenization comprises finding out information of other hierarchy in the classification hierarchy relation table based on species names in the nucleic acid sequence database, or using a serial number as a search target for comparison with a database that stores the serial numbers based on a serial number of a species in the nucleic acid sequence database, after a species name of the serial number is found, the information of other hierarchy is found from the classification hierarchy relation table.
6. The method for constructing the ribosomal RNA database according to claim 5, wherein the step of using the AI technology to perform normalized classification and naming comprises performing a comparison according to a species hierarchy, so as to confirm that there is no repetition in sequence classification information.
7. The method for constructing the ribosomal RNA database according to claim 3, wherein the step of selecting the kingdom to which the sequence species belongs comprises selecting sequences belonging to a kingdom of Archaea and a kingdom of Bacteria directed at the 16S rRNA gene database, and excluding other kingdoms or sequences where a kingdom name is mistakenly named as Archaea or Bacteria.
8. The method for constructing the ribosomal RNA database according to claim 3, wherein in the 16S rRNA gene database, when a sequence contains the same species sequence with 100% identical conditions, the sequence is a redundant sequence.
9. The method for constructing the ribosomal RNA database according to claim 3, wherein in the 16S rRNA gene database, sequences with inconsistent lengths are those that are shorter than 1200 bases or longer than 1800 bases in length.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0017]
[0018]
[0019]
[0020]
DESCRIPTION OF EMBODIMENTS
[0021] As used herein, a range defined by “one value to another value” is a general description that avoids listing all the values in a range in the specification. Therefore, the recitation of a particular numerical range includes any numerical value within the numerical range and a smaller numerical range defined by any numerical value within the numerical range, and such recitation is equivalent to explicitly describing said any numerical value and said smaller numerical value in the specification.
[0022] The following examples will be described in detail in conjunction with the accompanying drawings, but the provided examples are not intended to limit the scope of the present disclosure.
[0023] The present disclosure provides a method for constructing a ribosomal RNA database.
[0024] Please refer to
[0025] Next, please continue to refer to
[0026] Next, please further refer to
[0027] Then, please continue to refer to
[0028] Thereafter, please continue to refer to
[0029] Redundant sequences and sequences with inconsistent lengths are filtered out. In terms of filtering out redundant sequences, bacterial strains might contain one or more sets of 16S rRNAs with the same sequence. Due to the high degree of conservation of 16S rRNAs, different subtypes of the same species might have exactly the same sequences. When the sequence contains the same species sequence with 100% identical conditions, it is regarded that the sequence is a redundant sequence and should be filtered out. In terms of sequences with inconsistent lengths, the full length of 16S rRNA is about 1600 bases. Studies show that it is necessary to use sequences covering 9 variable regions in order to accurately identify bacterial strains in the hierarchy of species. If the sequence length is too short, the sequence range for identification is insufficient, which might lead to misclassification of species. If the sequence is too long, it means that the sequence contains two or more sets of 16S rRNA, and other genes might be mixed between the 16S rRNAs, which will also affect the accuracy of species classification. Exclusion conditions for length of sequences are, for example, defined as sequences with shorter than 1200 bases or more than 1800 bases in length.
[0030] Next, please continue to refer to
[0031] Finally, please continue to refer to
[0032] To sum up, the present disclosure provides a method for constructing a ribosomal RNA database, including multiple filtering processes and ensuring the integrity and interpretability of the sequence species classification hierarchy. It is expected to increase the accuracy for processing ribosomal RNA sequence data analysis, so as to improve the prediction accuracy of microbial phase. By using the construction method of a ribosomal RNA database of the present disclosure, a high-quality and high-accuracy ribosomal RNA database may be established, and the ribosomal RNA database may be used for cross-comparison with the data adopting the standard classification nomenclature, and the method of the disclosure may be directly applied to the analytical process of microbial phase.
[0033] More specifically, the construction method of a ribosomal RNA database of the present disclosure may ensure that the most important sequence names are not likely to be misspelt or mistaken based on the ribosomal RNA database that is normalized and homogenized while having cross-database comparability. After the database is filtered by setting multiple conditions, the amount of data is considerably reduced, which helps to reduce the calculation time and the database is easier to maintain. The constructed ribosomal RNA database is suitable for use as a standard database for comparison, for comparing with unknown sequences obtained by researchers, so the sequence information in the database must be representative and informative. Therefore, exclusion of sequences with a large number of ambiguous or highly unknown bases may improve the interpretability of analysis results.