DNA CODING METHOD AND BIOMEDICAL ENGINEERING APPLICATION OF SAME CODING METHOD
20220139500 · 2022-05-05
Inventors
Cpc classification
G16B20/20
PHYSICS
G16B50/00
PHYSICS
G16H50/20
PHYSICS
G16B20/00
PHYSICS
International classification
G16B50/00
PHYSICS
Abstract
The present invention relates to a method for code standardizing DNA (a) C, T, A, and G are designated as 00, 01, 10, and 11, respectively, and (b) when each base is a base pair of G and C and A and T, in the 5′ to 3′ direction, designated as 1100 for G and C, 0011 for C and G, and 1001 for A and T and 0110 for T and A. As a result, the DNA code standardization method of the present invention provides an easy method for identifying specific patterns, secondary structures, and nucleotide sequence variations within the nucleotide sequence, and facilitates the prediction of diseases by using disease-specific sequence mutations such as SNPs. It provides an easy method for identifying a specific pattern present in a nucleotide sequence such as a DNA fragment or an aptamer.
Claims
1. (canceled)
2. A method of providing information optimized to identify a specific pattern or secondary structure of a specific DNA fragment or aptamer using DNA code standardization, comprising the steps of: (a) designating C, T, A, and G of a specific DNA fragment nucleotide sequence as 00, 01, 10, 11, respectively; and (b) comparing the numerically designated code arrangement with the code sum arrangement.
3. The method according to claim 2, wherein the step of comparing the numerically designated code arrangement with the code sum arrangement is characterized in that it is determined that a stem structure is formed when two or more pairs of codes whose sum of each sequence becomes 3 after transforming the binary number sequence of 00, 01, 10, and 11 in step (a) into a decimal number are arranged at both ends, and a loop structure is formed when three or more sequences that cannot form complementary binding are linked to the center because the code sum of the sequences facing each other is greater than or smaller than 3.
4. A method of providing information on the presence or absence of nucleotide sequence variation in a specific DNA fragment using DNA code standardization comprising the steps of: (a) designating C, T, A, and G of a specific DNA fragment nucleotide sequence as 00, 01, 10, 11, respectively; and (b) comparing the sum of the numerically designated codes.
5. The method of claim 4, wherein the step of comparing of the sum of the codes is characterized in that it is determined that mutation is present when there is a difference of 1 to 3 after transforming the binary number arrangement of 00, 01, 10, and 11 in step (a) into a decimal number, obtaining the sum and then comparing it with a normal sequence.
6. The method according to claim 4, wherein the position of the variant sequence is confirmed by comparing each value of the codes obtained by designating C, T, A, and G of the nucleotide sequence of a specific DNA fragment as 00, 01, 10, and 11, respectively.
7. An information providing computer program, stored in a computer-readable medium, optimized for identifying a specific pattern or secondary structure of a specific DNA fragment or aptamer for causing a computer to perform the following steps, the steps of: (a) designating C, T, A, and G of the nucleotide sequence of a specific DNA fragment as 00, 01, 10, 11, respectively; and (b) it is determined that a stem structure is formed when two or more pairs of codes whose sum of each sequence becomes 3 after transforming the binary number sequence of 00, 01, 10, and 11 in step (a) into a decimal number are arranged at both ends, and a loop structure is formed when three or more sequences that cannot form complementary binding are linked to the center because the code sum of the sequences facing each other is greater than or smaller than 3.
8-9. (canceled)
Description
DESCRIPTION OF DRAWINGS
[0029]
[0030]
[0031]
[0032]
[0033]
MODE FOR INVENTION
[0034] Hereinafter, the present invention will be described in more detail by the following examples. However, the following examples are described with the intention of illustrating the present invention, and the scope of the present invention is not to be construed as being limited by the following examples.
Example 1: Code Standardization According to the Molecular Weight of Each Base
[0035] Each of the four bases determining the sequence of DNA was expressed as a binary two-digit number, which is a computer language, and the molecular weight of each base was analyzed and indicated in
[0036] Each base has the largest value in the order of G, A, T, and C. if comparing by adding the molecular weights of C, which is paired with G by a hydrogen bond, and T, which is complementary to A, 654.4 (=347.2+307.2) and 653.4 (=331.2+322.2), confirming that they are paired with each other with an equivalent molecular mass of approximately 1:1. The reason that the sum of the molecular weights of A and T is 1 less than the sum of the molecular weights of G and C is that GC has nitrogen (N), A=T has carbon (C), and hydrogen (H) by one compared to other bonding pairs. It is because there is a difference (=1) of the sum of the molecular weights of each pair as much as the difference between the molecular weight of N and the sum of the molecular weights of C+H (14>12+1). Therefore, A and T form two hydrogen bonds in the absence of O or N capable of hydrogen bonding, thereby forming a weaker bond than the GC bond, which forms three hydrogen bonds. Therefore, the code of each base was designated by reflecting the principle of the molecular structure and binding mass ratio of the DNA. The codes of each given base were designated as binary numbers of 00, 01, 10, and 11 values for C, T, A, and G in the order from the smallest to the largest in molecular weight. (
[0037] The designated code value is designed so that when the bases of G and C and A and T are paired, the code sum ratio is 1:1, which is the same as the actual mass ratio. (
[0038] The code sum is the sum of each code value after converting the code of each base into a decimal number. The code sum of G and C and A and T is the same as ‘3’.
Example 2: Optimization of Reflection of Molecular Weight Ratios of DNA Fragments and Aptamers
[0039] Since according to the molecular weight of each base of DNA the codes were assigned in the order of mass from lowest to highest, the total code sum of the DNA fragments was calculated by reflecting the ratio of the molecular weights of each sequence. (
[0040] The exemplary sequence is a sequence exemplified for the purpose of confirming the molecular weight reflection ratio of the code, and the scope of the present invention is not to be construed as being limited to the sequence of SEQ ID NOs: 1 to 6. The sequences of SEQ ID NOs: 1 to 6 are as follows.
TABLE-US-00001 (SEQ ID NO: 1) 5′ AGAGCTCGCGCCGGAGTTCTCAATGCAAGAGC 3′ (SEQ ID NO: 2) 5′ GCGGCGGTGGCCTGAAGTCTGGCGGTGGGCCCC 3′ (SEQ ID NO: 3) 5′ GCGGCGGTGGCCAGAAGTCTCGCGGTGGCGGC 3′ (SEQ ID NO: 4) 5′ GTGGAGGCGGTGGCCAGTCTCGCGGTGGCGGC 3′ (SEQ ID NO: 5) 5′ GTGGCGGTGGCCAGCATAGTGGCGGTGGGCCAG 3′ (SEQ ID NO: 6) 5′ GTGGAGGCGGTGGCCGTGGAGGCGGAGGCCGC 3′
[0041] The six exemplary sequences are 32 mer nucleotide sequences, the length of the nucleotides is the same, but the types and sequences of the nucleotides are variously configured, and the code conversion values of each nucleotide are indicated in
[0042] When compared with the molecular weight (M.W.) of each sequence, the smaller the molecular weight, the smaller the code sum was, and the higher the molecular weight, the higher the code sum was. (
[0043] In this way, codes were designated by reflecting the ratio of molecular weights and was optimized to compare the ratios of molecular weights of each sequence by using the resultant conversed code sum.
Example 3: Optimization of Pattern Identification of DNA Fragments and Aptamers
[0044] By converting the sequences of DNA fragments and aptamers into binary nucleotide codes and comparing each sequence, it was optimized to identify specific patterns and secondary structures included in the sequences. To understand this, a DNA sequence consisting of 9 nucleotide sequences was used as an example sequence. (
[0045] The above exemplary sequence is intended to illustrate the pattern of the code, and the scope is not to be construed as being limited to the exemplary sequence of SEQ ID NO: 7.
[0046] An exemplary sequence of SEQ ID NO: 7 is as follows.
TABLE-US-00002 (SEQ ID NO: 7) 5′ GCGGTGGCG 3′
[0047] The number listed by converting the example sequence into a nucleotide code is as follows. [0048] 11 00 11 11 01 11 11 00 11 (Example sequence code 1)
[0049] The code is designed so that each base has a code sum of ‘3’ with a complementary base capable of forming hydrogen bonding, and the arrangement of these sequences can form a stem structure in the DNA aptamer sequence. (
[0050] Most of the stem-loop structure patterns of DNA have two or more bases that can form a stem structure at both ends. Since the code sum of the sequences facing each other is greater than or less than 3, there is a characteristic that a loop structure can be formed when three or more sequences that cannot form complementary binding are linked in the center.
[0051] The exemplary sequence may form two stem-loop structures, which can be simply confirmed by a base code arrangement. The sequence capable of complementary binding with the first 11 base code is the base of the 8th 00 code except for the 00 code next to it (
[0052] In this way, by standardizing each base as a code, it is possible to predict whether or not complementary binding to each base is possible according to the base code sum, and it was confirmed that it was easy to predict the secondary structure and pattern of the DNA sequence according to the number of complementary bonds of each sequence and the number of bases connected thereto.
Example 4: Optimization of SNP Identification Due to Code Standardization
[0053] By converting the DNA sequence into a code and comparing the code sum of each sequence, it was optimized to determine whether the nucleotide sequence of a specific DNA fragment is mutated. Since the SNP sequence is a DNA fragment sequence in which one base has been mutated, it was confirmed that the code was applied to the SNP sequence and compared with the normal sequence, thereby making it easy to determine the presence and location of the mutation. The efficiency of code standardization was confirmed by applying it to the SNP sequence of the CD44 gene, which is one of various SNP sequences and is found in 84% of breast cancer patients. [Zhou, J., Nagarkatti, P. S., Zhong, Y., Creek, K., Zhang, J., & Nagarkatti, M. (2010). Unique SNP in CD44 intron 1 and its role in breast cancer development. Anticancer research, 30(4), 1263-1272]
[0054] The SNP sequence of the breast cancer patient is a sequence in which the A base at the 14th position from the exon (Exon 2) is mutated to G among the sequences present at the position of the first intron (intron 1) of the gene. This sequence was converted into a code, arranged in a binary array, and the code sum was calculated, and the code sum of the normal sequence and the mutant sequence was compared. (
[0055] When the codes of the normal sequence and the mutant sequence were transformed into decimal numbers, respectively, and the sum was calculated, the normal sequence was 39, the mutant sequence was 40, and the mutant sequence was confirmed as a value 1 greater than the normal sequence. In this way, it is possible to determine whether a mutation exists in a DNA fragment only with the code sum, and at this time, the code sum may differ by 1 to 3 depending on the type of mutated base. In addition, the position of the sequence can be confirmed by comparing the values of each of the mutated codes.
[0056] As described above, by converting the DNA fragment sequences identified in the normal control group and the specific mutant sequence identified in the disease test group into codes and comparing the code sum, the difference between the sequences can be quickly identified and the existence of SNPs can be easily searched for, and by applying a code sum to the identified SNP sequence, it can be used for disease diagnosis.