5-formylcytosine specific chemical labeling method and related applications

10519184 ยท 2019-12-31

Assignee

Inventors

Cpc classification

International classification

Abstract

The present invention relates to a 5-formylcytosine specific chemical labeling method and related applications in aspects such as sequencing, detection, imaging, and diagnosis. In the method, a condensation reaction occurs between an active methylene group in an active methylene compound containing a side-chain reactive group and an aldehyde group in 5-formylcytosine or a 1-substituted derivative of 5-formylcytosine, and at the same time an intramolecular reaction occurs between the side-chain reactive group of the active methylene compound and a 4-amino group of cytosine to implement ring closing. By means of the 5-formylcytosine specific chemical labeling method and related compounds of the present invention, detection of the content of 5-formylcytosine in nucleic acid molecules, and specific concentration of 5-formylcytosine-containing nucleic acid samples, and analysis of sequence distribution information of 5-formylcytosine and/or single-base resolution sequence information in nucleic acid molecules and the like may be implemented. The present invention provides various effective research methods in the research fields of epigenetics and nucleic acid biochemistry.

Claims

1. A method for specific chemical labeling of 5-formylcytosine or a 1-substituted derivative thereof, comprising the step of reacting an active methylene compound containing a side-chain active group R.sub.1CH.sub.2R.sub.2 with the 5-formylcytosine or a 1-substituted derivative thereof, wherein a dehydration condensation reaction occurs between the active methylene compound containing a side-chain active group and a 5-formyl group of cytosine in the 5-formylcytosine or a 1-substituted derivative thereof, and at the same time an intramolecular reaction occurs between the side-chain active group of the active methylene compound and a 4-amino group of cytosine in the 5-formylcytosine or a 1-substituted derivative thereof to implement ring closing, as shown in the equation below: ##STR00028## wherein, R represents hydrogen, hydrocarbyl, hydrocarbyl with OH, NH.sub.2, CHO and/or COOH, ribosyl or deoxyribosyl, 5- or 3-phosphate-modified ribosyl or deoxyribosyl, or structures excluding the 5-formylcytosine from ribonucleic acid or deoxyribonucleic acid binding to 1-position of the 5-formylcytosine via glucosidic bond; the hydrocarbyl is C1-C30 linear or branched alkyl, C1-C30 linear or branched alkenyl, or C1-C30 linear or branched alkynyl; R.sub.1 is an electrondrawing group selected from the group consisting of cyano, nitro, formyl, carbonyl compound ##STR00029## and ##STR00030## R.sub.2 is an electrondrawing group selected from the group consisting of cyano, formyl, carbonyl compound ##STR00031## and ##STR00032## R.sub.3 is an unsubstituted C1-C30 linear or branched alkyl, alkenyl or alkynyl, or a C1-C30 linear or branched alkyl, alkenyl or alkynyl substituted with OH, NH.sub.2, CHO, COOH, azido and/or biotin; and R.sub.1 and R.sub.2 are independent from each other or forming a ring directly by bonding with each other or forming a ring indirectly by bonding via an atom C, N or O.

2. The method according to claim 1, characterized in, that the active methylene compound containing a side-chain active group is compound i as shown in formula i, and the compound i reacts with the 5-formylcytosine or a 1-substituted derivative thereof in one step to synthesize compound I as shown in formula I: ##STR00033## wherein R and R.sub.1 are respectively as described in claim 1; R.sub.4 represents C1-C30 linear or branched alkyl, alkenyl or alkynyl, or C1-C30 linear or branched alkyl substituted with OH, NH.sub.2, CHO and/or COOH.

3. The method according to claim 2, characterized in, the compound i is methyl acetoacetate, ethyl acetoacetate, diethyl malonate or ethyl 6-azido-3-oxyhexanoate.

4. The method according to claim 1, characterized in that, the active methylene compound containing a side-chain active group is compound ii as shown in formula ii, and said compound ii reacts with the 5-formylcytosine or a 1-substituted derivative thereof in one step to synthesize compound II as shown in formula II: ##STR00034## wherein R and R.sub.1 are respectively as described in claim 1.

5. The method according to claim 4, characterized in, the compound ii is malononitrile.

6. The method according to claim 1, wherein the active methylene compound containing a side-chain active group is compound iii as shown in formula iii, and said compound iii reacts with the 5-formylcytosine or a 1-substituted derivative thereof in one step to synthesize compound III as shown in formula III: ##STR00035## wherein R is as described in claim 1; and R.sub.5custom character R.sub.6custom character R.sub.7 and R.sub.8 are, independently from each other, hydrogen, OH, NH.sub.2, CHO, COOH, CN, NO.sub.2, azido, or C1-C30 linear or branched alkyl, alkenyl or alkynyl, or C1-C30 linear or branched alkyl, alkenyl or alkynyl substituted with OH, O, NH.sub.2, NH, CHO, COOH, azido and/or biotin; or the active methylene compound containing a side-chain active group is a compound as shown in formula iv, ##STR00036## in formula iv, X represents C1-C5 linear or branched hydrocarbyl, or C1-C5 linear or branched hydrocarbyl with ether bond O and/or imino group NH; n is a positive integer greater than or equal to 1; and Y is biotin, azido, or C2-C20 alkynyl.

7. A compound selected from the compound as shown in formula I, II, III or iv: ##STR00037## wherein, R and R.sub.1 are as defined in claim 1; R.sub.5custom character R.sub.6custom character R.sub.7 and R.sub.8 are, independently from each other, hydrogen, OH, NH.sub.2, CHO, COOH, CN, NO.sub.2, azido, or C1-C30 linear or branched alkyl, alkenyl or alkynyl, or C1-C30 linear or branched alkyl, alkenyl or alkynyl substituted with OH, O, NH.sub.2, NH, CHO, COOH, azido and/or biotin; X represents C1-C5 linear or branched hydrocarbyl, or C1-C5 linear or branched hydrocarbyl with ether bond O or imino group NH; n is a positive integer greater than or equal to 1; and Y is biotin, azido, or C2-C20 alkynyl.

8. The method according to claim 1, wherein the hydrocarbyl is C1-C10 linear or branched alkyl, C1-C10 linear or branched alkenyl, or C1-C10 linear or branched alkynyl.

9. The method according to claim 1, wherein R represents CH.sub.3, CH.sub.2CH.sub.3, CHO, CH.sub.2CHO or ##STR00038##

10. The method according to claim 2, characterized in that, R.sub.4 represents C1-C10 linear or branched alkyl or C1-C10 linear or branched alkyl substituted with OH, NH.sub.2, CHO and/or COOH.

11. The method according to claim 6, wherein R.sub.5, R.sub.6, R.sub.7 and R.sub.8 are, independently from each other, hydrogen, OH, NH.sub.2, CHO, COOH, CN, NO.sub.2, azido, or C1-C10 linear alkyl, or C1-C10 linear alkyl substituted with OH, O, NH.sub.2, NH, CHO, COOH, azido and/or biotin; X is CH.sub.2, OCH.sub.2CH.sub.2, CH.sub.2OCH.sub.2 or CH.sub.2CH.sub.2O; n is a positive integer between 1 and 9; and Y is biotin, azido, or ethynyl or cyclooctynyl.

12. The method according to claim 6, wherein X is CH.sub.2, n is a positive integer between 1 and 9, and Y is biotin, azido or ethynyl.

13. The method according to claim 6, wherein the compound iii is 1,3-indandione; or the compound as shown in formula iv is 5-(2-azidoethyl)-1,3-indandione.

14. The compound according to claim 7, wherein in formula III, R.sub.5, R.sub.6, R.sub.7 and R.sub.8 are, independently from each other, hydrogen, OH, NH.sub.2, CHO, COOH, CN, NO.sub.2, azido or C1-C10 linear alkyl, or C1-C10 linear alkyl substituted with OH, O, NH.sub.2, NH, CHO, COOH, azido and/or biotin; and in formula iv, X is CH.sub.2, OCH.sub.2CH.sub.2, CH.sub.2OCH.sub.2 or CH.sub.2CH.sub.2O, n is a positive integer between 1 and 9, and Y is biotin, azido, or ethynyl or cyclooctynyl.

15. The compound according to claim 7, characterized in that, in formula iv, X is CH.sub.2, n is a positive integer between 1 and 9, and Y is biotin, azido or ethynyl.

16. The compound according to claim 7, wherein it is a compound selected from the compounds of the following formulas: ##STR00039## ##STR00040##

17. A kit for detecting 5-formylcytosine base, comprising the active methylene compound containing a side-chain active group R.sub.1CH.sub.2R.sub.2 as defined in claim 1, and corresponding reaction solvent, wherein the corresponding reaction solvent is an alkaline organic solution or an acidic to neutral aqueous solution.

18. The kit according to claim 17, wherein the active methylene compound containing a side-chain active group R.sub.1CH.sub.2R.sub.2 is the compound i as defined in claim 2, the compound ii as defined in claim 4, or the compound iii or iv as defined in claim 6.

19. The kit according to claim 17, wherein the active methylene compound containing a side-chain active group R.sub.1CH.sub.2R.sub.2 is methyl acetoacetate, ethyl acetoacetate, diethyl malonate, ethyl 6-azido-3-oxyhexanoate, malononitrile, 1,3-indandione or 5-(2-azidoethyl)-1,3-indandione.

20. The kit according to claim 17, wherein it is a kit selected from: Kit 1, comprising the following 4 modules: Module 1: a 5-formylcytosine reaction module, comprising ethyl 6-azido-3-oxyhexanoate, and corresponding reaction solution, wherein the corresponding reaction solution is an alkaline organic solution; Module 2: a selective enrichment module, comprising magnetic beads specifically binding to biotin, a screening buffer, and a reagent which selectively reacts with azido and contains biotin modification; Module 3: a sodium bisulfite treatment module, comprising a sodium bisulfite treating reagent and related recovering materials; and Module 4: a specific PCR amplification module, comprising a specific DNA polymerase and a reaction system screened for the reaction product of 5-formylcytosine; Kit 2, comprising the following 3 modules: Module 1: a 5-formylcytosine reaction module, comprising 5-(2-azidoethyl)-1,3-indandione, and corresponding reaction solution, wherein the corresponding reaction solution is an alkaline organic solution or an acidic to neutral aqueous solution; Module 2: a selective enrichment module, comprising magnetic beads specifically binding to biotin, a screening buffer, and a reagent which selectively reacts with azido and contains biotin modification; and Module 3: a specific PCR amplification module, comprising a specific DNA polymerase and a reaction system screened for the reaction product of 5-formylcytosine; Kit 3, comprising the following 3 modules: Module 1: a module for immunoprecipitation enrichment of 5-formylcytosine, comprising a 5 formylcytosine antibody and corresponding reaction buffer for DNA immunoprecipitation test; Module 2: a 5-formylcytosine reaction module, comprising malononitrile or 1,3-indandione, and corresponding reaction solution, wherein, for malononitrile, the corresponding reaction solution is an acidic to neutral aqueous solution; and for 1,3-indandione, the corresponding reaction solution is an alkaline organic solution or an acidic to neutral aqueous solution; and Module 3: a specific PCR amplification module, comprising a specific DNA polymerase and reaction system screened for the reaction product of 5-formylcytosine; and Kit 4, comprising the following 2 modules: Module 1: a 5-formylcytosine reaction module, comprising ethyl 6-azido-3-oxyhexanoate or 5-(2-azidoethyl)-1,3-indandione, and corresponding reaction solution, wherein, for ethyl 6-azido-3-oxyhexanoate, the corresponding reaction solution is an alkaline organic solution; and for 5-(2-azidoethyl)-1,3-indandione, the corresponding reaction solution is an alkaline organic solution or an acidic to neutral aqueous solution; and Module 2: a selective enrichment module, comprising magnetic beads specifically binding to biotin, a screening buffer, and a reagent which selectively reacts with azido and contains biotin modification.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is the mass spectrometry results of 9-base DNA, i.e., 5-AGA TC.sup.5fG TAT-3 and those obtained after the reaction of such 9-base DNA with 5 representatives of compounds i, ii and iii in example 1.

(2) FIG. 2 is the mass spectrometry results of 5 kinds of 9-base DNA sequences before and after the reaction with malononitrile in example 1, wherein each 9-base DNA sequence has one kind of cytosine different from that in the other 9-base DNA sequences. The results show the selectivity of the reaction in the present invention.

(3) FIGS. 3A and 3B show that 5fC ring formation protecting sodium bisulfite sequencing technique is implemented with compound i of diethyl malonate in example 2, wherein 5fC* represents the product obtained after the reaction of 5fC.

(4) FIG. 4 is a flow chart of specifically enriching a nucleic acid containing a 5fC base with compound AI.

(5) FIG. 5 is a flow chart for specifically enrichment with compound AI (left) and MALDI-TOF detection spectra (right) in example 3.

(6) FIG. 6 shows the efficiency of enriching a DNA containing a 5fC base with compound AI in example 3.

(7) FIG. 7 is a flow chart of specifically enriching a nucleic acid containing a 5fC base with 6-azido-3-oxyethyl hexanoate.

(8) FIGS. 8A and 8B are respectively the sequencing results obtained before and after the reaction with compound AI in example 4, which show that ring formation promoting 5fC to T conversion sequencing technique is implemented with compound AI, wherein 5fC* represents the product obtained after the reaction of 5fC.

(9) FIGS. 9A and 9B are respectively the sequencing results obtained before and after the reaction with malononitrile in example 5, which show that ring formation promoting 5fC to T conversion sequencing technique is implemented with compound ii of malononitrile, wherein 5fC* represents the product obtained after the reaction of 5fC.

(10) FIG. 10 shows new ultraviolet absorption peaks produced from the reaction of 9-base DNA, i.e. 5-AGA TC.sup.5fG TAT-3, respectively with 4 kinds of compounds of malononitrile (A), 1,3-indandione (B), ethyl acetoacetate (C) and diethyl malonate (D), shown by Thermo Nanodrop Micro-Ultraviolet Spectrophotometer in example 6.

(11) FIG. 11 shows the fluorescent activation effect produced by the reaction of 5fC bases with malononitrile in example 6.

(12) FIG. 12A shows the net increase of fluorescence intensity at different concentrations of the reaction product of Oligo NO.1 and malononitrile in example 6. In the diagram, the concentrations represented by curves from bottom to top are respectively 10 nM, 50 nM, 100 nM, 200 nM, 500 nM and 1000 nM in order. FIG. 12B is a linear relation diagram of the net increase of fluorescence intensity of the reaction product versus the concentration thereof.

(13) FIG. 13 shows the double-stranded DNA sequence after the reaction with TaqI indigestible compound AI in example 7.

(14) FIG. 14 shows the enrichment of 5fC distribution regions of genomic DNA in mice embryonic stein cells with compound AI in example 7.

(15) FIG. 15 shows the representative regions of 5fC single-base resolution position in the genomic DNA of mice embryonic stem cells exhibited by the ring formation promoting 5fC to T conversion sequencing technique based on compound AI in example 7.

(16) FIG. 16 shows the comparison between the sequencing read results from 5fC ring formation protecting sodium bisulfite sequencing, ring formation promoting 5fC to T conversion sequencing technique, conventional sequencing, and sodium bisulfite sequencing.

DETAILED DESCRIPTION OF THE INVENTION

(17) The novel compounds, the synthesis methods and reaction conditions, the related applications of the compounds and methods of the present invention will be described in detail below, in order to clearly describe the contents of the present invention.

(18) The invention relates to the application of the 5-formylcytosine-related conjugate polycyclic compound.

(19) Any conjugate polycyclic compound having a structure of the aforementioned 3 kinds of structures of compounds I, II, or III applies to the present invention. The synthesis methods of the new structure compounds of compounds I, II, and III are not limited to the synthesis methods of the present invention. No matter whether the synthesis methods thereof have difference, they apply to 3 kinds of compounds of the present invention.

(20) 3 kinds of compounds provided by the invention can be used in studies on nucleic acids. 3 kinds of compounds of the invention generate fluorescence under certain excitation light conditions, and thus can be used as a new category of fluorescent bases to apply in research areas such as study on kinetics problem of nucleic acid base conformation, interaction between other molecules (such as protein) with nucleic acid, nucleic acid-nucleic acid interaction, the chemical environments where a nucleic acid is present, and the like. Meanwhile, 3 kinds of fluorescent bases provided by the invention can be introduced from 5fC bases. In use, the corresponding phosphoramidite monomer of 5fC base can firstly be used to replace target fluorescent base to synthesize DNA, and then the fluorescent bases are introduced by the reaction of the present invention when necessary. Therefore, the feasibility of applying these 3 kinds of fluorescent bases is higher than that of other commercially available non-natural fluorescent bases.

(21) Related Applications of 5-formylcytosine Specific Chemical Labeling Method of the Present Invention

(22) 1. Specific Chemical Labeling of 5fC

(23) (1) Direct Labeling of 5fC

(24) 5-formylcytosine is directly labeled with this method. Particularly, 5-formylcytosine can be reacted with the compounds of the present invention under the reaction condition of the present invention, such that it is converted into a new cytosine conjugate polycyclic derivative compound, thereby incorporating new chemical properties, for example, new ultraviolet absorption spectrum and fluorescence emission spectrum. The new chemical properties of the obtained product can be used to indicate 5-formylcytosine, achieving labeling of 5-formylcytosine, and a new labeling method is provided for studying dynamic change of intracellular epigenetics. Quantitative analysis of 5-formylcytosine base in unknown nucleic acid samples can also be performed by making use of the special absorption spectrum or fluorescence emission spectrum of the reaction products I, II or III.

(25) In an embodiment of the invention, by means of the reaction of malononitrile with oligomeric deoxyribonucleotide chain containing 5fC base, a working curve of concentration versus fluorescence intensity is plotted, which shows good fitting degree. The concentration of 5fC base can be determined quantitatively by measuring the fluorescence intensity of 5fC reaction product in an unknown sample.

(26) (2) Indirect Labeling of 5fC

(27) Specific functional groups can be introduced into 5-formylcytosine by the reaction of the active methylene compound having a special functional group (i.e. active methylene compound containing a side-chain active group) with 5-formylcytosine, achieving indirect labeling of 5fC. In the case of fluorescent molecule, indirect labeling of 5fC is implemented by using the fluorescence emission spectrum of this fluorescent molecule under certain exciting light. Besides, azido or alkynyl can also be introduced, and then indirect labeling of 5fC is performed by further using the principle of click chemistry.

(28) The click chemistry here mainly refers to the [3+2] cycloaddition reaction of azido with alkynyl or alkynyl derivatives.

(29) 2. Changing Related Enzymology Effect of 5-Formylcytosine

(30) The chemical properties of 5fC bases can also be changed by the reaction in the method of the present invention, which means that specific labeling of 5-formylcytosine in DNA or RNA results in change in the chemical properties of 5-formylcytosine in biological samples, thus influencing the abilities of nucleic acid-binding proteins (such as nucleic acid polymerase, and restriction endonuclease) to identify and bind the 5fC-containing nucleic acids, and then the activities of related proteins to identify nucleic acid substrates can be influenced. Such a change can be used in special biological studies.

(31) In an embodiment, the 5fC base on the substrate sequence for TaqI restriction endonuclease is labeled with compound iv, and therefore the enzyme digestion reaction activity of TaqI is influenced, such that TaqI fails to digest the chemical-reaction-modified T/C.sup.5fGA sequence.

(32) The enzyme treatment effect is changed by the chemical modification of 5fC mentioned above, in which the enzyme used comprises various restriction endonucleases and DNA polymerase. The commercial companies which provide enzyme reagents includes but not limited to for example NEB, Thermo Scientific, TAKARA, Promega, Agilent and the like.

(33) 3. Specific Enrichment of 5fC

(34) Specific functional groups are introduced into 5-formylcytosine by means of the reaction of the active methylene compound having a special functional group (selected from the above side-chain active group-containing active methylene compounds i, ii, iii, and iv) with 5-formylcytosine, and the chemical properties of such special functional groups are used to implement the specific enrichment of a nucleic acid molecule containing 5-formylcytosine. For example, azido is introduced into the active methylene compound, and click chemistry reaction with this azido is performed by using biotin-labeled alkynyl or alkynyl derivatives, such that the biotin label is indirectly introduced into 5-formylcytosine; and then by means of the specific binding between streptavidin and biotin, screening of nucleic acid molecule with 5-formylcytosine is implemented. On the contrary, alkynyl can also be introduced into the active methylene compound, click chemistry reaction with this alkynyl is performed by utilizing azido molecule with biotin label, and then the enrichment can be implemented in the same way above.

(35) In a specific embodiment, 5fC is specifically labeled with an azido derivative of 1,3-indandione-compound AI, the obtained molecule is further labeled with biotin by means of click reaction, and then the specific binding between biotin and streptavidin is used to enrich the nucleic acid molecule containing 5fC. The same effect can also be achieved by using an azido derivative of ethyl acetoacetate, i.e. ethyl 6-azido-3-oxyhexanoate.

(36) 4. Detection of Distribution Information of 5fC in Genome

(37) By means of the above method for specifically enriching 5fC, the detection of distribution information of 5fC in genome can be implemented. Through the specific labeling of 5fC base, the enrichment and purification of a genomic DNA fragment containing 5fC base is implemented. Then by means of sequencing and alignment with corresponding genome, the distribution information of 5-formylcytosine in genome such as, regulatory region, transcription initiation region, gene exon and intron regions, characteristic histone modification region and the like in gene can be analyzed.

(38) The genomic DNA samples above can be derived from cell culture, animal tissue, animal blood, formalin-fixed tissue, paraffin-embedded tissue, and trace sample such as early development sample of embryo, single cell and the like.

(39) 5. Single-Base Resolution Sequencing of 5fC

(40) The method of the present invention, i.e., the specific reaction between an active methylene group containing a side-chain active group and 5fC, can be used for single-base resolution detection of the 5fC position in the sequence of a nucleic acid sample.

(41) The nucleic acid sample above refers to a genomic DNA sample or RNA sample, which can be derived from cell culture, animal tissue, animal blood, formalin-fixed tissue, paraffin-embedded tissue, and trace sample such as early development sample of embryo, single cell and the like.

(42) Any technique using such reactions to carry out 5fC base sequencing can be applied to the present invention.

(43) (1) 5fC Ring-Protecting Sodium Bisulfite Sequencing Technique

(44) 5fC ring-protecting sodium bisulfite sequencing technique is implemented through the reaction between compound i and 5-formylcytosine. The core of this technique lies in performing sodium bisulfite sequencing for the samples before and after the reaction with compound i respectively. 5fC site in the sample is read as T in sequencing before the reaction, while 5fC base is read as C by sequencing after the reaction for that 5fC base is protected by a conjugate structure enabling it to be resistant to sodium bisulfite treatment. By comparing these two sequencing results, T-C mismatching sites are found, and single-base resolution sequence information of 5fC can be identified.

(45) The sodium bisulfite sequencing above refers to that nucleic acid is treated under a weak acid condition with high concentration sodium bisulfite, and cytosine (and the oxides thereof, i.e. 5-formylcytosine and 5-carboxylcytosine) is hydrolyzed to remove 4-amino, and finally converted to uracil. However, the two derivatives of cytosine, i.e. 5-methylcytosine 5mC and 5-hydroxymethylcytosine 5hmC will not be converted to uracil. In Polymerase Chain Reaction (PCR) amplification, uracil U is read as thymine T, and both the remaining 5mC and 5hmC are amplified into C. Further sequencing can determine whether a site read as C is 5mC or 5hmC.

(46) In a specific embodiment, compound i is selected as diethyl malonate. 5fC base in the sequence is read as T in sodium bisulfite sequencing before the reaction, while the product of 5fC is read as C in sodium bisulfite sequencing after the reaction.

(47) (2) Ring Formation Promoting 5fC to T Conversion Sequencing Technique

(48) Ring formation promoting 5fC to T conversion sequencing technique can be implemented through the reaction between compound ii and 5-formylcytosine. The core of this technique lies in performing PCR amplification and sequencing for the samples before and after the reaction with compound ii respectively. 5fC site in the sample is not influenced before the reaction, and is read as cytosine C in sequencing; while 5fC site in the sample is read as thymine T in sequencing after the reaction, and therefore the sequencing result thereof is also shown as T. By comparing these two sequencing results, mutation sites of C-T are found, and single-base resolution sequence information of 5fC can be identified.

(49) Ring formation promoting 5fC to T conversion sequencing technique can also be implemented through the reaction between compound iii and 5-formylcytosine. The process thereof is similar with that when using compound ii, which comprises performing PCR amplification for the samples before and after the reaction with compound iii respectively. 5fC site in the sample is read as C before the reaction, while 5fC site in the sample is read as T after the reaction. By comparing the two sequencing results, particular sequence information of 5fC can be identified.

(50) As for the two sequencing methods in (1) and (2) before, the related particular commercial sequencing platform can be selected from any of the followings:

(51) 1) the first generation dideoxy base sequencing method, in which the commercial sequencing platforms that can be used include a series of instruments for the first generation sequencing platform from ABI;

(52) 2) the second generation high-throughput sequencing technique, in which the commercial sequencing platforms that can be used include: a series of sequencing platforms from Illumina (former Solexa), including but not limited to, Miseq, Hiseq 2000, Hiseq2500, NextSeq 500, Hiseq X, etc.; sequencing platforms using pyrosequencing method from Roche (fainter 454), for example, including but not limited to GS FLX; and SOLiD sequencing platforms from ABI, for example, including but not limited to SOLiD 5500;

(53) 3) the third generation single molecule sequencing technique, in which the commercial sequencing platforms that can be used include: SMRT sequencing platforms from Pacific Bioscience, for example, including but not limited to SMRT RSII; nanopore single molecule sequencing platforms from Oxford Nanopore Technologies, such as MniION platform; HeliScope platform from Helicos Biosciences.

(54) (3) The Third Generation Single Molecule Sequencing Based on Chemical Modification of 5fC

(55) The target bases are directly detected through the modification to the chemical structure of 5fC base with compound i, ii, or iii, and the third generation single molecule sequencing technique. By changing the chemical properties for protein to identify the modified 5fC base, the kinetic parameter of the binding of protein to base during the third generation single molecule sequencing are influenced, such that the base is distinguished from other naturally existing bases, thus directly identifying the position of the target 5fC bases.

(56) The third generation single molecule sequencing platform here can be selected from SMRT sequencing platforms from Pacific Bioscience, or nanopore single molecule sequencing platforms from Oxford Nanopore Technologies. When using SMRT sequencing platform, the amplification efficiency of polymerase is influenced after the modification to the chemical structure of 5fC base with compound i, ii or iii, such that the kinetic parameter of amplification are influenced, and the positions of 5fC are identified. When using nanopore single molecule sequencing platform, the kinetic parameter of the binding of nanopore protein to base are influenced after the modification to the chemical structure of 5fC base with compound i, ii or iii. By measuring this kinetic parameter, it can be determined that whether the base is a modified 5fC base.

(57) 6. Kits for 5-Formylcytosine Sequencing.

(58) (1) Kit 1 for 5fC Ring-Protecting Sodium Bisulfite Sequencing Technique

(59) By means of the reaction method for labeling 5-formylcytosine with azido-containing compound i, kit 1 for single-base resolution analysis of sequence information of 5-formylcytosine in a nucleic acid sample is designed. Based on the specific reaction between ethyl 6-azido-3-oxyhexanoate and 5fC, biotin is introduced to 5fC through click chemistry reaction, so as to perform selective enrichment of 5fC. In combination with sodium bisulfite sequencing technique, the sequencing results before and after the treatment with compound ethyl 6-azido-3-oxyhexanoate are compared to identify the positions of 5fC base, achieving the 5fC ring-protecting sodium bisulfite sequencing technique. Kit 1 mainly comprises the following 4 modules:

(60) Module 1: a 5fC reaction module, comprising a reagent of ethyl 6-azido-3-oxyhexanoate, and corresponding reaction solution. This module is used to react with 5fC base in nucleic acid sample, to label 5fC base with azido.

(61) Module 2: a selective enrichment module, comprising magnetic beads specifically binding to biotin, a screening buffer, and a reagent which selectively reacts with azido and contains biotin modification.

(62) This module is used to perform a click chemistry [3+2] cycloaddition reaction with the azido labeled in nucleic acid sample, such that 5fC base is further labeled with biotin. Further, by means of the binding of biotin to the streptavidin coupled to the magnetic beads, the nucleic acid sample fragments containing 5fC base are separated and purified with a magnetic frame.

(63) Module 3: a sodium bisulfite treatment module, comprising a sodium bisulfite treating reagent and related recovering materials.

(64) This module is used to react with the enriched nucleic acid sample fragment, such that normal cytosines and remaining 5-carboxylcytosines are deaminated and hydrolyzed into uracil U.

(65) Module 4: a specific PCR amplification module, comprising a specific DNA polymerase and a reaction system screened for the reaction product of 5fC.

(66) This module is used to amplify the labeled and sodium bisulfite treated nucleic acid sample, so as to perform high-throughput sequencing.

(67) (2) Kit 2 for Ring Formation Promoting 5fC to T Conversion Sequencing Technique

(68) By means of the reaction method for labeling 5-formylcytosine with azido-containing compound iv, kit 2 for single-base resolution analysis of sequence information of 5-formylcytosine in a nucleic acid sample is designed. In an example, based on the specific reaction between compound AI and 5fC, biotin is introduced to 5fC through click chemistry reaction, so as to perform selective enrichment of 5fC. Samples before and after the treatment of compound iv are PCR amplified and sequenced. By comparing the sequencing results, the sequence position of 5fC base can be identified, achieving the ring formation promoting 5fC to T conversion sequencing technique. Kit 2 mainly comprises the following 3 modules:

(69) Module 1: a 5fC reaction module, comprising a reagent compound AI, (5-(2-azidoethyl)-1,3-indandione), and corresponding reaction solution.

(70) This module is used to react with 5fC base in nucleic acid sample to label 5fC base with azido. Module 2: a selective enrichment module, comprising magnetic beads specifically binding to biotin, a screening buffer, and a reagent which selectively reacts with azido and contains biotin modification.

(71) This module is used to perform a click chemistry [3+2] cycloaddition reaction with the azido labeled in nucleic acid sample, such that 5fC base is further labeled with biotin. Further, by means of the binding of biotin to the streptavidin coupled to the magnetic beads, the nucleic acid sample fragments containing 5fC base are separated and purified with a magnetic frame.

(72) Module 3: a specific PCR amplification module, comprising a specific DNA polymerase and a reaction system screened for the reaction product of 5fC.

(73) This module is used to amplify the enriched nucleic acid sample, so as to perform high-throughput sequencing. At the same time, the original 5fC site is allowed to be read as T in PCR amplification, and thus a mutation point is introduced, achieving the ring formation promoting 5fC to T conversion sequencing technique.

(74) (3) Kit 3 for Ring Formation Promoting 5fC to T Conversion Sequencing Technique

(75) By means of the reaction method for labeling 5-formylcytosine with compound ii or iii, kit 3 for single-base resolution analysis of sequence information of 5-formylcytosine in a nucleic acid sample is designed. The selective enrichment is performed with a published specific antibody for 5-formylcytosine (Li Shen, et al., Cell, 2013, 153:692-706). Then malononitrile is reacted with 5fC, and the conversion from 5fC to T is resulted by PCR. By comparing the sequencing results of the amplified products, the position of 5fC base in the sequence can be identified, thus achieving the ring formation promoting 5fC to T conversion sequencing technique. Kit 3 mainly comprises the following 3 modules:

(76) Module 1: a module for immunoprecipitation enrichment of 5-formylcytosine, comprising a 5fC antibody and corresponding reaction buffer for DNA immunoprecipitation test.

(77) This module is used to directly enrich the nucleic acid sample fragment containing 5fC base.

(78) Module 2: a 5fC reaction module, comprising a reagent of malononitrile (compound ii) or 1,3-indandione (compound iii), and corresponding reaction solution.

(79) This module is used to react with 5fC base in the nucleic acid sample.

(80) Module 3: a specific PCR amplification module, comprising a specific DNA polymerase and reaction system screened for the reaction product of 5fC.

(81) This module is used to amplify and enrich the malononitrile treated nucleic acid sample, so as to perform high-throughput sequencing. At the same time, the original 5fC site is allowed to be read as T in PCR amplification, and thus a mutation point is introduced, achieving the ring formation promoting 5fC to T conversion sequencing technique.

(82) (4) Kit 4 for Single Molecule Sequencing Based on Labeling of 5fC

(83) By means of the reaction method for labeling 5-formylcytosine with azido labeled compound i or iii (including compound iv), in combination with the third generation single molecule sequencing platform, kit 4 for single-base resolution analysis of sequence information of 5-formylcytosine in a nucleic acid sample is designed. Based on the selective enrichment of DNA fragment containing 5fC base with ethyl 6-azido-3-oxyhexanoate or compound AI, the third generation single molecule real-time detection platform is further used to find the position having a special kinetic parameter, and identify the 5fC modified position, achieving single molecule real-time detection of sequence information of 5fC bases. Kit 4 mainly comprises the following 2 modules:

(84) Module 1: a 5fC reaction module, comprising a reagent of ethyl 6-azido-3-oxyhexanoate (compound i) or compound AI (5-(2-azidoethyl)-1,3-indandione, compound iv), and corresponding reaction solution.

(85) This module is used to react with 5fC base in nucleic acid sample, so as to label 5fC base with azido.

(86) Module 2: a selective enrichment module, comprising magnetic beads specifically binding to biotin, a screening buffer, and a reagent which selectively reacts with azido and contains biotin modification.

(87) This module is used to perform a click chemistry [3+2] cycloaddition reaction with the azido labeled in genome, such that 5fC base is further labeled with biotin. Further, by means of the binding of biotin to the streptavidin coupled to the magnetic beads, the nucleic acid sample fragments containing 5fC base are separated and purified with a magnetic frame.

(88) The nucleic acid sample targeted by kits 1, 2, 3, and 4 above refer to genomic a DNA sample or RNA sample, which can be derived from cell culture, animal tissue, animal blood, formalin-fixed tissue, paraffin-embedded tissue, and trace sample such as early development sample of embryo, single cell and the like.

(89) 7. 5fC Labeling Method and Application of Related Compounds in the Aspect of Molecule Diagnosis

(90) The above specific enrichment methods for 5fC, and related active methylene compounds containing a specific chemical label are used in the molecule diagnosis involving 5-formylcytosine in biological samples. The changes in activities and expression quantities of 5-formylcytosine related proteins produced in cells such as TET protein, TDG for excision of 5-formylcytosine and the like will influence the content and sequence distribution of 5-formylcytosine in the genome. The changes in the content and sequence distribution of 5-formylcytosine in the biological samples are detected by using the above related labeling, detection, and sequencing methods for 5-formylcytosine. Thus, reference data can be provided for disease diagnosis and pathology indications such as pathological changes and histological changes, which is beneficial for clinical diagnosis.

(91) The present invention is further described through the following 8 particular examples, for the purpose of better understanding of the contents of the present invention. The contents of the present invention, however, are not limited to the examples illustrated below. All the reagents and solvents used in the examples are bought from commercial companies, unless otherwise specified.

(92) The DNA Sequences Involved in the Tests of the Present Invention

(93) TABLE-US-00001 OligoID SEQID NO. Sequence(5-3) Notes No. 1 AGATC.sup.5fGTAT 5fC-9mer 22 2 AGATCGTAT C-9mer 23 3 AGATC.sup.5mGTAT 5mC-9mer 24 4 AGATC.sup.5hmGTAT 5hmC-9mer 25 5 AGATC.sup.5caGTAT 5caC-9mer 26 6 CCTCACCATCTCAACCAATATTATATTATGTCTACACGTTC.sup.5f OligoNO.6 1 GC.sup.5fGTTCCGTGTTATAATATTGAGGGAGAAGTGGTGA Forward TCACCACTTCTCCCTCAATATTATAACACGGAACG*CG*AA OligoNO.6 2 CGTGTAGACATAATATAATATTGGTTGAGATGGTGAGG Reverse 7 CCCTTTTATTATTTTAATTAATATTATATT Model-BS-F 3 8 CTCCGACATTATCACTACCATCAACCACCCATCCTACCTGG Model-R 4 ACTACATTCTTATTCAGTATTCACCACTTCTCCCTCAAT 9 CTCCGACATTATCACTACCA Model-SeqR 5 sequencing primer 10 CATGAGTGCCCTCAGCAGTAAGTAACTGACCAGATCTCTC qPCR-5fC-M 6 GTGCCTCTTGAGGCTACTGAGTTATCCAACCTTTAGGAGCC dsDNA ATGCATCGATAGCATCCGC.sup.5fCACAGGCAGTGAGGCTACTG AGTCATGCACGCAGAAAGAAATAGC 11 ATTCACTCCCACTGAGACTGTGGATCAGGCCAACATACAT qPCR-Ctl 7 GCCTTCAGTAACTGACCAGATCTCTTAGTTCTCTTGAGGCT dsDNA ACTGAGTTAGAATGGCAGAGTCAAGGAGC ObtainedbyPCRamplification,comprising 100%dATP,100%dTTP,100%dGTP,70%dCTP, 15%d5mCTP,10%d5hmCTP,5%d5caCTP 12 CTACGCAAACTGGCTGTCAAAGTAACTGACCAGATCTCTC qPCR-Ref 8 GGCTCTCTTGAGGCTACTGAGTTATCATGGACGCTACCTCA dsDNA CAG 13 CATGAGTGCCCTCAGCAGTA qPCR-M-F 9 14 TCCAACCTTTAGGAGCCATG qPCR-M-R 10 15 AGGCCAACATACATGCCTTC qPCR-Ctl-F 11 16 GAATGGCAGAGTCAAGGAGC qPCR-Ctl-R 12 17 CTACGCAAACTGGCTGTCAA qPCR-Ref-F 13 18 CTGTGAGGTAGCGTCCATGA qPCR-Ref-R 14 19 CCTCACCATCTCAACCAATATTATATTACGCGTATATC.sup.5fG 76mer5fCx.2 15 C.sup.5fGTATTTCGCGTTATAATATTGAGGGAGAAGTGGTGA 20 CCTCACCATCTCAACCAATA Model-F 16 21 CCTCACCATCTCAACCAATATTATATTACGCGTATATC.sup.5fG 76mer5fCx1 17 CGTATTTCGCGTTATAATATTGAGGGAGAAGTGGTGA 22 CCTCACCATCTCAACCAATATTATATTAGTATTTC.sup.5fGATTAC OligoNO.22 18 GCGTTATTATATTGAGGGAGAAGTGGTGA Forward TCACCACTTCTCCCTCAATATAATAACGCGTAATCGAAATA OligoNO.22 19 CTAATATAATATTGGTTGAGATGGTGAGG Reverse 23 CCTCACCATCTCAACCAATATTATATTAGTATTTCGATTACG OligoNO.23 20 CGTTATTATATTGAGGGAGAAGTGGTGA Forward TCACCACTTCTCCCTCAATATAATAACGCGTAATCGAAATA OligoNO.23 21 CTAATATAATATTGGTTGAGATGGTGAGG Reverse

(94) All the oligomeric nucleotide chains with modified base used in the experiments were synthesized by using ABI EXPEDIATE nucleic acid solid-phase synthesizer. The phosphoramidite monomers used for synthesis were bought from Glen Research, USA. The oligomeric nucleotide chains containing only normal bases used in the experiments were synthesized by Sangon Biotech (Shanghai) Co., Ltd.

Example 1

Synthesis of the Representative Compounds of Compounds I, II, and III

(95) The artificially synthesized 9-base oligomeric nucleotide chain containing 5fC base Oligo NO.1 was reacted with the representative compounds i-1, ii-1, and iii-1 of 3 kinds of compounds i, ii, and iii, resulting in 3 representative product compound I-1, II-1, and 111-3 of 3 structures I, II, and III. In the reaction, the representative compound of compound i is either ethyl acetoacetate or methyl acetoacetate; the representative compound of compound ii is malononitrile; and the representative compound of compound iii is 1,3-indandione.

(96) The particular reaction route was as below.

(97) ##STR00025##

(98) Compound i-1 was the representative compound, ethyl acetoacetate or methyl acetoacetate. An appropriate amount of Oligo NO.1, i.e. 5fC-9mer DNA oligomeric nucleotide chain, was dissolved in an alkaline methanol solution, then much excessive moles of ethyl acetoacetate or methyl acetoacetate was directly added, and the reaction was performed under agitation at 37 C. for 24 h after mixing homogeneously, obtaining the same compound I-1. In the reaction, the active methylene at 2-position of ethyl acetoacetate or methyl acetoacetate was condensed with formyl of 5fC, and at the same time an intramolecular reaction occurred, during which the 4-amino in the cytosine ring replaced the ethanol/methanol portion in the ester bond, thereby resulting in compound I-1 by ring formation. The MALDI-TOF mass spectrum identification showed that no peak of raw materials remained, m/z.sup.(ob): 2763.5.fwdarw.m/z.sup.(ob): 2829.8/2829.5 (as shown in A, B, and C of FIG. 1).

(99) Compound ii-1 was the representative compound, malononitrile. An appropriate amount of Oligo NO.1, i.e. 5fC-9mer DNA oligomeric nucleotide chain, was dissolved in an weak acidic aqueous solution, much excessive moles of high-concentration aqueous stock solution of malononitrile was simultaneously added, and the reaction was performed under agitation at 37 C. for 24 h after mixing homogeneously, obtaining compound II-1. In the reaction, the active methylene of malononitrile was condensed with 5-formyl of 5fC, and then the 4-amino in the cytosine formed a ring together with the cyano of malononitrile through an intramolecular addition reaction, resulting in target compound II-1. The MALDI-TOF mass spectrum identification showed that no peak of raw materials remained, m/z.sup.(ob): 2763.5.fwdarw.m/z.sup.(ob): 2812.5 (as shown in A and D of FIG. 1).

(100) Compound iii-1 was the representative compound, 1,3-indandione. The reaction of 1,3-indandione with 5fC DNA can be accomplished in an alkaline methanol solution or a weak acidic aqueous solution. An appropriate amount of Oligo NO.1, i.e. 5fC-9mer DNA oligomeric nucleotide chain, was dissolved, much excessive moles of 1,3-indandione in the form of yellow solid was simultaneously added to be dissolved (in an alkaline methanol solution) or reach saturation (in a weak acidic aqueous solution), and the reaction was performed under agitation at 37 C. for 24 h after mixing homogeneously, obtaining compound III-1. In the reaction, the active methylene of malononitrile was condensed with 5-formyl of 5fC, and the 4-amino in the cytosine formed a ring together with the cyano of malononitrile through an intramolecular addition reaction, resulting in target compound III-1. The MALDI-TOF mass spectrum identification showed that no peak of raw materials remained, m/z.sup.(ob): 2763.5.fwdarw.m/z.sup.(ob): 2874.7 (as shown in A and E of FIG. 1).

(101) The results of MALDI-TOF mass spectrum in FIG. 1 showed that no peak of raw materials was detected, which indicates extremely high reaction efficiency.

(102) The reaction provided by the present invention has excellent selectivity. The reaction is specific for 5fC base only, and no side reaction with other cytosines or cytosine derivatives occurs. As shown in FIG. 2, malononitrile as representative was reacted with other 4 cytosines (C, 5mC, 5mhC, and 5caC) containing DNA sequences (Oligo NO.2, Oligo NO.3, Oligo NO.4, and Oligo NO.5, respectively). MALDI-TOF mass spectrum identification showed that the other cytosines or cytosine derivatives were not reacted, and the corresponding increase in molecular weight was only observed for 5fC-9mer DNA sequence after reaction (the secondary peak in group of 5hmC was attributed to incompletely purified sample). This demonstrated excellent reaction selectivity.

Example 2

Implementing 5fC Ring-Protecting Sodium Bisulfite Sequencing Technique with Diethyl Malonate

(103) Diethyl malonate belongs to the active methylene of compound i. The target compound 1-2 can be obtained through a two-step condensation reaction of diethyl malonate with 5fC base in an alkaline methanol solution (as shown in the schematic diagram below). The process of the reaction of Oligo NO.1, i.e. 5fC-9mer DNA oligomeric nucleotide chain with diethyl malonate is as follows: an appropriate amount of DNA oligomeric nucleotide chains was dissolved in an alkaline methanol solution, then much excessive moles of diethyl malonate was directly added, and the reaction was performed under agitation at 37 C. for 24 h after mixing homogeneously, obtaining compound I-2. In the reaction, the active methylene at 2-position of diethyl malonate was condensed with the formyl of 5fC, and at the same time, an intramolecular reaction occurred, during which the 4-amino in the cytosine ring replaced the ethanol portion of the ester bond; and at the same time, a transesterification reaction of the ester bond which did not participate in the ring formation occurred in the alkaline methanol solution to form a methoxyl carbonyl group, producing compound I-2 through ring formation. The MALDI-TOF mass spectrum identification indicated that there is no peak of raw materials remained, m/z.sup.(ob): 2763.5.fwdarw.m/z.sup.)ob): 2845.4 (as shown in A and F of FIG. 1).

(104) ##STR00026##

(105) A double-stranded DNA sequence Oligo NO.6 containing two 5fC bases and with 77 bases in length was reacted with diethyl malonate. In Oligo NO.6 sequence, the forward chain comprises two 5fC bases, such as 5fC base as shown in bold in the sequence (5-C.sup.5fGC.sup.5fG-3), and the reverse chain does not comprise 5fC base, of which the sequence corresponding to 5fC base is G (5-CG*CG*-3). After the treatment of sodium bisulfite, PCR amplification was performed on two primers of Oligo NO.7 and Oligo NO.8. The reverse sequencing primer Oligo NO.9 was used when sequencing. Thus, the G* signals of sequence 5-CG*CG*-3 in the read results were corresponding to 5fC signals. The reaction conditions were the same as set forth. After evaporating methanol to dryness, the reaction product was recovered through ethanol precipitation.

(106) The recovered DNA samples were amplified directly through PCR reaction or amplified through PCR after treating the sample with EpiTect Fast Bisulfite Conversion Kit from QIAGEN. The reaction product was then sequenced to identify whether it is resistant to the treatment of sodium bisulfite. As shown in the sequencing results in FIGS. 3A and 3B, when the product was directly amplified and sequenced after the reaction with diethyl malonate, cytosines or 5fC bases were correspondingly read as significant guanine G signals. However, after treatment with sodium bisulfite, normal cytosines in the sample sequence were converted to uracil U, which were amplified into thymine T through PCR, and thus read as adenine A signals. However, the products 5fC* after the reaction with diethyl malonate were resistant to the sodium bisulfite treatment, remained to pair with cytosine C bases during PCR process, and thus were read as guanine G signals in sequencing. This means that ring formation reaction protected the 4-amino of cytosine, and did not influence such cytosine to be read as C during normal PCR process. During the sodium bisulfite treatment, the protected 5fC was not deaminated and hydrolyzed. However, other normal cytosines were deaminated and hydrolyzed during the sodium bisulfite treatment, and read as T in sequencing. By comparing the sodium bisulfite sequencing results before and after the reaction, the single-base resolution sequence positions of 5fC can be identified (FIGS. 3A and 3B).

(107) In this method, the 4-amino of cytosine was protected by ring formation reaction, such that the cytosine was prevented from being deaminated and hydrolyzed. In comparison with the case that the 5fC position before ring formation reaction can be deaminated and hydrolyzed and therefore read as T in sequencing, the position of 5fC base in the sequence can be identified. This method can be called as 5fC ring formation-protecting sodium bisulfite sequencing technique.

Example 3

Specific Enrichment of Nucleic Acid Containing 5fC Base with the Representative Compound AI of Type iv (formula iv-1)

(108) The reactive region of 1,3-indandione is the methylene between the carbonyl groups in the 5-membered ring. Thus, the modifications at positions 3, 4, 5, and 6 of benzene ring structure will not have significant effect on the properties of the compound. Therefore, 5-(2-azidoethyl)-1,3-indandione (compound AI) was synthesized for specifically enriching nucleic acid containing 5fC bases.

(109) The synthesis route of 5-(2-azidoethyl)-1,3-indandione (compound AI) is as below.

(110) ##STR00027##

Synthesis of 4-(2-chloroethyl)-benzoyl chloride

(111) 4-(2-chloroethyl)-benzoic acid (10 g, 108 mmol) was mixed with 50 mL SOCl.sub.2, several drops of DMF was added, the mixture was heated and refluxed for 12 h, and then excessive SOCl.sub.2 was evaporated, resulting in yellow liquid (10.8 g, 96%). This liquid was directly used in the next reaction step.

Synthesis of 5-(2-chloroethyl)-1,3-indandione (5-(2-chloroethyl)-1H-indene-1,3(2H)-dione)

(112) AlCl.sub.3 (14 g, 106 mmol, 1 eq.) and 200 ml CH.sub.2Cl.sub.2 were added into a 500 mL dried 2-necked flask. 4-(2-chloroethyl)-benzoyl chloride (21.6 g, 106 mmol) was added into CH.sub.2Cl.sub.2 solution under the protection of nitrogen. Then redistilled malonyl dichloride (16.5 g, 117 mmol, 1.1 eq.) was dropped slowly into the solution at 0 C., resulting in dark brown liquid. The reaction was performed at room temperature for 12 h. After the reaction, the solution was poured into ice, followed by adding HCl solution (10%, 250 mL) and stirring vigorously for 1 h. Then the solution was extracted with CHCl.sub.3 (3400 mL). The extract was dried with anhydrous sodium sulfate, concentrated, subjected to column chromatography on silica gel, and eluted with petroleum ether/dichloromethane 2:1, resulting in light yellow solid (7.9 g, 36%). 1H NMR (300 MHz, CDCl3) 7.93 (d, J=7.8 Hz, 1H), 7.83 (s, 1H), 7.71 (d, J=7.8 Hz, 1H), 3.80 (t, J=6.6 Hz, 2H), 3.25 (t, J=6.6 Hz, 2H), 3.24 (s, 2H).

Synthesis of 5-(2-Azidoethyl)-1,3-Indandione (5-(2-Azidoethyl)-1H-Indene-1,3(2H)-Dione, i.e. AI)

(113) NaN.sub.3 (2.3 g, 36 mmol, 2 eq.) was dissolved in 100 mL dried DMSO, and 5-(2-azidoethyl)-1,3-indandione (3.7 g, 18 mmol) was added. The reaction was performed at 80 C. for 20 min. After the reaction, 300 mL water was added into the solution. Then the solution was extracted with diethyl ether (3400 mL). The extract was dried with anhydrous sodium sulfate, concentrated, subjected to column chromatography on silica gel, and eluted with petroleum ether/dichloromethane 1:1, resulting in light yellow solid (680 mg, 18%). 1H NMR (300 MHz, CDCl3) 7.94 (d, J=7.8 Hz, 1H), 7.82 (s, 1H), 7.70 (d, J=7.8 Hz, 1H), 3.62 (t, J=6.6 Hz, 2H), 3.24 (s, 2H), 3.06 (t, J=6.6 Hz, 2H), 13C NMR (75 MHz, CDCl3) 197.6, 197.1, 147.4, 144.1, 142.4, 136.7, 123.8, 123.4, 51.9, 45.6, 35.9; MS(ESI) [M+H].sup.+, 216.2.

(114) The specific reaction between the synthesized compound AI and nucleic acid sequence containing 5fC can be used to selectively separating and enriching DNA samples containing 5fC base. The process was as shown in FIG. 4, the 5fC base in a nucleic acid sample was reacted with the compound AI, such that an azido was specifically introduced. A biotin with disulfide linkage was further specifically introduced into the reaction product through the Click-Chemistry reaction between the azido and the alkynyl. In this way, through the two-step reaction, a biotin group was introduced into the position of 5fC base selectively and efficiently. Then, selective enrichment was carried out by utilizing the strong binding between streptavidin and biotin, and thus the DNA sequences containing 5fC were separated for the next operation such as sequencing analysis and the like. The MALDI-TOF mass spectrometry of the products obtained from respective steps of the reaction of Oligo NO.1 containing single 5fC with the compound AI, as shown in FIG. 5, exhibiting a high efficiency of the reaction.

(115) Three artificially synthesized double-stranded DNA samples were incorporated into mouse embryonic stem cell genomic DNA samples in a portion of 2 pg/(1 g gDNA). The samples were enriched through the experimental process above. The enrichment effect was detected by real time fluorescent quantitative PCR. The three sequences used were respectively: Oligo NO.10, comprising one 5fC site, for which Oligo NO.13/14 primer pair was used during qPCR; Oligo NO.11, a control sequence, obtained by PCR, comprising 100% dATP, 100% dTTP, 100% dGTP, 70% dCTP, 15% d5mCTP, 10% d5hmCTP, 5% d5caCTP, and comprising no 5fC, for which Oligo NO.15/16 primer pair was used during qPCR; and Oligo NO.12, a reference sequence, only comprising four kinds of basic bases, for which Oligo NO.17/18 primer pair was used during qPCR. The relative enrichment degree was calculated using Ct method.

(116) The enrichment results were as shown in FIG. 6. It can be seen that, the DNA fragment containing 5fC can be selectively enriched with the compound AI. The enrichment degree for the DNA sequence containing only single 5fC base can be up to about 100 times. However, in the control group, the DNA sequence containing 15% 5 mC, 10% 5 hmC, or 5% 5 caC base was not enriched.

(117) Similar enrichment process can also be implemented with ethyl 6-azido-3-oxyhexanoate. As shown in FIG. 7, ethyl 6-azido-3-oxyhexanoate specifically reacted with a nucleic acid containing 5fC in an alkaline methanol solution, such that the nucleic acid containing 5fC was labeled with an azido. An affinity group, such as biotin, was further introduced into the nucleic acid by means of the click reaction between alkynyl and azido. The affinity group enabled the enrichment and separation of the nucleic acid containing 5fC.

Example 4

Implementing of Ring Formation Promoting 5fC to T Conversion Sequencing Technique with 1,3-Indandione and the Derivatives Thereof

(118) 1,3-indandione belongs to the representative compounds of compound iii of the present invention. A DNA sequence Oligo NO.19 containing two 5fC bases and with 76 bases in length was reacted with a derivative of 1,3-indandione-compound AI (see example 3 for the synthesis route and application thereof). The used sequence comprises two 5fC bases (5-C.sup.5fGC.sup.5fG-3). The sample before or after the reaction was amplified directly with Oligo NO.8 and Oligo NO.20. The amplified product was also sequenced with Oligo NO.9. Because of the reverse sequencing primer used, the G* signals of the sequence 5-CG*CG*-3 in the sequencing result were corresponding to the signals for 5fC sites. The reaction conditions were the same as set forth. The reaction product was recovered through ethanol precipitation.

(119) The ring formation promoting 5fC to T conversion sequencing technique was implemented with compound AI. The result thereof was as shown in FIGS. 8A and 8B. Before the reaction with compound AI, the two 5fC bases were read as guanine G signals. After the reaction, the two 5fC base positions were read as thymine T during PCR amplification. Thus, when using reverse sequencing primer, 5fC position was read as adenine A signal, and the regions corresponding to other cytosines were not influenced. By comparing the sequence information before and after the reaction, C-T mutation signal (forward primer sequencing) or G-A mutation signal (reverse primer sequencing) was found to be the position of 5fC base. In this way, single base resolution sequence information of 5fC in the genome can also be easily detected.

(120) In this method, 5fC was reacted the compound AI, such that the reaction product of 5fC was read as thymine T during PCR amplification. By stably reading out the C-T mismatched sites through comparing the results before and after the reaction, the sequence position of 5fC can be directly identified. Such 5fC sequencing methods can be called as ring formation promoting 5fC to T conversion sequencing technique.

Example 5

Implementing of Ring Formation Promoting 5fC to T Conversion Sequencing Technique by Means of Malononitrile Reaction

(121) Malononitrile belongs to the representative compounds of compound ii of the invention. A DNA sequence Oligo NO.21 containing single 5fC base and with 76 bases in length was reacted with malononitrile. The used sequence comprises only one 5fC base (5-C.sup.5fGCG-3). The sample before or after the reaction was amplified directly with Oligo NO.8 and Oligo NO.20. The amplified product was also sequenced with Oligo NO.9. Thus, the G* signals of the sequence 5-CGCG*-3 in the read result were corresponding to the signals for 5fC. The reaction conditions were the same as set forth. The reaction product was directly recovered through ethanol precipitation.

(122) The sequences before and after the reaction were directly amplified by PCR reaction respectively. The amplified products were sequenced with reverse primer, obtaining the results as shown in FIGS. 9A and 9B. Before the reaction with malononitrile, the 5fC base position was read as guanine G signal. After the reaction, the 5fC base position was read as thymine T during PCR amplification. Thus, when using reverse sequencing primer, 5fC position was read as adenine A signal. By comparing the sequence information before and after the reaction, C-T mutation signal (forward primer sequencing) or G-A mutation signal (reverse primer sequencing) was found to be the position of 5fC base. In this way, single base resolution sequence information of 5fC in a nucleic acid sequence can further be easily detected.

(123) In this method, 5fC was reacted with malononitrile, such that the reaction product of 5fC can also be amplified into thymine T during PCR amplification. Such 5fC sequencing methods are also classified as ring formation promoting 5fC to T conversion sequencing technique

Example 6

Specifical Detection of the Concentration of 5fC by Means of the Fluorescence Property of the Reaction Product of Malononitrile

(124) It was found that all the compounds i, ii, and iii can enable sample to exhibit new ultraviolet absorption peaks, when using Nanodrop micro-ultraviolet spectrophotometer from Thermo to quantify the Oligo NO.1 (-AGA TC.sup.5GTAT-3) sample after reaction. As shown in FIG. 10, the reaction product of Oligo NO.1 with malononitrile exhibits a new absorption peak at about 330 nm; the reaction product of Oligo NO.1 with 1,3-indandione exhibits a new absorption peak at about 310 nm; the reaction product of Oligo NO.1 with ethyl acetoacetate or methyl acetoacetate exhibits a new absorption peak at about 350 nm; and the reaction product of Oligo NO.1 with diethyl malonate exhibits a new absorption peak at about 345 nm. Since a new ultraviolet absorption can be detected due to the formation of conjugate polycyclic derivative, the reaction product is possible to generate new fluorescence. New fluorescence of the reaction product was indeed detected by a fluorescence spectrophotometer. Here, only the reaction product of malononitrile with 5fC base is used as an example for illustration. The other aforementioned compounds containing an active methylene are not additionally discussed here.

(125) The reaction product of malononitrile with 5fC base DNA possesses good fluorescence. As shown in FIG. 11, Oligo NO.1 was used as raw material to react with malononitrile, and the obtained reaction product was determined via a fluorescence spectrophotometer to be a new resulting product (included within the scope of compound I) with a max excitation wavelength of 328 nm and a max emission wavelength of 370 nm.

(126) The reaction product was quantitatively prepared into standard solutions with a concentration gradient. Meanwhile, the sample solution of raw material Oligo NO.1 was prepared with the same concentration gradient. The fluorescence intensities of the two kinds of solutions with various concentration gradients were determined under the same condition. The difference between the fluorescence intensities of the two kinds of solutions was calculated by subtracting the intensity of raw material form that of the reaction product to obtain the net increase between the fluorescence intensities before and after the reaction. As shown in FIGS. 12A and 12B, with the increase of the reaction product's concentration, the net increase of fluorescence intensity increases proportionally (FIG. 12A). A standard curve was plotted with the net increase of fluorescence intensity as the vertical axis and with the corresponding concentration as the horizontal axis, exhibiting a good linear relationship. The lower limit of detection can reach 10 nM (FIG. 12B).

(127) The fluorescence activation effect of such reaction products can be used to quantify the concentration of 5fC base, and also can be used to label the 5fC base in a nucleic acid sample.

Example 7

Influencing the Identification of a Substrate Sequence with TaqI Endonuclease by Means of Compound AI Reaction

(128) TaqI can cleave a double-stranded DNA containing a 5-TCGA-3 palindromic sequence, and the second base cytosine can be 5-position modified (5mC, 5hmC, 5fC, 5caC) base (Shinsuke Ito, et al., Science, 2011, 333:1300-1303). By means of the reactions of the aforementioned 3 kinds of compounds with the 5fC base in the 5-TC.sup.5fGA-3 sequence, the chemical property of the 5fC base were altered, which may change the ability of TaqI to identify a substrate sequence. Here, only the reaction product of compound AI with 5fC base is used as an example for illustration. The other aforementioned active methylene compounds are not additionally discussed here.

(129) The used double-stranded DNA is Oligo NO.22, the forward chain of which comprises a 5-TC.sup.5fGA-3 sequence, and the backward chain of which does not comprise any 5fC base. The reference sequence is Oligo NO.23, the sequence of which is identical to Oligo NO.22, except that it does not contain any 5fC base. Compound AI was reacted with Oligo NO.22, a biotin was coupled to Oligo NO.22 through Click Chemistry, and the completely labeled double-stranded Oligo NO.22 reaction product sequence was eluted with DTT after enrichment. Then, the reference sequence Oligo NO.23, the Oligo NO.22 before the reaction, and the reacted and eluted Oligo NO.22 sequenced were simultaneously digested with TaqI for 1 h, and loaded to 4% agarose gel to determine whether the sequences were digested completely by electrophoresis. As 5-TC.sup.5fGA-3 or 5-TCGA-3 is located in the middle of the used sequence of Oligo NO.22 or Oligo NO.23, the sequence size before digestion is 70 bp, and the sequence size of the completely digested product is 35 bp.

(130) As shown in FIG. 13, in the control group, the samples containing a double-stranded 5-TCGA-3 (Oligo NO.23) or 5-TC.sup.5fGA-3 (Oligo NO.22) can be digested completely, while in the experimental group, the sample obtained from the reaction of 5-TC5fGA-3 (Oligo NO.22) with compound AI and enrichment cannot be digested, indicating that the reaction product influences the identification of a substrate with TaqI.

Example 8

Detection of Distribution of 5fC Base in Mouse Embryonic Stem Cell Genomic DNA by the Ring Formation Promoting 5fC to T Conversion Sequencing Technique Based on Compound AI

(131) To confirm whether the method of the present invention can detect distribution information of 5-formylcytosine and single-base resolution sequence information in biological samples (for example genomic DNA), the ring formation promoting 5fC to T conversion sequencing technique based on compound AI is used for illustration here. In particular, the above examples 3 and 4 were applied to the genomic DNA samples of mice embryonic stem cells (mESC).

(132) The pretreated genomic DNA of wild-type mESC was reacted with compound AI for 24 hours. The DNA was recovered and coupled with a biotin group through Click reaction. The DNA sequence containing a label was separated and enriched with streptavidin magnetic beads, obtaining the DNA fragments with 5fC bases distributed therein. The obtained samples were subjected to the second generation sequencing library construction, PCR amplification, and then high-throughput sequencing. The sequencing results were aligned back with the genome. Thus the distribution information and single base resolution sequence information of 5fC bases in the mESC genome can be observed.

(133) As shown in FIG. 14, three samples of genomic DNAs were sequenced in a batch, including an unreacted sample, a sample after reaction but before enrichment, and an enriched sample. It can be seen that no significant enrichment distribution was observed for the unreacted sample and the sample before enrichment, while a significant enrichment peak in the distribution region of 5fC base was observed for the enriched sample. The results shows that, the enrichment of the 5fC base-containing DNA sequences based on compound AI is feasible, and can be used to analyze the genomic distribution information of 5fC base in combination with high-throughput sequencing data.

(134) In view that 5fC base was read as cytosine T during PCR amplification after the reaction with compound AI, the single-base resolution position of 5-formylcytosine can be detected through the detection of C-T mis-matched positions in the sequence read in high-throughput sequencing. FIG. 15 shows a representative position of C-T mis-match of in the enrichment peak. It can be seen that each sequence read out in the enrichment peak contains one C-T mis-matched position, and 4 C-T mis-matched positions were obtained by comparing with the genome, wherein 3 circled mis-matched positions were located at the position of CpG dyad. It follows that ring formation promoting 5fC to T conversion sequencing technique can detect single-base resolution position of 5fC base in real biological samples.

(135) By combining the two methods of the above 5fC ring protecting sodium bisulfite sequencing technique and ring formation promoting 5fC to T conversion sequencing technique, the single-base resolution read information of all cytosines during the sequencing reading can be summarized in the table as shown in FIG. 16. In conventional sequencing, all the 5 kinds of cytosines are read as cytosine C; in conventional sodium bisulfite sequencing, 5-methylcytosine and 5-hydroxymethylcytosine are read as C, while cytosine, 5-formylcytosine and 5-carboxylcytosine are read as thymine T. In the 5fC ring protecting sodium bisulfite sequencing technique provided in the present invention, 5fC base is protected, and is read as T in sodium bisulfite sequencing. Therefore, the position of 5fC base can be identified by comparing with the result of conventional sodium bisulfite sequencing. In addition, in the 5fC ring formation promoting sodium bisulfite sequencing technique provided in the present invention, through direct PCR amplification and sequencing, 5fC base is read as thymine T. By comparing with the result of conventional sequencing, the C-T mismatched position is the single-base resolution sequence position of 5fC base.