SINGLE-STRANDED END PRESERVING ADAPTORS
20250297301 ยท 2025-09-25
Inventors
Cpc classification
C12Q2525/186
CHEMISTRY; METALLURGY
C12N9/22
CHEMISTRY; METALLURGY
C12Q2525/186
CHEMISTRY; METALLURGY
C12Q1/6806
CHEMISTRY; METALLURGY
C12N15/1093
CHEMISTRY; METALLURGY
C12Q1/6806
CHEMISTRY; METALLURGY
International classification
C12Q1/6806
CHEMISTRY; METALLURGY
C12N9/00
CHEMISTRY; METALLURGY
C12N9/22
CHEMISTRY; METALLURGY
Abstract
Provided herein are compositions, kits, systems, and methods employing single-stranded end-preserving adaptors. Such single-stranded adaptors are attached to DNA duplex molecules while preserving original 5 or 3 single-strand protruding ends (e.g., present in cell-free DNA) by attaching such adapters to 3 ends the DNA duplex molecules using a single strand ligase that has step 3 ligase activity, but not step 2 adenylyl transfer activity, and attaching such adapters to 5 ends of the DNA duplex molecules using a ligase enzyme (e.g., a circligase), thereby forming loop-like structures on one or each end of the DNA duplex molecules. In further embodiments, the loop-like structures are cleaved (e.g., by an endonuclease) as the single-stranded adapters have a cleavable portion, thereby generating a two-part adapter on one or both ends of the DNA duplex molecules that preserves the initial 5 or 3 single-strand protruding ends, along with any methylation present.
Claims
1. A kit comprising: a) a plurality of single-stranded adaptors, wherein each of said single-stranded adaptors comprises a nucleic acid sequence and optionally a 3 end blocking group attached to the 3 end of said nucleic acid sequence, wherein said 3 end blocking group is optionally a modified non-canonical nucleotide or a phosphate group, wherein the 5 end of said nucleic acid sequence is adenylated, and wherein said nucleic acid sequence comprises: a 5 region, a 3 region, and a cleavable region between said 5 and 3 regions, and is optionally methylated at every, or nearly every, cytosine present in said nucleic acid sequence; and b) a single strand ligase, wherein said single strand ligase has step 3 ligase activity, but not step 2 adenylyl transfer activity.
2. The kit of claim 1, wherein said cleavable region comprises: A) a modified non-canonical base, B) a nucleic acid backbone linkage, or C) an endonuclease recognition site, wherein endonuclease recognition site is formed at the junction of said 5 and 3 regions or is composed of a sequence between said 5 and 3 regions, or D) one or more RNA bases, wherein optionally the rest of the first and second nucleic acid sequences are composed of DNA, or E) comprises a sequence that forms a endonuclease recognition sequence when a secondary oligonucleotide is added.
3. The kit claim 1, wherein said single strand ligase is a thermostable lysine-mutant ssDNA/RNA ligase which is a mutated version of a precursor thermostable ssDNA/RNA ligase, wherein said precursor thermostable ssDNA/RNA ligase has a Motif I EKx(D/N/H) G, and wherein said thermostable lysine-mutant replaces K in said Motif I with any other amino acid or is selected from alanine (A), serine(S), cysteine (C), valine (V), threonine (T), and Glycine (G).
4. The kit of claim 1, wherein said single strand ligase has an amino acid sequence that is 95% or 100% identical to any one of SEQ ID NOs: 1-21.
5. The kit of claim 1, wherein said 5 and 3 regions of said first nucleic acid sequence each comprise at least one element selected from: a flow cell attachment sequence, a unique barcode sequence, a non-unique barcode sequence, a sample-identifying index sequence, a read 1 primer binding sequence, a read 2 primer binding sequence, and a universal PCR amplification primer binding sequence.
6. The kit of claim 1, wherein said 5 and 3 regions of said first nucleic acid sequence each comprise a barcode sequence that have predefined relationship to each other, which is optionally based on a pre-defined association which may be a look up table, and optionally wherein said barcodes sequences are non-complementary to each other, and optionally where the barcodes are of different lengths.
7. The kit of claim 1, wherein said 3 end blocking group is present.
8. The kit of claim 1, wherein said 3 end blocking group is selected from: a nucleotide with 3-phosphate group and a nucleotide reversible terminator.
9. The kit of claim 1, further comprising a deblocking agent.
10. The kit of claim 1, further comprising a ligase enzyme selected from: Circligase I, Circligase II, RtcB ligase from E. coli. or homologs, thermostable RtcB or homologs, TS2126 RNA ligase, and Mth DNA ligase, T4 RNA ligase 1, and T4 RNA ligase 2.
11. The kit of claim 1, further comprising one or more enzymes capable of cleaving said cleavable region of said nucleic acid sequence.
12. The kit of claim 11, wherein said one more enzymes, or reagent, are selected from: an endonuclease, endonuclease V, endonuclease VIII, an endonuclease and uracil DNA glycosylase, and thermostable oxoguanine glycosylase (OGG), and iodine.
13. The kit of claim 1, further comprising a plurality of DNA duplex molecules, wherein each of said DNA duplex molecule comprise: i) a first duplex end that comprises a 3 strand end and a 5 strand end, and has a 3 or 5 single-strand protruding end that is either a single non-adenine nucleotide or is at least two nucleotides in length, and ii) a second duplex end that comprises a 3 strand end and a 5 strand end which optionally has a 3 or 5 single-strand protruding or blunt end, and optionally wherein any of, or all of, said protruding ends comprise at least one cytosine that is methylated.
14. The kit of claim 13, wherein each of said DNA duplex molecules comprises a loop-like structure on the first duplex end and second duplex end, wherein said loop-like structure is composed of said nucleic acid sequence.
15. The kit of claim 1, further comprising one or more containers for collectively or separately holding the recited components, optionally, wherein said components are present inside said container.
16. The kit of claim 15, wherein said one or more containers is selected from a cardboard box, a plastic bag or box, glass vials, and plastic vials.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
[0060]
[0061]
[0062]
[0063]
DEFINITIONS
[0064] A single-stranded DNA/RNA ligase enzyme is considered to have step 3 ligase activity but not step 2 adenylyl transfer activity, when it is able to ligate between 5-adenylated end and 3-hydroxyl end of single strand DNA/RNA, but not able to transfer the AMP to the 5 -phosphate-terminated DNA or RNA strand to form a 5-App-DNA/RNA intermediate. All known DNA and RNA ligases perform the catalysis via a common pathway which involves three nucleotidyl transfer reactions (Lehman et al, Science, 1974; Lindahl et al, Annu Rev Biochem, 1992). In the case of ATP-dependent DNA or RNA ligases, the first step (step 1) involves the attack on the -phosphate of ATP by ligase, which results in release of pyrophosphate and formation of a ligase-AMP intermediate. AMP is linked covalently to the amino group of a lysine residue within a conserved sequence motif. In the second step (step 2), the AMP nucleotide is transferred to the 5 -phosphate-terminated DNA or RNA strand to form a 5-App-DNA/RNA intermediate. In the third and final step (step 3), attack by the 3-OH strand on the 5-App-DNA/RNA end joins the two polynucleotides and liberates AMP.
[0065] End repair means fill-in of the 3-single-strand protruding ends by polymerase and/or resection of the 5-single-strand protruding ends of the original DNA analyte DNA duplex. In certain embodiments, the methods herein do not employ any type of end repair.
DETAILED DESCRIPTION
[0066] Provided herein are compositions, kits, systems, and methods employing single-stranded end-preserving adaptors. Such single-stranded adaptors are attached to DNA duplex molecules while preserving original 5 or 3 single-strand protruding ends (e.g., present in cell-free DNA) by attaching such adapters to 3's ends the DNA duplex molecules using a single strand ligase that has step 3 ligase activity, but not step 2 adenylyl transfer activity, and attaching such adapters to the 5 ends of the DNA duplex molecules using a ligase enzyme (e.g., a circligase), thereby forming loop-like structures on one or each end of the DNA duplex molecules. In further embodiments, the loop-like structures are cleaved (e.g., by an endonuclease) as the single-stranded adapters have a cleavable portion, thereby generating a two-part adapter on one or both ends of the DNA duplex molecules that preserves the initial 5 or 3 single-strand protruding ends (or blunt ends), as the method does not require any end-repair or A-tailing. Such methods are particularly useful in sequencing library preparation and fragmentomic analyses, along with any methylation present.
[0067] Current end repair/A-tailing (ER/AT) chemistry during NGS library preparation results in writing and erasing activities to the starting material, which introduce a variety of artifacts and implications to the sequencing and data interpretation. In addition, conventional ER/AT abolishes the native DNA ends, which for some types of samples can be informative when they are generated biologically. Provided herein, in certain embodiments, is a sequencing library preparation workflow that employs ligation (e.g., before an amplification step). In some embodiments, an important step is the duplex-retaining single strand DNA tail ligation (DISTAL ligation) schema (e.g., as exemplified in
[0068] In particular embodiments, the single-strand adaptors herein comprise a 5 region and a 3 region, and wherein the 5 region comprises a first barcodes sequence, and the 3 region comprises a second barcode sequence. In some embodiments, the first and second barcodes sequences are complementary and may hybridize to each other. In other embodiments, the first and second barcodes are not complementary and do not hybridize to each other.
[0069] Such embodiments where the first and second barcode sequences do not hybridize to each other is different than the normal approach in the art where the barcode sequence is in the stalk of a Y shaped adaptor, necessitating that the first and second barcodes are always complementary to each other (due to the constraint of hybridization). As the result, in normal approach, adaptor with barcode A ligates to one end of the duplex and adaptor with barcode B ligates to the other end of the duplex, so that read pair from one strand has barcode AB, and read pair from the other strand has barcode BA,. During analysis, read pairs that share the same genomic coordinates of start and stop positions, and also AB/BA barcodes are considered as duplex.
[0070] In embodiments herein the 5 and 3 barcodes are not complementary to each other, as shown in Table 5 below: where AB, CD, EF, and GH each represent 5 barcode and 3 barcode and do not hybridize to each other. Single strand adaptors with unique combinations of AB, CD, EF, and GH may be synthesized individually, and used as a mixture. As an example, during ligation, first single strand adaptor with barcode A and B ligates to the first duplex end, and second single strand adaptor with barcode C and D ligates to the second duplex end. Read pair from one strand has barcode DA, and read pair from the other strand has barcode BC. During analysis, read pairs that overlap in their start and stop positions, and have the corresponding association as defined in the look up table (e.g., as long as they pair with each other in this table, DA/BC in the example, and therefore have a predetermined relationship that can be looked up) are considered duplex. The duplex ends can then be inferred from the genomic start and stop positions of the two strands at either end.
TABLE-US-00001 TABLE 5 5 barcode 3 barcode A B C D E F G H
[0071]
[0072] As discussed above, the barcodes herein do not need to be complimentary, which, for example, expands the design space to 16 possible combinations for 1-mers as shown in Table 6 below, in contrast to only 4 possible combinations if complementarity is required.
TABLE-US-00002 TABLE 6 1-mer duplex barcode design table 5-barcode 3-barcode Mismatch A A 1 A C 1 A G 1 A T 0 C C 1 C G 0 C T 1 C A 1 G A 1 G T 1 G C 0 G G 1 T T 1 T A 0 T C 1 T G 1
[0073] A possible look up table for the 1 mer duplex barcode is:
TABLE-US-00003 TABLE 6.1 5-barcode 3-barcode A A C T G G T C
[0074] Another possible look up table for the 1 mer duplex barcode is:
TABLE-US-00004 TABLE 6.2 5-barcode 3-barcode A C C T G A T G
[0075] As another example, barcodes of equal or unequal length can be used as duplex barcodes.
TABLE-US-00005 TABLE 7 barcodes of different length 5-barcode 3-barcode AA CCT CC TT GC A TA CCGG TC CC
[0076] As another example, barcodes of same sequences can be used duplex barcodes. This design has the added advantage that a look up table is not even needed.
TABLE-US-00006 TABLE8 anexampledesignof96duplexbarcodes withsamesequences 5- 3- Barcode Barcode AATGCAC AATGCAC ATCCTTG ATCCTTG GTATGCT GTATGCT AGATGCG AGATGCG ATGTGGC ATGTGGC CGATACT CGATACT GACTGTA GACTGTA GGTACGT GGTACGT TAACGCG TAACGCG TTACGGC TTACGGC AACCGCA AACCGCA AACGTCC AACGTCC ACGCAGT ACGCAGT AGGCACA AGGCACA CAACACA CAACACA GATCGCT GATCGCT GGATATA GGATATA GGATCAG GGATCAG TTGTGCG TTGTGCG ACCGAAC ACCGAAC AGCGACT AGCGACT AGGTTAA AGGTTAA ATATTGG ATATTGG ATCACAC ATCACAC ATTAGGT ATTAGGT CAATATC CAATATC CAGTGGA CAGTGGA CCGATAG CCGATAG CTACATT CTACATT CTGTATA CTGTATA GAGTGAG GAGTGAG GATCCAG GATCCAG GCTACAC GCTACAC GCTCGAA GCTCGAA GGATTCC GGATTCC GGTGCAA GGTGCAA GTAAGAG GTAAGAG GTACCTG GTACCTG GTAGTCG GTAGTCG GTGTACC GTGTACC TCCGACA TCCGACA TTAGCCT TTAGCCT ACTGTTC ACTGTTC AGAATAC AGAATAC ATCAATA ATCAATA ATGCGCT ATGCGCT ATTGTAG ATTGTAG CAATGAT CAATGAT CATACAT CATACAT CATGCCA CATGCCA CGCGATA CGCGATA CGGCAAT CGGCAAT CTGTTAT CTGTTAT GAATAGG GAATAGG GATTATT GATTATT GCCTTAC GCCTTAC GTTAGTA GTTAGTA TACCATA TACCATA TATAATG TATAATG TCGAAGG TCGAAGG TCGACAT TCGACAT TGAAGTG TGAAGTG TGTCCAT TGTCCAT TGTTAAG TGTTAAG AACTTAG AACTTAG AAGCTGG AAGCTGG AATATCG AATATCG AATGTGT AATGTGT ACACTTA ACACTTA ACGAACC ACGAACC AGGTAGG AGGTAGG ATAGGCC ATAGGCC ATATAAC ATATAAC ATGATTC ATGATTC CACTCAC CACTCAC CAGGAAC CAGGAAC CCGGATT CCGGATT CGGATGT CGGATGT CTAATAA CTAATAA CTCTGTG CTCTGTG CTGGTGA CTGGTGA CTTGTCC CTTGTCC GAGCAGA GAGCAGA GCAATGG GCAATGG GCCGTTA GCCGTTA GCTGAAG GCTGAAG GTTCTTC GTTCTTC GTTGGAT GTTGGAT TAACCTT TAACCTT TAGCGGT TAGCGGT TAGGACG TAGGACG TCGCTCA TCGCTCA TCGTAAC TCGTAAC TGATTAT TGATTAT TGCACAG TGCACAG TGGATCG TGGATCG
[0077] In certain embodiments, the two barcodes used herein (for a particular template) do not have the same length (and may or may not be complimentary, and may be the same sequence). Such embodiments are shown in exemplary
[0078] In some embodiments, the barcodes (e.g., non-complementary barcodes; same sequence) may have a length of 1, 2, 3, 4, 5, 6, 7, or more nucleotides. In particular embodiments, the non-complementary barcodes have 1, 2, 3 or more mismatches. In particular embodiments, barcodes (e.g., only barcodes) that form pairs with maximal mismatches (hamming distance). In other embodiments, the barcodes are exactly the same. In further embodiments, the strand combinations of two barcodes have a length difference, such as 1, 2, 3, 4, or more differences. In particular embodiments, one of the two barcodes in the combination is not present (zero nucleotides). In other the barcodes have certain constant and/or degenerate bases.
[0079] In certain embodiments, the thermostable lysine-mutants employed in the methods, kits, systems, and compositions herein with the single stranded adaptors are as provided in SEQ ID Nos: 1-21 in Tables 1 and 2 below, or N or C terminal truncated, versions thereof.
TABLE-US-00007 TABLE1 ThermostableLysine-MutantssDNA/RNALigases SEQID AminoAcidSequence NO: Speciesname MVSSYFRNLLLKLGLPEERLEVLEGKGALAEDEFEGIRYVRFRDSARNFRRG 1 Thermococcus TVVFETGEAVLGFPHIKRVVQLENGIRRVFKNKPFYVEEXVDGYNVRVVKVK kodakarensis DKILAITRGGFVCPFTTERIEDFVNFDFFKDYPNLVLVGEMAGPESPYLVEG PPYVKEDIEFFLFDIQEKGTGRSLPAEERYRLAEEYGIPQVERFGLYDSSKV GELKELIEWLSEEKREGIVMKSPDMRRIAKYVTPYANINDIKIGSHIFFDLP HGYFMGRIKRLAFYLAENHVRGEEFENYAKALGTALLRPFVESIHEVANGGE VDETFTVRVKNITTAHKMVTHFERLGVKIHIEDIEDLGNGYWRITEKRVYPD ATREIRELWNGLAFVD WhereXisanyaminoacidexceptK. MVSSHFKEILMRLGLPEDRIEVLEAKGGITEEEFDGIRYLRFKDSARGLRRG 2 Pyrococcusyayanosii TVVFDEANVILGFPHIKRVVSLRAGVMRIFKRTPFYVEEXVDGYNVRVALVS DRVLAITRGGFVCPFTTERILDFVPEEFFKDYPHLVLVGEMAGPESPYLVEG PPYVEEDIRFFLFDIQEKGTGKSLPVQERLKLAEEYGIPHVKVFGLYTVDRI EDLYDLIERLSREGREGVVMKSPDMKRVVKYVTPFANVNDVKIGAKVFFELP PGYFMSRIMRLAFYVAERRIKGERFEELARNLGKALLEPFVESIWDVEQGDE IAEVFRIRVKRIETAYKMVTHFERLGLNIKIEDIEEVGGMWRITFKRAYDEA TREIRELIGGRAFVD WhereXisanyaminoacidexceptK. MVSSKFKDILYRLGIPEGKVEDLEARGGLVEDKFDDIKYLRIRNSVGKLRRG 3 Pyrococcushorikoshii TVVLNDKFIILGFPHIKRIVNLKNGIKRTFKRGEFYVEEXVDGYNVRVVKFR GKXLGITRGGFICPFTTERISDFIPEEFFKDHPNLILVGEMAGPESPYLVEG PPYVKEDIQFFLFDIQELGTGRSLPVEERLKIAEEYGISHVEVFGKFTYKDL EEIYEIVERLSREGREGIVMKSPDMRKMVKYVTPYANINDIKIGARVFYELP PGYFTSRISRLAFYIAEKRLRGENFEELAKELGKALLQPLVESIHDVEQEDE IAEVFKVRVKKIETAYKMVTHFEKLGLRIEIVDIEEMKGGWRITFKRLYPDA TEEIRELIGGKSFVD WhereXisanyaminoacidexceptK. MKEVVSSVYKEILVKLGLTEDRIETLEMKGGIIEDEFDGIRYVRFKDSAGKL 4 Pyrococcusabyssi RRGTVVIDEEYVIPGFPHIKRIINLRSGIRRIFKRGEFYVEEXVDGYNVRVV MYKGKMLGITRGGFICPFTTERIPDFVPQEFFKDNPNLILVGEMAGPESPYL VEGPPYVKEDIQFFLFDVQEIKTGRSLPVEERLKIAEEYGINHVEVFGKYTK DDVDELYQLIERLSKEGREGIIMKSPDMKKIVKYVTPYANINDIKIGARVFY ELPPGYFTSRISRLAFYLAEKRIKGEEFERVAKELGSALLQPFVESIFDVEQ EEDIHELFKVRVKRIETAYKMVTHFEKLGLKIEIVDIEEIKDGWRITFKRLY PDATNEIRELIGGKAFVD WhereXisanyaminoacidexceptK. MENMVSSKFKELLYTLGIPEDKVEILEARGGIMEDEFEGIRYLRFKNSVGKL 5 Pyrococcusfuriosus RRGTVLFEDGTTVFGFPHIKRIVNLSAGVRKIFKSSEFYVEEXVDGYNVRVV KFKDRILGITRGGFICPYTTERIAEFVPEEFFKDHKDLVLVGEMAGPESPYL VEGPPYVKEDIQFFLFDIQDIKTGSSLPVEERLKLAEEYGINHVEVFGRYSY KDIDDLYELIERLSREGREGIVMKSPDMKKIVKYVTPYANINDIKIGARVFY ELPGGYFTSRISRLAFYIAEKKIRGEELHNLALQLGKALLQPLVEAIHDVTQ GDVIAERFRVRVRKIETAYKMVTHFEKLGLEIEIEDIEEIEGGWRVTFKRVY PEATREIRDLIGGKAFVD WhereXisanyaminoacidexceptK. MTWIKNPEPWMVNLVAEKLGLDVERVETLARHGTIRFRGYRDVVYALLRREI 6 Hyperthermus AGHPEGTVVLLERNGARLVPGYPPIQRMVLPTIALPRHFIDKVVVEEXMNGY butylicus NVRLVMFHRKLLAVTRGGFICPYTTARLERLIGGRVRELFREIDPETYTIAG EVVGLENPYTRYFYPEAPRFDYFVFDLFHELKPLPPLERNELLEKYGLKHVR LLGVIDKNDVEMFKQIVAELDREGREGVVAKDPEYRVPPLKYTTSAVNIGDV RYGMRFFMEEGRSFLFSRLLRELFRAYEEGFGDAQLEKLALEFGRAATEPAL ESIRKVAMGDMLYEEFELVFADEVELEEFTSYMAELGVDIVVVSTSREDEGL RARMRKIKDTWIQLRKVLDTGLSPVD WhereXisanyaminoacidexceptK. MTWIKNPEPWMVNLVAEKLGLDVERVETLARHGTIRFRGYRDVVYALLRREI 7 HbuRN12K106A AGHPEGTVVLLERNGARLVPGYPPIQRMVLPTIALPRHFIDKVVVEEAMNGY NVRLVMFHRKLLAVTRGGFICPYTTARLERLIGGRVRELFREIDPETYTIAG EVVGLENPYTRYFYPEAPRFDYFVFDLFHELKPLPPLERNELLEKYGLKHVR LLGVIDKNDVEMFKQIVAELDREGREGVVAKDPEYRVPPLKYTTSAVNIGDV RYGMRFFMEEGRSFLFSRLLRELFRAYEEGFGDAQLEKLALEFGRAATEPAL ESIRKVAMGDMLYEEFELVFADEVELEEFTSYMAELGVDIVVVSTSREDEGL RARMRKIKDTWIQLRKVLDTGLSPVD MASAAEVLASALRAVGVDPGSVDLEALSTRRSVRVSRFEDVVYVGFRRQFRG 8 Aeropyrumpernix VPEGTLVAFRRGEQIVVWGYPSIKRMLLPRVAVPRWFPGPTVLVEEXMNGYN VRVFTLGGMVYAATRGGLICPYTTRRLRRLYGGALQKILEDLGAEGSFIAGE VVGLENPYTRYYYEEAPGFGYFIFDIFKGGRQLPPRVKFSLAPEYGLKTVNL LAEIPATASGVERLYTIVEDLEKRGREGVIVKDPEGRVEPLKYTTSRINIGD IRLGMRYPFEEGRSFLFPRILREIFREWETGRRRYGELGEAILAPAIEAVEA VSRGGRLVEEFELVFANEVEAEEVIAYFASLGVHLEIAGVARGVDGVRVAFR KPRKSEGEIARILETGISPLD WhereXisanyaminoacidexceptK. MDENELVNKLSDALGIEYEKLSKHIGRSIRLMKYGELNYVVERRDLLGYREG 9 Staphylothermus TTILLGEEPLIVHGYPSIQRLAFIEGVSKHMIDNVVVEEXMNGYNVRVVYYM marinus NNIYAITRGGYICPYTTARIRKLYSKNIKLAYQEYPDTILVGEVVGTENPYV IYDYPEARGFDYFIFDTMKKDKLQPLRIRDEIAEKYSLKTVRILDIINKRDI DRLKTIINRLEKERREGVVLKDPYQRVPPLKYTTIYINIRDIWEGMRYPFDE GRGYLFSRIVRLIAQGYEYDWNNTELDRIALKLGRAILEPAINSLKKRANGE IIASKYTLVFPSEDDLSKYIEYAESIGMDFIFRVVEKREDGCIVVELFKMKE THNIYTKMLKTGYSPLD WhereXisanyaminoacidexceptK. MIRIPLERWMIEKLAEALNVNIEEAERLARRRNVVRLMKWRNVTYFSLRKDV 10 Pyrolobusfumarii YGLREGTLIAVWPDGYRVVPGYPSIQRVLLPSVALPKHFIDKIVVEEXLNGY NVRVVKLRDEIVAVTRGGLICPYTTQRIRKLYGDKLTSLFREEGEELVVAGE VIGLENPYVRFYYPEAGGFAYFIFDIVHGEKFLPPHERKEIVEKHGLLHVPV LGEIDKNDIKAFRKIIEDLERRGREGVVLKDPEYRVPPLKYTTSFINIHDIE IGMRFPFDEGRNYLFSRILREIFKAVEEGWDDRRLLLAEQNLGKAILEPAIE AVKEVKNGKMLYEEFMLPEDTRDDFEEFLDYMASLGVDIIVAGVEQRSDGSI VARIRKVKDTWREVQKILETGLSPID WhereXisanyaminoacidexceptK. MISPELVKEALKKKKVRSEEAFGLEYLRENDDYKDIPRGTAIFKDFIIWGYP 11 Aquifexaeolicus HIGRIFLLETGLREQFEAPFWVEEXVDGYNTRIFKYGDNYYALSRGGFICPF TTDRLPDLIDLRILDENPDLVICAEVAGPENPYIEESPPYVKEDVQLEVEDE MKKNEQGFLSQEEKMELIEKYNLPHVEILGRETASEEGIKKIKEILKRENEE GREGVVFKEDSERNKRAKYITSYANLMDIKTNAKNMLQLPPEYYTNRILRLV LFMYEEGLKTTEHLYEELGRAFIDGLFQAIEQFEKEHKVYKTFTCKERKKEN AIALLELLSKTSKHIQVKERRLEKEGDYWRLEFDKVFLNMTGLLGHLLSGGI VYD WhereXisanyaminoacidexceptK.
TABLE-US-00008 TABLE2 SEQID AminoAcidSequence NO: Speciesname MTWIHSPESWMLDVVAEALGIDRERVEHLARHRTIRYRVERGILYASLRREV 12 Pyrodictium AGHPEGTVIVFGRGWWRLIPGYPSIQRMVLPSVALPRHFVDKIVVEEXLNGY delaneyi NVRVALIDDRIIAVTRGGFICPYTTSRLERIMGNQLKDMLRELGPEEHVAAG EVIGLENPYTRYFYPEAPRFGYFVFDVFREGKPLPPGWRDEVTEKHGVPHVP VLGVLDKNDIEGFKKIVERLNQEGREGVIVKDPEYRVPPLKYTTPATNIGDI RYGMRFFMEEGRGFLFSRLLREIFRVYEEGLTGPRLDALALELGRAALQPAI ETVKKVAAGDMVYEEFELEFASRSELEEFMDYMQGLGVDLVLVEIREENGLL KTRIRKMKETWLQVRKMLETGLTPID WhereXisanyaminoacidexceptK. MRRDVSQFANKLDIGKVSELLDIPEHRITGALKRKTIQYVWGKKELFRFDKP 13 Candidatus VSSIEGGTSVFTEPFDIVRGFPKISRTLMLSPALQKHESSCRKVAVEEXMNG Methanoperedens YNVRVALIGDALVALTRGGFICPYTTEKAIDLIGYDFFNDHPDLVLCGEMVG PDSPYVPKTFYDIESLDFFVEDIREKITGKPLSVMERRALVDKYGIKSVRLF GEFEIGETHSEITRIIKDLGGSQHEGVVIKDPQMVVPPMKYTSSESNCADLR YAFEFYNDFGRDFFFGRVCREAFQSVEWDEDEESVEKRCRQLGESLLLPMIK TIKKKKDGERIAENVQIRVKSLDTVKEFEEYLKLVGVDAVFEEPEQTGNEYF VRIRKMHQSTNDRTEAILGGQLWS WhereXisanyaminoacidexceptK. MTWIHRPEPWMLDVVADALGLPRERVEELASRRTLRFREFRGLLYASLRRGV 14 Pyrodictium AGHHEGTAVVFGRGWWRVVPGYPPIQRMVLPSVALPRHELDRVVVEEXLNGY occultum NVRVVLVDDRILAVTRGGLICPYTTSRLERLMGDRLREMLRELGPEDHVAAG EVIGLENPYTRYFYPEAPRFGYFVFDIFRGGRPLPPRMRDEAAEKHGVPHVP VLGVLEKTDVEAFKRIVERLDREGREGVVVKDPDYRVPPLKYTTSSTNIGDI RLGMRFFMEEGWSFLFSRILREIFRVYEEGVEGPRLDAIALELGRAALQPAV ETVKKVAGGYMVYEEFELEFAGRDELEEFMDYMQSLGVDVVLVEAREEGGVL RARMRKIKETWIRVRRILETGVSPID WhereXisanyaminoacidexceptK. MGWVQPEPWMVDAVAEALGLERERVESLAKHRTIRFRVERGILYASLRRELG 15 Thermoproteota GYPEGTVVIFGRGWSRVVHGYPPIQRMVLPSVALPRHFVDRIVVEEXLNGYN archaeon VRVVLVDGRLLAVTRGGFICPYTTDRIERLLGGRLREMLRELGEEEHVAAGE VIGLENPYTRYYYPEAPRFGYFVFDIFRSGKPLPPRVRDEATEKHGVPHVPV LGVLDKGDIEGERSIVEALERRGREGVVVKDPEYRVHPLKYTTHATNVGDIR LGMRFFMEEGRGFLFSRLLREIFRAYEQGLQGPRLEKLATEIGLAALEPALE TVRLVAAGEPVYEEFELEFENRDRLEEFLEYMQSLGVDVVVAGTYERDGMLV ARVRKMRDTWLQVRRMLETGLTPID WhereXisanyaminoacidexceptK. MFVSESLGLSKHLGETLEERKILREALISHSFFSDVIEAVREDKKFGEIEEG 16 Geoglobus TVVAKTINGVRIVRGFPKIKRALVLNPTLKKHFENEVAVEEXMNGYNVRIAR acetivorans FGKNLYAMTRRGIICPYTTEKARELINPEFFKDHSDLVLCCEAVGEESPYVP KSMYGVEGLDFFVEDIREERTNRPLPVEEKLRLCEEYGLRHATYFGTYDVDV AHDEIKDIISDLAGKGREGVVIKDPEMKLSPLKYTTSQTNAEDLKYAFRFEN DYGKDFMESRIVREGFQSFEFNEGDKEFRERCLRLGMAILKPMVESIREVAL GGKVSEKLRLRFGSLDVMNLFFEQWKRSKVDFEITDIKKDGKDIVVFVNKTM RNTTDKIKAHLEGIPW WhereXisanyaminoacidexceptK. MKFIAEALGVSQAVIEKLNEKNLIRLAFIKHPFERDVIEAYKLERKVGEFEP 17 Archaeoglobus GTLIAKTVEGLRVVRGYPKIKRALTLYPTIKKHFKGEVVLEEXMNGYNVRLV profundus KFGENIYAITRGGFICPYTTEKARRLVNLDFFKDNPKLMLCCEAVGEESPFV PKDVYGVKTIDFYVEDIRDQKTNIALPIKQKEKLAEEYGLKLAPILAEVQVS KAHEIAKEIILELDKRGREGIVIKDPMMRRPPIKYTTSQCNCSDLSYAFRFF EEYGKDEMESRIIREAFQSFEFRENEEKFKDRCLRLGEAILSMVKSIKEVNE GKRIVEKMRLRFYDLEIFELFKEHIRRMGIRAEFSNPKREEDGYVVWVYRHI MSTTDKIKYILAGNLY WhereXisanyaminoacidexceptK. MVSSHFKSLLLELGISRERIEILESKGGIVEDEFEGIRYLRFKDSAGSLRRG 18 Thermococcus TVVFDSHNIILGFPHIKRVVHLENGIKRVFKRKPFYVEEXVDGYNIRVAQIE litoralis GRVFAFTRGGFVCPFTTERIEDFVNMEFFKDYPNLVLCGEMAGPESPYLVEG PPYVKEDIEFFLFDIQEKKTGKSLTVEERLKIAEEYGIPSVEVFGVYDISKI DELKELIEQLSREKREGIVMKSPDMKKIVKYVTPYANVNDIKIGARIFFDLP HGYFMQRIKRLAFYLAEKRVQDEEFEKYARALGRALLEPFVESIWDVSAGEE IAEVFTVRVKHIETAYKMVSHFERLGLKIHIEDIEEMPQGYWRITFKRVYPD ATREIRELWSGHAFVD WhereXisanyaminoacidexceptK. MVSSRFKDILTSLGISEERIEILEAKGGIVEDEYEGLRYLRFKDSAGKLRRG 19 Pyrococcussp. TVVFDFDKIILGFPHIKRVVNLEKGIRRIFKRGEFYVEEXVDGYNVRVTKVG ST04 ERILAITRGGFICPFTTERITDFVPEEFFKDNPNLVLVGEMAGPESPYLVEG PPYVKEDIKFFLEDVQEINTGKSLPVEERLKLAEEYGIPHVEVFGKYTRDDI GELYALIEKLSEEGREGIVMKSPDMKKIVKYVTPYANINDIKIGARVFYELP PGYFTSRISRLAFYIAERKIRDEELRKLAEDLGKALLQPFVEGILDVEQGEE IAETFKIRVKKIETAYKMVTHFEKLGLNIEIVDIEEMDGLWRITEKRVYSDA TEKIKELVGGKAFVD WhereXisanyaminoacidexceptK. MKSERGIMKYKDFIYYPFKKGGFGKGSVIIYHNDDVKIVPGYPSIKRLVLLS 20 Ignicoccuspacificus KVPEHFPEGVSVEEXMNGYNVRAMIVGGDVAFITRGGYLCPYTNARLNTLYG DSM13166 EKVKALLEELPPGSFLAGEVVGVENPYVRVKYPEAPYFDYFIFDIFVKTEDG WRQMPVEERHEIVKRHGLRSVRLLGTFESSEAPLKIKEIIDREDKEGREGVV MKDPEYKRSPAKYTGSYTNIGDIREGMRYPFDEGKDYLFPRIVREIFKVYEE GLSDKELERRALELGMAILKPAVESLKEVAQGETLFERFVLRFPHEEDLEEY LNYTRSLGVKVIVEEKWEEGEWIVVKAKKFKNTSNVYRSMLKSGQTPLD WhereXisanyaminoacidexceptK. MVSSYFKGILLNLGLDEERIEVLENKGGIVEDEFEGMRYLRLKDSARSLRRG 21 Palaeococcus TVVEDEHNIILGFPHIKRVVQLENGIRRAFKRKPFYVEEXVDGYNVRVAKIG EKILVFTRGGFVCPFTTERIEDFITLDFFKDYPNMVLCGEMAGPESPYLVEG PPYVKEDIQFFLEDIQEKKTGRSLPVEERLKLAEEYGIPSVEVEGLYDLSRI DELHALIDRLTKEKREGIVMKSPDMKKIVKYVTPYANINDIKIGARIFFDLP HGYFMQRIKRLAFYLAERKIRGEEFDEYARALGKVLLEPFVESIWDISSGDD EIAELFTVRVKKLETAHKMVTHFERLRLKIHIDDIEVLDNGYWRITEKRVYP DATKEMRELWNGHAFVD WhereXisanyaminoacidexceptK. pacificus
[0080] In some embodiments, the sequences in Table 1 or above are used to perform a sequence search (e.g., using BLAST or PSI-BLAST) to find other thermostable ssDNA/RNA ligases from other species (e.g., by finding those with 30% . . . 50% . . . 60% or more homology). For a particular candidate homolog that is identified, the next step is to find out the growth temperature of the species it is from. In general, a useful single strand ligase candidate would come from a species that has a growth temperature range higher than about 65 C. Next, one can perform a multiple sequence alignment, and locate the conserved catalytic motif, EKxxG (x is any amino acid; such as shown in Tables 1 and 2 above). Next, within the catalytic motif, mutate K to any other amino acid (e.g., to make a step 3 ligase mutant). In certain embodiments, the lysine (K) in such Motif I is mutated to another amino acid, preferably an alanine (A), serine(S), cysteine (C), valine (V), threonine (T), and Glycine (G). Such candidate enzymes (e.g., mutant enzymes) can then be screened for ssDNA and ssRNA activities (and thermostability), for example, using the same procedure as in Example 1 below (e.g., replacing the step 3 ligase mutant in Example 1 with the candidate mutant and measure performance).
[0081] In certain embodiments, the single stranded adaptors disclosed herein are used in library preparation (library prep) and/or then in sequencing methods, such as in attaching adaptors to library fragments for subsequent sequencing. For example, in some embodiments, the disclosure provided herein finds use in a Second Generation (a.k.a. Next Generation or Next-Gen), Third Generation (a.k.a. Next-Next-Gen), or Fourth Generation (a.k.a. N3-Gen) sequencing technology including, but not limited to, pyrosequencing, sequencing-by-ligation, single molecule sequencing, sequence-by-synthesis (SBS), sequence-by-binding, semiconductor sequencing, massive parallel clonal, massive parallel single molecule SBS, massive parallel single molecule real-time, massive parallel single molecule real-time nanopore technology, etc. Morozova and Marra provide a review of some such technologies in Genomics, 92:255 (2008), herein incorporated by reference in its entirety.
[0082] Any number of DNA sequencing techniques are suitable, including fluorescence-based sequencing methodologies (See, e.g., Birren et al., Genome Analysis: Analyzing DNA, 1, Cold Spring Harbor, N.Y.; herein incorporated by reference in its entirety). In some embodiments, the present disclosure finds use in automated sequencing techniques understood in that art. In some embodiments, the present technology finds use in parallel sequencing of partitioned amplicons (PCT Publication No: WO2006084132, herein incorporated by reference in its entirety). In some embodiments, the technology finds use in DNA sequencing by parallel oligonucleotide extension (See, e.g., U.S. Pat. Nos. 5,750,341, and 6,306,597, both of which are herein incorporated by reference in their entireties). Additional examples of sequencing techniques in which the technology finds use include the Church polony technology (Mitra et al., 2003, Analytical Biochemistry 320, 55-65; Shendure et al., 2005 Science 309, 1728-1732; U.S. Pat. Nos. 6,432,360, 6,485,944, 6,511,803; all of which are herein incorporated by reference in their entireties), the 454 picotiter pyrosequencing technology (Margulies et al., 2005 Nature 437, 376-380; US20050130173; herein incorporated by reference in their entireties), the Solexa single base addition technology (Bennett et al., 2005, Pharmacogenomics, 6, 373-382; U.S. Pat. Nos. 6,787,308; 6,833,246; herein incorporated by reference in their entireties), the Lynx massively parallel signature sequencing technology (Brenner et al. (2000). Nat. Biotechnol. 18:630-634; U.S. Pat. Nos. 5,695,934; 5,714,330; all of which are herein incorporated by reference in their entireties), and the Adessi PCR colony technology (Adessi et al. (2000). Nucleic Acid Res. 28, E87; WO 00018957; herein incorporated by reference in its entirety). In certain embodiments, the library preparation and sequencing technologies are as described in any of the following U.S. patents, each of which is herein incorporated by reference: 9,752,188; 10,570,451; 11,479,807; 8,383,345; 10,876,172; 9,598,731; 9,902,992; 10,801,063; 11,091,797; 8,532,930; 9,639,657; and 10,011,870.
[0083] In certain embodiments, the 3 end of the single-stranded adaptors here have a 3 end blocking group such as a phosphate group or a modified noncanonical nucleotide to prevent ligation between the 5-adenylated end and 3 end of the adaptor. If the blocking group is, for example, a nucleotide with 3-phosphate, such blocking group can be removed by T4 polynucleotide kinase, calf intestinal alkaline phosphatase (CIP), or shrimp alkaline phosphatase (SAP). In certain embodiments, where CIP or SAP are used, one may employ a round of T4 PNK to re-phosphorylate the 5-end of the proximal strand.
[0084] Another class of the base blockers and de-blocking can leverage reversible terminator technologies widely used in the sequencing industry.
[0085] In certain embodiments, the single-strand adaptors herein comprises a non-canonical base in the cleavable region, such as inosine, uracil, 5-formylcytosine, 5-carboxylcytosine, or 8-oxoguanine. In some embodiments, wherein inosine is present in the cleavable region, Endonuclease V or similar enzyme is used to perform strand cleavage. In certain embodiments, wherein uracil, 5-formylcytosine, or 5-carboxylcytosine is present in the cleavage region, a combination of uracil DNA glycosylase (UDG) and Endonuclease VIII (or similar enzymes) are employed for strand cleavage. In particular embodiments, where 8-oxoguanine is present in the cleavable region, a thermostable OGG (oxoguanine glycosylase) is employed or strand cleavage. In particular embodiment, a cleavable backbone linkage can be, for example, phosphorothioate DNA and cleaved by iodine (I.sub.2) (Qiang Huang et al. Origin of iodine preferential attack at sulfur in phosphorothioate and subsequent PO or PS bond dissociation, PNAS, vol 119, 2022).
[0086] In particular embodiments, the cleavable region is cut with an endonuclease. For example, a secondary oligo (as shown in
[0087] In certain embodiments, the methods and compositions preserve methylation on 3 and 5 protruding ends of target DNA as, for example, no end-repair or A-tailing is required of the target DNA. Conventional library prep uses end-repair and A-tailing (ER/AT) step as part of its workflow. The enzymatic reactions in ER/AT write to the starting DNA templates by filling in the 3-recessed ends or erase to the starting DNA template by removing the 3-protruding ends, as shown in
[0088] In certain embodiments, the first and/or second nucleic acid sequence (e.g., that make up the adapters) is/are methylated at every (or almost every) cytosine present. In certain embodiments, the methylation is 5-methylcytosine (5 mC) and/or 5-hydroxymethylcytosine (5 hmC). Such methylation, for example, can be introduced during synthesis of the nucleic acid sequences. During bisulfite conversion (e.g., as part of bisulfite sequencing methods herein) 5-methylcytosine (or 5 hmC) will not be converted during the conversion step while unmodified cytosine will be converted to uracil. The conversion can either be chemical based (such as bisulfite conversion) or enzyme based (such as EM-seq). Methylation of the nucleic acid sequences herein prevents, for example, any elements present in the adapters (such as flow cell binding sequences and universal primer binding sites) from having their sequences changed during bisulfite conversion (or other conversion). In this regard, the adapter sequences can still bind to universal primers employed and can still bind to flow cell sequences (e.g., such as in Illumina sequencers). An exemplary workflow employing such methylated nucleic acid sequences is shown in
[0089] In certain embodiments, rolling circle amplification is performed instead of PCR amplification, as shown in
EXPERIMENTAL
Example 1
Single-Stranded End-Preserving Adaptors for Library Preparation
[0090] In this Example, a library preparation approach (DISTAL-seq) is illustrated based on a ligation schema termed duplex-retaining single strand tail ligation (DISTAL ligation). In certain embodiments of DISTAL ligation, a 5-adenylated single strand adaptor DNA is ligated to the 3-ends of the duplex DNA at elevated temperature (e.g., 75 C.) catalyzed by a step 3 ligase mutant (hyperligase in this example), with the DNA duplex still retained (see,
[0091] Based on embodiments of DISTAL-seq, DUET-seq (duplex end restoration sequencing) was developed by incorporating strand-specific unique molecular identifiers (UMIs, aka barcodes which can be unique or non-unique) into the single strand adaptor, so that a portion of the resulting reads can be paired to its original duplex form (
[0092] In summary, this example demonstrates the principles and utilities of certain embodiments of DISTAL ligation, and the library preparation workflows (embodiments of DISTAL-seq and DUET-seq) that starts with it. These methods provide advantageous alternatives to the conventional ER/AT-based NGS preparation methods.
METHODS AND MATERIALS
Single-Strand Adaptor Preparation
[0093] Two types of single-strand adaptors were prepared and used in this example, one for an embodiment of regular DISTAL-seq workflow and one for duplex end-restoration sequencing (DUET-seq), as listed in Table 3. All oligos were order from Integrated DNA Technologies (IDT). All oligo purification was done by using Monarch DNA purification kit (NEB). For Ampure clean-up, Ampure XP beads were purchased from Beckman-Coulter.
TABLE-US-00009 TABLE3 Single-strandadaptoroligosandPCRprimersusedinthisexample Name sequence SEQIDNO: Purification ILMNAda2 NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC/ 22 HPLC ideoxyU/ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNN/ 3Phos/ dupada_3 GTCGABBBBBBBBAGATCGGAAGAGCACACGTCTGAACTCCAG 23 HPLC TCAC/ideoxyI/TACACTCTTTCCCTACACGACGCTCTTCCGATCT pBR322_ ACTCTTCCTTTTTCAATATTATTGAAG 24 Desalted 4288F pBR322_ ACACGGTGCCTGACTGCGTTAGCAATTTAAC 25 Desalted 85R pBR322_ AGGTTAATGTCATGATAATAATGGTTTC 26 Desalted 4320R lambda_ GGGTTTTCTTTTGTGCGCTTGCAGGCCAGC 27 Desalted 4500F lambda_ AGCAGAATGCCGTCCACCATCGGATCGCTGG 28 Desalted 4800R lambda_ GGGTTTCCTCAGCTCTTTTGTGCGCTTGCAGGCCAGC 29 Desalted 4500F_ 3res lambda_ AGCAGAACCTCAGCTGCCGTCCACCATCGGATCGCTGG 30 Desalted 4800R_ 3res lambda_ GGGTTTTGCTGAGGCTTTTGTGCGCTTGCAGGCCAGC 31 Desalted 4500F_ 5res lambda_ AGCAGAATGCTGAGGGCCGTCCACCATCGGATCGCTGG 32 Desalted 4800R_ 5res 1. List of base modifications: /ideoxyU/: internal dU; /idoxylI/: internal dI; /3Phos/: 3- phosphate; B =C/G/A; N =A/C/G/T 2. Underline sites denote Nb.BbVCI recognition sites.
[0094] To prepare a DISTAL-seq adaptor, ILMNAda2 (Table 3) was resuspended in water to 20 uM. 5-phosphorylation was done by mixing 26 ul ILMNAda2, 3 ul 10 T4 ligase buffer (NEB) and 1 ul T4 polynucleotide kinase (3 phosphatase minus) (NEB). Reaction was incubated at 37 C. for 1 hour and column purified with 30 ul elution volume. 5-adenylation was done by using 5-adenylation kit (NEB), by mixing 6 ul phosphorylated ILMNAda2, 2 ul 10 adenylation buffer, 2 ul of 1 mM ATP, and 2 ul Mth RNA ligase in 20 ul reaction volume. Reaction was incubated at 65 C. for 1 hour and heat-inactivated at 85 C. for 5 minutes. Reaction was then column purified with 10 ul elution volume in low-TE buffer.
[0095] For an exemplary DUET-seq adaptor, dupada_3 (Table 3) was first resuspended in water to 20 uM. 3-extension was done by mixing 26 ul dupada_3, 5 ul 10 reaction buffer, 1 ul dA/U/C/GTP mix (10 mM each), 1 ul Taq-Klenow (cat #TT-100, MCLAB) in a total volume of 50 ul. Extension reaction was programmed to first heat to 95 C. for 1 min, and then 68 C. for 10 min. Reaction was column purified with 26 ul elution volume. Since the extension will incorporate a single dU in the newly synthesized 3 end, USER reagent (NEB) (uracil DNA glycosylase and Endo VIII) was then used to cleave at the dU site and generate the 3-phosphate end. This is done by mixing 26 ul extended dupada_3, 3 ul 10T4 ligase buffer (NEB), 1 ul T4 polynucleotide kinase (3 phosphatase minus) (NEB) and 0.5 ul USER reagent (uracil DNA glycosylase and Endo VIII). Reaction was incubated at 37 C. for 40 min and column purified with 25 ul elution volume. 5-adenylation was done similarly as above with the 5-adenylation kit (NEB), by mixing 6 ul 5- and 3-phosphorylated dupada_3, 2 ul 10 adenylation buffer, 2 ul of 1 mM ATP, and 2 ul Mth RNA ligase in 20 ul reaction volume. Reaction was incubated at 65 C. for 1 hour and heat-inactivated at 85 C. for 5 minutes, and was then column purified with 10 ul elution volume in low-TE buffer.
Distal Ligation and DNA Substrate Preparation
[0096] Purified HyperLigase is from RGENE Inc. (10). Cloning and purification of HyperLigase was described earlier in (10, herein incorporated by reference).
[0097] A typical 50 ul HyperLigase ligation is composed of: 5 ul 10 HyperLigase reaction buffer (700 mM Tris, pH=7.5), 5 ul MnCl.sub.2 (100 mM), 5 ul adenylated adaptor (10 uM), 15 ul 40% (w/v) PEG8000, 1.5 ul 5 M NaCl, 2.5 ul purified HyperLigase and 15 ul input sample solution containing duplex DNA. Reactions are incubated in PCR machine at 75 C. (with heated lid on) for 6 hours. Reaction series in
[0098] For DNA duplex substrates used in
Distal-Seq and DUET-Seq Library Preparation
[0099] Genomic DNA of E. coli 0157 strain EDL933 was ordered from Sigma (cat #IRMM449). Human genomic DNA extracted from blood (buffy coat) was purchased from Sigma (cat #11691112001, Roche). Human cell-free plasma DNA was purchased from PlasmaLab International (Everett, WA). Fragmentation was either done by using Covaris M220 model or by using NEBNext dsDNA fragmentase (NEB cat #M0348) following manufacturer's instructions. For sonicated genomic DNA, an extra round of end polishing was done by treating DNA with T4 polynucleotide kinase (PNK) in the T4 ligase buffer and purified by bead clean-up.
[0100] In this example, exemplary DISTAL-seq and DUET-seq start directly with hyperligase ligation using 50 ng fragmented DNA. Briefly, 50 ul reaction consists of: 5 ul 10 reaction buffer, 5 ul MnCl.sub.2, 2.5 ul HyperLigase, 15 ul 40% (w/v) PEG8000, 5 ul adaptor, 1.5 ul 5 M NaCl, and 15 ul DNA solution. Reaction was incubated at 75 C. for 6 hours and purified by 2 rounds of 1 Ampure beads clean-up with elution volume of 26 ul in water. De-blocking of the adaptor 3-ends was done by using 26 ul DNA from the previous step, 3 ul 10T4 ligase buffer, and 1 ul T4 PNK. Reaction was incubated at 37 C. for 40 min, after which another 1 beads clean-up was done to the reaction mix with elution volume of 10 ul. Circularization was done by using 10 ul DNA solution from the previous step, 1 ul 10 circligase reaction buffer, 0.5 ul CircLigase II and 0.5 ul MnCl.sub.2 (Biosearch Technologies, cat #CL9021). Reaction was incubated at 60 C. for 1 hour, after which another 1 Ampure clean-up was done with elution volume of 18 ul. Dumb-bell DNA digestion is done by using 18 ul DNA solution from the previous step, 2 ul of 10 rCutSmart, and 0.2 ul of endonuclease. For regular DISTAL-seq, in which internal dU is embedded within the adaptor, USER reagent (uracil DNA glycosylase and Endo VIII) was used; for DUET-seq, in which internal dI is embedded within the adaptor, Endonuclease V was used (cat #M0305, NEB). Digestion reaction was incubated at 37 C. for 15 min followed with heat-inactivation at 65 C. for 20 min. For PCR enrichment. 25 ul of Kapa HiFi HotStart ReadyMix (KR0370, Roche Sequencing) and 5 ul UDI primer mix (part number 10005922, IDT) were added to the digestion reaction mix. PCR conditions followed manufacturer's protocol. For E. coli library, 12 cycles were performed; for DUET-seq, 16 cycles were performed. After PCR, purification was done by 0.9 beads clean-up.
Sequencing and Data Analysis
[0101] All libraries were quantified by qPCR (KR0405, Roche Sequencing) and pooled based on library concentration and planned read allocation. Sequencing was carried out 2151 cycles on an Illumina NextSeq 500 using a High-output kit according to manufacturer's protocol.
[0102] Programs and commands with parameters to process the read data is listed in Table 4.
TABLE-US-00010 TABLE 4 Computational programs and commands for analysis Programs Commands notes Trimmomatic java -jar ~/Trimmomatic-0.39/trimmomatic-0.39.jar PE s7_R1.fastq.gz s7_R2.fastq.gz Read trimming. s7_paired_R1.fq.gz output_forward_unpaired.fq.gz s7_paired_R2.fq.gz S7 is the output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2: True representative LEADING: 3 TRAILING: 3 MINLEN: 36 sample used. Picard picard FastqToSam F1=s7_paired_R1.fq.gz F2=s7_paired_R2.fq.gz Merge read O=s7_unaligned.bam SM=s7 1 and read2 Fgbio java -jar ~/picard/fgbio-2.2.0.jar ExtractUmisFromBam --input=s7_unaligned.bam -- Extract UMIs output=s7_unaligned_withumi.bam --read-structure=8M143T 13M138T --molecular- from reads index-tags=ZA ZB --single-tag=RX picard picard SamToFastq I=s7_unaligned_withumi.bam F=s7_unaligned_withumi.fastq regenerate INTERLEAVE=TRUE fastq read file Bwa bwa mem -t 4 -p Alignment to ~/reference/human/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna reference s7_unaligned_withumi.fastq -o s7_aligned_withoutumi.sam samtools samtools view s7_aligned_withoutumi.sam -b -o s7_aligned_withoutumi.bam Conversion of sam to bam picard picard MergeBamAlignment UNMAPPED=s7_unaligned_withumi.bam Merge bam, ALIGNED=s7_aligned_withoutumi.bam O=s7_aligned_withumi.bam adding UMI R=/home/zhengyuhudson/reference/human/GCA_000001405.15_GRCh38_no_alt_analy info back sis_set.fna SO=coordinate ALIGNER_PROPER_PAIR_FLAGS=true MAX_GAPS=1 ORIENTATIONS=FR VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true picard picard MarkDuplicates I=s7_aligned_withumi.bam O=markduplicates_umi.bam De- M=markduplicates.txt BARCODE_TAG=RX REMOVE_DUPLICATES=true deplication
[0103] Briefly, adaptor sequences and polyG sequences were first trimmed off reads by using Trimmomatic (11). Reads were then aligned to the reference genomes by using bwa (12) and files were processed with samtools (13). Alignment statistics were generated by Picard tools (14). UMI-aware read processing was done by using fgbio (15). Customary scripts for pairing duplex strands and end restoration were written using PERL.
RESULTS
Biochemical Characteristics of DISTAL Ligation
[0104] In this Example, DISTAL ligation is catalyzed by a mutant thermostable single strand ligase (named HyperLigase), which originates from Hyperthermus butylicus, a hyperthermophilic archaebacterium that grows optimally between 95 C. and 106 C. (16). A lysine to alanine mutation was introduced at the catalytic site so that the mutant ligase is only capable of ligation between a 5-adenylated end and a 3-hydroxyl end of single strand DNA/RNA at elevated temperature (up to 95 C.) (10). Another mutant ligase (thermostable Mth App ligase, NEB cat #0319) from Methanobacterium thermoautotrophicum, with optimal reaction temperature at 65 C., was reported in (17) and also tested in this example.
[0105] Although the 3-hydroxyl end is typically provided by a single strand DNA, the hypothesis tested here is on whether duplex DNA, with duplexes at the ends transiently separating into single strand conformation due to thermodynamics (DNA breathing (18)), can facilitate ligation at the 3-ends catalyzed by the single strand DNA ligases. The duplex is ideally retained before and after the ligation (hence the name), with the benefits of retaining strand pairing signals for the later end restoration (see below DUET-seq).
[0106] To test the impact of the reaction temperature, DISTAL ligation reactions were set up and incubated at various temperature for 6 hours (
[0107] As a comparison, the thermostable Mth App ligase (NEB) was used to test ligation between duplex DNA and single strand DNA at 65 C. for 6 hours. However, almost no ligation products can be observed (lane 9,
[0108] The quantitative conversion efficiency of hyperligase ligation, as measured by the percentage of conversion from substrate to either one-sided or two-sided ligation product is shown in
[0109] A notable feature of the hyperligase ligation in
[0110] To investigate the potential impact of end configuration to hyperligase ligation, duplex DNA with 5-single-strand protruding end or 3-single-strand protruding end were prepared by subjecting blunt end PCR DNA to nicking enzyme (Nb.BbVCI) digestion. Nicking enzyme recognition sites were introduced from the primers used in PCR (Table 3). After digestion, 5-single-strand protruding ends have 9-10 nt single strand portion while 3-single-strand protruding ends have 11-12 nt on either side of the duplex DNA. The digested DNA were then used as substrate for hyperligase ligation, and compared with blunt end DNA, as shown in
Distal Ligation Enables Sequencing Library Preparation (Distal-Seq)
[0111]
[0112] As a proof of principle experiment, library from 50 ng of reference material Escherichia coli 0157 (EDL933) gDNA was made using the DISTAL-seq workflow (
[0113] To investigate the sequence-specific bias of both hyperligase ligation and circular ligation, stretches of randomized nucleotides (NNNNN, N=A/T/C/G) were designed at both 5- and 3-ends of the single strand adaptor. Mapped read1 and read2 of the E. coli DISTAL-seq data were aligned and sequence context on either side of the ligation junction was analyzed for potential bias (
[0114] Since there are a few steps during this exemplary DISTAL-seq library preparation that requires long incubation at elevated temperature (6 hours at 75 C. for hyperligase ligation and 1 hour at 60 C. for CircligaseII reaction), a potential concern is on the DNA damage heat may introduce, including but not limited to C>T transition and G>T transversion (19). To address this question, DISTAL-seq data from the reference genomic material (Escherichia coli 0157 (EDL 933)) is used and mutations were called from the sequencing data (see Methods). As show in
Duplex End Restoration Sequencing (DUET-Seq) and Validation
[0115] Duplex sequencing uses a strategy of tagging each strand separately by distinct but corresponding UMIs so that two strands can be paired during analysis (5). Duplex sequencing has been used to detect rare genomic mutations with high positive predictive value (PPV) (5), and proven valuable in a variety of research and clinical application. As discussed earlier, with the current duplex sequencing library preparation methods, native ends are either filled in or resected to blunt ends so that they cannot be restored after the duplex is reconstructed. The principle of the DISTAL-seq provides a feasible framework where if strands can be paired, duplex and native end restoration sequencing (DUET-seq) may become possible.
[0116]
[0117] The read structure and the strand paring diagram are also shown in
[0118] To validate DUET-seq, a mixture of DNA with known ends was used: lambda DNA (48 kb) digested with FauI (CCCGC(N).sub.4GGGCG(N).sub.6) was spiked into lambda DNA digested with AluI (AGCT/TCGA) at 1:2000 ratio. FauI is expected to generate 2-nt 3-single-strand protruding ends while AluI generates blunt ends. In 50 ng of starting DNA mixture for DUET-seq, there are about 4.510.sup.5 copies of FauI-digested genome equivalent in the background of 910.sup.8 copies of AluI-digested genome equivalent. The theoretical diversity of the barcode combination from the DUET-seq adaptor is 410.sup.7 (
[0119] Duplex sequencing is reported to be inefficient in recovering both strands and often requires excess level of sequencing (20). During DUET-seq, an extra step of exonuclease treatment can be added before the endonuclease V digestion step (
[0120] The strand pairing analysis identified 1 duplex FauI-digested fragment from the Exo-library, as compared to 11 duplex FauI-digested fragments were, which represents about 10-fold enrichment in duplex recovery. For the duplex FauI-digested fragments identified from both libraries, both strands of the duplex originate from bona fide FauI cleavage, with no strand swapping between FauI-digested and AluI-digested strands. The signed single strand end length (
Application of DUET-Seq to Human Genomic DNA and Cell-Free DNA
[0121] Finally, this exemplary DUET-seq was applied to a few real-world samples to profile the states of the native ends. First, DUET-seq was used to compare the end profiles between sonicated genomic DNA and enzymatically fragmented genomic DNA. As shown in FIG. 4A, majority of the DNA fragments in sonicated genomic DNA possesses either blunt ends or ends with short single strand overhangs (2-3 nt). There is an almost equal population of ends being 5-single-strand protruding or 3-single-strand protruding. The enzymatically fragmented genomic DNA, however, has majority of ends being 3-single-strand protruding, with a much wider size distribution of the single strand overhang. A major peak at 1 nt, which stands for 1-base 3-single-strand protruding, accounts for 10% of all the ends. The 3-single-strand protruding nature of the ends likely reflects nucleases' preference in the fragmentase product mix.
[0122] For cell free plasma DNA, interestingly, a first observation is that its end pattern has an intriguing resemblance to the enzymatically fragmented DNA: majority of the ends is 3-single-strand protruding, with a major peak at 2 nt (10%). As a quality check of the DUET-seq library,
[0123] The attachment of an adaptor with defined sequence to library DNA is crucial in driving creative ways to make sequenceable libraries for the interrogation of genomic alterations. Many ligation strategies have been described including duplex to duplex ligation, such as A/T ligation, blunt-blunt ligation, etc., and single strand ligation, for example, mediated by splint or direct single strand to single strand ligation. Here, an alternative strategy is illustrated in which single strand adaptor is directly ligated to duplex DNA. It is termed DISTAL (duplex retaining single strand tail) ligation. As shown in this example, DISTAL ligation enables sequencing library preparation workflows where conventional end repair is no longer necessary.
[0124] Reaction temperature plays a role in driving the efficiency of the DISTAL ligation, as shown in
[0125] Another element for DISTAL ligation is the choice of thermostable ligase capable of ligating between two single strand DNA molecules. In this example, a thermostable mutant ligase with a wide range of temperature tolerance was chosen to enable hyperligase ligation (10). In addition, the mutation at the catalytic lysine in the enzyme dictates the uni-directional ligation between the 5-adenylated end and the 3-hydroxyl end, minimizing the chance of undesired by-product generation. Other enzymes with similar characteristics through database mining may be useful for this purpose.
[0126] Embodiments, of DISTAL ligation allows insights for adaptor design for the sequencing library preparation. For example, unlike the conventional Y-adaptor for which a short duplex is needed due to the substrate requirement of the T4 DNA ligase, ligations in embodiments of DISTAL-seq are generally completed in two separate steps, and in either step, the substrates are in the form of single strand DNA. In particular embodiments, the substrate requirement might make the duplex portion of the conventional Illumina adaptor unnecessary. This may have an added benefit of reducing adaptor length as well as adaptor dimer length, making adaptor dimer more efficiently removed by size selection.
[0127] Although designed for ligation to duplex DNA, DISTAL workflow does not preclude single strand DNA in the starting material ligating to the adaptor. These ligated products can also go through the later steps of the library preparation, get amplified and sequenced. As such, DISTAL-seq and DUET-seq data may have captured both double and single strand DNA population in the starting material. Indeed, for the cell free plasma DNA, as shown in
[0128] For DUET-seq, note that as a proof of principle, the strand-specific UMIs were designed as complementary through primer extension for the ease of synthesis (
[0129] Finally, since no ER/AT is used in the library prep, DISTAL-seq can be extended to workflows for readout of epigenetic base modifications. The dilution of epigenetic signals, especially present at the start of read 2, as observed in (3), is not expected to exist for the collected dataset. Such high-fidelity epigenomic datasets should be useful in illuminating epigenetic changes in the disease process, especially at an early onset. current examples are by sequencing genetics. When the adaptor is methylated, for example, an example of sequencing epigenome can be demonstrated.
REFERENCES
[0130] 1. Gregory, et al. (2020) Characterization and mitigation of fragmentation enzyme-induced dual stranded artifacts. NAR Genom Bioinform, 2. [0131] 2. Xiong et al. (2022) Duplex-Repair enables highly accurate sequencing, despite DNA damage. Nucleic Acids Res, 50, e1-e1. [0132] 3. Jiang, et al. (2020) Detection and characterization of jagged ends of double-stranded DNA in plasma. Genome Res, 30, 1144-1153. [0133] 4. Thierry, A. R. (2023) Circulating DNA fragmentomics and cancer screening. Cell Genomics, 3, 100242. [0134] 5. Schmitt, et al. (2012) Detection of ultra-rare mutations by next-generation sequencing. Proceedings of the National Academy of Sciences, 109, 14508-14513. [0135] 6. Picelli, et al., (2014) Tn5 transposase and tagmentation procedures for massively scaled sequencing projects. Genome Res, 24, 2033-2040. [0136] 7. Gansauge, M. T. and Meyer, M. (2013) Single-stranded DNA library preparation for the sequencing of ancient or damaged DNA. Nat Protoc, 8, 737-748. [0137] 8. Troll, et al., (2019) A ligation-based single-stranded library preparation method to analyze cell-free DNA and synthetic oligos. BMC Genomics, 20, 1023. [0138] 9. Harkins, et al., (2020) A novel NGS library preparation method to characterize native termini of fragmented DNA. Nucleic Acids Res, 48, e47-e47. [0139] 10. Zheng, Y. and Hong, M. HYPER-THERMOSTABLE LYSINE-MUTANT SSDNA/RNA LIGASES. EP3 430 154B1. [0140] 11. Bolger, A. M., Lohse, M. and Usadel, B. (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30, 2114-2120. [0141] 12. Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-60. [0142] 13. Li, et al., (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078-2079. [0143] 14. https://followed by broadinstitute.github.io/picard/. [0144] 15. https://followed by github.com/fulcrumgenomics/fgbio. [0145] 16. Zillig, et al. (1990) Hyperthermus butylicus, a hyperthermophilic sulfur-reducing archaebacterium that ferments peptides. J Bacteriol, 172, 3959-3965. [0146] 17. Zhelkovsky, A. M. and McReynolds, L. A. (2012) Structure-function analysis of Methanobacterium thermoautotrophicum RNA ligase-engineering a thermostable ATP independent enzyme. BMC Mol Biol, 13, 24. [0147] 18. Phelps, et al., (2013) Single-molecule FRET and linear dichroism studies of DNA breathing and helicase binding at replication fork junctions. Proceedings of the National Academy of Sciences, 110, 17320-17325. [0148] 19. Costello, et al. (2013) Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res, 41, e67-e67. [0149] 20. Bae, et al. (2023) Single duplex DNA sequencing with CODEC detects mutations with high sensitivity. Nat Genet, 55, 871-879. [0150] 21. Zheng, et al., (2010) A unique family of Mrr-like modification-dependent restriction endonucleases. Nucleic Acids Res, 38, 5527-5534. [0151] 22. Jiang, et al. (2015) Lengthening and shortening of plasma DNA in hepatocellular carcinoma patients. Proceedings of the National Academy of Sciences, 112. [0152] 23. Snyder, et al., (2016) Cell-free DNA Comprises an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin. Cell, 164, 57-68. [0153] 24. Lanman, et al. (2015) Analytical and Clinical Validation of a Digital Sequencing Panel for Quantitative, Highly Accurate Evaluation of Cell-Free Circulating Tumor DNA. PLoS One, 10, e0140712. [0154] 25. Tanaka, N. and Shuman, S. (2011) RtcB Is the RNA Ligase Component of an Escherichia coli RNA Repair Operon. Journal of Biological Chemistry, 286, 7727-7731. [0155] 26. Das, et al., (2013) Rewriting the rules for end joining via enzymatic splicing of DNA 3-PO 4 and 5-OH ends. Proceedings of the National Academy of Sciences, 110, 20437-20442. [0156] 27. Duan, et al., (2019) Purification and enzymatic characterization of the RNA ligase RTCB from Thermus thermophilus. Biotechnol Lett, 41, 1051-1057. [0157] 28. Desai, et al., (2015) Coevolution of RtcB and Archease created a multiple-turnover RNA ligase. RNA, 21, 1866-1872. [0158] 29. U.S. Pat. Pub. 2022/0356467. [0159] 30. Jacewicz A., et al (2022) Structures of RNA ligase RtcB in complexes with divalent cations and GTP, RNA, 28(11):1509-1518
[0160] All publications and patents mentioned in the specification and/or listed below are herein incorporated by reference. Various modifications and variations of the described method and system of the invention will be apparent to those skilled in the art without departing from the scope and spirit of the invention. Although the invention has been described in connection with specific embodiments, it should be understood that the invention as claimed should not be unduly limited to such specific embodiments. Indeed, various modifications of the described modes for carrying out the invention that are obvious to those skilled in the relevant fields are intended to be within the scope described herein.