SINGLE-STRANDED END PRESERVING ADAPTORS

Abstract

Provided herein are compositions, kits, systems, and methods employing single-stranded end-preserving adaptors. Such single-stranded adaptors are attached to DNA duplex molecules while preserving original 5 or 3 single-strand protruding ends (e.g., present in cell-free DNA) by attaching such adapters to 3 ends the DNA duplex molecules using a single strand ligase that has step 3 ligase activity, but not step 2 adenylyl transfer activity, and attaching such adapters to 5 ends of the DNA duplex molecules using a ligase enzyme (e.g., a circligase), thereby forming loop-like structures on one or each end of the DNA duplex molecules. In further embodiments, the loop-like structures are cleaved (e.g., by an endonuclease) as the single-stranded adapters have a cleavable portion, thereby generating a two-part adapter on one or both ends of the DNA duplex molecules that preserves the initial 5 or 3 single-strand protruding ends, along with any methylation present.

Claims

1. A kit comprising: a) a plurality of single-stranded adaptors, wherein each of said single-stranded adaptors comprises a nucleic acid sequence and optionally a 3 end blocking group attached to the 3 end of said nucleic acid sequence, wherein said 3 end blocking group is optionally a modified non-canonical nucleotide or a phosphate group, wherein the 5 end of said nucleic acid sequence is adenylated, and wherein said nucleic acid sequence comprises: a 5 region, a 3 region, and a cleavable region between said 5 and 3 regions, and is optionally methylated at every, or nearly every, cytosine present in said nucleic acid sequence; and b) a single strand ligase, wherein said single strand ligase has step 3 ligase activity, but not step 2 adenylyl transfer activity.

2. The kit of claim 1, wherein said cleavable region comprises: A) a modified non-canonical base, B) a nucleic acid backbone linkage, or C) an endonuclease recognition site, wherein endonuclease recognition site is formed at the junction of said 5 and 3 regions or is composed of a sequence between said 5 and 3 regions, or D) one or more RNA bases, wherein optionally the rest of the first and second nucleic acid sequences are composed of DNA, or E) comprises a sequence that forms a endonuclease recognition sequence when a secondary oligonucleotide is added.

3. The kit claim 1, wherein said single strand ligase is a thermostable lysine-mutant ssDNA/RNA ligase which is a mutated version of a precursor thermostable ssDNA/RNA ligase, wherein said precursor thermostable ssDNA/RNA ligase has a Motif I EKx(D/N/H) G, and wherein said thermostable lysine-mutant replaces K in said Motif I with any other amino acid or is selected from alanine (A), serine(S), cysteine (C), valine (V), threonine (T), and Glycine (G).

4. The kit of claim 1, wherein said single strand ligase has an amino acid sequence that is 95% or 100% identical to any one of SEQ ID NOs: 1-21.

5. The kit of claim 1, wherein said 5 and 3 regions of said first nucleic acid sequence each comprise at least one element selected from: a flow cell attachment sequence, a unique barcode sequence, a non-unique barcode sequence, a sample-identifying index sequence, a read 1 primer binding sequence, a read 2 primer binding sequence, and a universal PCR amplification primer binding sequence.

6. The kit of claim 1, wherein said 5 and 3 regions of said first nucleic acid sequence each comprise a barcode sequence that have predefined relationship to each other, which is optionally based on a pre-defined association which may be a look up table, and optionally wherein said barcodes sequences are non-complementary to each other, and optionally where the barcodes are of different lengths.

7. The kit of claim 1, wherein said 3 end blocking group is present.

8. The kit of claim 1, wherein said 3 end blocking group is selected from: a nucleotide with 3-phosphate group and a nucleotide reversible terminator.

9. The kit of claim 1, further comprising a deblocking agent.

10. The kit of claim 1, further comprising a ligase enzyme selected from: Circligase I, Circligase II, RtcB ligase from E. coli. or homologs, thermostable RtcB or homologs, TS2126 RNA ligase, and Mth DNA ligase, T4 RNA ligase 1, and T4 RNA ligase 2.

11. The kit of claim 1, further comprising one or more enzymes capable of cleaving said cleavable region of said nucleic acid sequence.

12. The kit of claim 11, wherein said one more enzymes, or reagent, are selected from: an endonuclease, endonuclease V, endonuclease VIII, an endonuclease and uracil DNA glycosylase, and thermostable oxoguanine glycosylase (OGG), and iodine.

13. The kit of claim 1, further comprising a plurality of DNA duplex molecules, wherein each of said DNA duplex molecule comprise: i) a first duplex end that comprises a 3 strand end and a 5 strand end, and has a 3 or 5 single-strand protruding end that is either a single non-adenine nucleotide or is at least two nucleotides in length, and ii) a second duplex end that comprises a 3 strand end and a 5 strand end which optionally has a 3 or 5 single-strand protruding or blunt end, and optionally wherein any of, or all of, said protruding ends comprise at least one cytosine that is methylated.

14. The kit of claim 13, wherein each of said DNA duplex molecules comprises a loop-like structure on the first duplex end and second duplex end, wherein said loop-like structure is composed of said nucleic acid sequence.

15. The kit of claim 1, further comprising one or more containers for collectively or separately holding the recited components, optionally, wherein said components are present inside said container.

16. The kit of claim 15, wherein said one or more containers is selected from a cardboard box, a plastic bag or box, glass vials, and plastic vials.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0045] FIG. 1. Reaction temperature of exemplary DISTAL ligation. (A) Impact of reaction temperature to an exemplary DISTAL ligation on a 320 bp duplex DNA. For DISTAL ligation series, lane 1-7 shows ligation of single strand adaptor to a DNA duplex approximately 300 bp at different temperatures. Lane 8 is duplex DNA only. Lane 9 is the same reaction catalyzed by the thermostable 5 App DNA/RNA Ligase (NEB) at 65 C. for 6 hours. Reaction samples were run on the high sensitivity D1000 screentape (Agilent) and analyzed by the Tapestation software. (B) Impact of reaction temperature to DISTAL ligation on a 180 bp duplex DNA. (C) Product distribution of reaction series in A and B. Molarity readings were obtained from Tapestation analysis software.

[0046] FIG. 2 shows an exemplary DISTAL-seq workflow and results. (A) Diagram of this exemplary DISTAL-seq workflow. (B) Tapestation trace of DISTAL-seq library using sheared E. coli genomic DNA. The distinct band around 150 bp is adaptor dimer. It's likely due to the residual unligated adaptors that go through deblocking, circularization, cleavage and amplification. (C) Sequence-specific bias for DISTAL ligation and Circligase II mediated ligation. (D) Coverage distribution for DISTAL-seq. (E) Mutation signature for the DISTAL-seq.

[0047] FIG. 3 shows an exemplary DUET-seq workflow and validation results on samples with known ends. (A) an exemplary DUET-seq adaptor preparation and read structure diagram. Unique molecular identifier (UMI) has 8 degenerate base, with 3.sup.8=6561 possibly barcode on either end, totally 65616561=410.sup.7 possible combinations. Degenerate base uses B (B=G/C/T) to avoid USER enzymes (uracil DNA glycosylase and Endo VIII) cutting inside the UMI. After pair-end sequencing, first 8 bases of the read 1 is UMI and first 13 bases of read 2 is UMI, with 5 constant bases. (B) (C) Tapestation traces of DUET-seq libraries using a mix of pre-digested lambda DNA. Lane B1: without ExoI/ExoIII treatment; Lane C1: with ExoI/ExoIII treatment. (C) Single strand end length after end restoration. After the duplex is restored, the end length is calculated as the 3-end coordinate of the fragment mapped to the forward strand minus the 5-end coordinate of the fragment mapped to the reverse strand, or the 5-end coordinate of the fragment mapped to the forward strand minus the 3-end coordinate of the fragment mapped to the reverse strand. Positive single strand end length denotes 5-single-strand protruding end, while negative end length denotes 3-single-strand protruding end. Value of zero denotes blunt end.

[0048] FIG. 4. End profiling of sheared genomic DNA and cell-free plasma DNA. (A) end length distribution of fragmented gDNA, including sonicated genomic DNA, enzymatically sheared genomic DNA and cell-free plasma DNA. (B) Insert distribution for the exemplary DUET-seq library of the cell free DNA. Data is taken from CollectInsertSizeMetrics from Picard Tools.

[0049] FIG. 5. Exemplary DISTAL ligation using 5-single-strand protruding or 3-single-strand protruding DNA, and comparison to blunt end DNA. Lane 1, blunt end DNA; lane 2, exemplary DISTAL ligation using blunt end DNA; lane 3,5-single-strand protruding DNA; lane 4, exemplary DISTAL ligation using 5-single-strand protruding DNA; lane 5, 3-single-strand protruding DNA; lane 6, DISTAL ligation using 3-single-strand protruding DNA.

[0050] FIG. 6. Insert size distribution of the exemplary E. coli DISTAL-seq library. The figure was generated by using CollectInsertSizeMetrics in Picard tools.

[0051] FIG. 7. GC bias for the exemplary E. coli DISTAL-seq library. The plot is generated by CollectGcBiasMetrics in Picard tools.

[0052] FIG. 8. Mutation AF distribution in E. coli DISTAL-seq.

[0053] FIG. 9. Insert size distribution for sonicated gDNA DUET-seq library.

[0054] FIG. 10. Insert size distribution for enzymatically fragmented DUET-seq library.

[0055] FIG. 11A shows an exemplary single stranded adaptor with a 3-blocking group, which can be, for example: a 3-phosphate, a 3-dideoxyC, a 3-biotin, a 3-spacer, etc. Other 3 blocker examples are sold by IDT in their catalog at Modifications/GetAllMods #3. FIG. 11B shows an exemplary single stranded adaptor with an embedded cleavable site. A cleavable base can be, for example, an internal uracil (e.g., cleaved by UDG and Endo VIII), internal 5-hydroxymethyluracil, internal inosine (e.g., cleaved by Endo V), and internal 8-oxoGuanine (cleaved by OGG and EndoVIII). A cleavable backbone linkage can be a phosphodiester bond (e.g., cleaved by an endonuclease) or for example, could be a phosphorothioate DNA bond, which, for example, can be cleaved by iodine (I.sub.2) (Qiang Huang et al. Origin of iodine preferential attack at sulfur in phosphorothioate and subsequent PO or PS bond dissociation, PNAS, vol 119, 2022, herein incorporated by reference). FIG. 11C shows an exemplary single stranded adaptor that can be cleaved by annealing a secondary oligo, forming a double-stranded region, which is then cleaved. This can be accomplished, for example, by embedding a few RNA bases, and cleaving by RNaseH, or by embedding a restriction enzyme site which is cleaved by restriction enzyme. FIG. 11D shows an exemplary single strand adaptor where more than one cleavage site is embedded inside the adaptor. Cleavage can be made, for example, sequentially. Such cleavage is also an alternative way to de-block the 3-end. FIG. 11E shows two exemplary single stranded adaptors made of DNA and RNA, which may (top) or may not (bottom) employ a 3-blocking group. FIG. 11F shows an exemplary single stranded adaptor with a 3 blocking group, which can be de-blocked by phosphatase, such as T4 PNK, shrimp alkaline phosphatase, etc. Another type of 3 blocking that can be used is a nucleotide reversible terminator, for which the blocking group is removed by chemical agents.

[0056] FIG. 12A shows an exemplary single stranded adaptor that employs elements that, for example, can be used with ILLUMINA sequencers including: a read 2 primer sequence, a sample index 2, a P7 sequence, a P5 sequence, a sample index 1, and a read 1 primer sequence. FIG. 12B shows an exemplary single stranded adaptor that employs elements that, for example, can be used with ILLUMINA sequencers including: a second UMI (barcode) sequence, a read 2 primer sequence, a sample index 2, a P7 sequence, a P5 sequence, a sample index 1, a read 1 primer sequence, and a first UMI (barcode) sequence.

[0057] FIG. 13: RtcB Ligase from E. coli joins single stranded RNA with a 3-phosphate or 2,3-cyclic phosphate to another RNA with a 5-hydroxyl end (25). It is also known that RtcB ligates 3-phosphate end to 5-hydroxyl end of single strand DNA (26). FIG. 13A shows the use RtcB to generally circularize the single-stranded adaptor sequences herein. For example, after hyperligase ligation, one could use E. coli RtcB to directly ligate the 3-phosphate end of the adaptor to the proximal 5-end of the duplex, as shown in the FIG. 13A below, without de-blocking of the adaptor. In FIG. 13B, as E. coli RtcB catalyzes ligation at 37 degrees C., hyperligase ligation and RtcB ligation could occur in separate steps. As shown in FIG. 13B, by using thermostable RtcB, one could combine hyperligase ligation and RtcB in one reaction. Thermostable RtcBs have been reported (27) (28), both of which are herein incorporated by reference, particularly for the thermostable RtcBs reported therein.

[0058] FIG. 14A. Exemplary diagram of duplex recovery using 5 and 3 barcodes with predefined association (see look up tables 5-8). The Duplex DNA fragment shows one end with a protruding end of about 9 nucleotides, and a second end with a producing end of about six nucleotides. Duplex DNA fragment ends are ligated with single strand adaptor with 5 and 3 barcodes forming the dumb-bell DNA molecule shown with the protruding ends that are preserved. The two loops of this dumb-bell DNA molecule are then cleaved. PCR amplification is then employed, such as with universal primers. These amplicons are then sequenced to generate sequence reads. Read pairs with proper association are used to recover duplex, as shown in the figure. Note that the recovered duplex is not necessarily blunt ended. The end coordinates can be used to recover the ends present in the starting material. FIG. 14B shows the same exemplary figure as FIG. 14A, but without barcodes added and with 3 protruding ends. FIG. 14C shows a similar exemplary figure without barcodes and 5 protruding ends.

[0059] FIGS. 15A and 15B show exemplary embodiments where the barcodes in the two adapters flanking a template sequence are of different lengths.

[0060] FIG. 16A shows how the methods and compositions herein preserve methylation on 3 and 5 protruding ends of target DNA as, for example, no end-repair or A-tailing is required of the target DNA. FIG. 16B shows an exemplary workflow using fully methylated nucleic acid sequences that form adapters herein.

[0061] FIG. 17 shows a workflow where after ligation as described herein, instead of maintaining the duplex, DNA sample is denatured and each single strand DNA is circularized individually.

[0062] FIG. 18A shows a workflow similar to that of FIG. 2, except once the dumbbell type structure if formed, a primer that is designed to anneal to the adaptor sequence is added to initiate rolling-circle amplification (RCA), to generate concatenated single-strand DNA with multiple copies of the duplex DNA templates. In FIG. 18B, after the first and second adaptors are ligated to the duplex DNA molecule, duplex DNA is denatured and subject to deblocking/ligation, so that each strand forms single strand circles with the adaptor sequence embedded inside. A primer that is designed to anneal to the adaptor sequence is added to initiate rolling-circle amplification (RCA), to generate concatenated single-strand DNA with multiple copies of the strand-specific DNA templates.

[0063] FIG. 19A shows the nucleic acid sequence (SEQ ID NO:54) of an exemplary single-stranded adapter, which is composed of DNA bases (e.g., 4 canonical bases) except for one uracil base. The Read 2 primer binding site (AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC; SEQ ID NO:56) and Read 1 sequencing primer binding site (ACACTCTTTCCCTACACGACGCTCTTCCGATCT; SEQ ID NO: 57), are used in certain Illumina sequencing instruments and are marked. The 5 end is adenylated and the 3 end has a phosphate group. FIG. 19B shows the same sequence as FIG. 19A, but further includes a 5 base random barcode at each end, to form SEQ ID NO: 55. FIG. 19C shows the same sequence as FIG. 19A (SEQ ID NO:54), but further diagrammatically depicts Barcode 1 at one end and Barcode 2 at the other ends. Barcodes 1 and 2, in a pool of adapters, can be (for example) be any of the barcode pairs from Table 8, which can form 96 pairs. FIG. 19D shows an exemplary single strand adapter (SEQ ID NO: 58) similar to FIG. 19A, but further includes a flow cell binding P7 sequence (ATCTCGTATGCCGTCTTCTGCTTG, SEQ ID NO:59) and a P5 flow cell binding sequence (AATGATACGGCGACCACCGAGATCTACAC, SEQ ID NO:60), that may be used, for example, when PCR is not used to add these sequences. FIG. 19E shows a similar sequence (SEQ ID NO:61) as FIG. 19D, but further adds an i7 (CTGATCGT, SEQ ID NO: 62) and an i5 (ATATGCGC, SEQ ID NO:63) index sequences.

DEFINITIONS

[0064] A single-stranded DNA/RNA ligase enzyme is considered to have step 3 ligase activity but not step 2 adenylyl transfer activity, when it is able to ligate between 5-adenylated end and 3-hydroxyl end of single strand DNA/RNA, but not able to transfer the AMP to the 5 -phosphate-terminated DNA or RNA strand to form a 5-App-DNA/RNA intermediate. All known DNA and RNA ligases perform the catalysis via a common pathway which involves three nucleotidyl transfer reactions (Lehman et al, Science, 1974; Lindahl et al, Annu Rev Biochem, 1992). In the case of ATP-dependent DNA or RNA ligases, the first step (step 1) involves the attack on the -phosphate of ATP by ligase, which results in release of pyrophosphate and formation of a ligase-AMP intermediate. AMP is linked covalently to the amino group of a lysine residue within a conserved sequence motif. In the second step (step 2), the AMP nucleotide is transferred to the 5 -phosphate-terminated DNA or RNA strand to form a 5-App-DNA/RNA intermediate. In the third and final step (step 3), attack by the 3-OH strand on the 5-App-DNA/RNA end joins the two polynucleotides and liberates AMP.

[0065] End repair means fill-in of the 3-single-strand protruding ends by polymerase and/or resection of the 5-single-strand protruding ends of the original DNA analyte DNA duplex. In certain embodiments, the methods herein do not employ any type of end repair.

DETAILED DESCRIPTION

[0066] Provided herein are compositions, kits, systems, and methods employing single-stranded end-preserving adaptors. Such single-stranded adaptors are attached to DNA duplex molecules while preserving original 5 or 3 single-strand protruding ends (e.g., present in cell-free DNA) by attaching such adapters to 3's ends the DNA duplex molecules using a single strand ligase that has step 3 ligase activity, but not step 2 adenylyl transfer activity, and attaching such adapters to the 5 ends of the DNA duplex molecules using a ligase enzyme (e.g., a circligase), thereby forming loop-like structures on one or each end of the DNA duplex molecules. In further embodiments, the loop-like structures are cleaved (e.g., by an endonuclease) as the single-stranded adapters have a cleavable portion, thereby generating a two-part adapter on one or both ends of the DNA duplex molecules that preserves the initial 5 or 3 single-strand protruding ends (or blunt ends), as the method does not require any end-repair or A-tailing. Such methods are particularly useful in sequencing library preparation and fragmentomic analyses, along with any methylation present.

[0067] Current end repair/A-tailing (ER/AT) chemistry during NGS library preparation results in writing and erasing activities to the starting material, which introduce a variety of artifacts and implications to the sequencing and data interpretation. In addition, conventional ER/AT abolishes the native DNA ends, which for some types of samples can be informative when they are generated biologically. Provided herein, in certain embodiments, is a sequencing library preparation workflow that employs ligation (e.g., before an amplification step). In some embodiments, an important step is the duplex-retaining single strand DNA tail ligation (DISTAL ligation) schema (e.g., as exemplified in FIG. 2), in which ligation occurs at elevated temperature between duplex and single strand adaptor, catalyzed by a thermostable single strand DNA ligase (e.g., a thermostable lysine-mutant, such as shown in Tables 1 and 2 below). Work conducted during development of embodiments herein shows that embodiments of DISTAL ligation enables new sequencing library preparation workflows in which no end repair or A-tailing is necessary. In addition, in some embodiments, by using a barcoded adaptor, it is feasible to restore originally paired DNA strands in the same duplex as well as the native DNA ends that comes with that pairing (DUET-seq). In certain embodiments, DUET-seq is applied to biological samples, such as those containing fragmented human genomic DNA and/or cell-free plasma DNA.

[0068] In particular embodiments, the single-strand adaptors herein comprise a 5 region and a 3 region, and wherein the 5 region comprises a first barcodes sequence, and the 3 region comprises a second barcode sequence. In some embodiments, the first and second barcodes sequences are complementary and may hybridize to each other. In other embodiments, the first and second barcodes are not complementary and do not hybridize to each other.

[0069] Such embodiments where the first and second barcode sequences do not hybridize to each other is different than the normal approach in the art where the barcode sequence is in the stalk of a Y shaped adaptor, necessitating that the first and second barcodes are always complementary to each other (due to the constraint of hybridization). As the result, in normal approach, adaptor with barcode A ligates to one end of the duplex and adaptor with barcode B ligates to the other end of the duplex, so that read pair from one strand has barcode AB, and read pair from the other strand has barcode BA,. During analysis, read pairs that share the same genomic coordinates of start and stop positions, and also AB/BA barcodes are considered as duplex.

[0070] In embodiments herein the 5 and 3 barcodes are not complementary to each other, as shown in Table 5 below: where AB, CD, EF, and GH each represent 5 barcode and 3 barcode and do not hybridize to each other. Single strand adaptors with unique combinations of AB, CD, EF, and GH may be synthesized individually, and used as a mixture. As an example, during ligation, first single strand adaptor with barcode A and B ligates to the first duplex end, and second single strand adaptor with barcode C and D ligates to the second duplex end. Read pair from one strand has barcode DA, and read pair from the other strand has barcode BC. During analysis, read pairs that overlap in their start and stop positions, and have the corresponding association as defined in the look up table (e.g., as long as they pair with each other in this table, DA/BC in the example, and therefore have a predetermined relationship that can be looked up) are considered duplex. The duplex ends can then be inferred from the genomic start and stop positions of the two strands at either end.

TABLE-US-00001 TABLE 5 5 barcode 3 barcode A B C D E F G H

[0071] FIG. 14A provides an exemplary figure demonstrating the use of barcodes that can be looked up (for bottom and top strand) in a look up table such as Table 5 above. FIG. 14A shows that the Duplex DNA fragment with one end with a protruding end of about 9 nucleotides, and a second end with a producing end of about six nucleotides. Duplex DNA fragment ends are ligated with single strand adaptor with 5 and 3 barcodes forming the dumb-bell DNA molecule shown with the protruding ends that are preserved. The two loops of this dumb-bell DNA molecule are then cleaved. PCR amplification is then employed, such as with universal primers. These amplicons are then sequenced to generate sequence reads. Read pairs with proper association are used to recover duplex, as shown in the figure. Note that the recovered duplex is not necessarily blunt ended. After reads are mapped onto reference genome, the end coordinates can be used to recover the ends present in the starting material.

[0072] As discussed above, the barcodes herein do not need to be complimentary, which, for example, expands the design space to 16 possible combinations for 1-mers as shown in Table 6 below, in contrast to only 4 possible combinations if complementarity is required.

TABLE-US-00002 TABLE 6 1-mer duplex barcode design table 5-barcode 3-barcode Mismatch A A 1 A C 1 A G 1 A T 0 C C 1 C G 0 C T 1 C A 1 G A 1 G T 1 G C 0 G G 1 T T 1 T A 0 T C 1 T G 1

[0073] A possible look up table for the 1 mer duplex barcode is:

TABLE-US-00003 TABLE 6.1 5-barcode 3-barcode A A C T G G T C

[0074] Another possible look up table for the 1 mer duplex barcode is:

TABLE-US-00004 TABLE 6.2 5-barcode 3-barcode A C C T G A T G

[0075] As another example, barcodes of equal or unequal length can be used as duplex barcodes.

TABLE-US-00005 TABLE 7 barcodes of different length 5-barcode 3-barcode AA CCT CC TT GC A TA CCGG TC CC

[0076] As another example, barcodes of same sequences can be used duplex barcodes. This design has the added advantage that a look up table is not even needed.

TABLE-US-00006 TABLE8 anexampledesignof96duplexbarcodes withsamesequences 5- 3- Barcode Barcode AATGCAC AATGCAC ATCCTTG ATCCTTG GTATGCT GTATGCT AGATGCG AGATGCG ATGTGGC ATGTGGC CGATACT CGATACT GACTGTA GACTGTA GGTACGT GGTACGT TAACGCG TAACGCG TTACGGC TTACGGC AACCGCA AACCGCA AACGTCC AACGTCC ACGCAGT ACGCAGT AGGCACA AGGCACA CAACACA CAACACA GATCGCT GATCGCT GGATATA GGATATA GGATCAG GGATCAG TTGTGCG TTGTGCG ACCGAAC ACCGAAC AGCGACT AGCGACT AGGTTAA AGGTTAA ATATTGG ATATTGG ATCACAC ATCACAC ATTAGGT ATTAGGT CAATATC CAATATC CAGTGGA CAGTGGA CCGATAG CCGATAG CTACATT CTACATT CTGTATA CTGTATA GAGTGAG GAGTGAG GATCCAG GATCCAG GCTACAC GCTACAC GCTCGAA GCTCGAA GGATTCC GGATTCC GGTGCAA GGTGCAA GTAAGAG GTAAGAG GTACCTG GTACCTG GTAGTCG GTAGTCG GTGTACC GTGTACC TCCGACA TCCGACA TTAGCCT TTAGCCT ACTGTTC ACTGTTC AGAATAC AGAATAC ATCAATA ATCAATA ATGCGCT ATGCGCT ATTGTAG ATTGTAG CAATGAT CAATGAT CATACAT CATACAT CATGCCA CATGCCA CGCGATA CGCGATA CGGCAAT CGGCAAT CTGTTAT CTGTTAT GAATAGG GAATAGG GATTATT GATTATT GCCTTAC GCCTTAC GTTAGTA GTTAGTA TACCATA TACCATA TATAATG TATAATG TCGAAGG TCGAAGG TCGACAT TCGACAT TGAAGTG TGAAGTG TGTCCAT TGTCCAT TGTTAAG TGTTAAG AACTTAG AACTTAG AAGCTGG AAGCTGG AATATCG AATATCG AATGTGT AATGTGT ACACTTA ACACTTA ACGAACC ACGAACC AGGTAGG AGGTAGG ATAGGCC ATAGGCC ATATAAC ATATAAC ATGATTC ATGATTC CACTCAC CACTCAC CAGGAAC CAGGAAC CCGGATT CCGGATT CGGATGT CGGATGT CTAATAA CTAATAA CTCTGTG CTCTGTG CTGGTGA CTGGTGA CTTGTCC CTTGTCC GAGCAGA GAGCAGA GCAATGG GCAATGG GCCGTTA GCCGTTA GCTGAAG GCTGAAG GTTCTTC GTTCTTC GTTGGAT GTTGGAT TAACCTT TAACCTT TAGCGGT TAGCGGT TAGGACG TAGGACG TCGCTCA TCGCTCA TCGTAAC TCGTAAC TGATTAT TGATTAT TGCACAG TGCACAG TGGATCG TGGATCG

[0077] In certain embodiments, the two barcodes used herein (for a particular template) do not have the same length (and may or may not be complimentary, and may be the same sequence). Such embodiments are shown in exemplary FIG. 15A. This figure shows that the duplex barcodes are not complementary, and are not the same length. Thus, combinations of barcode design of different lengths further expands the design space that one may employ in the methods and compositions herein. It is noted that one exemplary advantage of this significantly expanded barcode design space reduces the requirement of barcode length for a same level of adaptor pool complexity. This can reduce the length of the adaptor oligonucleotides, and thereby reduce the synthesis cost. In further embodiments, as shown in FIG. 15B, the associated barcodes may contain additional degenerate bases of equal or unequal length.

[0078] In some embodiments, the barcodes (e.g., non-complementary barcodes; same sequence) may have a length of 1, 2, 3, 4, 5, 6, 7, or more nucleotides. In particular embodiments, the non-complementary barcodes have 1, 2, 3 or more mismatches. In particular embodiments, barcodes (e.g., only barcodes) that form pairs with maximal mismatches (hamming distance). In other embodiments, the barcodes are exactly the same. In further embodiments, the strand combinations of two barcodes have a length difference, such as 1, 2, 3, 4, or more differences. In particular embodiments, one of the two barcodes in the combination is not present (zero nucleotides). In other the barcodes have certain constant and/or degenerate bases.

[0079] In certain embodiments, the thermostable lysine-mutants employed in the methods, kits, systems, and compositions herein with the single stranded adaptors are as provided in SEQ ID Nos: 1-21 in Tables 1 and 2 below, or N or C terminal truncated, versions thereof.

TABLE-US-00007 TABLE1 ThermostableLysine-MutantssDNA/RNALigases SEQID AminoAcidSequence NO: Speciesname MVSSYFRNLLLKLGLPEERLEVLEGKGALAEDEFEGIRYVRFRDSARNFRRG 1 Thermococcus TVVFETGEAVLGFPHIKRVVQLENGIRRVFKNKPFYVEEXVDGYNVRVVKVK kodakarensis DKILAITRGGFVCPFTTERIEDFVNFDFFKDYPNLVLVGEMAGPESPYLVEG PPYVKEDIEFFLFDIQEKGTGRSLPAEERYRLAEEYGIPQVERFGLYDSSKV GELKELIEWLSEEKREGIVMKSPDMRRIAKYVTPYANINDIKIGSHIFFDLP HGYFMGRIKRLAFYLAENHVRGEEFENYAKALGTALLRPFVESIHEVANGGE VDETFTVRVKNITTAHKMVTHFERLGVKIHIEDIEDLGNGYWRITEKRVYPD ATREIRELWNGLAFVD WhereXisanyaminoacidexceptK. MVSSHFKEILMRLGLPEDRIEVLEAKGGITEEEFDGIRYLRFKDSARGLRRG 2 Pyrococcusyayanosii TVVFDEANVILGFPHIKRVVSLRAGVMRIFKRTPFYVEEXVDGYNVRVALVS DRVLAITRGGFVCPFTTERILDFVPEEFFKDYPHLVLVGEMAGPESPYLVEG PPYVEEDIRFFLFDIQEKGTGKSLPVQERLKLAEEYGIPHVKVFGLYTVDRI EDLYDLIERLSREGREGVVMKSPDMKRVVKYVTPFANVNDVKIGAKVFFELP PGYFMSRIMRLAFYVAERRIKGERFEELARNLGKALLEPFVESIWDVEQGDE IAEVFRIRVKRIETAYKMVTHFERLGLNIKIEDIEEVGGMWRITFKRAYDEA TREIRELIGGRAFVD WhereXisanyaminoacidexceptK. MVSSKFKDILYRLGIPEGKVEDLEARGGLVEDKFDDIKYLRIRNSVGKLRRG 3 Pyrococcushorikoshii TVVLNDKFIILGFPHIKRIVNLKNGIKRTFKRGEFYVEEXVDGYNVRVVKFR GKXLGITRGGFICPFTTERISDFIPEEFFKDHPNLILVGEMAGPESPYLVEG PPYVKEDIQFFLFDIQELGTGRSLPVEERLKIAEEYGISHVEVFGKFTYKDL EEIYEIVERLSREGREGIVMKSPDMRKMVKYVTPYANINDIKIGARVFYELP PGYFTSRISRLAFYIAEKRLRGENFEELAKELGKALLQPLVESIHDVEQEDE IAEVFKVRVKKIETAYKMVTHFEKLGLRIEIVDIEEMKGGWRITFKRLYPDA TEEIRELIGGKSFVD WhereXisanyaminoacidexceptK. MKEVVSSVYKEILVKLGLTEDRIETLEMKGGIIEDEFDGIRYVRFKDSAGKL 4 Pyrococcusabyssi RRGTVVIDEEYVIPGFPHIKRIINLRSGIRRIFKRGEFYVEEXVDGYNVRVV MYKGKMLGITRGGFICPFTTERIPDFVPQEFFKDNPNLILVGEMAGPESPYL VEGPPYVKEDIQFFLFDVQEIKTGRSLPVEERLKIAEEYGINHVEVFGKYTK DDVDELYQLIERLSKEGREGIIMKSPDMKKIVKYVTPYANINDIKIGARVFY ELPPGYFTSRISRLAFYLAEKRIKGEEFERVAKELGSALLQPFVESIFDVEQ EEDIHELFKVRVKRIETAYKMVTHFEKLGLKIEIVDIEEIKDGWRITFKRLY PDATNEIRELIGGKAFVD WhereXisanyaminoacidexceptK. MENMVSSKFKELLYTLGIPEDKVEILEARGGIMEDEFEGIRYLRFKNSVGKL 5 Pyrococcusfuriosus RRGTVLFEDGTTVFGFPHIKRIVNLSAGVRKIFKSSEFYVEEXVDGYNVRVV KFKDRILGITRGGFICPYTTERIAEFVPEEFFKDHKDLVLVGEMAGPESPYL VEGPPYVKEDIQFFLFDIQDIKTGSSLPVEERLKLAEEYGINHVEVFGRYSY KDIDDLYELIERLSREGREGIVMKSPDMKKIVKYVTPYANINDIKIGARVFY ELPGGYFTSRISRLAFYIAEKKIRGEELHNLALQLGKALLQPLVEAIHDVTQ GDVIAERFRVRVRKIETAYKMVTHFEKLGLEIEIEDIEEIEGGWRVTFKRVY PEATREIRDLIGGKAFVD WhereXisanyaminoacidexceptK. MTWIKNPEPWMVNLVAEKLGLDVERVETLARHGTIRFRGYRDVVYALLRREI 6 Hyperthermus AGHPEGTVVLLERNGARLVPGYPPIQRMVLPTIALPRHFIDKVVVEEXMNGY butylicus NVRLVMFHRKLLAVTRGGFICPYTTARLERLIGGRVRELFREIDPETYTIAG EVVGLENPYTRYFYPEAPRFDYFVFDLFHELKPLPPLERNELLEKYGLKHVR LLGVIDKNDVEMFKQIVAELDREGREGVVAKDPEYRVPPLKYTTSAVNIGDV RYGMRFFMEEGRSFLFSRLLRELFRAYEEGFGDAQLEKLALEFGRAATEPAL ESIRKVAMGDMLYEEFELVFADEVELEEFTSYMAELGVDIVVVSTSREDEGL RARMRKIKDTWIQLRKVLDTGLSPVD WhereXisanyaminoacidexceptK. MTWIKNPEPWMVNLVAEKLGLDVERVETLARHGTIRFRGYRDVVYALLRREI 7 HbuRN12K106A AGHPEGTVVLLERNGARLVPGYPPIQRMVLPTIALPRHFIDKVVVEEAMNGY NVRLVMFHRKLLAVTRGGFICPYTTARLERLIGGRVRELFREIDPETYTIAG EVVGLENPYTRYFYPEAPRFDYFVFDLFHELKPLPPLERNELLEKYGLKHVR LLGVIDKNDVEMFKQIVAELDREGREGVVAKDPEYRVPPLKYTTSAVNIGDV RYGMRFFMEEGRSFLFSRLLRELFRAYEEGFGDAQLEKLALEFGRAATEPAL ESIRKVAMGDMLYEEFELVFADEVELEEFTSYMAELGVDIVVVSTSREDEGL RARMRKIKDTWIQLRKVLDTGLSPVD MASAAEVLASALRAVGVDPGSVDLEALSTRRSVRVSRFEDVVYVGFRRQFRG 8 Aeropyrumpernix VPEGTLVAFRRGEQIVVWGYPSIKRMLLPRVAVPRWFPGPTVLVEEXMNGYN VRVFTLGGMVYAATRGGLICPYTTRRLRRLYGGALQKILEDLGAEGSFIAGE VVGLENPYTRYYYEEAPGFGYFIFDIFKGGRQLPPRVKFSLAPEYGLKTVNL LAEIPATASGVERLYTIVEDLEKRGREGVIVKDPEGRVEPLKYTTSRINIGD IRLGMRYPFEEGRSFLFPRILREIFREWETGRRRYGELGEAILAPAIEAVEA VSRGGRLVEEFELVFANEVEAEEVIAYFASLGVHLEIAGVARGVDGVRVAFR KPRKSEGEIARILETGISPLD WhereXisanyaminoacidexceptK. MDENELVNKLSDALGIEYEKLSKHIGRSIRLMKYGELNYVVERRDLLGYREG 9 Staphylothermus TTILLGEEPLIVHGYPSIQRLAFIEGVSKHMIDNVVVEEXMNGYNVRVVYYM marinus NNIYAITRGGYICPYTTARIRKLYSKNIKLAYQEYPDTILVGEVVGTENPYV IYDYPEARGFDYFIFDTMKKDKLQPLRIRDEIAEKYSLKTVRILDIINKRDI DRLKTIINRLEKERREGVVLKDPYQRVPPLKYTTIYINIRDIWEGMRYPFDE GRGYLFSRIVRLIAQGYEYDWNNTELDRIALKLGRAILEPAINSLKKRANGE IIASKYTLVFPSEDDLSKYIEYAESIGMDFIFRVVEKREDGCIVVELFKMKE THNIYTKMLKTGYSPLD WhereXisanyaminoacidexceptK. MIRIPLERWMIEKLAEALNVNIEEAERLARRRNVVRLMKWRNVTYFSLRKDV 10 Pyrolobusfumarii YGLREGTLIAVWPDGYRVVPGYPSIQRVLLPSVALPKHFIDKIVVEEXLNGY NVRVVKLRDEIVAVTRGGLICPYTTQRIRKLYGDKLTSLFREEGEELVVAGE VIGLENPYVRFYYPEAGGFAYFIFDIVHGEKFLPPHERKEIVEKHGLLHVPV LGEIDKNDIKAFRKIIEDLERRGREGVVLKDPEYRVPPLKYTTSFINIHDIE IGMRFPFDEGRNYLFSRILREIFKAVEEGWDDRRLLLAEQNLGKAILEPAIE AVKEVKNGKMLYEEFMLPEDTRDDFEEFLDYMASLGVDIIVAGVEQRSDGSI VARIRKVKDTWREVQKILETGLSPID WhereXisanyaminoacidexceptK. MISPELVKEALKKKKVRSEEAFGLEYLRENDDYKDIPRGTAIFKDFIIWGYP 11 Aquifexaeolicus HIGRIFLLETGLREQFEAPFWVEEXVDGYNTRIFKYGDNYYALSRGGFICPF TTDRLPDLIDLRILDENPDLVICAEVAGPENPYIEESPPYVKEDVQLEVEDE MKKNEQGFLSQEEKMELIEKYNLPHVEILGRETASEEGIKKIKEILKRENEE GREGVVFKEDSERNKRAKYITSYANLMDIKTNAKNMLQLPPEYYTNRILRLV LFMYEEGLKTTEHLYEELGRAFIDGLFQAIEQFEKEHKVYKTFTCKERKKEN AIALLELLSKTSKHIQVKERRLEKEGDYWRLEFDKVFLNMTGLLGHLLSGGI VYD WhereXisanyaminoacidexceptK.

TABLE-US-00008 TABLE2 SEQID AminoAcidSequence NO: Speciesname MTWIHSPESWMLDVVAEALGIDRERVEHLARHRTIRYRVERGILYASLRREV 12 Pyrodictium AGHPEGTVIVFGRGWWRLIPGYPSIQRMVLPSVALPRHFVDKIVVEEXLNGY delaneyi NVRVALIDDRIIAVTRGGFICPYTTSRLERIMGNQLKDMLRELGPEEHVAAG EVIGLENPYTRYFYPEAPRFGYFVFDVFREGKPLPPGWRDEVTEKHGVPHVP VLGVLDKNDIEGFKKIVERLNQEGREGVIVKDPEYRVPPLKYTTPATNIGDI RYGMRFFMEEGRGFLFSRLLREIFRVYEEGLTGPRLDALALELGRAALQPAI ETVKKVAAGDMVYEEFELEFASRSELEEFMDYMQGLGVDLVLVEIREENGLL KTRIRKMKETWLQVRKMLETGLTPID WhereXisanyaminoacidexceptK. MRRDVSQFANKLDIGKVSELLDIPEHRITGALKRKTIQYVWGKKELFRFDKP 13 Candidatus VSSIEGGTSVFTEPFDIVRGFPKISRTLMLSPALQKHESSCRKVAVEEXMNG Methanoperedens YNVRVALIGDALVALTRGGFICPYTTEKAIDLIGYDFFNDHPDLVLCGEMVG PDSPYVPKTFYDIESLDFFVEDIREKITGKPLSVMERRALVDKYGIKSVRLF GEFEIGETHSEITRIIKDLGGSQHEGVVIKDPQMVVPPMKYTSSESNCADLR YAFEFYNDFGRDFFFGRVCREAFQSVEWDEDEESVEKRCRQLGESLLLPMIK TIKKKKDGERIAENVQIRVKSLDTVKEFEEYLKLVGVDAVFEEPEQTGNEYF VRIRKMHQSTNDRTEAILGGQLWS WhereXisanyaminoacidexceptK. MTWIHRPEPWMLDVVADALGLPRERVEELASRRTLRFREFRGLLYASLRRGV 14 Pyrodictium AGHHEGTAVVFGRGWWRVVPGYPPIQRMVLPSVALPRHELDRVVVEEXLNGY occultum NVRVVLVDDRILAVTRGGLICPYTTSRLERLMGDRLREMLRELGPEDHVAAG EVIGLENPYTRYFYPEAPRFGYFVFDIFRGGRPLPPRMRDEAAEKHGVPHVP VLGVLEKTDVEAFKRIVERLDREGREGVVVKDPDYRVPPLKYTTSSTNIGDI RLGMRFFMEEGWSFLFSRILREIFRVYEEGVEGPRLDAIALELGRAALQPAV ETVKKVAGGYMVYEEFELEFAGRDELEEFMDYMQSLGVDVVLVEAREEGGVL RARMRKIKETWIRVRRILETGVSPID WhereXisanyaminoacidexceptK. MGWVQPEPWMVDAVAEALGLERERVESLAKHRTIRFRVERGILYASLRRELG 15 Thermoproteota GYPEGTVVIFGRGWSRVVHGYPPIQRMVLPSVALPRHFVDRIVVEEXLNGYN archaeon VRVVLVDGRLLAVTRGGFICPYTTDRIERLLGGRLREMLRELGEEEHVAAGE VIGLENPYTRYYYPEAPRFGYFVFDIFRSGKPLPPRVRDEATEKHGVPHVPV LGVLDKGDIEGERSIVEALERRGREGVVVKDPEYRVHPLKYTTHATNVGDIR LGMRFFMEEGRGFLFSRLLREIFRAYEQGLQGPRLEKLATEIGLAALEPALE TVRLVAAGEPVYEEFELEFENRDRLEEFLEYMQSLGVDVVVAGTYERDGMLV ARVRKMRDTWLQVRRMLETGLTPID WhereXisanyaminoacidexceptK. MFVSESLGLSKHLGETLEERKILREALISHSFFSDVIEAVREDKKFGEIEEG 16 Geoglobus TVVAKTINGVRIVRGFPKIKRALVLNPTLKKHFENEVAVEEXMNGYNVRIAR acetivorans FGKNLYAMTRRGIICPYTTEKARELINPEFFKDHSDLVLCCEAVGEESPYVP KSMYGVEGLDFFVEDIREERTNRPLPVEEKLRLCEEYGLRHATYFGTYDVDV AHDEIKDIISDLAGKGREGVVIKDPEMKLSPLKYTTSQTNAEDLKYAFRFEN DYGKDFMESRIVREGFQSFEFNEGDKEFRERCLRLGMAILKPMVESIREVAL GGKVSEKLRLRFGSLDVMNLFFEQWKRSKVDFEITDIKKDGKDIVVFVNKTM RNTTDKIKAHLEGIPW WhereXisanyaminoacidexceptK. MKFIAEALGVSQAVIEKLNEKNLIRLAFIKHPFERDVIEAYKLERKVGEFEP 17 Archaeoglobus GTLIAKTVEGLRVVRGYPKIKRALTLYPTIKKHFKGEVVLEEXMNGYNVRLV profundus KFGENIYAITRGGFICPYTTEKARRLVNLDFFKDNPKLMLCCEAVGEESPFV PKDVYGVKTIDFYVEDIRDQKTNIALPIKQKEKLAEEYGLKLAPILAEVQVS KAHEIAKEIILELDKRGREGIVIKDPMMRRPPIKYTTSQCNCSDLSYAFRFF EEYGKDEMESRIIREAFQSFEFRENEEKFKDRCLRLGEAILSMVKSIKEVNE GKRIVEKMRLRFYDLEIFELFKEHIRRMGIRAEFSNPKREEDGYVVWVYRHI MSTTDKIKYILAGNLY WhereXisanyaminoacidexceptK. MVSSHFKSLLLELGISRERIEILESKGGIVEDEFEGIRYLRFKDSAGSLRRG 18 Thermococcus TVVFDSHNIILGFPHIKRVVHLENGIKRVFKRKPFYVEEXVDGYNIRVAQIE litoralis GRVFAFTRGGFVCPFTTERIEDFVNMEFFKDYPNLVLCGEMAGPESPYLVEG PPYVKEDIEFFLFDIQEKKTGKSLTVEERLKIAEEYGIPSVEVFGVYDISKI DELKELIEQLSREKREGIVMKSPDMKKIVKYVTPYANVNDIKIGARIFFDLP HGYFMQRIKRLAFYLAEKRVQDEEFEKYARALGRALLEPFVESIWDVSAGEE IAEVFTVRVKHIETAYKMVSHFERLGLKIHIEDIEEMPQGYWRITFKRVYPD ATREIRELWSGHAFVD WhereXisanyaminoacidexceptK. MVSSRFKDILTSLGISEERIEILEAKGGIVEDEYEGLRYLRFKDSAGKLRRG 19 Pyrococcussp. TVVFDFDKIILGFPHIKRVVNLEKGIRRIFKRGEFYVEEXVDGYNVRVTKVG ST04 ERILAITRGGFICPFTTERITDFVPEEFFKDNPNLVLVGEMAGPESPYLVEG PPYVKEDIKFFLEDVQEINTGKSLPVEERLKLAEEYGIPHVEVFGKYTRDDI GELYALIEKLSEEGREGIVMKSPDMKKIVKYVTPYANINDIKIGARVFYELP PGYFTSRISRLAFYIAERKIRDEELRKLAEDLGKALLQPFVEGILDVEQGEE IAETFKIRVKKIETAYKMVTHFEKLGLNIEIVDIEEMDGLWRITEKRVYSDA TEKIKELVGGKAFVD WhereXisanyaminoacidexceptK. MKSERGIMKYKDFIYYPFKKGGFGKGSVIIYHNDDVKIVPGYPSIKRLVLLS 20 Ignicoccuspacificus KVPEHFPEGVSVEEXMNGYNVRAMIVGGDVAFITRGGYLCPYTNARLNTLYG DSM13166 EKVKALLEELPPGSFLAGEVVGVENPYVRVKYPEAPYFDYFIFDIFVKTEDG WRQMPVEERHEIVKRHGLRSVRLLGTFESSEAPLKIKEIIDREDKEGREGVV MKDPEYKRSPAKYTGSYTNIGDIREGMRYPFDEGKDYLFPRIVREIFKVYEE GLSDKELERRALELGMAILKPAVESLKEVAQGETLFERFVLRFPHEEDLEEY LNYTRSLGVKVIVEEKWEEGEWIVVKAKKFKNTSNVYRSMLKSGQTPLD WhereXisanyaminoacidexceptK. MVSSYFKGILLNLGLDEERIEVLENKGGIVEDEFEGMRYLRLKDSARSLRRG 21 Palaeococcus TVVEDEHNIILGFPHIKRVVQLENGIRRAFKRKPFYVEEXVDGYNVRVAKIG EKILVFTRGGFVCPFTTERIEDFITLDFFKDYPNMVLCGEMAGPESPYLVEG PPYVKEDIQFFLEDIQEKKTGRSLPVEERLKLAEEYGIPSVEVEGLYDLSRI DELHALIDRLTKEKREGIVMKSPDMKKIVKYVTPYANINDIKIGARIFFDLP HGYFMQRIKRLAFYLAERKIRGEEFDEYARALGKVLLEPFVESIWDISSGDD EIAELFTVRVKKLETAHKMVTHFERLRLKIHIDDIEVLDNGYWRITEKRVYP DATKEMRELWNGHAFVD WhereXisanyaminoacidexceptK. pacificus

[0080] In some embodiments, the sequences in Table 1 or above are used to perform a sequence search (e.g., using BLAST or PSI-BLAST) to find other thermostable ssDNA/RNA ligases from other species (e.g., by finding those with 30% . . . 50% . . . 60% or more homology). For a particular candidate homolog that is identified, the next step is to find out the growth temperature of the species it is from. In general, a useful single strand ligase candidate would come from a species that has a growth temperature range higher than about 65 C. Next, one can perform a multiple sequence alignment, and locate the conserved catalytic motif, EKxxG (x is any amino acid; such as shown in Tables 1 and 2 above). Next, within the catalytic motif, mutate K to any other amino acid (e.g., to make a step 3 ligase mutant). In certain embodiments, the lysine (K) in such Motif I is mutated to another amino acid, preferably an alanine (A), serine(S), cysteine (C), valine (V), threonine (T), and Glycine (G). Such candidate enzymes (e.g., mutant enzymes) can then be screened for ssDNA and ssRNA activities (and thermostability), for example, using the same procedure as in Example 1 below (e.g., replacing the step 3 ligase mutant in Example 1 with the candidate mutant and measure performance).

[0081] In certain embodiments, the single stranded adaptors disclosed herein are used in library preparation (library prep) and/or then in sequencing methods, such as in attaching adaptors to library fragments for subsequent sequencing. For example, in some embodiments, the disclosure provided herein finds use in a Second Generation (a.k.a. Next Generation or Next-Gen), Third Generation (a.k.a. Next-Next-Gen), or Fourth Generation (a.k.a. N3-Gen) sequencing technology including, but not limited to, pyrosequencing, sequencing-by-ligation, single molecule sequencing, sequence-by-synthesis (SBS), sequence-by-binding, semiconductor sequencing, massive parallel clonal, massive parallel single molecule SBS, massive parallel single molecule real-time, massive parallel single molecule real-time nanopore technology, etc. Morozova and Marra provide a review of some such technologies in Genomics, 92:255 (2008), herein incorporated by reference in its entirety.

[0082] Any number of DNA sequencing techniques are suitable, including fluorescence-based sequencing methodologies (See, e.g., Birren et al., Genome Analysis: Analyzing DNA, 1, Cold Spring Harbor, N.Y.; herein incorporated by reference in its entirety). In some embodiments, the present disclosure finds use in automated sequencing techniques understood in that art. In some embodiments, the present technology finds use in parallel sequencing of partitioned amplicons (PCT Publication No: WO2006084132, herein incorporated by reference in its entirety). In some embodiments, the technology finds use in DNA sequencing by parallel oligonucleotide extension (See, e.g., U.S. Pat. Nos. 5,750,341, and 6,306,597, both of which are herein incorporated by reference in their entireties). Additional examples of sequencing techniques in which the technology finds use include the Church polony technology (Mitra et al., 2003, Analytical Biochemistry 320, 55-65; Shendure et al., 2005 Science 309, 1728-1732; U.S. Pat. Nos. 6,432,360, 6,485,944, 6,511,803; all of which are herein incorporated by reference in their entireties), the 454 picotiter pyrosequencing technology (Margulies et al., 2005 Nature 437, 376-380; US20050130173; herein incorporated by reference in their entireties), the Solexa single base addition technology (Bennett et al., 2005, Pharmacogenomics, 6, 373-382; U.S. Pat. Nos. 6,787,308; 6,833,246; herein incorporated by reference in their entireties), the Lynx massively parallel signature sequencing technology (Brenner et al. (2000). Nat. Biotechnol. 18:630-634; U.S. Pat. Nos. 5,695,934; 5,714,330; all of which are herein incorporated by reference in their entireties), and the Adessi PCR colony technology (Adessi et al. (2000). Nucleic Acid Res. 28, E87; WO 00018957; herein incorporated by reference in its entirety). In certain embodiments, the library preparation and sequencing technologies are as described in any of the following U.S. patents, each of which is herein incorporated by reference: 9,752,188; 10,570,451; 11,479,807; 8,383,345; 10,876,172; 9,598,731; 9,902,992; 10,801,063; 11,091,797; 8,532,930; 9,639,657; and 10,011,870.

[0083] In certain embodiments, the 3 end of the single-stranded adaptors here have a 3 end blocking group such as a phosphate group or a modified noncanonical nucleotide to prevent ligation between the 5-adenylated end and 3 end of the adaptor. If the blocking group is, for example, a nucleotide with 3-phosphate, such blocking group can be removed by T4 polynucleotide kinase, calf intestinal alkaline phosphatase (CIP), or shrimp alkaline phosphatase (SAP). In certain embodiments, where CIP or SAP are used, one may employ a round of T4 PNK to re-phosphorylate the 5-end of the proximal strand.

[0084] Another class of the base blockers and de-blocking can leverage reversible terminator technologies widely used in the sequencing industry.

[0085] In certain embodiments, the single-strand adaptors herein comprises a non-canonical base in the cleavable region, such as inosine, uracil, 5-formylcytosine, 5-carboxylcytosine, or 8-oxoguanine. In some embodiments, wherein inosine is present in the cleavable region, Endonuclease V or similar enzyme is used to perform strand cleavage. In certain embodiments, wherein uracil, 5-formylcytosine, or 5-carboxylcytosine is present in the cleavage region, a combination of uracil DNA glycosylase (UDG) and Endonuclease VIII (or similar enzymes) are employed for strand cleavage. In particular embodiments, where 8-oxoguanine is present in the cleavable region, a thermostable OGG (oxoguanine glycosylase) is employed or strand cleavage. In particular embodiment, a cleavable backbone linkage can be, for example, phosphorothioate DNA and cleaved by iodine (I.sub.2) (Qiang Huang et al. Origin of iodine preferential attack at sulfur in phosphorothioate and subsequent PO or PS bond dissociation, PNAS, vol 119, 2022).

[0086] In particular embodiments, the cleavable region is cut with an endonuclease. For example, a secondary oligo (as shown in FIG. 11C) is added to the sample and hybridizes to the single strand adaptor and creates a cleavage site in the cleavable region (e.g., which can be cleaved by a restriction endonuclease). In some embodiments, while most of the single strand adaptor is composed of DNA, the cleavage region contains one or more RNA bases, allowing cleavage by an Rnase H, Rnase H2, or site-specific endonuclease.

[0087] In certain embodiments, the methods and compositions preserve methylation on 3 and 5 protruding ends of target DNA as, for example, no end-repair or A-tailing is required of the target DNA. Conventional library prep uses end-repair and A-tailing (ER/AT) step as part of its workflow. The enzymatic reactions in ER/AT write to the starting DNA templates by filling in the 3-recessed ends or erase to the starting DNA template by removing the 3-protruding ends, as shown in FIG. 16A, losing information about potential methylation in protruding ends. The DNA patches that are filled in during this step are void of biologically relevant base modification, such as 5-methylcytosine (5 mC), 5-hydroxymethylcytosine (5 hmC) etc. As a result, they introduce dilution effect to the CpG sites near the ends of the template DNA. Such artifacts exist extensively in conventional genome-wide methyl-seq results (see, e.g., Jiang et al., Genome Res. 2020 August; 30 (8): 1144-1153, herein incorporated by reference, particularly FIG. 1). As the methods and compositions herein do not require ER/AT, no writing is involved during the library preparation process. As the result, no artifactual DNA patches are introduced during the library preparation workflow and high-fidelity methylome data can be obtained (e.g., by bisulfite and similar methods). In other words, as most methods rely on bisulfite conversion of DNA to detect unmethylated cytosines (which changes unmethylated cytosines to uracil during library preparation), converted bases are identified (after PCR) as thymine in the sequencing data, and read counts are used to determine the % methylated cytosines. Thus, the methods and compositions herein allow for a more realistic and higher fidelity methylation detection (e.g., which has implications for more accurate cancer detection).

[0088] In certain embodiments, the first and/or second nucleic acid sequence (e.g., that make up the adapters) is/are methylated at every (or almost every) cytosine present. In certain embodiments, the methylation is 5-methylcytosine (5 mC) and/or 5-hydroxymethylcytosine (5 hmC). Such methylation, for example, can be introduced during synthesis of the nucleic acid sequences. During bisulfite conversion (e.g., as part of bisulfite sequencing methods herein) 5-methylcytosine (or 5 hmC) will not be converted during the conversion step while unmodified cytosine will be converted to uracil. The conversion can either be chemical based (such as bisulfite conversion) or enzyme based (such as EM-seq). Methylation of the nucleic acid sequences herein prevents, for example, any elements present in the adapters (such as flow cell binding sequences and universal primer binding sites) from having their sequences changed during bisulfite conversion (or other conversion). In this regard, the adapter sequences can still bind to universal primers employed and can still bind to flow cell sequences (e.g., such as in Illumina sequencers). An exemplary workflow employing such methylated nucleic acid sequences is shown in FIG. 16B.

[0089] In certain embodiments, rolling circle amplification is performed instead of PCR amplification, as shown in FIGS. 18A and 18B. In FIG. 18A, after the first and second adaptors are ligated onto the duplex DNA molecule and dumbbell loop structures are formed, a primer that is designed to anneal to the adaptor sequence is added to initiate rolling-circle amplification (RCA), to generate concatenated single-strand DNA with multiple copies of the duplex DNA templates. In FIG. 18B, after the first and second adaptors are ligated to the duplex DNA molecule, duplex DNA is denatured and subject to deblocking/ligation, so that each strand forms single strand circles with the adaptor sequence embedded inside. A primer that is designed to anneal to the adaptor sequence is added to initiate rolling-circle amplification (RCA), to generate concatenated single-strand DNA with multiple copies of the strand-specific DNA templates.

EXPERIMENTAL

Example 1

Single-Stranded End-Preserving Adaptors for Library Preparation

[0090] In this Example, a library preparation approach (DISTAL-seq) is illustrated based on a ligation schema termed duplex-retaining single strand tail ligation (DISTAL ligation). In certain embodiments of DISTAL ligation, a 5-adenylated single strand adaptor DNA is ligated to the 3-ends of the duplex DNA at elevated temperature (e.g., 75 C.) catalyzed by a step 3 ligase mutant (hyperligase in this example), with the DNA duplex still retained (see, FIGS. 1 and 2A). After DISTAL ligation, since the 3-end of the adaptor is brought to the proximity of the 5-end of the duplex DNA, a more efficient intramolecular ligation is readily feasible. The resulting dumb-bell-shaped DNA can then be processed for PCR enrichment (FIG. 2A). A unique feature of embodiments of the DISTAL-seq is that the ligations at the 5-ends and the 3-ends of the library duplex DNA occur in single strand form, so that conventional ER/AT is not necessary.

[0091] Based on embodiments of DISTAL-seq, DUET-seq (duplex end restoration sequencing) was developed by incorporating strand-specific unique molecular identifiers (UMIs, aka barcodes which can be unique or non-unique) into the single strand adaptor, so that a portion of the resulting reads can be paired to its original duplex form (FIG. 3A). As a result of the strand pairing, native DNA ends are also restored. For validation, certain embodiments of DUET-seq were shown to recover known DNA ends with no strand swapping from a pool of restriction enzyme digested DNA. Embodiments of DUET-seq was then used to compare the end profiles between sonicated genomic DNA and enzymatically fragmented DNA, and revealed that while ends of the sonicated DNA can be either 5-single-strand protruding or 3-single-strand protruding with equal probability and tend to be rather short, ends from enzymatically fragmentation are predominantly 3-single-strand protruding and have a wider length distribution. Finally, embodiments of DUET-seq was applied to a cell free plasma DNA sample, for which the ends are generated and polished from an in vivo setting. Intriguingly, predominantly 3-single-strand protruding ends are discovered in the cell free plasma DNA.

[0092] In summary, this example demonstrates the principles and utilities of certain embodiments of DISTAL ligation, and the library preparation workflows (embodiments of DISTAL-seq and DUET-seq) that starts with it. These methods provide advantageous alternatives to the conventional ER/AT-based NGS preparation methods.

METHODS AND MATERIALS

Single-Strand Adaptor Preparation

[0093] Two types of single-strand adaptors were prepared and used in this example, one for an embodiment of regular DISTAL-seq workflow and one for duplex end-restoration sequencing (DUET-seq), as listed in Table 3. All oligos were order from Integrated DNA Technologies (IDT). All oligo purification was done by using Monarch DNA purification kit (NEB). For Ampure clean-up, Ampure XP beads were purchased from Beckman-Coulter.

TABLE-US-00009 TABLE3 Single-strandadaptoroligosandPCRprimersusedinthisexample Name sequence SEQIDNO: Purification ILMNAda2 NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC/ 22 HPLC ideoxyU/ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNN/ 3Phos/ dupada_3 GTCGABBBBBBBBAGATCGGAAGAGCACACGTCTGAACTCCAG 23 HPLC TCAC/ideoxyI/TACACTCTTTCCCTACACGACGCTCTTCCGATCT pBR322_ ACTCTTCCTTTTTCAATATTATTGAAG 24 Desalted 4288F pBR322_ ACACGGTGCCTGACTGCGTTAGCAATTTAAC 25 Desalted 85R pBR322_ AGGTTAATGTCATGATAATAATGGTTTC 26 Desalted 4320R lambda_ GGGTTTTCTTTTGTGCGCTTGCAGGCCAGC 27 Desalted 4500F lambda_ AGCAGAATGCCGTCCACCATCGGATCGCTGG 28 Desalted 4800R lambda_ GGGTTTCCTCAGCTCTTTTGTGCGCTTGCAGGCCAGC 29 Desalted 4500F_ 3res lambda_ AGCAGAACCTCAGCTGCCGTCCACCATCGGATCGCTGG 30 Desalted 4800R_ 3res lambda_ GGGTTTTGCTGAGGCTTTTGTGCGCTTGCAGGCCAGC 31 Desalted 4500F_ 5res lambda_ AGCAGAATGCTGAGGGCCGTCCACCATCGGATCGCTGG 32 Desalted 4800R_ 5res 1. List of base modifications: /ideoxyU/: internal dU; /idoxylI/: internal dI; /3Phos/: 3- phosphate; B =C/G/A; N =A/C/G/T 2. Underline sites denote Nb.BbVCI recognition sites.

[0094] To prepare a DISTAL-seq adaptor, ILMNAda2 (Table 3) was resuspended in water to 20 uM. 5-phosphorylation was done by mixing 26 ul ILMNAda2, 3 ul 10 T4 ligase buffer (NEB) and 1 ul T4 polynucleotide kinase (3 phosphatase minus) (NEB). Reaction was incubated at 37 C. for 1 hour and column purified with 30 ul elution volume. 5-adenylation was done by using 5-adenylation kit (NEB), by mixing 6 ul phosphorylated ILMNAda2, 2 ul 10 adenylation buffer, 2 ul of 1 mM ATP, and 2 ul Mth RNA ligase in 20 ul reaction volume. Reaction was incubated at 65 C. for 1 hour and heat-inactivated at 85 C. for 5 minutes. Reaction was then column purified with 10 ul elution volume in low-TE buffer.

[0095] For an exemplary DUET-seq adaptor, dupada_3 (Table 3) was first resuspended in water to 20 uM. 3-extension was done by mixing 26 ul dupada_3, 5 ul 10 reaction buffer, 1 ul dA/U/C/GTP mix (10 mM each), 1 ul Taq-Klenow (cat #TT-100, MCLAB) in a total volume of 50 ul. Extension reaction was programmed to first heat to 95 C. for 1 min, and then 68 C. for 10 min. Reaction was column purified with 26 ul elution volume. Since the extension will incorporate a single dU in the newly synthesized 3 end, USER reagent (NEB) (uracil DNA glycosylase and Endo VIII) was then used to cleave at the dU site and generate the 3-phosphate end. This is done by mixing 26 ul extended dupada_3, 3 ul 10T4 ligase buffer (NEB), 1 ul T4 polynucleotide kinase (3 phosphatase minus) (NEB) and 0.5 ul USER reagent (uracil DNA glycosylase and Endo VIII). Reaction was incubated at 37 C. for 40 min and column purified with 25 ul elution volume. 5-adenylation was done similarly as above with the 5-adenylation kit (NEB), by mixing 6 ul 5- and 3-phosphorylated dupada_3, 2 ul 10 adenylation buffer, 2 ul of 1 mM ATP, and 2 ul Mth RNA ligase in 20 ul reaction volume. Reaction was incubated at 65 C. for 1 hour and heat-inactivated at 85 C. for 5 minutes, and was then column purified with 10 ul elution volume in low-TE buffer.

Distal Ligation and DNA Substrate Preparation

[0096] Purified HyperLigase is from RGENE Inc. (10). Cloning and purification of HyperLigase was described earlier in (10, herein incorporated by reference).

[0097] A typical 50 ul HyperLigase ligation is composed of: 5 ul 10 HyperLigase reaction buffer (700 mM Tris, pH=7.5), 5 ul MnCl.sub.2 (100 mM), 5 ul adenylated adaptor (10 uM), 15 ul 40% (w/v) PEG8000, 1.5 ul 5 M NaCl, 2.5 ul purified HyperLigase and 15 ul input sample solution containing duplex DNA. Reactions are incubated in PCR machine at 75 C. (with heated lid on) for 6 hours. Reaction series in FIG. 1 were incubated at various temperature in separate runs (not with gradient option). Reactions were then purified twice using 1 bead clean-up and run on Tapestation using High sensitivity D1000 screentape (Agilent). For reaction using thermostable App DNA ligase (cat #M0319, NEB), reaction was set up according to manufacturer's recommendation and incubated at 65 C. for 6 hours.

[0098] For DNA duplex substrates used in FIG. 1, primers were designed based on pBR322 sequence (Table 3). pBR322_4280F and pBR322_4360R were used to generate 180 bp amplicon and pBR322_4280F and pBR322_85R were used to generate 320 bp amplicon. PCRs were done by using Taq2 master mix (cat #M0270, NEB) and followed manufacturer's recommended protocol. To generate duplex DNA with defined 3-single-strand protruding and 5-single-strand protruding ends, PCRs using lambda_4500F_3res/lambda_4800R_3res, and lambda_4500F_5res/lambda_4800R_5res as primers, lambda DNA as the template, were done separately. Purified PCR products were then subject to Nb.BbVCI (NEB) digestion at 37 C. for 2 hours. Purified digested DNA with defined 3-single-strand protruding and 5-single-strand protruding ends were then used in DISTAL ligation. Blunt-ended duplex DNA were generated by using lambda_4500F/lambda_4800R as PCR primers, purified and used in DISTAL ligation.

Distal-Seq and DUET-Seq Library Preparation

[0099] Genomic DNA of E. coli 0157 strain EDL933 was ordered from Sigma (cat #IRMM449). Human genomic DNA extracted from blood (buffy coat) was purchased from Sigma (cat #11691112001, Roche). Human cell-free plasma DNA was purchased from PlasmaLab International (Everett, WA). Fragmentation was either done by using Covaris M220 model or by using NEBNext dsDNA fragmentase (NEB cat #M0348) following manufacturer's instructions. For sonicated genomic DNA, an extra round of end polishing was done by treating DNA with T4 polynucleotide kinase (PNK) in the T4 ligase buffer and purified by bead clean-up.

[0100] In this example, exemplary DISTAL-seq and DUET-seq start directly with hyperligase ligation using 50 ng fragmented DNA. Briefly, 50 ul reaction consists of: 5 ul 10 reaction buffer, 5 ul MnCl.sub.2, 2.5 ul HyperLigase, 15 ul 40% (w/v) PEG8000, 5 ul adaptor, 1.5 ul 5 M NaCl, and 15 ul DNA solution. Reaction was incubated at 75 C. for 6 hours and purified by 2 rounds of 1 Ampure beads clean-up with elution volume of 26 ul in water. De-blocking of the adaptor 3-ends was done by using 26 ul DNA from the previous step, 3 ul 10T4 ligase buffer, and 1 ul T4 PNK. Reaction was incubated at 37 C. for 40 min, after which another 1 beads clean-up was done to the reaction mix with elution volume of 10 ul. Circularization was done by using 10 ul DNA solution from the previous step, 1 ul 10 circligase reaction buffer, 0.5 ul CircLigase II and 0.5 ul MnCl.sub.2 (Biosearch Technologies, cat #CL9021). Reaction was incubated at 60 C. for 1 hour, after which another 1 Ampure clean-up was done with elution volume of 18 ul. Dumb-bell DNA digestion is done by using 18 ul DNA solution from the previous step, 2 ul of 10 rCutSmart, and 0.2 ul of endonuclease. For regular DISTAL-seq, in which internal dU is embedded within the adaptor, USER reagent (uracil DNA glycosylase and Endo VIII) was used; for DUET-seq, in which internal dI is embedded within the adaptor, Endonuclease V was used (cat #M0305, NEB). Digestion reaction was incubated at 37 C. for 15 min followed with heat-inactivation at 65 C. for 20 min. For PCR enrichment. 25 ul of Kapa HiFi HotStart ReadyMix (KR0370, Roche Sequencing) and 5 ul UDI primer mix (part number 10005922, IDT) were added to the digestion reaction mix. PCR conditions followed manufacturer's protocol. For E. coli library, 12 cycles were performed; for DUET-seq, 16 cycles were performed. After PCR, purification was done by 0.9 beads clean-up.

Sequencing and Data Analysis

[0101] All libraries were quantified by qPCR (KR0405, Roche Sequencing) and pooled based on library concentration and planned read allocation. Sequencing was carried out 2151 cycles on an Illumina NextSeq 500 using a High-output kit according to manufacturer's protocol.

[0102] Programs and commands with parameters to process the read data is listed in Table 4.

TABLE-US-00010 TABLE 4 Computational programs and commands for analysis Programs Commands notes Trimmomatic java -jar ~/Trimmomatic-0.39/trimmomatic-0.39.jar PE s7_R1.fastq.gz s7_R2.fastq.gz Read trimming. s7_paired_R1.fq.gz output_forward_unpaired.fq.gz s7_paired_R2.fq.gz S7 is the output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2: True representative LEADING: 3 TRAILING: 3 MINLEN: 36 sample used. Picard picard FastqToSam F1=s7_paired_R1.fq.gz F2=s7_paired_R2.fq.gz Merge read O=s7_unaligned.bam SM=s7 1 and read2 Fgbio java -jar ~/picard/fgbio-2.2.0.jar ExtractUmisFromBam --input=s7_unaligned.bam -- Extract UMIs output=s7_unaligned_withumi.bam --read-structure=8M143T 13M138T --molecular- from reads index-tags=ZA ZB --single-tag=RX picard picard SamToFastq I=s7_unaligned_withumi.bam F=s7_unaligned_withumi.fastq regenerate INTERLEAVE=TRUE fastq read file Bwa bwa mem -t 4 -p Alignment to ~/reference/human/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna reference s7_unaligned_withumi.fastq -o s7_aligned_withoutumi.sam samtools samtools view s7_aligned_withoutumi.sam -b -o s7_aligned_withoutumi.bam Conversion of sam to bam picard picard MergeBamAlignment UNMAPPED=s7_unaligned_withumi.bam Merge bam, ALIGNED=s7_aligned_withoutumi.bam O=s7_aligned_withumi.bam adding UMI R=/home/zhengyuhudson/reference/human/GCA_000001405.15_GRCh38_no_alt_analy info back sis_set.fna SO=coordinate ALIGNER_PROPER_PAIR_FLAGS=true MAX_GAPS=1 ORIENTATIONS=FR VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true picard picard MarkDuplicates I=s7_aligned_withumi.bam O=markduplicates_umi.bam De- M=markduplicates.txt BARCODE_TAG=RX REMOVE_DUPLICATES=true deplication

[0103] Briefly, adaptor sequences and polyG sequences were first trimmed off reads by using Trimmomatic (11). Reads were then aligned to the reference genomes by using bwa (12) and files were processed with samtools (13). Alignment statistics were generated by Picard tools (14). UMI-aware read processing was done by using fgbio (15). Customary scripts for pairing duplex strands and end restoration were written using PERL.

RESULTS

Biochemical Characteristics of DISTAL Ligation

[0104] In this Example, DISTAL ligation is catalyzed by a mutant thermostable single strand ligase (named HyperLigase), which originates from Hyperthermus butylicus, a hyperthermophilic archaebacterium that grows optimally between 95 C. and 106 C. (16). A lysine to alanine mutation was introduced at the catalytic site so that the mutant ligase is only capable of ligation between a 5-adenylated end and a 3-hydroxyl end of single strand DNA/RNA at elevated temperature (up to 95 C.) (10). Another mutant ligase (thermostable Mth App ligase, NEB cat #0319) from Methanobacterium thermoautotrophicum, with optimal reaction temperature at 65 C., was reported in (17) and also tested in this example.

[0105] Although the 3-hydroxyl end is typically provided by a single strand DNA, the hypothesis tested here is on whether duplex DNA, with duplexes at the ends transiently separating into single strand conformation due to thermodynamics (DNA breathing (18)), can facilitate ligation at the 3-ends catalyzed by the single strand DNA ligases. The duplex is ideally retained before and after the ligation (hence the name), with the benefits of retaining strand pairing signals for the later end restoration (see below DUET-seq).

[0106] To test the impact of the reaction temperature, DISTAL ligation reactions were set up and incubated at various temperature for 6 hours (FIG. 1). As shown in FIG. 1A, both one-sided ligation product and two-sided ligation product can be readily seen above the starting duplex DNA substrate, which is around 320 bp (lane 8, FIG. 1A). This first observation demonstrates that hyperligase ligation is clearly feasible at elevated temperature. The intensity of the two-sided ligation product band peaks at 75 C. and starts to decrease at either higher or lower temperature. When the reaction temperature drops below 70 C., the ligation reaction becomes inefficient so that there is almost no observable two-sided ligation product (FIG. 1A). Similar experimental set up was repeated for a duplex DNA of different size (180 bp). FIG. 1B shows that in general same trends hold for the smaller-sized duplex DNA. The testing of hyperligase ligation near the 180 bp range is important because for some type of DNA, such as cell free plasma DNA, majority of the fragments may fall into the vicinity of that size range.

[0107] As a comparison, the thermostable Mth App ligase (NEB) was used to test ligation between duplex DNA and single strand DNA at 65 C. for 6 hours. However, almost no ligation products can be observed (lane 9, FIG. 1A). This observation, together with the low ligation efficiency below 70 C. observed for HyperLigase (FIG. 1AB), further supports the importance of reaction temperature.

[0108] The quantitative conversion efficiency of hyperligase ligation, as measured by the percentage of conversion from substrate to either one-sided or two-sided ligation product is shown in FIG. 1C. At 75 C., more than 80% of the duplex DNA can be ligated either at one end or at both ends, with 30-40% being two-sided and 40-50% being one-sided (FIG. 1C). Since only two-sided product will be efficiently amplified during PCR (DISTAL-seq, FIG. 2A), it suggests, for the conditions in this example, a ceiling of about 30% to 40% of the duplex DNA fragments that are converted to sequenceable DNA templates. Time-course experiments suggest that ligation products keep accumulating over time but appear to reach a plateau after 6-8 hours (data not shown).

[0109] A notable feature of the hyperligase ligation in FIG. 1 is that ligation can only occur between 5-adenylated end of the single strand adaptor DNA and the 3-hydroxyl end of the duplex DNA. The 3-end of the single strand adaptor DNA is modified (3-phosphate in this example) so that self-circularization or cross concatenation of the adaptor is blocked. There is no added ATP in the ligation reaction so that no additional 5-adenylated ends can be generated, which also precludes ligation among the input DNA themselves, whether in a single strand or in a duplex form. The uni-direction feature of the hyperligase ligation increases efficiency as well as preventing sequencing artifacts due to the formation of chimeras.

[0110] To investigate the potential impact of end configuration to hyperligase ligation, duplex DNA with 5-single-strand protruding end or 3-single-strand protruding end were prepared by subjecting blunt end PCR DNA to nicking enzyme (Nb.BbVCI) digestion. Nicking enzyme recognition sites were introduced from the primers used in PCR (Table 3). After digestion, 5-single-strand protruding ends have 9-10 nt single strand portion while 3-single-strand protruding ends have 11-12 nt on either side of the duplex DNA. The digested DNA were then used as substrate for hyperligase ligation, and compared with blunt end DNA, as shown in FIG. 5. The results there show that hyperligase ligation does not exhibit bias for the ends tested. This conclusion provides support that hyperligase ligation can be used to attach adaptor to the 3-ends of a complex library with heterogenous DNA end configuration.

Distal Ligation Enables Sequencing Library Preparation (Distal-Seq)

[0111] FIG. 2A illustrate a sequencing library preparation workflow based on DISTAL ligation (DISTAL-seq), which can be grouped into 4 stages: (1) hyperligase ligation, in which 5-adenylated single strand adaptor is ligated to the 3-end of the duplex library DNA; (2) adaptor 3-end deblocking, during which the phosphate group at the 3-end of the adaptor is removed by the phosphatase activity of the T4 PNK. Other DNA phosphatases can be used for this purpose as well (data not shown). (3) circular ligation between the proximal 5- and 3-ends by the single strand DNA/RNA ligase CircLigaseII. Similar to hyperligase ligation, this ligation step occurs at elevated temperature (60 C.). The difference is that instead of being inter-molecular for hyperligase ligation, this ligation is intra-molecular; (4) Cleavage of the adaptor at pre-designed site by endonucleases, and followed by PCR enrichment (FIG. 2A).

[0112] As a proof of principle experiment, library from 50 ng of reference material Escherichia coli 0157 (EDL933) gDNA was made using the DISTAL-seq workflow (FIG. 2AB) and sequenced. Approximately 4.8 million of pair-end reads (2.4 million of each direction) were obtained and aligned to the reference genome. The mean coverage is around 85 and the median coverage is around 95 (FIG. 2C). The read mapping is of high quality, with low error rate (PF_HQ_ERROR_RATE=0.0064, PF_INDEL_RATE=0.00010) and low chimera rate (PCT_CHIMERAS=0.003. All metrics were reported by CollectAlignmentMetrics in Picard Tools). FIG. 6 shows the distribution of insert sizes, which is consistent with the intended 300 bp size range during sonication. A moderate level of GC-bias is observed for both high-GC and low-GC bins, with normalized coverage dropping to around 0.5 at either extreme (FIG. 7). Nevertheless, the GC-bias does not appear to impact the overall coverage uniformity significantly since >90% of the bases are covered with more than 0.2 of the mean coverage (17) (FIG. 2D).

[0113] To investigate the sequence-specific bias of both hyperligase ligation and circular ligation, stretches of randomized nucleotides (NNNNN, N=A/T/C/G) were designed at both 5- and 3-ends of the single strand adaptor. Mapped read1 and read2 of the E. coli DISTAL-seq data were aligned and sequence context on either side of the ligation junction was analyzed for potential bias (FIG. 2C). As shown in FIG. 2C, no significant sequence bias is present at the ligation junction on the side of the genomic insert. There is slight sequence bias present on the adaptor side for either DISTAL ligation or for the circular ligation, which should not be concerning and could be associated with the synthetic degenerate adaptor itself. In addition, having the degenerate regions at both ends of the adaptor provides additional utility of serving as unique molecular identifiers (UMIs) and can be used for further error suppression (FIG. 2D).

[0114] Since there are a few steps during this exemplary DISTAL-seq library preparation that requires long incubation at elevated temperature (6 hours at 75 C. for hyperligase ligation and 1 hour at 60 C. for CircligaseII reaction), a potential concern is on the DNA damage heat may introduce, including but not limited to C>T transition and G>T transversion (19). To address this question, DISTAL-seq data from the reference genomic material (Escherichia coli 0157 (EDL 933)) is used and mutations were called from the sequencing data (see Methods). As show in FIG. 2E, all possible mutation types are at relatively comparable levels, supporting a minimal impact of heat as a source for the artifactual mutations. The allele fractions of each mutation type are shown in FIG. 8. Moreover, since UMIs are designed into both ends of the DISTAL-seq adaptor, further error suppression can be achieved by grouping reads with same UMIs and (FIG. 2E).

Duplex End Restoration Sequencing (DUET-Seq) and Validation

[0115] Duplex sequencing uses a strategy of tagging each strand separately by distinct but corresponding UMIs so that two strands can be paired during analysis (5). Duplex sequencing has been used to detect rare genomic mutations with high positive predictive value (PPV) (5), and proven valuable in a variety of research and clinical application. As discussed earlier, with the current duplex sequencing library preparation methods, native ends are either filled in or resected to blunt ends so that they cannot be restored after the duplex is reconstructed. The principle of the DISTAL-seq provides a feasible framework where if strands can be paired, duplex and native end restoration sequencing (DUET-seq) may become possible.

[0116] FIG. 3A illustrates an exemplary adaptor design and preparation process for the DUET-seq. Strand-specific UMIs are complementary to each other through primer extension. A single dU base is present in the newly synthesized portion, which can later be cleaved by USER reagent (uracil DNA glycosylase and Endo VIII) enzyme and generate a blocked 3-phosphate end (FIG. 3A). The rest of the template for the primer extension is designed so that no additional dU will be incorporated to avoid unintended USER cleavage. In addition, due to the incorporation of dU, a single dI is designed into the loop region and can be readily cleaved by endonuclease V. The 5-end of the DUET-seq adaptor can be enzymatically phosphorylated and adenylated (FIG. 3A). Similar to DISTAL-seq library preparation process, since both ligations are performed at elevated temperature (75 C. for hyperligase and 60 C. for circligase), the DUET-seq adaptor is generally expected to adopt single strand form instead of stem-loop form and ligate to each strand separately.

[0117] The read structure and the strand paring diagram are also shown in FIG. 3A. Two sets of pair-end reads are considered to originate from the same duplex if their coordinates overlap by more than a heuristic threshold (150 bp used in this study) and they share the same sets of UMIs with the structure of A-B and B-A (read1 UMI-read2 UMI).

[0118] To validate DUET-seq, a mixture of DNA with known ends was used: lambda DNA (48 kb) digested with FauI (CCCGC(N).sub.4GGGCG(N).sub.6) was spiked into lambda DNA digested with AluI (AGCT/TCGA) at 1:2000 ratio. FauI is expected to generate 2-nt 3-single-strand protruding ends while AluI generates blunt ends. In 50 ng of starting DNA mixture for DUET-seq, there are about 4.510.sup.5 copies of FauI-digested genome equivalent in the background of 910.sup.8 copies of AluI-digested genome equivalent. The theoretical diversity of the barcode combination from the DUET-seq adaptor is 410.sup.7 (FIG. 3A). Thus, for FauI-digested DNA fragments, it is less likely that fragments sharing the same start and stop coordinates will be tagged with a same set of unique UMI combination. For other applications with much larger genome size, e.g., human genomic DNA, 50 ng contains about 1.3910.sup.4 copies of haploid genome equivalent, much less than the diversity of the barcode combination. The likelihood of the UMI clash is much less concerning. The goal of the validation experiment is to test whether DUET-seq can faithfully restore the FauI-digested duplex and recover the FauI cutting pattern.

[0119] Duplex sequencing is reported to be inefficient in recovering both strands and often requires excess level of sequencing (20). During DUET-seq, an extra step of exonuclease treatment can be added before the endonuclease V digestion step (FIG. 3B). Un-ligated and partially ligated products, but not fully ligated product, are substrates for Exonuclease I and III, and subject to degradation. The benefit of the exonuclease treatment is to enrich DNA with both strands fully ligated so that both strands can be sequenced and represented in the sequencing data. FIG. 3C shows the Tapestation trace of the lambda DUET-seq library with and without exonuclease treatment. Sequencing was then done on these libraries, with 3.1 M paired reads on the Exo-DUET-seq library (median coverage 250) and 2.9 M paired reads on Exo+ DUET-seq library (median coverage 250).

[0120] The strand pairing analysis identified 1 duplex FauI-digested fragment from the Exo-library, as compared to 11 duplex FauI-digested fragments were, which represents about 10-fold enrichment in duplex recovery. For the duplex FauI-digested fragments identified from both libraries, both strands of the duplex originate from bona fide FauI cleavage, with no strand swapping between FauI-digested and AluI-digested strands. The signed single strand end length (FIG. 3D) shows that majority of the ends are 3-single-strand protruding with 2-nt overhang, while a small proportion showing a 3-nt overhang. The 3-nt overhang observation is interesting in that it is known that type IIs-like restriction enzyme (FauI) may be wobbling in their cutting sites (21). Nevertheless, the relatively low sequencing depth might also contribute to a sampling bias. Overall, these results support that DUET-seq can recover duplex DNA as well as the native ends associated with the duplexes.

Application of DUET-Seq to Human Genomic DNA and Cell-Free DNA

[0121] Finally, this exemplary DUET-seq was applied to a few real-world samples to profile the states of the native ends. First, DUET-seq was used to compare the end profiles between sonicated genomic DNA and enzymatically fragmented genomic DNA. As shown in FIG. 4A, majority of the DNA fragments in sonicated genomic DNA possesses either blunt ends or ends with short single strand overhangs (2-3 nt). There is an almost equal population of ends being 5-single-strand protruding or 3-single-strand protruding. The enzymatically fragmented genomic DNA, however, has majority of ends being 3-single-strand protruding, with a much wider size distribution of the single strand overhang. A major peak at 1 nt, which stands for 1-base 3-single-strand protruding, accounts for 10% of all the ends. The 3-single-strand protruding nature of the ends likely reflects nucleases' preference in the fragmentase product mix. FIGS. 9 and 10 show the insert size distributions for the sonicated and enzymatically fragmented library separately, which are similar. Although both fragmentation methods are commonly used in NGS library preparation, the results here show that it is important to recognize the distinct difference in the end pattern generated by the two methods, since it may result in different levels of end-repair during ER/AT.

[0122] For cell free plasma DNA, interestingly, a first observation is that its end pattern has an intriguing resemblance to the enzymatically fragmented DNA: majority of the ends is 3-single-strand protruding, with a major peak at 2 nt (10%). As a quality check of the DUET-seq library, FIG. 4B shows the insert size distribution for the cell-free DNA library. It clearly shows the characteristic 170 bp mono-nucleosomal major peak and the 340 bp di-nucleosomal minor peak, consistent with previously reports (22). Sequence motif finding was attempted in the vicinity of the 5- and 3-ends, but yielded no significant findings (data not shown). There is also similarity in size distribution of the single strand overhang between the two types of DNA: with (min=47 nt, max=37 nt, mean=7.9 nt) for the cell free DNA and (min=118 nt, max=49 nt, mean=8.7 nt) for the enzymatically fragmented DNA. Majority of the ends has a single strand end length in the (25.2) interval (FIG. 4B). Compared with the previous report on the end length distribution (FIG. 5A in (3)), one of the main differences is that fragments with long single strand overhang (>30 nt) is much rare in the results here.

[0123] The attachment of an adaptor with defined sequence to library DNA is crucial in driving creative ways to make sequenceable libraries for the interrogation of genomic alterations. Many ligation strategies have been described including duplex to duplex ligation, such as A/T ligation, blunt-blunt ligation, etc., and single strand ligation, for example, mediated by splint or direct single strand to single strand ligation. Here, an alternative strategy is illustrated in which single strand adaptor is directly ligated to duplex DNA. It is termed DISTAL (duplex retaining single strand tail) ligation. As shown in this example, DISTAL ligation enables sequencing library preparation workflows where conventional end repair is no longer necessary.

[0124] Reaction temperature plays a role in driving the efficiency of the DISTAL ligation, as shown in FIG. 1. It appears that hyperligase ligation occurs within a temperature window around 75 C. Lower temperature might affect the thermodynamics of DNA ends by reducing the duration of breathing, thus reducing the accessibility of the 3-ends. This is supported by significantly lower efficiency at 65 C. for the HyperLigase, as well as the observation that the thermostable Mth App ligase is not able to drive ligation at 65 C. On the other hand, much higher temperature may increase the likelihood of separating the entire duplex, and introduce bias to sequencing data. Indeed, when DISTAL ligation was performed at 80 C., DISTAL-seq data using E. coli gDNA shows higher GC bias at the high GC bins (data not shown). Other experimental techniques to lower duplex melting temperature, such as supplementing with betaine, or by using thermostable single strand binding protein etc., could be employed.

[0125] Another element for DISTAL ligation is the choice of thermostable ligase capable of ligating between two single strand DNA molecules. In this example, a thermostable mutant ligase with a wide range of temperature tolerance was chosen to enable hyperligase ligation (10). In addition, the mutation at the catalytic lysine in the enzyme dictates the uni-directional ligation between the 5-adenylated end and the 3-hydroxyl end, minimizing the chance of undesired by-product generation. Other enzymes with similar characteristics through database mining may be useful for this purpose.

[0126] Embodiments, of DISTAL ligation allows insights for adaptor design for the sequencing library preparation. For example, unlike the conventional Y-adaptor for which a short duplex is needed due to the substrate requirement of the T4 DNA ligase, ligations in embodiments of DISTAL-seq are generally completed in two separate steps, and in either step, the substrates are in the form of single strand DNA. In particular embodiments, the substrate requirement might make the duplex portion of the conventional Illumina adaptor unnecessary. This may have an added benefit of reducing adaptor length as well as adaptor dimer length, making adaptor dimer more efficiently removed by size selection.

[0127] Although designed for ligation to duplex DNA, DISTAL workflow does not preclude single strand DNA in the starting material ligating to the adaptor. These ligated products can also go through the later steps of the library preparation, get amplified and sequenced. As such, DISTAL-seq and DUET-seq data may have captured both double and single strand DNA population in the starting material. Indeed, for the cell free plasma DNA, as shown in FIG. 4B, in addition to mono-nucleosomal and di-nucleosomal cfDNA, there are sequenced inserts within the small size range, such as less than 50 bp, which may have gone through DISTAL workflow as single strand to single strand ligation. These inserts could exist as single strand DNA, or part of nicked double strand DNA in the starting cfDNA samples (23).

[0128] For DUET-seq, note that as a proof of principle, the strand-specific UMIs were designed as complementary through primer extension for the ease of synthesis (FIG. 3A). However, due to the nature of the separate ligation steps, the strand-specific UMIs do not need to be complementary. As long as there is a corresponding table to link the two distinct UMIs, the strands from the same duplex can be paired during the computational analysis. For example, one can design and synthesize multiple adaptor oligos, each with distinct 5- and 3-UMIs, and pool them for use in DUET-seq. Given the advances in high-throughput oligo synthesis, this provides another route for DUET-seq adaptor preparation.

[0129] Finally, since no ER/AT is used in the library prep, DISTAL-seq can be extended to workflows for readout of epigenetic base modifications. The dilution of epigenetic signals, especially present at the start of read 2, as observed in (3), is not expected to exist for the collected dataset. Such high-fidelity epigenomic datasets should be useful in illuminating epigenetic changes in the disease process, especially at an early onset. current examples are by sequencing genetics. When the adaptor is methylated, for example, an example of sequencing epigenome can be demonstrated.

REFERENCES

[0130] 1. Gregory, et al. (2020) Characterization and mitigation of fragmentation enzyme-induced dual stranded artifacts. NAR Genom Bioinform, 2. [0131] 2. Xiong et al. (2022) Duplex-Repair enables highly accurate sequencing, despite DNA damage. Nucleic Acids Res, 50, e1-e1. [0132] 3. Jiang, et al. (2020) Detection and characterization of jagged ends of double-stranded DNA in plasma. Genome Res, 30, 1144-1153. [0133] 4. Thierry, A. R. (2023) Circulating DNA fragmentomics and cancer screening. Cell Genomics, 3, 100242. [0134] 5. Schmitt, et al. (2012) Detection of ultra-rare mutations by next-generation sequencing. Proceedings of the National Academy of Sciences, 109, 14508-14513. [0135] 6. Picelli, et al., (2014) Tn5 transposase and tagmentation procedures for massively scaled sequencing projects. Genome Res, 24, 2033-2040. [0136] 7. Gansauge, M. T. and Meyer, M. (2013) Single-stranded DNA library preparation for the sequencing of ancient or damaged DNA. Nat Protoc, 8, 737-748. [0137] 8. Troll, et al., (2019) A ligation-based single-stranded library preparation method to analyze cell-free DNA and synthetic oligos. BMC Genomics, 20, 1023. [0138] 9. Harkins, et al., (2020) A novel NGS library preparation method to characterize native termini of fragmented DNA. Nucleic Acids Res, 48, e47-e47. [0139] 10. Zheng, Y. and Hong, M. HYPER-THERMOSTABLE LYSINE-MUTANT SSDNA/RNA LIGASES. EP3 430 154B1. [0140] 11. Bolger, A. M., Lohse, M. and Usadel, B. (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30, 2114-2120. [0141] 12. Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-60. [0142] 13. Li, et al., (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078-2079. [0143] 14. https://followed by broadinstitute.github.io/picard/. [0144] 15. https://followed by github.com/fulcrumgenomics/fgbio. [0145] 16. Zillig, et al. (1990) Hyperthermus butylicus, a hyperthermophilic sulfur-reducing archaebacterium that ferments peptides. J Bacteriol, 172, 3959-3965. [0146] 17. Zhelkovsky, A. M. and McReynolds, L. A. (2012) Structure-function analysis of Methanobacterium thermoautotrophicum RNA ligase-engineering a thermostable ATP independent enzyme. BMC Mol Biol, 13, 24. [0147] 18. Phelps, et al., (2013) Single-molecule FRET and linear dichroism studies of DNA breathing and helicase binding at replication fork junctions. Proceedings of the National Academy of Sciences, 110, 17320-17325. [0148] 19. Costello, et al. (2013) Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res, 41, e67-e67. [0149] 20. Bae, et al. (2023) Single duplex DNA sequencing with CODEC detects mutations with high sensitivity. Nat Genet, 55, 871-879. [0150] 21. Zheng, et al., (2010) A unique family of Mrr-like modification-dependent restriction endonucleases. Nucleic Acids Res, 38, 5527-5534. [0151] 22. Jiang, et al. (2015) Lengthening and shortening of plasma DNA in hepatocellular carcinoma patients. Proceedings of the National Academy of Sciences, 112. [0152] 23. Snyder, et al., (2016) Cell-free DNA Comprises an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin. Cell, 164, 57-68. [0153] 24. Lanman, et al. (2015) Analytical and Clinical Validation of a Digital Sequencing Panel for Quantitative, Highly Accurate Evaluation of Cell-Free Circulating Tumor DNA. PLoS One, 10, e0140712. [0154] 25. Tanaka, N. and Shuman, S. (2011) RtcB Is the RNA Ligase Component of an Escherichia coli RNA Repair Operon. Journal of Biological Chemistry, 286, 7727-7731. [0155] 26. Das, et al., (2013) Rewriting the rules for end joining via enzymatic splicing of DNA 3-PO 4 and 5-OH ends. Proceedings of the National Academy of Sciences, 110, 20437-20442. [0156] 27. Duan, et al., (2019) Purification and enzymatic characterization of the RNA ligase RTCB from Thermus thermophilus. Biotechnol Lett, 41, 1051-1057. [0157] 28. Desai, et al., (2015) Coevolution of RtcB and Archease created a multiple-turnover RNA ligase. RNA, 21, 1866-1872. [0158] 29. U.S. Pat. Pub. 2022/0356467. [0159] 30. Jacewicz A., et al (2022) Structures of RNA ligase RtcB in complexes with divalent cations and GTP, RNA, 28(11):1509-1518

[0160] All publications and patents mentioned in the specification and/or listed below are herein incorporated by reference. Various modifications and variations of the described method and system of the invention will be apparent to those skilled in the art without departing from the scope and spirit of the invention. Although the invention has been described in connection with specific embodiments, it should be understood that the invention as claimed should not be unduly limited to such specific embodiments. Indeed, various modifications of the described modes for carrying out the invention that are obvious to those skilled in the relevant fields are intended to be within the scope described herein.

SINGLE-STRANDED END PRESERVING ADAPTORS

Inventors

Cpc classification

Classification Explorer

C12Y605/01

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2525/186

CHEMISTRY; METALLURGY

Classification Explorer

C12N9/22

CHEMISTRY; METALLURGY

Classification Explorer

C12N9/2497

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2563/179

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2535/122

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2531/113

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6869

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2525/186

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2535/122

CHEMISTRY; METALLURGY

Classification Explorer

C12N9/93

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2531/113

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2563/179

CHEMISTRY; METALLURGY

Classification Explorer

C12Y301/21007

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2521/501

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6806

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2521/501

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6869

CHEMISTRY; METALLURGY

Classification Explorer

C12Y302/02027

CHEMISTRY; METALLURGY

Classification Explorer

C12N15/1093

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6806

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6855

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2525/191

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6855

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2525/191

CHEMISTRY; METALLURGY

International classification

Classification Explorer