DEGRADABLE CARRIER NUCLEIC ACID FOR USE IN THE EXTRACTION, PRECIPITATION AND/OR PURIFICATION OF NUCLEIC ACIDS

20210164020 · 2021-06-03

    Inventors

    Cpc classification

    International classification

    Abstract

    The present invention provides a degradable carrier nucleic acid for use in the extraction, precipitation and/or purification of nucleic acids.

    Claims

    1. A nucleic acid comprising at least a first sequence and a second sequence wherein at least one of the first or second sequence is an endonuclease recognition sequence or is capable of acting as an endonuclease recognition sequence when converted into a corresponding double-stranded DNA form, wherein the length of the endonuclease recognition sequence is at least 15 nucleotides.

    2. The nucleic acid according to claim 1 wherein both the first and the second sequence is an endonuclease recognition sequence or is capable of acting as an endonuclease recognition sequence when converted into a corresponding double-stranded DNA form, and wherein the length of the first and second endonuclease recognition sequence is at least 15 nucleotides.

    3. The nucleic acid according to any of claim 1 or 2 wherein the nucleic acid comprises at least three sequences wherein each sequence is an endonuclease recognition sequence or is capable of acting as an endonuclease recognition sequence when converted into a corresponding double-stranded DNA form, and wherein the length of each endonuclease recognition sequence is at least 15 nucleotides, optionally wherein the nucleic acid comprises at least 4, optionally at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, at least 75, at least 80, at least 85, at least 90, at least 95, at least 100, at least 125, at least 150, at least 175, at least 200, at least 225, at least 250, at least 275, at least 300, at least 350, at least 400, at least 450, at least 500, at least 550, at least 600, at least 650, or at least 700 sequences wherein each sequence is an endonuclease recognition sequence or is capable of acting as an endonuclease recognition sequence when converted into a corresponding double-stranded DNA form, and wherein the length of each endonuclease recognition sequence is at least 15 nucleotides.

    4. The nucleic acid according to any of claims 1-3 wherein the length of the nucleic acid is 10 kb or less, optionally less than 9.5 kb, optionally less than 9.0 kb, optionally less than 8.5 kb, optionally less than 8.0 kb, optionally less than 7.5 kb, optionally less than 7.0 kb, optionally less than 6.5 kb, optionally less than 6.0 kb, optionally less than 5.5 kb, optionally less than 5.0 kb, optionally less than 4.5 kb, optionally less than 4.0 kb, optionally less than 3.5 kb, optionally less than 3.0 kb, optionally less than 2.5 kb, optionally less than 2.0 kb, optionally less than 1.5 kb, optionally less than 1.25 kb, optionally less than 1.0 kb, optionally less than 900 nucleotides, optionally less than 800 nucleotides, optionally less than 700 nucleotides, optionally less than 600 nucleotides, optionally less than 500 nucleotides, optionally less than 400 nucleotides, optionally less than 300 nucleotides, optionally less than 200 nucleotides, optionally less than 100 nucleotides.

    5. The nucleic acid according to any of claims 1-4 wherein the length of the nucleic acid is between 100 nucleotides and 10 kb in length, optionally between 200 nucleotides and 9 kb, optionally between 300 nucleotides and 8 kb, optionally between 400 nucleotides and 7 kb, optionally between 500 nucleotides and 6 kb, optionally between 600 nucleotides and 5 kb, optionally between 700 nucleotides and 4 kb, optionally between 800 nucleotides and 3 kb, optionally between 900 nucleotides and 2 kb, optionally 1 kb.

    6. The nucleic acid according to any of claims 1-5 wherein at least the first or the second sequence that is an endonuclease recognition sequence or that is capable of acting as an endonuclease recognition sequence when converted into a corresponding double-stranded DNA form is at least 16 nucleotides or greater in length, optionally at least 17 nucleotides or greater in length, optionally at least 18 nucleotides or greater in length, optionally at least 19 nucleotides or greater in length, optionally at least 20 nucleotides or greater in length, optionally at least 21 nucleotides or greater in length, optionally at least 22 nucleotides or greater in length, optionally at least 23 nucleotides or greater in length, optionally at least 24 nucleotides or greater in length, optionally at least 25 nucleotides or greater in length, optionally at least 26 nucleotides or greater in length, optionally at least 27 nucleotides or greater in length, optionally at least 28 nucleotides or greater in length, optionally at least 29 nucleotides or greater in length, optionally at least 30 nucleotides or greater in length, optionally at least 31 nucleotides or greater in length, optionally at least 32 nucleotides or greater in length, optionally at least 33 nucleotides or greater in length, optionally at least 34 nucleotides or greater in length, optionally at least 35 nucleotides or greater in length, optionally wherein both of the first sequence and a second sequence that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are at least 16 nucleotides or greater in length, optionally at least 17 nucleotides or greater in length, optionally at least 18 nucleotides or greater in length, optionally at least 19 nucleotides or greater in length, optionally at least 20 nucleotides or greater in length, optionally at least 21 nucleotides or greater in length, optionally at least 22 nucleotides or greater in length, optionally at least 23 nucleotides or greater in length, optionally at least 24 nucleotides or greater in length, optionally at least 25 nucleotides or greater in length, optionally at least 26 nucleotides or greater in length, optionally at least 27 nucleotides or greater in length, optionally at least 28 nucleotides or greater in length, optionally at least 29 nucleotides or greater in length, optionally at least 30 nucleotides or greater in length, optionally at least 31 nucleotides or greater in length, optionally at least 32 nucleotides or greater in length, optionally at least 33 nucleotides or greater in length, optionally at least 34 nucleotides or greater in length, optionally at least 35 nucleotides or greater in length, optionally wherein any number of or all of the sequences that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are at least 16 nucleotides or greater in length, optionally at least 17 nucleotides or greater in length, optionally at least 18 nucleotides or greater in length, optionally at least 19 nucleotides or greater in length, optionally at least 20 nucleotides or greater in length, optionally at least 21 nucleotides or greater in length, optionally at least 22 nucleotides or greater in length, optionally at least 23 nucleotides or greater in length, optionally at least 24 nucleotides or greater in length, optionally at least 25 nucleotides or greater in length, optionally at least 26 nucleotides or greater in length, optionally at least 27 nucleotides or greater in length, optionally at least 28 nucleotides or greater in length, optionally at least 29 nucleotides or greater in length, optionally at least 30 nucleotides or greater in length, optionally at least 31 nucleotides or greater in length, optionally at least 32 nucleotides or greater in length, optionally at least 33 nucleotides or greater in length, optionally at least 34 nucleotides or greater in length, optionally at least 35 nucleotides or greater in length.

    7. The nucleic acid according to any of claims 1-6 wherein the first and second sequence that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are the same length, optionally wherein any number of or all of the sequences that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are the same length.

    8. The nucleic acid according to any of claims 1-6 wherein the first and second sequence that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are of different lengths, optionally wherein any number of or all of the sequences that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are of different lengths.

    9. The nucleic acid according to any of claims 1-8 wherein the nucleic acid comprises sequences that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form of at least three different lengths length, optionally at least four different lengths, optionally at least five different lengths, optionally at least six different lengths, optionally at least seven different lengths, optionally at least eight different lengths, optionally at least nine different lengths, optionally at least 10 different lengths.

    10. The nucleic acid according to any of claims 1-9 wherein the first sequence and the second sequence that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are identical to one another, optionally wherein any number of or all of the sequences that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are identical to one another.

    11. The nucleic acid according to any of claims 1-9 wherein the first sequence and the second sequence that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are different to one another, optionally wherein any number of or all of the sequences that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are different to one another.

    12. The nucleic acid according to any of claims 1-11 wherein the sequences that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are recognition sequences for the same endonuclease, optionally wherein any number of or all of the sequences that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are recognition sequences for the same endonucleases.

    13. The nucleic acid according to any of claims 1-12 wherein the sequences that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are recognition sequences for different endonucleases, optionally wherein any number of or all of the sequences that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are recognition sequences for different endonucleases.

    14. The nucleic acid according to any of claims 1-13 wherein the sequences that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are recognition sequences for at least three different endonucleases, optionally at least four different endonucleases, optionally at least five different endonucleases, optionally at least six different endonucleases, optionally at least seven different endonucleases, optionally at least eight different endonucleases, optionally at least nine different endonucleases, optionally at least 10 different endonucleases.

    15. The nucleic acid according to any of claims 1-14 wherein at least one of the sequences that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are recognition sequences for a homing endonuclease, optionally wherein two of the sequences that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are recognition sequences for homing endonuclease enzymes, optionally wherein all of the sequences that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are recognition sequences for homing endonuclease enzymes.

    16. The nucleic acid according to any of claims 1-15 wherein the first and/or the second sequence is repeated within the nucleic acid, optionally wherein the first and/or the second sequence occurs at least twice within the nucleic acid, optionally at least three times, optionally at least four times, optionally at least five times, optionally at least 6 times, optionally at least 7 times, optionally at least 8 times, optionally at least 9 times, optionally at least 10 times, optionally at least 12 times, optionally at least 14 times, optionally at least 16 times, optionally at least 18 times, optionally at least 20 times, optionally at least 25 times, optionally at least 30 times, optionally at least 35 times, optionally at least 40 times, optionally at least 45 times, optionally at least 50 times, optionally at least 55 times, optionally at least 60 times, optionally at least 65 times, optionally at least 70 times, optionally at least 75 times, optionally at least 80 times, optionally at least 85 times, optionally at least 90 times, optionally at least 95 times, optionally at least 100 times, optionally at least 110 times, optionally at least 120 times, optionally at least 130 times, optionally at least 140 times, optionally at least 150 times, optionally at least 160 times, optionally at least 170 times, optionally at least 180 times, optionally at least 190 times, optionally at least 200 times.

    17. The nucleic acid according to any of claims 1-16 wherein both the first sequence and the second sequence are repeated in the nucleic acid in alternating fashion, optionally wherein where the nucleic acid comprises at least three sequences that are endonuclease recognition sequences or are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form each of the sequences are repeated in an alternating fashion.

    18. The nucleic acid according to any of claims 1-17 wherein the first sequence and the second sequence are arranged such that cleavage of the first sequence results in the production of nucleic acid fragments that are all less than 200 nucleotides, optionally less than 180 nucleotides, optionally less than 160 nucleotides, optionally less than 140 nucleotides, optionally less than 120 nucleotides, optionally less than 100 nucleotides, optionally less than 80 nucleotides, optionally less than 60 nucleotides, optionally less than 40 nucleotides, optionally less than 20 nucleotides.

    19. The nucleic acid according to any of claims 1-18 wherein the first sequence and the second sequence are arranged such that cleavage of the second sequence results in the production of nucleic acid fragments that are all less than 200 nucleotides, optionally less than 180 nucleotides, optionally less than 160 nucleotides, optionally less than 140 nucleotides, optionally less than 120 nucleotides, optionally less than 100 nucleotides, optionally less than 80 nucleotides, optionally less than 60 nucleotides, optionally less than 40 nucleotides, optionally less than 20 nucleotides.

    20. The nucleic acid according to any of claims 1-19 wherein the first sequence and the second sequence are arranged such that cleavage of the first sequence and the second sequence results in the production of nucleic acid fragments that are all less than 200 nucleotides, optionally less than 180 nucleotides, optionally less than 160 nucleotides, optionally less than 140 nucleotides, optionally less than 120 nucleotides, optionally less than 100 nucleotides, optionally less than 80 nucleotides, optionally less than 60 nucleotides, optionally less than 40 nucleotides, optionally less than 20 nucleotides.

    21. The nucleic acid according to any of claims 1-20 wherein cleavage of the first and/or second sequence results in the production of nucleic acid fragments that are all less than 80 nucleotides, optionally less than 70 nucleotides.

    22. The nucleic acid according to any of claims 1-21 wherein cleavage of the first and/or second sequence results in the production of nucleic acid fragments that can be removed by Solid Phase Reversible Immobilization, optionally Solid Phase Reversible Immobilization-based paramagnetic beads, optionally AMPure beads or RNAclean XP beads.

    23. The nucleic acid according to any of claims 1-22 wherein the first and/or second sequences that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are recognition sequences for a homing endonuclease wherein the homing endonuclease is selected from the group consisting of: BneMS4ORFIP, F-CphI, F-EcoT3I, F-EcoT5I, F-EcoT5II, F-EcoT5IV, F-PhiU5I, F-SceI, F-SceII, F-TevI, F-TevII, F-TevIII, F-TevIV, H-DreI, H-DreI, I-AabMI, I-AchMI, I-AniI, I-ApeKI, I-BanI, I-BasI, I-BmoI, I-Bth0305I, I-BthII, 1-BthORFAP, I-CeuI, I-ChuI, I-CmoeI, I-CpaI, I-CpaII, I-CpaMI, I-CreI, I-CreII, I-CsmI, I-CvuI, I-DdiI, I-DmoI, I-GpeMI, I-GpiI, I-GzeI, I-GzeII, I-HjeMI, I-HmuI, I-HmuII, 1-LlaI, I-LtrI, I-LtrWI, I-MpeMI, I-MsoI, I-NanI, I-NfiI, I-NitI, I-NjaI, I-OmiII, I-OnuI, I-PakI, I-PanMI, I-PfoP3I, I-PnoMI, I-PogTE7I, I-PorI, I-PpoI, I-ScaI, I-SceI, I-SceII, I-SceIII, I-SceIV, I-SceV, I-SceVI, I-SceVII, I-SecIII, I-SmaMI, I-SpomI, I-SscMI, I-Ssp6803I, I-TevI, I-TevII. I-TevIII. I-TsII. I-TsIWI, I-Tsp061I, I-TwoI, I-Vdi141I, -AvaI, PI-BciPI, PI-HvoWI, PI-MtuI, PI-PabI, PI-PabII, PI-PfuI, PI-PfuII, PI-PkoI, PI-PkoII, PI-PspI, PI-PspI, PI-ScaI, PI-SceI, PI-TfuI, PI-TfuII, PI-ThyI, PI-TliI, PI-TliII, PI-TmaI, PI-TmaKI, PI-ZbaI.

    24. The nucleic acid according to any of claims 1-23 wherein the first and/or second sequence that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form are recognition sequences is selected from the group consisting of: SEQ ID NO: 1-142, optionally SEQ ID NO: 1 and/or SEQ ID NO: 2.

    25. The nucleic acid according to any of claims 1-24 wherein the nucleic acid comprises one or more modifications, optionally a modification selected from the group consisting of biotinylation.

    26. The nucleic acid according to any of claims 1-25 wherein the nucleic acid is an RNA nucleic acid, optionally wherein the RNA comprises a 5′ cap.

    27. The nucleic acid according to any of claims 1-25 wherein where the nucleic acid is an RNA the RNA does not comprise a 5′ cap.

    28. The nucleic acid according to any of claims 1-25 wherein the nucleic acid is a DNA, optionally a double-stranded DNA nucleic acid.

    29. A vector comprising the nucleic acid according to any of claims 1-28.

    30. A vector comprising a sequence that is capable of being transcribed into an RNA transcript, wherein the transcript comprises a nucleic acid according to any of claims 1-28.

    31. A cell comprising a nucleic acid according to any of claims 1-28 or a vector according to any of claim 29 or 30, optionally wherein the cell is: a) a prokaryotic cell, optionally a bacterial cell, optionally an E. coli cell, a Bacillus subtilis cell, a Bacillus megaterium cell, a Vibrio natriegens cell, or a Pseudomonas fluorescens cell; or b) a eukaryotic cell, optionally a yeast cell, optionally Pichia pastoris or Saccharomyces cerevisiae; an insect cell, optionally a baculovirus infected insect cell; or a mammalian cell, optionally a baculovirus infected mammalian cell, a HEK293 cell, a HeLa cell, or CHO cells.

    32. A composition comprising at least one nucleic acid according to any of claims 1-28.

    33. A composition comprising at least two nucleic acids according to any of claims 1-28 wherein the at least two nucleic acids are of different sequence to one another, optionally comprising at least 3 different nucleic acids wherein the at least 3 different nucleic acids are of different sequence to one another, optionally comprising at least 4 different nucleic acids wherein the at least 4 different nucleic acids are of different sequence to one another, optionally comprising at least 5 different nucleic acids wherein the at least 5 different nucleic acids are of different sequence to one another, optionally comprising at least 6 different nucleic acids wherein the at least 6 different nucleic acids are of different sequence to one another, optionally comprising at least 7 different nucleic acids wherein the at least 7 different nucleic acids are of different sequence to one another, optionally comprising at least 8 different nucleic acids wherein the at least 8 different nucleic acids are of different sequence to one another, optionally comprising at least 9 different nucleic acids wherein the at least 9 different nucleic acids are of different sequence to one another, optionally comprising at least 10 different nucleic acids wherein the at least 10 different nucleic acids are of different sequence to one another.

    34. A composition comprising at least two nucleic acids according to any of claims 1-28 wherein the at least 2 nucleic acids are of different length, optionally at least 3 nucleic acids according to any of claims 1-24 wherein the at least 3 nucleic acids are of different lengths, optionally at least 4 nucleic acids according to any of claims 1-24 wherein the at least 4 nucleic acids are of different lengths, optionally at least 5 nucleic acids according to any of claims 1-24 wherein the at least 5 nucleic acids are of different lengths, optionally at least 6 nucleic acids according to any of claims 1-24 wherein the at least 6 nucleic acids are of different lengths, optionally at least 7 nucleic acids according to any of claims 1-24 wherein the at least 7 nucleic acids are of different lengths, optionally at least 8 nucleic acids according to any of claims 1-24 wherein the at least 8 nucleic acids are of different lengths, optionally at least 9 nucleic acids according to any of claims 1-24 wherein the at least 9 nucleic acids are of different lengths, optionally at least 10 nucleic acids according to any of claims 1-24 wherein the at least 10 nucleic acids are of different lengths.

    35. The composition according to any of claims 32-34 wherein the composition comprises 10 different nucleic acids according to any of claims 1-28 wherein each of the 10 nucleic acids is of a different length, optionally wherein the composition comprises a nucleic acid according to any of claims 1-28 of each of the following lengths: i) between 1000 nucleotides and 1200 nucleotides, optionally 1100 nucleotides, optionally 1034 nucleotides; ii) between 900 nucleotides and 1000 nucleotides, optionally between 920 nucleotides and 980 nucleotides, optionally between 960 nucleotides and 970 nucleotides, optionally 966 nucleotides; iii) between 850 nucleotides and 900 nucleotides, optionally between 860 nucleotides and 890 nucleotides, optionally 889 nucleotides; iv) between 800 nucleotides and 850 nucleotides, optionally between 810 nucleotides and 840 nucleotides, optionally between 820 nucleotides and 830 nucleotides, optionally 821 nucleotides; v) between 700 nucleotides and 800 nucleotides, optionally between 720 nucleotides and 780 nucleotides, optionally between 740 nucleotides and 760 nucleotides, optionally 744 nucleotides; vi) between 650 nucleotides and 700 nucleotides, optionally between 660 nucleotides and 690 nucleotides, optionally between 670 nucleotides and 680 nucleotides, optionally 676 nucleotides; vii) between 550 nucleotides and 650 nucleotides, optionally between 560 nucleotides and 640 nucleotides, optionally between 570 nucleotides and 630 nucleotides, optionally between 580 nucleotides and 620 nucleotides, optionally between 590 nucleotides and 610 nucleotides, optionally 599 nucleotides or 600 nucleotides; viii) between 500 nucleotides and 550 nucleotides, optionally between 510 nucleotides and 540 nucleotides, optionally between 520 nucleotides and 530 nucleotides, optionally 531 nucleotides; ix) between 400 nucleotides and 500 nucleotides, optionally between 410 nucleotides and 490 nucleotides, optionally between 420 nucleotides and 480 nucleotides, optionally between 430 nucleotides and 470 nucleotides, optionally between 440 nucleotides and 460 nucleotides, optionally 450 nucleotides or 454 nucleotides; and x) between 300 nucleotides and 400 nucleotides, optionally between 310 nucleotides and 390 nucleotides, optionally between 320 nucleotides and 380 nucleotides, optionally between 330 nucleotides and 370 nucleotides, optionally between 340 nucleotides and 360 nucleotides, optionally 350 nucleotides or 386 nucleotides; optionally wherein the composition comprises at least 10 different nucleic acids according to any of claims 1-25 wherein the nucleic acids are 1034 nucleotides in length, 966 nucleotides in length, 889 nucleotides in length, 821 nucleotides in length, 744 nucleotides in length, 676 nucleotides in length, 599 nucleotides in length, 531 nucleotides in length, 454 nucleotides in length, and 386 nucleotides in length.

    36. The composition according to any of claims 32-35 wherein the composition comprises capped RNA nucleic acids according to any of claims 1-28, optionally comprises capped and uncapped RNA nucleic acids according to any of claims 1-28.

    37. The composition according to any of claims 32-36 wherein the range of sizes of the nucleic acids according to any of claims 1-28 are similar to the range of sizes of RNA or DNA nucleic acids in a sample.

    38. The composition according to any of claims 32-37 wherein the percentage of RNA nucleic acids that comprise a 5′ cap is similar to the percentage of capped RNA nucleic acids in a sample.

    39. The composition according to any of claims 32-38 wherein the composition comprises RNA nucleic acids according to any of 1-28, wherein at least 5% of the RNA nucleic acids comprises a 5′ cap, optionally at least 10% of the RNA nucleic acids comprises a 5′ cap, optionally at least 20% of the RNA nucleic acids comprises a 5′ cap, optionally at least 30% of the RNA nucleic acids comprises a 5′ cap, optionally at least 40% of the RNA nucleic acids comprises a 5′ cap, optionally at least 50% of the RNA nucleic acids comprises a 5′ cap, optionally at least 60% of the RNA nucleic acids comprises a 5′ cap, optionally at least 70% of the RNA nucleic acids comprises a 5′ cap, optionally at least 80% of the RNA nucleic acids comprises a 5′ cap, optionally at least 90% of the RNA nucleic acids comprises a 5′ cap, optionally 100% of the RNA nucleic acids comprises a 5′ cap.

    40. A kit comprising: at least one nucleic acid according to any of claims 1-28, optionally comprising at least two nucleic acids according to any of claims 1-28, optionally comprising at least 3 nucleic acids according to any of claims 1-28, optionally comprising at least 4 nucleic acids according to any of claims 1-28, optionally comprising at least 5 nucleic acids according to any of claims 1-28, optionally comprising at least 6 nucleic acids according to any of claims 1-28, optionally comprising at least 7 nucleic acids according to any of claims 1-28, optionally comprising at least 8 nucleic acids according to any of claims 1-28, optionally comprising at least 9 nucleic acids according to any of claims 1-28, optionally comprising at least 10 nucleic acids according to any of claims 1-28; and/or at least one vector according to any of claims 29 and 30; and/or at least one cell according to claim 31; and/or at least one composition according to any of claims 32-39.

    41. The kit according to claim 40 wherein the kit comprises at least one nucleic acid according to any of claims 1-28 that is a capped RNA and at least one nucleic acid according to any of claims 1-28 that is an uncapped RNA.

    42. The kit according to any of claims 40 and 41 wherein the kit comprises at least two nucleic acids according to any of claims 1-28 wherein the at least two nucleic acids are of different lengths, optionally at least 3 nucleic acids according to any of claims 1-28 wherein the at least 3 nucleic acids are of different lengths, optionally at least 4 nucleic acids according to any of claims 1-28 wherein the at least 4 nucleic acids are of different lengths, optionally at least 5 nucleic acids according to any of claims 1-28 wherein the at least 5 nucleic acids are of different lengths, optionally at least 6 nucleic acids according to any of claims 1-28 wherein the at least 6 nucleic acids are of different lengths, optionally at least 7 nucleic acids according to any of claims 1-28 wherein the at least 7 nucleic acids are of different lengths, optionally at least 8 nucleic acids according to any of claims 1-28 wherein the at least 8 nucleic acids are of different lengths, optionally at least 9 nucleic acids according to any of claims 1-28 wherein the at least 9 nucleic acids are of different lengths, optionally at least 10 nucleic acids according to any of claims 1-28 wherein the at least 10 nucleic acids are of different lengths.

    43. The kit according to any of claims 40-42 wherein where the at least one nucleic acid according to any of claims 1-28 is an RNA nucleic acid, the kit comprises the RNA nucleic acid in a capped and uncapped form, optionally wherein the kit comprises 2 different RNA nucleic acids according to any one of claims 1-28, each in both a capped and uncapped form, optionally wherein the kit comprises 3 different RNA nucleic acids according to any one of claims 1-28, each in both a capped and uncapped form, optionally wherein the kit comprises 4 different RNA nucleic acids according to any one of claims 1-28, each in both a capped and uncapped form, optionally wherein the kit comprises 5 different RNA nucleic acids according to any one of claims 1-28, each in both a capped and uncapped form, optionally wherein the kit comprises 6 different RNA nucleic acids according to any one of claims 1-28, each in both a capped and uncapped form, optionally wherein the kit comprises 7 different RNA nucleic acids according to any one of claims 1-28, each in both a capped and uncapped form, optionally wherein the kit comprises 8 different RNA nucleic acids according to any one of claims 1-28, each in both a capped and uncapped form, optionally wherein the kit comprises 9 different RNA nucleic acids according to any one of claims 1-28, each in both a capped and uncapped form, optionally wherein the kit comprises 10 different RNA nucleic acids according to any one of claims 1-28, each in both a capped and uncapped form.

    44. The kit according to any of claims 40-43 wherein the kit comprises one or more endonuclease enzymes, optionally wherein at least one of the endonuclease enzymes is a homing endonuclease enzyme, and wherein the at least one endonuclease enzyme recognises at least one of the sequences that are endonuclease recognition sequences or that are capable of acting as endonuclease recognition sequences when converted into a corresponding double-stranded DNA form.

    45. A method of isolating nucleic acid from a sample wherein the method comprises the use of any one or more of: the nucleic acids according to any of claims 1-28; the vectors according to any one of claim 29 or 30; the cell according to claim 31; the compositions according claim 32-40; and/or the kits according to claims 41-44.

    46. A method for improving the yield of nucleic acid obtained from a sample, wherein the method comprises the use of any one or more of: the nucleic acids according to any of claims 1-28; the vectors according to any one of claim 29 or 30; the cell according to claim 31; the compositions according claim 32-40; and/or the kits according to claims 41-44.

    47. The method of any of claim 45 or 46 wherein the sample is a small sample, optionally wherein the sample comprises: 0.1 ng to 500 ng of nucleic acids; and/or Less than 5000 cells, optionally less than 4000 cells, optionally less than 2000 cells, optionally less than 1000 cells, optionally less than 800 cells, optionally less than 600 cells, optionally less than 400 cells, optionally less than 200 cells, optionally around 100 cells or less.

    48. The method of any of claims 45-47 wherein the sample is selected from the group consisting of: a sample from an embryo; a sample of oocytes, FACS sorted cells, rare cell types, small biopsies, primordial germ cells, and samples of an embryo in the early embryonic developmental stages.

    49. The method according to any of claims 45-48 wherein the method further comprises contacting the nucleic acid according to any of claims 1-28 or the composition according to any of claims 32-40 with at least one endonuclease, optionally at least one homing endonuclease.

    50. A method for isolating nucleic acid that will be sequenced wherein the method comprises the use of any one or more of: the nucleic acids according to any of claims 1-28; the vectors according to any one of claim 29 or 30; the cell according to claim 31; the compositions according claim 32-40; and/or the kits according to claims 41-44.

    51. A method for sequencing a nucleic acid wherein the method comprises the use of any one or more of: the nucleic acids according to any of claims 1-28; the vectors according to any one of claim 29 or 30; the cell according to claim 31; the compositions according claim 32-40; and/or the kits according to claims 41-44.

    52. A method for cap analysis of gene expression (CAGE) wherein the method comprises the use of any one or more of: the nucleic acids according to any of claims 1-28 optionally wherein the nucleic acid according to any of claims 1-28 is an RNA nucleic acid; the vectors according to any one of claim 29 or 30; the cell according to claim 31; the compositions according claim 32-40; and/or the kits according to claims 41-44.

    53. The method according to claim 52 wherein the method further comprises contacting the nucleic acid according to any one or more of claims 1-28 or the composition according to any of claims 32-40 with at least one endonuclease, optionally at least one homing endonuclease.

    54. The method according to claim 53 wherein said contacting occurs following reverse transcription of the RNA to DNA.

    55. A method for cap analysis of gene expression (CAGE) wherein the method comprises cleavage of the cDNA by a transposon and tagging of the double-stranded cDNA, optionally wherein the method comprises tagmentation.

    56. The method according to claim 55 wherein the method comprises the use of any one or more of: the nucleic acids according to any of claims 1-28 optionally wherein the nucleic acid according to any of claims 1-28 is an RNA nucleic acid; and/or the compositions according claim 32-40;

    57. The method according to any of claim 55 or 56 wherein the method does not comprise a 3′ linker ligation reaction and/or does not comprise uracil specific excision reagent (USER) treatment.

    58. The method according to any of claims 56 and 57 wherein the method comprises the following steps in the following order: A) 1. cleavage of the cDNA by a transposon and tagging of the double-stranded cDNA, for example tagmentation; 2. PCR amplification 3. Degradation of the carrier nucleic acid according to the invention or the nucleic acid of the compositions according to the invention; 4. Purification of the DNA fragments, for example by Solid Phase Reversible Immobilization-based paramagnetic beads, for example AMPure beads or RNAclean XP; or B) 1. Degradation of the nucleic acid according to the invention or the nucleic acid of the compositions according to the invention; 2. Purification of the DNA fragments, for example by Solid Phase Reversible Immobilization-based paramagnetic beads, for example AMPure beads or RNAclean XP; or 3. cleavage of the cDNA by a transposon and tagging of the double-stranded cDNA, for example tagmentation; 4. PCR amplification; or C) 1. Degradation of the carrier nucleic acid according to the invention or the nucleic acid of the compositions according to the invention; 2. Purification of the DNA fragments, for example by Solid Phase Reversible Immobilization-based paramagnetic beads, for example AMPure beads or RNAclean XP; 3. PCR amplification 4. (optional 2nd round of degradation of the carrier nucleic acid according to the invention or the nucleic acid of the compositions according to the invention) 5. Cleavage of the cDNA by a transposon and tagging of the double-stranded cDNA, for example tagmentation.

    59. A method for assessing gene promoters and/or transcription start sites, the method comprising: a) providing a sample of target nucleic acid; and b) mixing the sample of nucleic acid with a nucleic acid according to any of claims 1-28 or a composition according to any of claims 32-40.

    60. A method of generating a nucleic acid library, optionally a cDNA library, wherein the method comprises the use of any one or more of: the nucleic acids according to any of claims 1-28 optionally wherein the nucleic acid according to any of claims 1-28 is an RNA nucleic acid; the vectors according to any one of claim 29 or 30; the cell according to claim 31; the compositions according claim 32-40; the kits according to claims 41-44; and/or the methods according to claims 45-59.

    61. A method of diagnosis wherein the method comprises the use of any one or more of: the nucleic acids according to any of claims 1-28 optionally wherein the nucleic acid according to any of claims 1-28 is an RNA nucleic acid; the vectors according to any one of claim 29 or 30; the cell according to claim 31; the compositions according claim 32-40; the kits according to claims 41-44; and/or the methods according to claims 45-60.

    62. A method of chromatin immunoprecipitation (ChIP), ChIP-seq, or FARP-ChIP-seq wherein the method comprises the use of any one or more of: the nucleic acids according to any of claims 1-28 optionally wherein the nucleic acid according to any of claims 1-28 is an RNA nucleic acid; the vectors according to any one of claim 29 or 30; the cell according to claim 31; the compositions according claim 32-40; the kits according to claims 41-44; and/or the methods according to claims 45-60.

    63. The method of any preceding claim wherein the method also comprises the use of an oligo blocker of carrier amplification during PCR amplification.

    Description

    FIGURE LEGENDS

    [0176] FIG. 1. SLIC-CAGE development and assessment. (a) Schematics of the SLIC-CAGE approach. Target RNA of limited quantity is mixed with the carrier mix to get 5 μg of total RNA material. cDNA is synthesised through reverse transcription and the cap oxidized using sodium periodate. Oxidation allows attachment of biotin using biotin hydrazide. In addition to the cap structure, biotin gets attached to the mRNA's 3′ end, as it is also oxidized using sodium periodate. To remove biotin from mRNA:cDNA hybrids with incompletely synthesized cDNA, and from mRNA's 3′ ends, the samples are treated with RNase I and RNase H. Complete cDNAs (cDNA that reached the 5′ end of mRNA), are selected by affinity purification on streptavidin magnetic beads (cap-trapping). cDNA is released from cap-trapped cDNA:mRNA hybrids and 5′- and 3′-linkers are ligated. The library molecules that originate from the carrier are degraded using I-Sce-I and I-Ceu-I homing endonucleases and the fragments removed using AMPure beads. The leftover library molecules are then PCR amplified to increase the amount of material for sequencing. (b-c) Pearson correlation of nAnT-iCAGE and SLIC-CAGE libraries prepared from (b) 5 ng or (c) 10 ng of S. cerevisiae total RNA. (d) Pearson correlation of SLIC-CAGE technical replicates prepared from 10 ng of S. cerevisiae total RNA. (e) CTSS signal in example locus on chromosome 12 in SLIC-CAGE libraries prepared from 5 or 10 ng of S. cerevisiae total RNA, and in nAnT-iCAGE library prepared from standard 5 □g of total RNA. The inset grey boxes show a magnification of a tag cluster. (f-h) Pearson correlation of nAnT-iCAGE and SLIC-CAGE libraries prepared from (f) 5 ng, (g) 10 ng or (h) 25 ng of M. musculus total RNA from. (i) CTSS signal in example locus on chromosome 8 in SLIC-CAGE libraries prepared from 5, 10 or 25 ng of M. musculus total RNA, and in the reference nAnT-iCAGE library prepared from standard 5 μg of total RNA. The inset grey boxes show a magnification of a tag cluster.

    [0177] FIG. 2. Identifying the lower limits of SLIC-CAGE libraries. (a) Genomic locations of tag clusters identified in SLIC-CAGE libraries prepared from 1, 5 or 10 ng of S. cerevisiae total RNA versus the reference nAnT-iCAGE library. (b) Distribution of tag cluster interquantile widths in SLIC-CAGE libraries prepared from 1, 5 or 10 ng of S. cerevisiae total RNA and in the nAnT-iCAGE library. (c) Nucleotide composition of all CTSSs identified in SLIC-CAGE libraries prepared from 5 or 10 ng of S. cerevisiae total RNA and in the reference nAnT-iCAGE library. (d) Dinucleotide composition of all CTSSs (left panel) or dominant CTSSs (right panel) identified in SLIC-CAGE libraries prepared from 5 or 10 ng of S. cerevisiae total RNA and in the nAnT-iCAGE library. Both panels are ordered from the most to the least used dinucleotide in nAnT-iCAGE. (e) Genomic locations of tag clusters in SLIC-CAGE libraries prepared from 5, 10 or 25 ng of M. musculus total RNA and in the nAnT-iCAGE library. (f) Distribution of tag cluster interquantile widths in SLIC-CAGE libraries prepared from 5, 10 or 25 ng of M. musculus total RNA and the nAnT-iCAGE library. (g) Nucleotide composition of all CTSSs identified in SLIC-CAGE libraries prepared from 5, 10 or 25 ng of M. musculus total RNA or identified in the nAnT-iCAGE library. (h) Dinucleotide composition of all CTSSs (left panel) or dominant CTSSs (right panel) identified in SLIC-CAGE libraries prepared from 5, 10 or 25 ng of M. musculus total RNA or identified in the reference nAnT-iCAGE library. Both panels are ordered from the most to the least used dinucleotide in the reference nAnT-iCAGE.

    [0178] FIG. 3. Comparison of nanoCAGE and the reference nAnT-iCAGE. (a-e) Pearson correlation of nAnT-iCAGE and nanoCAGE libraries prepared from (a) 5 ng, (b, c) 10 ng, (d) 50 ng or (e) 500 ng of S. cerevisiae total RNA. (f) Pearson correlation of nanoCAGE technical replicates prepared from 10 ng of S. cerevisiae total RNA. (g) CTSS signal in example locus on chromosome 12 in nanoCAGE libraries prepared from 5, 10, 50 or 500 ng, SLIC-CAGE library prepared from 5 ng, and the nAnT-iCAGE library prepared from 5 μg of S. cerevisiae total RNA (the same locus is shown in FIG. 1e). Insets in first two nanoCAGE libraries have a different scale, as signal is skewed with PCR amplification. The inset grey boxes show a magnification of a tag cluster. Different tag cluster is magnified compared to FIG. 1e, as nanoCAGE did not detect the upstream tag cluster on the minus strand (h) Genomic locations of tag clusters identified in nanoCAGE libraries prepared from 5-500 ng of S. cerevisiae total RNA and in the nAnT-iCAGE library (i) Distribution of tag cluster interquantile widths in nanoCAGE libraries prepared from 5-500 ng of S. cerevisiae total RNA versus the reference nAnT-iCAGE library. (j) Nucleotide composition of all CTSSs identified in nanoCAGE libraries prepared from 5-500 ng of S. cerevisiae total RNA or identified in the reference nAnT-iCAGE library. (k) Dinucleotide composition of all CTSSs (left panel) or dominant CTSSs (right panel) identified in nanoCAGE libraries prepared from 5-500 ng of S. cerevisiae total RNA or in the reference nAnT-iCAGE library. Both panels are ordered from the most to the least used dinucleotide in the reference nAnT-iCAGE.

    [0179] FIG. 4. SLIC-CAGE is equivalent to nAnT-iCAGE for pattern discovery. Comparison of SLIC-CAGE derived from 10 ng and nAnT-iCAGE derived from 5 μg of M. musculus total RNA. In all heatmaps, promoters are centred at the dominant CTSS (dashed vertical line at 0) and ordered by tag cluster interquantile width with sharpest promoters on top and broadest on the bottom of each heatmap. The horizontal line separates sharp and broad promoters (empirical boundary for sharp promoters is set at interquantile width <=3). (a) Comparison of TA dinucleotide density in the SLIC-CAGE (left) and nAnT-iCAGE library (right). (b) Comparison of TATA-box density in SLIC-CAGE (left) vs nAnT-iCAGE library (right). Promoter regions are scanned using a minimum of 80th percentile match to the TATA-box pwm. (c) Comparison of GC dinucleotide density in the SLIC-CAGE (left) and nAnT-iCAGE library (right). (d) Average VWV (AA/AT/TA/TT) dinucleotide frequency in sharp and broad promoters identified in SLIC-CAGE (left) or nAnT-iCAGE library (right). Inset shows a closer view on VVVV dinucleotide frequency (blue) overlain with the signal obtained when the sequences are aligned to a randomly chosen identified CTSS within broad promoters (yellow). (e) CTSS coverage heatmap of SLIC-CAGE (left) or nAnT-iCAGE library (right). (f) H3K4me3 relative coverage in sharp versus broad promoters identified in SLIC-CAGE (left) or nAnT-iCAGE (right). (g) H3K4me3 signal density across promoter regions centred on SLIC-CAGE or nAnT-iCAGE identified dominant CTSS. (h) Relative coverage of CpG islands across sharp and broad promoters, centred on dominant CTSS identified in SLIC-CAGE (left) or nAnT-iCAGE (right). (i) CpG islands coverage signal across promoter regions centred on dominant CTSS identified in SLIC-CAGE (left) or nAnT-iCAGE (right).

    [0180] FIG. 5

    [0181] Sequence of the carrier synthetic gene. I-SceI recognition sites are underlined, while I-CeuI recognitions sites are highlighted in bold.

    TABLE-US-00001 [SEQ ID NO: 143] CAGCGTTCGCTATAACTATAACGGTCCTAAGGTAGCGAAATGCAAGAGCA ATACCGCCCGGAAGAGATAGAATCCAAAGTACAGCTTCATAGGGATAACA GGGTAATTTGGGATGAGAAGCGCACATTTGAAGTAACCGAAGACGAGAGC AAAGAGATAACTATAACGGTCCTAAGGTAGCGAAAGTATTACTGCCTGTC TATGCTTCCCTATCCTTCTGGTCGACTACACATGTAGGGATAACAGGGTA ATGGCCACGTACGTAACTACACCATCGGTGACGTGATCGCCCGCTACCAG CGTAACTATAACGGTCCTAAGGTAGCGAATATGCTGGGCAAAAACGTCCT GCAGCCGATCGGCTGGGACGCGTTTGGTCTAGGGATAACAGGGTAATTGC CTGCGGAAGGCGCGGCGGTGAAAAACAACACCGCTCCGGCACCGTGGTAA CTATAACGGTCCTAAGGTAGCGAAACGTACGACAACATCGCGTATATGAA AAACCAGCTCAAAATGCTGGGCTTTAGGGATAACAGGGTAATTGGTTATG ACTGGAGCCGCGAGCTGGCAACCTGTACGCCGGAATACTACCTAACTATA ACGGTCCTAAGGTAGCGAAGTTGGGAACAGAAATTCTTCACCGAGCTGTA TAAAAAAGGCCTGGTATATTAGGGATAACAGGGTAATAAGAAGACTTCTG CGGTCAACTGGTGCCCGAACGACCAGACCGTACTGGCTAACTATAACGGT CCTAAGGTAGCGAAGAACGAACAAGTTATCGACGGCTGCTGCTGGCGCTG CGATACCAAAGTTGTAGGGATAACAGGGTAATAACGTAAAGAGATCCCGC AGTGGTTTATCAAAATCACTGCTTACGCTGACTAACTATAACGGTCCTAA GGTAGCGAATTGCAGCTCAACGATCTGGATAAACTGGATCACTGGCCAGA CACCGTTAATAGGGATAACAGGGTAATCGAATTCGTCTGCGACACGTAG Sequence of corresponding RNA transcript [SEQ ID NO: 144]: RNA_1: GNNNNNCAGCGUUCGCUACAGCGUUCGCUAUAACUAUAACGGUCCUAAGG UAGCGAAAUGCAAGAGCAAUACCGCCCGGAAGAGAUAGAAUCCAAAGUAC AGCUUCAUAGGGAUAACAGGGUAAUUUGGGAUGAGAAGCGCACAUUUGAA GUAACCGAAGACGAGAGCAAAGAGAUAACUAUAACGGUCCUAAGGUAGCG AAAGUAUUACUGCCUGUCUAUGCUUCCCUAUCCUUCUGGUCGACUACACA UGUAGGGAUAACAGGGUAAUGGCCACGUACGUAACUACACCAUCGGUGAC GUGAUCGCCCGCUACCAGCGUAACUAUAACGGUCCUAAGGUAGCGAAUAU GCUGGGCAAAAACGUCCUGCAGCCGAUCGGCUGGGACGCGUUUGGUCUAG GGAUAACAGGGUAAUUGCCUGCGGAAGGCGCGGCGGUGAAAAACAACACC GCUCCGGCACCGUGGUAACUAUAACGGUCCUAAGGUAGCGAAACGUACGA CAACAUCGCGUAUAUGAAAAACCAGCUCAAAAUGCUGGGCUUUAGGGAUA ACAGGGUAAUUGGUUAUGACUGGAGCCGCGAGCUGGCAACCUGUACGCCG GAAUACUACCUAACUAUAACGGUCCUAAGGUAGCGAAGUUGGGAACAGAA AUUCUUCACCGAGCUGUAUAAAAAAGGCCUGGUAUAUUAGGGAUAACAGG GUAAUAAGAAGACUUCUGCGGUCAACUGGUGCCCGAACGACCAGACCGUA CUGGCUAACUAUAACGGUCCUAAGGUAGCGAAGAACGAACAAGUUAUCGA CGGCUGCUGCUGGCGCUGCGAUACCAAAGUUGUAGGGAUAACAGGGUAAU AACGUAAAGAGAUCCCGCAGUGGUUUAUCAAAAUCACUGCUUACGCUGAC UAACUAUAACGGUCCUAAGGUAGCGAAUUGCAGCUCAACGAUCUGGAUAA ACUGGAUCACUGGCCAGACACCGUUAAUAGGGAUAACAGGGUAAUCGAAU UCGUCUGCGACACGUAGNNNNNN RNA_2: [SEQ ID NO: 145] GNNNNNCAGCGUUCGCUACAGCGUUCGCUAUAACUAUAACGGUCCUAAGG UAGCGAAAUGCAAGAGCAAUACCGCCCGGAAGAGAUAGAAUCCAAAGUAC AGCUUCAUAGGGAUAACAGGGUAAUUUGGGAUGAGAAGCGCACAUUUGAA GUAACCGAAGACGAGAGCAAAGAGAUAACUAUAACGGUCCUAAGGUAGCG AAAGUAUUACUGCCUGUCUAUGCUUCCCUAUCCUUCUGGUCGACUACACA UGUAGGGAUAACAGGGUAAUGGCCACGUACGUAACUACACCAUCGGUGAC GUGAUCGCCCGCUACCAGCGUAACUAUAACGGUCCUAAGGUAGCGAAUAU GCUGGGCAAAAACGUCCUGCAGCCGAUCGGCUGGGACGCGUUUGGUCUAG GGAUAACAGGGUAAUUGCCUGCGGAAGGCGCGGCGGUGAAAAACAACACC GCUCCGGCACCGUGGUAACUAUAACGGUCCUAAGGUAGCGAAACGUACGA CAACAUCGCGUAUAUGAAAAACCAGCUCAAAAUGCUGGGCUUUAGGGAUA ACAGGGUAAUUGGUUAUGACUGGAGCCGCGAGCUGGCAACCUGUACGCCG GAAUACUACCUAACUAUAACGGUCCUAAGGUAGCGAAGUUGGGAACAGAA AUUCUUCACCGAGCUGUAUAAAAAAGGCCUGGUAUAUUAGGGAUAACAGG GUAAUAAGAAGACUUCUGCGGUCAACUGGUGCCCGAACGACCAGACCGUA CUGGCUAACUAUAACGGUCCUAAGGUAGCGAAGAACGAACAAGUUAUCGA CGGCUGCUGCUGGCGCUGCGAUACCAAAGUUGUAGGGAUAACAGGGUAAU AACGUAAAGAGAUCCCGCAGUGGUUUAUCAAAAUCACUGCUUACGCUGAC UAACUAUAACGGUCCUAAGGUAGCGAAUUGCAGCUCAACGAUCUGGAUAN NNNNN RNA_3: [SEQ ID NO: 146] GNNNNNCAGCGUUCGCUACAGCGUUCGCUAUAACUAUAACGGUCCUAAGG UAGCGAAAUGCAAGAGCAAUACCGCCCGGAAGAGAUAGAAUCCAAAGUAC AGCUUCAUAGGGAUAACAGGGUAAUUUGGGAUGAGAAGCGCACAUUUGAA GUAACCGAAGACGAGAGCAAAGAGAUAACUAUAACGGUCCUAAGGUAGCG AAAGUAUUACUGCCUGUCUAUGCUUCCCUAUCCUUCUGGUCGACUACACA UGUAGGGAUAACAGGGUAAUGGCCACGUACGUAACUACACCAUCGGUGAC GUGAUCGCCCGCUACCAGCGUAACUAUAACGGUCCUAAGGUAGCGAAUAU GCUGGGCAAAAACGUCCUGCAGCCGAUCGGCUGGGACGCGUUUGGUCUAG GGAUAACAGGGUAAUUGCCUGCGGAAGGCGCGGCGGUGAAAAACAACACC GCUCCGGCACCGUGGUAACUAUAACGGUCCUAAGGUAGCGAAACGUACGA CAACAUCGCGUAUAUGAAAAACCAGCUCAAAAUGCUGGGCUUUAGGGAUA ACAGGGUAAUUGGUUAUGACUGGAGCCGCGAGCUGGCAACCUGUACGCCG GAAUACUACCUAACUAUAACGGUCCUAAGGUAGCGAAGUUGGGAACAGAA AUUCUUCACCGAGCUGUAUAAAAAAGGCCUGGUAUAUUAGGGAUAACAGG GUAAUAAGAAGACUUCUGCGGUCAACUGGUGCCCGAACGACCAGACCGUA CUGGCUAACUAUAACGGUCCUAAGGUAGCGAAGAACGAACAAGUUAUCGA CGGCUGCUGCUGGCGCUGCGAUACCAAAGUUGUAGGGAUAACAGGGUAAU AACGUAAAGAGAUCCCGCAGUGNNNNNN RNA_4: [SEQ ID NO: 147] GNNNNNCAGCGUUCGCUACAGCGUUCGCUAUAACUAUAACGGUCCUAAGG UAGCGAAAUGCAAGAGCAAUACCGCCCGGAAGAGAUAGAAUCCAAAGUAC AGCUUCAUAGGGAUAACAGGGUAAUUUGGGAUGAGAAGCGCACAUUUGAA GUAACCGAAGACGAGAGCAAAGAGAUAACUAUAACGGUCCUAAGGUAGCG AAAGUAUUACUGCCUGUCUAUGCUUCCCUAUCCUUCUGGUCGACUACACA UGUAGGGAUAACAGGGUAAUGGCCACGUACGUAACUACACCAUCGGUGAC GUGAUCGCCCGCUACCAGCGUAACUAUAACGGUCCUAAGGUAGCGAAUAU GCUGGGCAAAAACGUCCUGCAGCCGAUCGGCUGGGACGCGUUUGGUCUAG GGAUAACAGGGUAAUUGCCUGCGGAAGGCGCGGCGGUGAAAAACAACACC GCUCCGGCACCGUGGUAACUAUAACGGUCCUAAGGUAGCGAAACGUACGA CAACAUCGCGUAUAUGAAAAACCAGCUCAAAAUGCUGGGCUUUAGGGAUA ACAGGGUAAUUGGUUAUGACUGGAGCCGCGAGCUGGCAACCUGUACGCCG GAAUACUACCUAACUAUAACGGUCCUAAGGUAGCGAAGUUGGGAACAGAA AUUCUUCACCGAGCUGUAUAAAAAAGGCCUGGUAUAUUAGGGAUAACAGG GUAAUAAGAAGACUUCUGCGGUCAACUGGUGCCCGAACGACCAGACCGUA CUGGCUAACUAUAACGGUCCUAAGGUAGCGAAGAACGAACAAGUUAUCGA CGGCNNNNNN RNA_5: [SEQ ID NO: 148] GNNNNNCAGCGUUCGCUACAGCGUUCGCUAUAACUAUAACGGUCCUAAGG UAGCGAAAUGCAAGAGCAAUACCGCCCGGAAGAGAUAGAAUCCAAAGUAC AGCUUCAUAGGGAUAACAGGGUAAUUUGGGAUGAGAAGCGCACAUUUGAA GUAACCGAAGACGAGAGCAAAGAGAUAACUAUAACGGUCCUAAGGUAGCG AAAGUAUUACUGCCUGUCUAUGCUUCCCUAUCCUUCUGGUCGACUACACA UGUAGGGAUAACAGGGUAAUGGCCACGUACGUAACUACACCAUCGGUGAC GUGAUCGCCCGCUACCAGCGUAACUAUAACGGUCCUAAGGUAGCGAAUAU GCUGGGCAAAAACGUCCUGCAGCCGAUCGGCUGGGACGCGUUUGGUCUAG GGAUAACAGGGUAAUUGCCUGCGGAAGGCGCGGCGGUGAAAAACAACACC GCUCCGGCACCGUGGUAACUAUAACGGUCCUAAGGUAGCGAAACGUACGA CAACAUCGCGUAUAUGAAAAACCAGCUCAAAAUGCUGGGCUUUAGGGAUA ACAGGGUAAUUGGUUAUGACUGGAGCCGCGAGCUGGCAACCUGUACGCCG GAAUACUACCUAACUAUAACGGUCCUAAGGUAGCGAAGUUGGGAACAGAA AUUCUUCACCGAGCUGUAUAAAAAAGGCCUGGUAUAUUAGGGAUAACAGG GUAAUAAGAAGACUUCUGCGGUCAACUNNNNNN RNA_6: [SEQ ID NO: 149] GNNNNNCAGCGUUCGCUACAGCGUUCGCUAUAACUAUAACGGUCCUAAGG UAGCGAAAUGCAAGAGCAAUACCGCCCGGAAGAGAUAGAAUCCAAAGUAC AGCUUCAUAGGGAUAACAGGGUAAUUUGGGAUGAGAAGCGCACAUUUGAA GUAACCGAAGACGAGAGCAAAGAGAUAACUAUAACGGUCCUAAGGUAGCG AAAGUAUUACUGCCUGUCUAUGCUUCCCUAUCCUUCUGGUCGACUACACA UGUAGGGAUAACAGGGUAAUGGCCACGUACGUAACUACACCAUCGGUGAC GUGAUCGCCCGCUACCAGCGUAACUAUAACGGUCCUAAGGUAGCGAAUAU GCUGGGCAAAAACGUCCUGCAGCCGAUCGGCUGGGACGCGUUUGGUCUAG GGAUAACAGGGUAAUUGCCUGCGGAAGGCGCGGCGGUGAAAAACAACACC GCUCCGGCACCGUGGUAACUAUAACGGUCCUAAGGUAGCGAAACGUACGA CAACAUCGCGUAUAUGAAAAACCAGCUCAAAAUGCUGGGCUUUAGGGAUA ACAGGGUAAUUGGUUAUGACUGGAGCCGCGAGCUGGCAACCUGUACGCCG GAAUACUACCUAACUAUAACGGUCCUAAGGUAGCGAAGUUGGGAACAGAA AUUCUUCACNNNNNN RNA_7: [SEQ ID NO: 150] GNNNNNCAGCGUUCGCUACAGCGUUCGCUAUAACUAUAACGGUCCUAAGG UAGCGAAAUGCAAGAGCAAUACCGCCCGGAAGAGAUAGAAUCCAAAGUAC AGCUUCAUAGGGAUAACAGGGUAAUUUGGGAUGAGAAGCGCACAUUUGAA GUAACCGAAGACGAGAGCAAAGAGAUAACUAUAACGGUCCUAAGGUAGCG AAAGUAUUACUGCCUGUCUAUGCUUCCCUAUCCUUCUGGUCGACUACACA UGUAGGGAUAACAGGGUAAUGGCCACGUACGUAACUACACCAUCGGUGAC GUGAUCGCCCGCUACCAGCGUAACUAUAACGGUCCUAAGGUAGCGAAUAU GCUGGGCAAAAACGUCCUGCAGCCGAUCGGCUGGGACGCGUUUGGUCUAG GGAUAACAGGGUAAUUGCCUGCGGAAGGCGCGGCGGUGAAAAACAACACC GCUCCGGCACCGUGGUAACUAUAACGGUCCUAAGGUAGCGAAACGUACGA CAACAUCGCGUAUAUGAAAAACCAGCUCAAAAUGCUGGGCUUUAGGGAUA ACAGGGUAAUUGGUUAUGACUGGAGCCGCGAGNNNNNN RNA_8: [SEQ ID NO: 151] GNNNNNCAGCGUUCGCUACAGCGUUCGCUAUAACUAUAACGGUCCUAAGG UAGCGAAAUGCAAGAGCAAUACCGCCCGGAAGAGAUAGAAUCCAAAGUAC AGCUUCAUAGGGAUAACAGGGUAAUUUGGGAUGAGAAGCGCACAUUUGAA GUAACCGAAGACGAGAGCAAAGAGAUAACUAUAACGGUCCUAAGGUAGCG AAAGUAUUACUGCCUGUCUAUGCUUCCCUAUCCUUCUGGUCGACUACACA UGUAGGGAUAACAGGGUAAUGGCCACGUACGUAACUACACCAUCGGUGAC GUGAUCGCCCGCUACCAGCGUAACUAUAACGGUCCUAAGGUAGCGAAUAU GCUGGGCAAAAACGUCCUGCAGCCGAUCGGCUGGGACGCGUUUGGUCUAG GGAUAACAGGGUAAUUGCCUGCGGAAGGCGCGGCGGUGAAAAACAACACC GCUCCGGCACCGUGGUAACUAUAACGGUCCUAAGGUAGCGAAACGUACGA CAACAUCGCGUAUANNNNNN RNA_9: [SEQ ID NO: 152] GNNNNNCAGCGUUCGCUACAGCGUUCGCUAUAACUAUAACGGUCCUAAGG UAGCGAAAUGCAAGAGCAAUACCGCCCGGAAGAGAUAGAAUCCAAAGUAC AGCUUCAUAGGGAUAACAGGGUAAUUUGGGAUGAGAAGCGCACAUUUGAA GUAACCGAAGACGAGAGCAAAGAGAUAACUAUAACGGUCCUAAGGUAGCG AAAGUAUUACUGCCUGUCUAUGCUUCCCUAUCCUUCUGGUCGACUACACA UGUAGGGAUAACAGGGUAAUGGCCACGUACGUAACUACACCAUCGGUGAC GUGAUCGCCCGCUACCAGCGUAACUAUAACGGUCCUAAGGUAGCGAAUAU GCUGGGCAAAAACGUCCUGCAGCCGAUCGGCUGGGACGCGUUUGGUCUAG GGAUAACAGGGUAAUUGCCUGCGGAAGGCGCGGCGGUNNNNNN RNA_10: [SEQ ID NO: 153] GNNNNNCAGCGUUCGCUACAGCGUUCGCUAUAACUAUAACGGUCCUAAGG UAGCGAAAUGCAAGAGCAAUACCGCCCGGAAGAGAUAGAAUCCAAAGUAC AGCUUCAUAGGGAUAACAGGGUAAUUUGGGAUGAGAAGCGCACAUUUGAA GUAACCGAAGACGAGAGCAAAGAGAUAACUAUAACGGUCCUAAGGUAGCG AAAGUAUUACUGCCUGUCUAUGCUUCCCUAUCCUUCUGGUCGACUACACA UGUAGGGAUAACAGGGUAAUGGCCACGUACGUAACUACACCAUCGGUGAC GUGAUCGCCCGCUACCAGCGUAACUAUAACGGUCCUAAGGUAGCGAAUAU GCUGGGCAAAAACGUCCUGNNNNNN

    [0182] FIG. 6

    [0183] Primers used to amplify carrier molecules. The same forward primer is used to create PCR templates for all carrier molecules. T7 promoter sequence is underlined: PCR_GN5_f1:

    TABLE-US-00002 [SEQ ID NO: 154] TAATACGACTCACTATAGNNNNNCAGCGTTCGCTA

    [0184] FIG. 7

    [0185] A) PCR conditions used to create carrier templates. B) Carrier combinations tested in SLIC-CAGE..sup.a Proportions of each carrier used are given in FIG. 8.

    [0186] FIG. 8

    [0187] A) Carrier molecule quantities used in SLIC-CAGE. Provides approximately 50 μg of the carrier mix 0.3-1 kb (44 μg of uncapped and 5 μg of capped). B) Primer sequences for qPCR used to estimate the ratio of target library and the leftover carrier. C) Real-time qPCR cycling conditions. D) PCR amplification conditions. .sup.ax corresponds to Ct value obtained in qPCR with adapter_f1 and adapter_r1 primers.

    [0188] FIG. 9

    [0189] Number of PCR cycles used to amplify SLIC-CAGE and nanoCAGE libraries. .sup.areference nAnT-iCAGE sample diluted 100-fold and PCR amplified 13 cycles using adapter_f1 and adapter_r1 primers.

    [0190] FIG. 10

    [0191] SLIC-CAGE, nAnT-iCAGE and nanoCAGE mapping efficiency.

    [0192] FIG. 11

    [0193] SLIC-CAGE carrier leftover.

    [0194] FIG. 12

    [0195] CTSS and tag cluster in SLIC-CAGE and nAnTi-CAGE.

    [0196] FIG. 13

    [0197] A) CTSS and tag cluster identification in nanocage. B) CTSS and tag cluster identification in nanocage. .sup.aNumber of alignment mismatches at the 1st and 2nd nucleotide position in nanoCAGE tags. .sup.bNumber of GG dinucleotides at 1st and 2nd position in nanoCAGE tags, flagged as mismatches in the alignment.

    [0198] FIG. 14

    [0199] Template switching oligonucleotides used in nanoCAGE. .sup.aTSO sequences are from Poulain, S. et al. NanoCAGE: A Method for the Analysis of Coding and Noncoding 5′-Capped Transcriptomes. Methods Mol Biol 1543, 57-109 (2017).

    [0200] FIG. 15

    [0201] Design and test of carrier molecules. (a) Schematics of the recombinant plasmid with the synthetic carrier gene. (b) Workflow for preparation of the carrier molecules with embedded I-CeuI and I-SceI recognition sites. First, the DNA template for in vitro transcription is produced using PCR amplification with a common forward primer (PCR_GN5_f1) and a variety of reverse primers (PCR_N6_r1-r10), to synthesise PCR templates of different lengths (931-351 nucleotides, Supplementary Table 2). The forward primer contains the T7-promoter sequence, and a GN.sub.5 sequence (N—random nucleotide). The reverse primer dictates the length of the final carrier and introduces random nucleotides at the 3′end of carrier molecules (N.sub.6). After PCR-amplification, the templates are gel-purified and the carrier molecules synthesised using run-off in vitro transcription. Carriers are then purified and a portion of it capped, followed by purification. (c-h) Test of various carrier mixes added to 100 ng of S. cerevisiae total RNA. Pearson correlation of the libraries constructed using 100 ng of S. cerevisiae total RNA and (c) no carrier added, (d) mix 1: mix of 931 nucleotides capped (0.5 μg) and 931 nucleotides (4.4 μg) uncapped carrier, (e) mix 2: mix of 351-931 nucleotides capped (0.5 μg) and 351-931 nucleotides (4.4 μg) uncapped carrier, replicate 1, (f) mix 2: same as in (e), replicate 2, (g) mix 3: 931 nucleotides capped (0.5 μg) carrier, (h) mix 4: 351-931 nucleotides capped (0.5 μg) carrier. All carrier mixes are presented in detail in the Supplementary Table 4 and 5. (i) Pearson correlation of two nAnT-iCAGE technical replicates constructed using 5 μg of total S. cerevisiae RNA. (j) Genomic locations of tag clusters identified in carrier test SLIC-CAGE libraries and the reference nAnT-iCAGE library. (k) Distribution of tag cluster interquantile widths in carrier test SLIC-CAGE libraries and the reference nAnT-iCAGE library. (l) Nucleotide composition of all CTSSs identified in carrier test SLIC-CAGE libraries and in the reference nAnT-iCAGE library. (m) Dinucleotide composition of all CTSSs (left panel) or dominant CTSSs (right panel) identified in carrier test SLIC-CAGE libraries and in the reference nAnT-iCAGE library. Both panels are ordered from the most to least used dinucleotide in the reference nAnT-iCAGE.

    [0202] FIG. 16

    [0203] Performance comparison of SLIC-CAGE and nanoCAGE libraries. Pearson correlation coefficients of (a) SLIC-CAGE libraries constructed from 1-100 ng of S. cerevisiae total RNA and corresponding nAnT-iCAGE libraries (b) nanoCAGE libraries constructed from 5-500 ng of S. cerevisiae total RNA and the nAnT-iCAGE libraries (c) SLIC-CAGE libraries constructed from 5-100 ng of M. musculus total RNA and the reference nAnT-iCAGE library. (d-f) Genomic locations of tag clusters identified in (d) S. cerevisiae SLIC-CAGE libraries and the reference nAnT-iCAGE library, (e) S. cerevisiae nanoCAGE libraries and the reference nAnT-iCAGE library, (f) M. musculus SLIC-CAGE libraries and the reference nAnT-iCAGE library. (g-i) Nucleotide composition of all CTSSs identified in (g) S. cerevisiae SLIC-CAGE libraries, (h) S. cerevisiae nanoCAGE libraries, (i) M. musculus SLIC-CAGE libraries. (j-l) Dinucleotide composition of all CTSSs identified in (j) S. cerevisiae SLIC-CAGE libraries, (k) S. cerevisiae nanoCAGE libraries, (l) M. musculus SLIC-CAGE libraries. All panels are ordered from the most to the least used dinucleotide in the reference nAnT-iCAGE. (m-o) Dinucleotide composition of dominant CTSSs identified in (m) S. cerevisiae SLIC-CAGE libraries, (n) S. cerevisiae nanoCAGE libraries, (o) M. musculus SLIC-CAGE libraries. All panels are ordered from the most to the least used dominant CTSS dinucleotide in the reference nAnT-iCAGE.

    [0204] FIG. 17

    [0205] Distribution of tag cluster interquantile widths in (a) SLIC-CAGE libraries prepared from 1-100 ng of S. cerevisiae total RNA in comparison with the nAnT-iCAGE and PCR amplified nAnT-iCAGE library (diluted in water 1:100 and PCR amplified—13 cycles). (b) SLIC-CAGE libraries prepared from 5-100 ng of M. musculus total RNA in comparison with nAnT-iCAGE. (c) nanoCAGE libraries prepared from 5-500 ng of S. cerevisiae total RNA in comparison nAnT-iCAGE.

    [0206] FIG. 18

    [0207] Assessment of positional accuracy in S. cerevisiae SLIC-CAGE libraries prepared from various amounts of total RNA (a) 1 ng, (b) 2 ng, (c) 5 ng, (d) 10 ng, replicate 1, (e) 10 ng, replicate 2, (f) 25 ng, (g) 50 ng, (h) 100 ng, or (i) nAnT-iCAGE library prepared from 5 μg of total RNA, diluted 1:100 and PCR amplified. Left panels: heatmaps represent log 10(TPM ratio), where the ratio is defined as nAnT-iCAGE TPM value divided with the corresponding SLIC-CAGE TPM value for each CTSS identified in both libraries. The horizontal lines separate four expression level (TPM) quantiles, with the lowest expression quantile on top, and the highest at the bottom of the heatmap. Within each quantile, the sequences are ordered from the highest to the lowest overall TPM ratio values per tag cluster in each SLIC-CAGE library. Middle panels: heatmaps represent the log 10 TPM value of the CTSS present in the nAnT-iCAGE and absent from the SLIC-CAGE library, or the −log 10 TPM value of the CTSS present in the SLIC-CAGE library and absent from the nAnT-iCAGE library. Ordering is the same as explained for left panels. Right panels: coverage of CTSSs present in the reference nAnT-iCAGE library, centred on the dominant CTSS identified in the SLIC-CAGE library with ordering as in the left panels.

    [0208] FIG. 19

    [0209] Assessment of positional accuracy in M. musculus SLIC-CAGE libraries prepared from various amounts of total RNA (a) 5 ng, (b) 10 ng, (c) 25 ng, (d) 50 ng or (e) 100 ng. Left panels: heatmaps represent log 10(TPM ratio), where the ratio is defined as nAnT-iCAGE TPM value divided with the corresponding SLIC-CAGE TPM value for each CTSS identified in both libraries. The horizontal lines separate four expression level (TPM) quantiles, with the lowest expression quantile on top, and the highest at the bottom of the heatmap. Within each quantile, the sequences are ordered from the highest to the lowest overall TPM ratio values per tag cluster in each SLIC-CAGE library. Middle panels: heatmaps represent the log 10 TPM value of the CTSS present in the nAnT-iCAGE and absent from the SLIC-CAGE library, or the −log 10 TPM value of the CTSS present in the SLIC-CAGE library and absent from the nAnT-iCAGE library. Ordering is the same as explained for left panels. Right panels: coverage of CTSSs present in the reference nAnT-iCAGE library, centred on the dominant CTSS identified in the SLIC-CAGE library with ordering as in the left panels.

    [0210] FIG. 20

    [0211] Dinucleotide composition of dominant CTSSs identified in nanoCAGE libraries derived from 5-500 ng of S. cerevisiae total RNA and compared with the nAnT-iCAGE library (derived from 5 μg of total RNA). Dominant CTSSs are split according to genomic locations.

    [0212] FIG. 21

    [0213] Dinucleotide composition of dominant CTSSs identified in nanoCAGE libraries derived from 5-500 ng of S. cerevisiae total RNA and compared with the nAnT-iCAGE library (derived from 5 μg of total RNA). Dominant CTSSs are split according their expression (TPM) values into quartiles (Q1—the lowest 25%, Q4—the highest 25%).

    [0214] FIG. 22

    [0215] Assessment of positional accuracy in S. cerevisiae nanoCAGE libraries prepared from various amounts of total RNA (a) 5 ng, (b) 10 ng, replicate 1 (c) 10 ng, replicate 2 (d) 25 ng, replicate 1 (e) 25 ng, replicate 2 (f) 50 ng, (g) 500 ng, replicate 1 (h) 500 ng, replicate 2 or (i) nAnT-iCAGE library prepared from 5 μg of total RNA, replicate 1. Left panels: heatmaps represent log 10(TPM ratio), where the ratio is defined as nAnT-iCAGE TPM value divided with the corresponding SLIC-CAGE TPM value for each CTSS identified in both libraries. The horizontal lines separate four expression level (TPM) quantiles, with the lowest expression quantile on top, and the highest at the bottom of the heatmap. Within each quantile, the sequences are ordered from the highest to the lowest overall TPM ratio values per tag cluster in each nanoCAGE library. Middle panels: heatmaps represent the log 10 TPM value of the CTSS present in the nAnT-iCAGE and absent from the nanoCAGE library, or the −log 10 TPM value of the CTSS present in the nanoCAGE library and absent from the nAnT-iCAGE library. Ordering is the same as explained for left panels. Right panels: coverage of CTSSs present in the reference nAnT-iCAGE library, centred on dominant CTSS identified in the nanoCAGE library with ordering as in the left panels.

    [0216] FIG. 23

    [0217] Separation of sharp and broad promoters/tag clusters in M. musculus SLIC-CAGE libraries. (a) Number of sharp or broad tag clusters (y-axis) in dependence of the interquantile width threshold (x-axis). The white dashed vertical line marks the chosen empirical threshold for separating sharp and broad tag clusters/promoters (sharp have interquantile width <=3 and broad >3). (b) Average AA/AT/TA/TT dinucleotide relative frequency in sharp or broad promoters identified in SLIC-CAGE or nAnT-iCAGE libraries. (c) Comparison of TATA-box density in SLIC-CAGE and nAnT-iCAGE libraries. Promoter regions are scanned using a minimum of 80.sup.th percentile match to the TATA-box pwm, centred on the dominant TSS and ordered by interquantile width with the sharpest promoters on top of the heatmap, and broadest at the bottom. The horizontal black line separates sharp and broad promoters, defined in (a). (d) TATA-box relative frequency in sharp or broad promoters.

    [0218] FIG. 24

    [0219] Pattern discovery in M. musculus SLIC-CAGE libraries. Comparison of CTSS coverage, TA dinucleotide density, GC dinucleotide density, H3K4me3 coverage, CpG islands coverage in SLIC-CAGE libraries prepared from (a) 5 ng, (b) 10 ng, (c) 25 ng, (d) 50 ng or (e) 100 ng of total RNA and nAnTi-CAGE library prepared from (f) 5 μg of total RNA. Windows are centred on the dominant CTSSs identified in SLIC-CAGE or nAnT-iCAGE libraries. Promoter regions are all ordered from sharpest to broadest tag cluster interquantile width. The horizontal line separates sharp and broad promoters (defined by an empirical threshold where interquantile width <=3 defines sharp, and interquantile width >3 defines broad promoters).

    [0220] FIG. 25

    [0221] Detailed workflow of SLIC-CAGE protocol steps following carrier degradation.

    [0222] FIG. 26

    [0223] Representative HS DNA bioanalyzer traces of libraries prepared form 5 or 10 of M. musculus total RNA after carrier degradation and PCR amplification steps: (a, b) prior to 2.sup.nd round of AMPure XP size selection; (c, d) final SLIC-CAGE library after 2.sup.nd round of AMPure XP size selection.

    [0224] FIG. 27

    [0225] Predicted frequency of cutting by homing endonuclease enzymes in various genomes.

    .sup.aSaccharomyces cerevisiae genome size: 12,100,000 nucleotides; GC content: 38%
    .sup.bDrosophila melanogaster genome size: 175,000,000 nucleotides; GC content: 43%
    .sup.cMus musculus genome size: 2,700,000,000 nucleotides; GC content: 42%
    .sup.dHomo sapiens genome size: 3,289,000,000 nucleotides, GC content: 41%
    *Perfect sequence matching is taken into account.
    **First number in the table is calculated using equal probability of each nucleotide. The second number takes into account the GC content of each organism (therefore it is more accurate).

    [0226] FIG. 28

    [0227] Exemplary endonucleases that are considered to be suitable for use in the invention.

    REFERENCES

    [0228] Poulain, S. et al. NanoCAGE: A Method for the Analysis of Coding and Noncoding 5′-Capped Transcriptomes. Methods Mol Biol 1543, 57-109 (2017). [0229] 1. Shiraki, T. et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proceedings of the National Academy of Sciences of the United States of America 100, 15776-15781 (2003). [0230] 2. Smale, S. T. & Kadonaga, J. T. The RNA polymerase II core promoter. Annu Rev Biochem 72, 449-479 (2003). [0231] 3. Consortium, F. et al. A promoter-level mammalian expression atlas. Nature 507, 462-470 (2014). [0232] 4. Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455-461 (2014). [0233] 5. Haberle, V. et al. Two independent transcription initiation codes overlap on vertebrate core promoters. Nature 507, 381-385 (2014). [0234] 6. Lenhard, B., Sandelin, A. & Carninci, P. Metazoan promoters: emerging characteristics and insights into transcriptional regulation. Nat Rev Genet 13, 233-245 (2012). [0235] 7. Haberle, V. & Lenhard, B. Promoter architectures and developmental gene regulation. Semin Cell Dev Biol 57, 11-23 (2016). [0236] 8. Consortium, E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57-74 (2012). [0237] 9. Celniker, S. E. et al. Unlocking the secrets of the genome. Nature 459, 927-930 (2009). [0238] 10. Kawaji, H., Kasukawa, T., Forrest, A., Carninci, P. & Hayashizaki, Y. The FANTOM5 collection, a data series underpinning mammalian transcriptome atlases in diverse cell types. Sci Data 4, 170113 (2017). [0239] 11. Carninci, P. et al. High-efficiency full-length cDNA cloning by biotinylated CAP trapper. Genomics 37, 327-336 (1996). [0240] 12. Kodzius, R. et al. CAGE: cap analysis of gene expression. Nat Methods 3, 211-222 (2006). [0241] 13. Takahashi, H., Lassmann, T., Murata, M. & Carninci, P. 5′ end-centered expression profiling using cap-analysis gene expression and next-generation sequencing. Nat Protoc 7, 542-561 (2012). [0242] 14. Murata, M. et al. Detecting expressed genes using CAGE. Methods Mol Biol 1164, 67-85 (2014). [0243] 15. Plessy, C. et al. Linking promoters to functional transcripts in small samples with nanoCAGE and CAGEscan. Nat Methods 7, 528-534 (2010). [0244] 16. Poulain, S. et al. NanoCAGE: A Method for the Analysis of Coding and Noncoding 5′-Capped Transcriptomes. Methods Mol Biol 1543, 57-109 (2017). [0245] 17. Zhu, Y. Y., Chenchik, A., Li, R., Hsieh, F. Y. & Siebert, P. D. in Genetic Library Construction and Screening: Advanced Techniques and Applications. (eds. R. C. Bird & B. F. Smith) 69-93 (Springer Berlin Heidelberg, Berlin, Heidelberg; 2002). [0246] 18. Tang, D. T. et al. Suppression of artifacts and barcode bias in high-throughput transcriptome analyses utilizing template switching. Nucleic acids research 41, e44 (2013). [0247] 19. Kivioja, T. et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat Methods 9, 72-74 (2011). [0248] 20. Smith, T., Heger, A. & Sudbery, I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res 27, 491-499 (2017). [0249] 21. Gimble, F. S. & Wang, J. Substrate recognition and induced DNA distortion by the PI-SceI endonuclease, an enzyme generated by protein splicing. J Mol Biol 263, 163-180 (1996). [0250] 22. Argast, G. M., Stephens, K. M., Emond, M. J. & Monnat, R. J., Jr. I-PpoI and I-CreI homing site sequence degeneracy determined by random mutagenesis and sequential in vitro enrichment. J Mol Biol 280, 345-353 (1998). [0251] 23. Palazzo, A. F. & Lee, E. S. Non-coding RNA: what is functional and what is junk? Front Genet 6, 2 (2015). [0252] 24. Carninci, P. et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet 38, 626-635 (2006). [0253] 25. Haberle, V., Forrest, A. R., Hayashizaki, Y., Carninci, P. & Lenhard, B. CAGEr: precise TSS data retrieval and high-resolution promoterome mining for integrative analyses. Nucleic acids research 43, e51 (2015). [0254] 26. Burke, T. W. & Kadonaga, J. T. The downstream core promoter element, DPE, is conserved from Drosophila to humans and is recognized by TAFII60 of Drosophila. Genes Dev 11, 3020-3031 (1997). [0255] 27. Zajac, P., Islam, S., Hochgerner, H., Lonnerberg, P. & Linnarsson, S. Base preferences in non-templated nucleotide incorporation by MMLV-derived reverse transcriptases. PLoS One 8, e85270 (2013). [0256] 28. Ponjavic, J. et al. Transcriptional and structural impact of TATA-initiation site spacing in mammalian core promoters. Genome Biol 7, R78 (2006). [0257] 29. Kutach, A. K. & Kadonaga, J. T. The downstream promoter element DPE appears to be as widely used as the TATA box in Drosophila core promoters. Mol Cell Biol 20, 4754-4764 (2000). [0258] 30. Segal, E. et al. A genomic code for nucleosome positioning. Nature 442, 772-778 (2006). [0259] 31. Zheng, X. et al. Low-Cell-Number Epigenome Profiling Aids the Study of Lens Aging and Hematopoiesis. Cell Rep 13, 1505-1518 (2015). [0260] 32. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nature Methods 9, 357 (2012). [0261] 33. Balwierz, P. J. et al. Methods for analyzing deep sequencing expression data: constructing the human and mouse promoterome with deepCAGE data. Genome Biol 10, R79 (2009). [0262] 34. Yu, G., Wang, L. G. & He, Q. Y. ChlPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 31, 2382-2383 (2015). [0263] 35. Aken, B. L. et al. The Ensembl gene annotation system. Database (Oxford) 2016 (2016). [0264] 36. Wickham, H. Ggplot2: elegant graphics for data analysis. (Springer, New York; 2009). [0265] 37. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079 (2009). [0266] 38. Lawrence, M., Gentleman, R. & Carey, V. rtracklayer: an R package for interfacing with genome browsers. Bioinformatics 25, 1841-1842 (2009).

    EXAMPLES

    Example 1—Introduction and Development of SLIC-CAGE

    [0267] The inventors have developed SLIC-CAGE, a Super-Low Input Carrier-CAGE approach that is based on cap-trapper technology and can generate unbiased high-complexity libraries from 5-10 ng of total RNA. Thus far the cap-trapper step has been the limiting factor in the reduction of the amount of required starting material. To facilitate the cap-trapper technology on the nanogram scale, representing capped RNA from as low as hundreds of eukaryotic cells, samples of the total RNA of interest are supplemented with novel pre-designed carrier RNAs. Prior to sequencing, the carrier is efficiently removed from the final library using homing endonucleases that target recognition sites embedded within the sequences of the carrier molecules, leaving only the target mRNA library to be amplified and sequenced. The specificity and the long recognition motifs of homing endonucleases ensure that no sample RNA is degraded in the process.

    [0268] The inventors have tested and validated SLIC-CAGE on a wide-range of starting material amounts (1-100 ng of total RNA) from Saccharomyces cerevisiae and Mus musculus using the current nAnT-iCAGE protocol, as a gold standard. Additional direct comparison between SLIC-CAGE and the latest nanoCAGE protocol.sup.16 showed that SLIC-CAGE strongly outperforms nanoCAGE in sensitivity, resolution and reproducibility. SLIC-CAGE produced unbiased libraries of higher complexity and quality than nanoCAGE, even when constructed using low total RNA input (5-10 ng compared to 500 ng). Taken together, the inventors demonstrate that SLIC-CAGE enables reliable genome-wide promoter-centric biological discovery and promoter classification using as little as 5-10 ng of total RNA material.

    [0269] In typical CAGE protocols, the cap-trapper step needs at least 5 μg of total RNA.sup.14 and is therefore the limiting factor. This step has been difficult to scale down as it involves the pull-down of biotinylated capped RNA using streptavidin beads. In such situations, a common biochemical approach to prevent sample loss is the use of carriers; i.e. inert non-interfering molecules to minimize sample loss caused by nonspecific adsorption and to improve specificity in affinity purification steps. However, unless there is a way to selectively remove the carrier afterwards, the carrier signal will dominate the sequenced sample and therefore lead to orders of magnitude of reduced sequencing depth of the sample itself.

    [0270] To solve this problem and enable profiling of minute amounts of RNA, the inventors have designed carrier RNA that will be similar in size distribution and the percentage of capped RNA to the cellular RNA, but whose cDNA will be possible to selectively degrade without affecting the cDNA originating from the sample.

    [0271] The inventors constructed the synthetic gene used as a template for run-off in vitro transcription of the carrier RNA (FIGS. 5 and 15 a,b). The synthetic gene is based on the Escherichia coli leucyl-tRNA synthetase sequence for two main reasons. The carrier nucleic acid should preferably not map to eukaryotic genomes. Secondly, leucyl-tRNA synthetase is a housekeeping gene from a mesophilic species and therefore its sequence is not expected to form strong secondary structures that would reduce its translation in vivo, or reduce the efficiency of reverse transcription to form RNA:cDNA hybrids. The carrier was made carrier selectively degradable by embedding it with multiple recognition sites of two homing endonucleases, I-CeuI and I-SceI (FIG. 1a, 5, 15a,b and 25). Combination of alternating recognitions sites allows for higher degradation efficiency and reduces sequence repetitiveness. The two enzymes have recognition sites of lengths 27 and 18 nucleotides, respectively, which even with some degeneracy allowed in the recognition site.sup.21, 22 makes their random occurrence in a transcriptome highly improbable. The two enzymes work at the same temperature and in the same buffer, so their digestion can be combined in a single step. A fraction of the synthesised carrier RNA is capped using Vaccinia Capping System (NEB) and mixed with uncapped carrier to achieve the desired capping percentage.sup.23.

    [0272] The percentage of capped RNAs in the carrier and its size distribution were optimised by performing the entire SLIC-CAGE protocol, starting by adding the synthetic carrier to the low-input sample to achieve a total of 5 μg of RNA material. To assess its performance, the output was compared with the nAnT-iCAGE library derived from 5 μg of total cellular RNA. nAnT-iCAGE was used as a reference as it is currently considered the most unbiased protocol for promoterome mapping.sup.14, and because TSS identification by cap-trapper based technology has been experimentally validated.sup.24. To identify the optimal ratio of capped and uncapped carrier, as well as the length of the carrier RNAs, the following carrier mixes were tested: 1) carriers with lengths distributed between 0.3-1 kb versus homogenous 1 kb length carriers and 2) a mixture of capped and uncapped versus only capped carrier. The SLIC-CAGE protocol was performed as outlined in FIG. 1a, starting with 100 ng of total RNA isolated from S. cerevisiae supplemented with the various carrier mixes up to total 5 μg of RNA. We then compared the output with the nAnT-iCAGE library generated using 5 μg of total RNA (FIG. 7B, 8A, 15c-m). Removal of the carrier was performed by two rounds of degradation using homing endonucleases (I-SceI and I-CeuI, FIG. 25) with a purification and a PCR amplification step between the rounds. The presence of the carrier significantly improved the correlation of individual CAGE-supported TSSs (CTSSs) between SLIC-CAGE and the reference nAnT-iCAGE library (FIG. 15c-h). This effect was not observed when either only the capped carrier or no carrier was used (FIG. 15 c,g,h). The highest correlation and reproducibility was achieved by a carrier mix composed of 10% capped and 90% uncapped molecules of 0.3-1 kb length (mix 2, FIGS. 7B and 8A). This mix was designed to closely mimic the composition of cellular total RNA. Other diagnostic criteria shown in FIG. 15 j-m confirm this is the optimal carrier choice.

    SLIC-CAGE Allows Genome-Wide TSS Identification from Nanogram-Scale Samples

    [0273] To identify the lowest amount of total RNA that can be used to produce high quality CAGE libraries a SLIC-CAGE titration test was performed with 1-100 ng of total S. cerevisiae RNA (FIG. 16a) and compared with an nAnT-iCAGE library derived from 5 μg of total RNA. The high correlation of individual CTSSs between SLIC-CAGE and the reference nAnT-iCAGE library (FIGS. 1b and c, Supplementary FIG. 16a) shows that genuine CTSSs are identified. Moreover, SLIC-CAGE libraries show high reproducibility (FIG. 1d). FIG. 1e shows an example locus in the genome browser, demonstrating the high similarity of SLIC-CAGE and nAnT-iCAGE CTSS profiles in all high-quality datasets (i.e. datasets with high complexity, see below).

    [0274] To confirm the general applicability of the SLIC-CAGE protocol, a similar titration test was performed using total RNA isolated from E14 mouse embryonic stem cells. The results obtained following sequencing of the libraries generated using 5, 10 or 25 ng of total RNA were highly correlated (Pearson correlation 0.9) with the reference nAnT-iCAGE derived library. The correlation did not improve further with increasing total starting RNA (FIG. 1f-h, 16c), again verifying the suitability of the SLIC-CAGE protocol for nanogram-scale samples. The genome browser view (FIG. 10 confirms the similarity of profiles on the individual CTSS level, although the library prepared from 5 ng of M. musculus total RNA exhibits minor differences, due to lower complexity as discussed in detail in the next section.

    [0275] Analysis of library mapping efficiency demonstrated that selective degradation of the carrier is highly efficient. When only 1 ng of total RNA is used with a 5000-fold more carrier (5 μg), 25% of the sequenced reads are uniquely mapped to the target organism, while the rest corresponds to the leftover carrier (27%), short amplified linkers or multimappers, commonly discarded from TSS analyses (FIGS. 10 and 11). This amount of leftover carrier is minor and does not significantly compromise sequencing depth (10% or less when 10 ng of total RNA are used). It is expected that with additional rounds of degradation and purification, the leftover carrier could be further reduced, although with a risk of sample loss, and we found it unnecessary.

    Example 2—Complexity and Resolution of SLIC-CAGE Libraries

    [0276] The complexity and any potential inherent CTSS detection biases of libraries produced using the SLIC-CAGE protocol was assessed. As discussed above, both SLIC-CAGE and nAnT-iCAGE libraries are highly correlated at individual CTSSs and so the spatial clustering of these CTSSs and its features was analysed.

    [0277] CTSSs in close vicinity reflect functionally equivalent transcripts and are generally clustered together and analysed as a single transcriptional unit termed a tag cluster.sup.25. Specificity in capturing genuine TSSs can be assessed by examining the fraction of tag clusters that overlap with expected promoter regions. A high percentage of SLIC-CAGE tag clusters were identified that map to known promoter regions in both S. cerevisiae and M. musculus libraries irrespective of the total starting RNA, thus indicating the high specificity of these libraries (approximately 80%, at the same level as the reference nAnT-iCAGE protocol, FIG. 2a,e and FIG. 16d,f).

    [0278] In addition to determining the number of unique CTSSs and tag clusters (FIG. 12), complexity of CAGE-derived libraries can be assessed by comparing tag cluster widths. To robustly identify tag cluster widths, the interquantile widths (IQ-width) were calculated that span 10th and the 90th percentile (q0.1-q0.9) of the total tag cluster signal to exclude effects of extreme outlier CTSSs. The distribution of tag cluster IQ-widths serves as a good indicator of library complexity. In low-complexity libraries, sparse CTSS detection will lead to artificially sharp tag clusters. IQ-width distribution of S. cerevisiae SLIC-CAGE tag clusters reveals that complexity of the reference nAnT-iCAGE library is recapitulated using as little as 5 ng of total RNA. This result is substantiated with the number of unique CTSSs which corresponds to the number identified with nAnT-iCAGE (around 70% overlap between 5 ng SLIC-CAGE and nAnTi-iCAGE, and 90% overlap in tag cluster identification). Low-complexity with artificially sharper tag clusters is seen only with 1-2 ng of total RNA input (FIGS. 2b, 12 and 17a). A highly similar result is observed with M. musculus SLIC-CAGE libraries, although lower complexity is notable at 5 ng of total RNA (FIGS. 2f and 17b). This is in agreement with the lower number of unique CTSSs identified in 5 ng M. musculus SLIC-CAGE library compared to nAnT-iCAGE (FIG. 12). It is expected that an increase in sequencing depth would ultimately recapitulate the complexity of the reference dataset as higher coverage in S. cerevisiae facilitates higher complexity libraries with lower starting amount (5 ng).

    [0279] SLIC-CAGE derived CTSS features from S. cerevisiae and M. musculus were also assessed and compared them with features extracted using the nAnT-iCAGE library as reference. First, nucleotide composition of all SLIC-CAGE-identified CTSSs reveals highly similar results to nAnT-iCAGE independent of the total input RNA (FIGS. 2c,g and 16g,i). Furthermore, the composition of [−1,+1] dinucleotide initiators (where the +1 nucleotide represents the identified CTSS) also showed a highly similar pattern to the reference nAnT-iCAGE dataset (FIG. 2d,h left panel, and 16j,i). SLIC-CAGE libraries identify CA, TA, TG and CG as the most preferred initiators, similar to preferred mammalian initiator sequences.sup.24.

    [0280] Focusing only on the initiation patterns ([−1, +1] dinucleotide) of the dominant TSS (CTSSs with the highest TPM within each the tag cluster) of each tag cluster facilitates estimation of the influence of PCR amplification on the distribution of tags within a tag cluster. Highly similar dinucleotide composition of dominant TSS initiators, independent of the amount of total RNA used, confirms that identification of the dominant TSSs is not obscured by PCR amplification (FIG. 2d,h right panel, and 16m,o). The identified preferred initiators are pyrimidine-purine dinucleotides CA, TG, TA (S. cerevisiae) or CA, CG, TG (M. musculus) in accordance with the lnr element (YR) 7, 26. These results confirmed the utility of SLIC-CAGE in uncovering authentic transcription initiation patterns such as the well-established CA initiator.

    [0281] As a final assessment of SLIC-CAGE performance, the expression ratios per individual CTSS common to SLIC-CAGE and the reference nAnT-iCAGE were analysed (FIGS. 18 and 19 left panels) and present the ratios in a heatmap centred on the dominant CTSS identified by the reference nAnT-iCAGE library. This analysis can uncover any positional biases, if introduced by the SLIC-CAGE protocol. Patterns of signal in heatmaps (grouping upstream or downstream of the nAnTi-iCAGE-identified dominant CTSS) would signify positional bias and indicate non-random capturing of authentic TSSs. The positions and expression values of CTSSs identified in the nAnT-iCAGE but absent in SLIC-CAGE libraries were also evaluated (FIGS. 18 and 19, middle panels). No positional biases with regards to SLIC-CAGE-identified CTSSs and their expression values were identified, independent of the total input RNA. As expected, a higher number of CTSSs identified in nAnT-iCAGE were absent from lower complexity S. cerevisiae SLIC-CAGE libraries derived from 1 and 2 ng total RNA (FIGS. 18a and b, middle panels). This was particularly evident in those CTSSs with expression values in the lower two quartiles (top two sections in each heatmap). Further, the CTSSs identified in both low-complexity SLIC-CAGE and nAnT-iCAGE exhibit higher TPM ratios, likely reflecting the effect of PCR amplification. On the other hand, the SLIC-CAGE library derived from 5 ng of total RNA (FIG. 18c) shows similar patterns as libraries derived from greater amounts of RNA (FIG. 18d-h) or the library derived by PCR amplification of the nAnT-iCAGE library (FIG. 18i).

    [0282] Similar results were obtained when comparing M. musculus SLIC-CAGE libraries with their reference nAnT-iCAGE library (FIG. 19), albeit with a twofold greater minimum starting RNA (10 ng) required for high-complexity libraries. Overall, these results show that SLIC-CAGE increases the sensitivity of the CAGE method 1000-fold over the current “gold standard” nAnT-iCAGE, without decrease in signal quality. This unparalleled sensitivity positions SLIC-CAGE as a method of choice for unbiased identification of TSSs in low-input samples that were previously inaccessible to CAGE methodology.

    Example 3—SLIC-CAGE Generates Superior Quality Libraries Compared to Existing Low Input Methods

    [0283] The current available method for low input samples, nanoCAGE, requires 50-500 ng of total cellular RNA and is very different from standard CAGE in its selection of capped RNAs.sup.15, 16. Whilst the gold standard verified CAGE protocols, i.e. nAnT-iCAGE relies on cap-trapper based selection of capped RNA, nanoCAGE uses the template switching property of the reverse transcriptase to selectively introduce a barcoded adapter only onto 5′ ends of capped RNA. The result are hybrid cDNA molecules with a specific nucleotide sequence added to the 5′ end of the capped RNA.

    [0284] nanoCAGE was compared to nAnT-iCAGE. A nanoCAGE titration test was carried out using S. cerevisiae total RNA (5-500 ng) and compared the obtained libraries with the reference nAnT-iCAGE library. CTSSs identified in nanoCAGE libraries were poorly correlated (Pearson correlation 0.5-0.6) with the nAnT-iCAGE library, irrespective of the total RNA used (FIG. 3a-e and FIG. 16b). Despite reduced similarity with nAnT-iCAGE, nanoCAGE libraries appeared reproducible (FIG. 3f). An example genome browser view also reveals significant differences in CTSS profiles between nanoCAGE and nAnT-iCAGE libraries (FIG. 3g). NanoCAGE systematically failed to capture all CTSSs identified with nAnT-iCAGE. In contrast, the SLIC-CAGE library derived from only 5 ng of total RNA accurately recapitulates the nAnT-iCAGE TSS profile shown in the same genomic region (FIG. 3g as FIG. 1e).

    [0285] The tag clusters identified in each nanoCAGE library was investigated and showed that approximately 85% were indeed in expected promoter regions (FIG. 3h, and FIG. 16e). The cluster overlap is highly similar to the reference nAnT-iCAGE library in all nanoCAGE libraries, independent of the amount of total RNA used. Therefore, nanoCAGE does not capture the full complexity of promoter TSS usage but its specificity for promoters is not diminished.

    [0286] To inspect the complexity of nanoCAGE libraries, the tag cluster IQ-widths was compared with the reference nAnT-iCAGE library (FIG. 3i and FIG. 17c). An increase in the number of sharper tag clusters is observed at 1-50 ng of total input RNA. The IQ-width distributions show that nanoCAGE systematically produces lower-complexity libraries compared to nAnT-iCAGE and SLIC-CAGE. This result agrees well with the consistently lower number of unique CTSSs identified in nanoCAGE libraries compared to nAnT-iCAGE libraries (FIG. 13a).

    [0287] Nucleotide composition of nanoCAGE-identified robust CTSSs revealed a strong preference for G-containing CTSS (FIG. 3j). This is specific to nanoCAGE libraries compared to nAnT-iCAGE and also independent of the total input RNA. This observed G-preference is not an artefact caused by the extra C added complementary to the cap structure at the 5′end of cDNA during reverse transcription, as that is common to all CAGE protocols and corrected in all datasets using the processing step in the Bioconductor package CAGEr.sup.25. Lastly, to check if in nanoCAGE more than one G is added during reverse transcription, the 5′end Gs flagged as a mismatch in the alignment were counted, indicating that the amount of two consecutive mismatches was not significant (FIG. 13B).

    [0288] The composition of [−1,+1] initiator dinucleotides revealed a severe depletion in identified CA and TA initiator, with the corresponding increase in G-containing initiators (TG and CG), in comparison with the reference nAnT-iCAGE dataset (FIG. 3k, left panel, Supplementary FIG. 2k). To assess the most robust CTSSs, the same analysis was repeated using only the dominant CTSSs in each tag cluster (FIG. 3k, right panel, Supplementary FIG. 2n) and the lack of CA and TA initiators was equally apparent. This property of nanoCAGE makes it unsuitable for the determination of dominant CTSSs and details of promoter architecture at base pair resolution.

    [0289] To exclude the effects of CTSSs located in non-promoter regions and to assess if CTSS identification depends on expression levels, tag clusters were divided according to their genomic location or expression values (division into four expression quartiles per each library) and repeated the analysis (FIGS. 20 and 21). Since a similar pattern (depletion of CA and TA initiators) was observed irrespective of the genomic location or expression level, these results suggest that the nanoCAGE bias is a consequence of the template switching property of reverse transcriptase, which is known to be sequence dependent, and is expected to prefer capped RNA that starts with G27.

    [0290] Signal ratios of individual CTSSs identified in each nanoCAGE library and the reference nAnT-iCAGE were analysed (ratio of TPM values, FIG. 22 left panels), and CTSSs not identified in nanoCAGE (FIG. 22. middle panels) similarly as described for SLIC-CAGE (see above). This analysis reveals that there are no position specific biases in nanoCAGE and that the biases are primarily caused by nucleotide composition of the capped RNA 5′ends. Further, it accentuates the inability of nanoCAGE to capture dominant CTSSs identified with the reference nAnT-iCAGE, even with higher amounts of starting material, compared to SLIC-CAGE (FIG. 22f-h vs FIG. 18a-h).

    Example 4—Use of SLIC-CAGE in Uncovering Promoter Architecture

    [0291] The dominant CTSS provides a structural reference point for the alignment of promoter sequences and thus facilitates the discovery of promoter-specific sequence features. High-quality data is necessary for the accurate identification of the dominant TSS within a tag cluster or promoter region. Sharp promoters, defined by small IQ-widths, are typically defined by a fixed distance from a core promoter motif, such as a TATA-box or TATA-like element at −30 position 28 upstream of the TSS, or by DPE motif at +28 to +3229 in Drosophila. Broader promoters, featuring multiple CTSS positions, are enriched for GC content and CpG island overlap 7 in vertebrates. Lower complexity libraries have an increased number of artificially sharp tag clusters (FIG. 2f), due to sparse CTSS identification. Although the identified CTSSs in lower-complexity SLIC-CAGE libraries are canonical, association of sequence features may be obscured by artificially sharp tag clusters. To address this, the promoter architecture for known promoter features in E14 mouse cell lines was investigated using SLIC-CAGE from 5 to 100 ng of total RNA.

    [0292] The presence of a TA dinucleotide around the −30 positions for all TSS identified by SLIC-CAGE for both 5 and 10 ng of input RNA. The TSSs were ordered by IQ-width of their corresponding tag cluster and extended to include 1 kb DNA sequence up- and downstream. The TA frequency is depicted in a heatmap in FIG. 4a for promoters ordered from sharp to broad for 10 ng of RNA and clearly recapitulates the patterns visible in the reference nAnT-iCAGE library. As expected, the sharpest tag clusters in libraries produced from 5 ng of total RNA have a weaker TA signal (FIG. 24a, TA heatmap), as these are likely artificially sharp and not the canonical sharp promoters. A similar result is observed for enrichment of the canonical TATA-box element, the 10 ng library recapitulated the reference nAnT-iCAGE library whereas the 5 ng library shows a weaker enrichment (FIG. 4b and FIG. 23c,d).

    [0293] A GC-enrichment in the region between the dominant TSS and 250 nucleotides downstream of it indicates positioning of the +1 nucleosomes and is expected to be highly localized in broad promoters. This feature is again recapitulated by the 10 ng RNA input library (FIG. 4c, FIG. 24).sup.3, 5, 7. Furthermore, rotational positioning of the +1 nucleosomes are associated with VWV periodicity (AA/AT/TA/TT dinucleotides) lined up with the dominant TSS. VWV dinucleotide density was examined separately for sharp and broad promoters identified by SLIC-CAGE and the reference nAnT-iCAGE library (FIG. 4d). A strong 10.5 nucleotides periodicity of VWV dinucleotides downstream of the dominant TSS was observed in SLIC-CAGE libraries derived from 10 ng of M. musculus total RNA and corresponded to the phasing observed with the reference nAnT-iCAGE library (FIG. 4d and FIG. 23b). This can only be observed across promoters if the dominant TSS is accurately identified and therefore it reflects the quality of the libraries. To confirm that VWV dinucleotide periodicity reflects +1 nucleosome positioning in broad promoters, we assessed H3K4me3 data downloaded from ENCODE (FIGS. 4f and g, FIG. 24 H3K4me3 heatmap). H3K4me3 subtracted coverage reflects the well-positioned +1 nucleosome broad promoters (FIG. 4f,g) and localizes with VWV periodicity specific for broad promoters (FIG. 4d). These results are in agreement with previously identified nucleosome positioning preferences.sup.30.

    [0294] As a final validation of SLIC-CAGE promoters, CpG island density was assessed separately in sharp and broad promoters (FIG. 4h,l, FIG. 24). A higher density of CpG islands was observed in SLIC-CAGE broad promoters, which corresponds to nAnT-iCAGE broad promoters and agrees with the expected association of broad promoters and CpG islands.7, 24 These results demonstrate the utility of SLIC-CAGE libraries derived from nanogram-scale samples in promoter architecture discovery, alongside the gold standard nAnT-iCAGE libraries.

    Example 5—Discussion

    [0295] The inventors have developed SLIC-CAGE, an unbiased cap-trapper based CAGE protocol optimized for promoterome discovery from as little as 5-10 ng of isolated total RNA (approximately 10.sup.3 cells). SLIC-CAGE libraries are of equivalent quality and complexity as nAnT-iCAGE libraries derived from 500-1000-fold more material (5 μg of total RNA, approximately 10.sup.6 cells). SLIC-CAGE extends the nAnT-iCAGE protocol through addition of the degradable carrier to the target RNA material of limited availability. Since the best CAGE protocol is not amenable to downscaling, the idea behind the carrier is to increase the amount of material to permit highly specific cap-trapper based purification of target RNA polymerase II transcripts and to minimize material loss in many protocol steps.

    [0296] The carrier was designed to have a similar size distribution and fraction of capped molecules as the total cellular RNA, to effectively saturate non-specific adsorption sites on all surfaces and matrices used throughout the protocol. In the final stage of SLIC-CAGE, the carrier molecules are selectively degraded using homing endonucleases, while the intact target library is amplified and sequenced.

    [0297] SLIC-CAGE has been shown herein to have superior sensitivity, resolution and absence of bias to the only other low-input CAGE technology, nanoCAGE, which relies on template switching during the cDNA synthesis.sup.15. Although the amount of starting material is significantly reduced, the lowest input limit for nanoCAGE is 50 ng of total RNA which may require up to 30 PCR cycles.sup.16. The performances of SLIC-CAGE and nanocage were directly compared in titration tests and demonstrated that: 1) higher complexity libraries are achieved with significantly lower input: SLIC-CAGE requires 5-10 ng, while nanoCAGE requires 50 ng of total RNA; 2) nanoCAGE does not recapitulate the complexity of the nAnT-iCAGE libraries even with the highest recommended amount of RNA (500 ng), in comparison SLIC-CAGE captures the full complexity when 5-10 ng are used; 3) nanoCAGE preferentially captures G-starting capped mRNAs, while SLIC-CAGE does not have 5′mRNA nucleotide dependent biases; 4) Biases in nanoCAGE libraries are independent of the total RNA amount used, and inherent to the template switching step.

    [0298] Importantly, with the carrier approach to minimize the target sample loss, SLIC-CAGE protocol requires less PCR amplification cycles—15-18 cycles for 10-1 ng of total RNA as input. This is advantageous as smaller number of PCR cycles reduce amplification biases and the fraction of observed duplicate reads. Although, nanoCAGE takes advantage of unique molecular identifiers to remove PCR duplicates, in our experience, synthesis of truly random UMI is problematic and subject to variability, thereby obscuring its use.

    [0299] A different carrier-based approach has recently been applied to down-scale chromatin-precipitation based methods—favoured amplification recovery via protection ChIP-seq (FARP-ChIP-seq).sup.31. FARP-ChIP-seq relies on a designed biotinylated synthetic DNA carrier, mixed with chromatin of interest prior to ChIP-seq library preparation. Amplification of the synthetic DNA carrier is prevented using specific blocker oligonucleotides. These blocker oligonucleotides are: 1) complementary to the biotin-DNA carrier; 2) carry phosphorothioate modification of the first three nucleotides at the 5′end for resistance to exonuclease activity of the polymerase; 3) carry a 3′end 3-carbon spacer to inhibit extension by PCR. The blocker strategy can achieve a 99% reduction in amplification of the biotin-DNA, which if applied instead of our degradable carrier would leave much more carrier to sequence (starting SLIC-CAGE with 1 ng of total RNA and 5 μg of the carrier, 27% of the carrier is left in the final library, which is more than a 10000-fold reduction, FIG. 11). While also being more costly and time consuming, this approach could be combined with selective degradation of the SLIC-CAGE carrier when near-complete removal of carrier is required thereby increasing sequencing depth to allow higher complexity libraries from even lower input amounts.

    [0300] SLIC-CAGE is expected to prove to be invaluable for in-depth and high-resolution promoter analysis of rare cell types, including early embryonic developmental stages or embryonic tissue from a wide range of model organisms, which has so far been inaccessible to the method. With its low material requirement (5-10 ng of total RNA), SLIC-CAGE can also be applied on isolated nascent RNA, to provide an unbiased promoterome with high positional and temporal resolution. Lastly, as bidirectional capped RNA is a signature feature of active enhancers4, deeply sequenced SLIC-CAGE libraries can be used to identify active enhancers in rare cell types. The principle of the degradable carrier can also easily be extended to other protocols where the required amount of RNA or DNA is limiting.

    Example 6—Methods

    Preparation of the Carrier RNA Molecules

    [0301] DNA template (1 kb length) for preparation of the carrier by in vitro transcription was synthesized and cloned into pJ241 plasmid (service by DNA 2.0, FIGS. 5 and 15) to produce the carrier plasmid. The template encompasses the gene that serves as the carrier, embedded with restriction sites for I-SceI and I-CeuI to allow degradation in the final steps of the library preparation. The templates for in vitro transcription were prepared by PCR amplification using the unique forward primer (PCR_GN5_f1, FIG. 6) which introduces the T7 promoter followed by five random nucleotides, and the reverse primer which determines the total length of the carrier template and introduces six random nucleotides at the 3′end (FIG. 6 PCR_N6_r1-r10).

    [0302] The PCR reaction to produce the carrier templates was composed of 0.2 ng p1-1 carrier plasmid, 1 μM primers (each), 0.02 U μl-1 Phusion High-Fidelity DNA Polymerase (Thermo Fisher Scientific) and 0.2 mM dNTP in 1×Phusion HF Buffer (final concentrations). The cycling conditions are presented in FIG. 7. Produced carrier templates (lengths 1034-386 nucleotides) were gel-purified to remove non-specific products.

    [0303] Carrier RNA was in vitro transcribed using HiScribe™ T7 High Yield RNA Synthesis Kit (NEB) according to manufacturer's instructions, and purified using RNeasy Mini kit (Qiagen). A portion of carrier RNAs was capped using Vaccinia Capping System and purified using RNeasy Mini kit (Qiagen). The capping efficiency was estimated using RNA 5′Polyphosphatase and Terminator™ 5′-Phosphate-Dependent Exonuclease, as only uncapped RNAs are dephosphorylated and degraded, while capped RNA's are protected. Several carrier combinations were tested in SLIC-CAGE (FIGS. 7B and 8A) and the final carrier used in SLIC-CAGE was comprised of 90% uncapped carrier and 10% capped carrier, both of varying length (FIG. 8A).

    SLIC-CAGE Library Preparation

    [0304] For the standard cap analysis of gene expression, the latest nAnT-iCAGE protocol was followed.sup.14. In the SLIC-CAGE variant, the carrier was mixed with the RNA of interest, to the total amount of 5 μg, e.g. 10 ng of RNA of interest were mixed with 4990 ng of carrier mix and subjected to reverse transcription as in the nAnT-iCAGE protocol.sup.14. Further library preparation steps were followed as described in Murata et al 2014.sup.14 with several exceptions: 1) samples were pooled only prior to sequencing, to allow individual quality control steps; 2) samples were never completely dried using the centrifugal concentrator and then redissolved as in nAnT-iCAGE, instead the leftover volume was monitored to avoid complete drying and adjusted with water to achieve the required volume; 3) After the final AMPure purification in the nAnT-iCAGE protocol, each sample was concentrated using the centrifugal concentrator, and its volume adjusted to 15 μl, out of which 1 μl was used for quality control on the Agilent Bioanalyzer HS DNA chip.

    [0305] Steps regarding degradation of the carrier in SLIC-CAGE libraries are schematically presented in FIG. 25.

    [0306] To degrade the carrier, 14 μl of sample was mixed with I-SceI (5 U) and I-CeuI (5 U) in 1×CutSmart buffer (NEB) and incubated at 37° C. for 3 h. The enzymes were heat inactivated at 65° C. for 20 min, and the samples purified using AMPure XP beads (1.8×AMPure XP volume per reaction volume, as described in Murata et al 14). The libraries were eluted in 42 μl of water and concentrated to 20 μl using the centrifugal concentrator. A qPCR control was then performed to determine the suitable number of PCR cycles for library amplification and assess the amount of the leftover carrier. The primers designed to amplify the whole library are complementary to 5′ and 3′ linker regions, while the primers used to selectively amplify just the carrier are complementary to the 5′end of the carrier (common to all carrier molecules) and the 3′linker (common to all molecules in the library, see Supplementary Table 6 for primer sequences). qPCR reactions were performed using KAPA SYBR FAST qPCR kit using 1 μl of the sample and 0.1 μM primers (final concentration), in 10 μl total volume using PCR cycle conditions presented in the FIG. 8C.

    [0307] The number of cycles for PCR amplification of the library corresponded to the Ct value obtained with the primers that amplify the whole library (adapter_f1 and adapter_r1, FIG. 8B). PCR amplification of the library was then performed using KAPA HiFi HS ReadyMix, with 0.1 μM primers (adapter_f1 and adapter_r1, FIG. 8B) and 18 μl of sample in a total volume of 100 μl. The cycling programme is presented in the FIG. 8D and the final number of cycles used to amplify the libraries in FIG. 9. Amplified samples were purified using AMPure XP beads (1.8×volume ratio of the beads to the sample), eluted with 42 μl of water and concentrated using centrifugal concentrator to 14 μl.

    [0308] A second round of carrier degradation was then performed as described for the first round. The samples were purified using AMPure XP beads (stringent 1:1 AMPure XP to sample volume ratio to exclude primer dimers and short fragments), eluted with 42 μl of water and concentrated to 12 μl using centrifugal concentrator. The combination of 1st round of carrier degradation followed by PCR amplification, AMPure XP purification and the 2nd round of carrier degradation is necessary to avoid substantial sample loss that leads to low-complexity libraries.

    [0309] Each sample was then individually assessed for fragment size distribution using an HS DNA chip (Bioanalyzer, Agilent). If short fragments were present in the library (<300 nucleotides, see Supplement FIG. 12), another round of size selection was performed using a stringed volume ratio of AMPure XP beads to the sample—0.8×(volume of each sample was prior to purification adjusted with water to 30 μl). The samples were eluted in 42 μl of water and concentrated to 12 μl using centrifugal concentrator. Fragment size distribution was again checked using an HS DNA chip (Bioanalyzer, Agilent), to ensure removal of the short fragments.

    [0310] Finally, the amount of leftover carrier was estimated using qPCR as described above after the 1st round of carrier degradation. The expected Ct in qPCR using adapter_f1 and adapter_r1 is 12-13 or 23-30 using carrier_f1 and adapter_r1 primer pairs (FIG. 8B) when the starting total RNA amount is 100-1 ng.

    [0311] The libraries were sequenced on MiSeq (S. cerevisiae) or HiSeq2500 (M. musculus) Illumina platforms in single-end 50 base-pair mode (Genomics Facility, MRC, LMS),

    NanoCAGE Library Preparation

    [0312] S. cerevisiae nanoCAGE libraries were prepared as described in the latest protocol version by Poulain et al 2017.sup.16. Briefly, 5, 10, 25, 50 or 500 ng of S. cerevisiae total RNA was reversely transcribed in the presence of corresponding template switching oligonucleotides (FIG. 14) followed by AMPure purification. One 500 ng replicate was pre-treated with exonuclease to test if rRNA removal has any effects on the quality of the final library.

    [0313] The number of PCR-cycles for semi-suppressive PCR was determined by qPCR as described in Poulain et al 2017 (FIG. 9). Samples were AMPure purified after amplification, and the concentration of each sample determined using Picogreen.

    [0314] 2 ng of each sample were pooled prior tagmentation and 0.5 ng of the pool was used in tagmentation. The sample was AMPure purified and quantified using Picogreen prior to MiSeq sequencing in single-end 50 base-pair mode (Genomics Facility, MRC, LMS).

    Processing of CAGE Tags: nAnT-iCAGE, SLIC-CAGE or nanoCAGE

    [0315] Sequenced CAGE tags (50 nucleotides) were mapped to a reference S. cerevisiae genome (sacCer3 assembly) or M. musculus genome (mm10 assembly) using Bowtie232 with default parameters that allow zero mismatches per seed sequence (default 20 nucleotides). Sequenced nanoCAGE libraries were trimmed prior to mapping to remove the linker and UMI region (15 nucleotides from the 5′end were trimmed).

    [0316] Only uniquely mapped reads were used in downstream analysis within R graphical and statistical computing environment (http://www.R-project.org/) using Bioconductor packages (http://www.bioconductor.org/) and custom scripts. The mapped reads were sorted and imported into R as bam files using CAGEr25. The additional G nucleotide at the 5′end of the reads, if added through template free activity of the reverse transcriptase, was resolved within CAGEr's standard workflow designed to remove G's that do not map to the genome: 1) if the first nucleotide is G and a mismatch, i.e. it does not map to the genome, it is removed from the read; 2) if the first nucleotide is G and it matches, it is retained or removed according to the percentage of mismatched G.

    [0317] All unique 5′ends represent CAGE tag-supported TSSs (CTSSs), and the number of tags within each CTSS represents expression levels. Raw tag counts were normalized using a referent power-law distribution to a total of 106 tags, resulting in normalized tags per million (TPMs) 33.

    Clustering of CTSSs into Tag Clusters

    [0318] CTSSs that pass the threshold of 1 TPM in at least one of the samples were clustered using a distance-based method implemented in the CAGEr package with a maximum allowed distance of 20 nucleotides between the neighbouring CTSS.

    [0319] For each tag cluster, a cumulative distribution of signal was calculated and the boundaries of the tag cluster calculated using the 10th and 90th percentile of its signal. The distance between these boundaries represents the interquantile width of a tag cluster.

    Genomic Locations of Tag Clusters

    [0320] Tag clusters were annotated with their corresponding genomic locations using the ChIPseeker package 34. In S. cerevisiae libraries, promoters were defined as 1 kb windows centred on Ensembl 35 annotated transcriptions start sites (annotations imported from SGD) and in M. musculus libraries, promoters were defined as <=1 kb or 1-3 kb from the UCSC annotated transcription start site.

    Nucleotide and Dinucleotide Composition of CTSSs

    [0321] CTSSs from each library were filtered prior to analysis to include only CTSS with at least 1 TPM. In each library the number of A, C, G or T-containing CTSS was counted, divided by the total number of filtered CTSSs and converted to a percentage. The same analysis was performed using only dominant TSS (identified using the CAGEr package as a CTSS with highest expression within a tag cluster).

    [0322] For dinucleotide analysis, identified filtered CTSSs were extended to include one upstream nucleotide ([−1, +1] dinucleotides where +1 represents the identified CTSS) and the same analysis as described above repeated for 16 possible dinucleotides.

    Dinucleotide Pattern Analysis in M. musculus Libraries

    [0323] Heatmaps Bioconductor package (Perry M (18). heatmaps: Flexible Heatmaps for Functional Genomics and Sequence Features. R package version 1.2.0) was used to visualize dinucleotide patterns (TA and GC) across sequences centred on the dominant TSS. Sequences were ordered by interquantile width of the containing tag cluster, with the sharpest on top and broadest tag cluster on the bottom of the heatmap. Raw data with the exact matching for TA or GC was smoothed prior to plotting using kernel smoothing within the heatmaps package. Each heatmap was divided into two sections based on tag cluster's IQ-widths. Empirical boundary (Supplementary FIG. 9a) was set to separate sharp (IQ-width <=3 nucleotides) and broad (IQ-width >3) tag clusters identified in M. musculus libraries. The horizontal line/boundary was implemented using heatmaps options to partition heatmaps/rows of an image.

    TATA-Box Motif Analysis in M. musculus Libraries

    [0324] SeqPattern package was used to scan the sequences for the occurrence of the TATA-box motif using a threshold of 80th percentile match to the TATA-box PWM (imported from the seqpattern package). We further smoothed the obtained results using the kernel smoothing (heatmaps package) and plotted the results with sequences ordered by interquantile width of the containing tag cluster (sharpest on top and broadest on bottom of the tag cluster) and centred on the dominant TSS. The horizontal line in each heatmap represents the empirical boundary that separates sharp (IQ-width <=3) and broad tag clusters (IQ-width >3).

    [0325] TATA-box metaplots (average signal/profile) were produced separately for sharp and broad tag clusters (see definition above). Seqpattern was used for scanning sequences using TATA-box PWM to identify 80% matches. The results were converted to the average signal using the heatmaps package with a 2 nucleotides bin size. The final data was plotted using the ggplot2 package 36.

    Nucleosome Positioning Signal in in M. musculus Libraries—WW Periodicity

    [0326] WW dinucleotide (AA/AT/TA/TT) occurrence (average relative signal) was obtained using the heatmaps package separately for sharp and broad tag clusters (see definition above). A 2 nucleotides bin size was used and the sequences were centred on the dominant TSS. As a control for the importance of centring the sequences on the dominant TSS, WW dinucleotide (AA/AT/TA/TT) occurrence was obtained as an average relative signal from sequences where each sequence is centred on a randomly chosen CTSS within a tag cluster. The final data was plotted using the ggplot2 package36.

    H3K4Me3 Signal Around M. musculus Tag Clusters

    [0327] H3K4me3 data for E14 cell line, mapped to mm10 was downloaded from ENCODE experiment ENCSR000CGO. Bam files for two replicates (accession numbers ENCFF997CAQ and ENCFF425ZMWO) were merged using samtools 37 and the merged bam file was imported to R environment using the rtracklayer package 38

    [0328] H3K4me3 coverage was calculated separately for reads mapping to minus or plus strand and minus strand reads subtracted from plus strand reads to get the subtracted H3K4me3 coverage.

    [0329] Subtracted H3K4me3 coverage was visualized using heatmaps package centred on the dominant TSSs with sequences ordered by IQ-width of the containing tag clusters (sharpest on top, and broadest at the bottom of the heatmap). Each heatmap was divided into two sections based on tag cluster's IQ-widths as described above.

    [0330] H3K4me3 coverage metaplots were produced separately for sharp and broad tag clusters (see definition above, only strongly supported dominant CTSSs with at least 5 TPM were used) using heatmaps package with a 3 nucleotides bin size The final data was plotted using the ggplot2 package36.

    M. musculus Tag Cluster Overlap with CpG Islands

    [0331] The CpG island track for mm10 was downloaded from the UCSC Genome Browser. Overlap with M. musculus tag clusters was visualized as a coverage heatmap using heatmaps package, centred on the dominant TSS with sequences ordered by IQ-width of the containing tag clusters (sharpest on top, and broadest at the bottom of the heatmap). Each heatmap was divided into two sections based on tag cluster's IQ-widths as described above.

    [0332] CpG coverage metaplots were produced separately for sharp and broad tag clusters (see definition above) using heatmaps package with a 3 nucleotides bin size. The final data was plotted using the ggplot2 package36.

    Code and Data Availability

    [0333] All custom scripts are available upon request and data is accessible at: https://drive.google.com/open?id=1T4ZL7JFnaWITUHD7LaLvu74r_IZYG60N

    Example 7

    New Method for CAGE

    [0334] The new method for CAGE is intended to improve sequencing efficiency on Illumina sequencing instruments (Illumina HiSeq2500) and shorten the protocol. Fewer protocol steps should lead to higher complexity libraries achieved with lower amount of total cellular RNA (currently the protocol is optimised to work with 5-10 ng, and optimisations may allow use of 1-2 ng).

    Changes in Protocol Steps:

    [0335] Average fragment length in the final SLIC-CAGE libraries is 800 nucleotides. Clustering of fragments on Illumina sequencers is more efficient for shorter fragments (standard Illumina sequencing libraries tend to have fragment size 200-500 nucleotides), therefore larger fragments typically lower sequencing efficiency. To improve sequencing quality of the SLIC-CAGE libraries, a tagmentation step (Illumina Nextera XT kit) is incorporated. The kit relies on transposition of barcode sequences randomly into DNA in a “cut and paste” reaction. This random insertion efficiently fragments the DNA and at the same time adds the sequences required for PCR amplification and sequencing. Incorporation of the tagmentation step after the SLIC-CAGE protocol is performed has been tested and analysis of the resultant libraries is underway.

    [0336] SLIC-CAGE methodology relies on nAnT-iCAGE protocol steps. However, by including tagmentation to decrease the size of the DNA fragments, the sequence necessary for sequencing and PCR amplification is added at the same time. Therefore, it is expected that nAnTi-CAGE steps which include ligation of the 3′linker, and treatment with the USER enzyme will be unnecessary. Replacement of the 3′linker ligation in the nAnT-iCAGE protocol with the tagmentation step is currently being investigated. As the carrier is still present in the libraries, and tagmentation involves also the PCR amplification step, the following conditions are being optimised and tested:

    A)

    [0337] 1. cleavage of the cDNA by a transposon and tagging of the double-stranded cDNA, for example tagmentation; [0338] 2. PCR amplification [0339] 3. Degradation of the carrier nucleic acid according to the invention or the nucleic acid of the compositions according to the invention; [0340] 4. Purification of the DNA fragments, for example by Solid Phase Reversible Immobilization-based paramagnetic beads, for example AMPure beads or RNAclean XP; or

    B)

    [0341] 1. Degradation of the nucleic acid according to the invention or the nucleic acid of the compositions according to the invention; [0342] 2. Purification of the DNA fragments, for example by Solid Phase Reversible Immobilization-based paramagnetic beads, for example AMPure beads or RNAclean XP; or [0343] 3. cleavage of the cDNA by a transposon and tagging of the double-stranded cDNA, for example tagmentation; [0344] 4. PCR amplification; or

    C)

    [0345] 1. Degradation of the carrier nucleic acid according to the invention or the nucleic acid of the compositions according to the invention; [0346] 2. Purification of the DNA fragments, for example by Solid Phase Reversible Immobilization-based paramagnetic beads, for example AMPure beads or RNAclean XP; [0347] 3. PCR amplification [0348] 4. (optional 2nd round of degradation of the carrier nucleic acid according to the invention or the nucleic acid of the compositions according to the invention) [0349] 5. Cleavage of the cDNA by a transposon and tagging of the double-stranded cDNA, for example tagmentation.

    [0350] Libraries will be tested by qPCR for the presence of the carrier, and if necessary, 2.sup.nd round of carrier degradation and size-selection will be performed.