Attribute Sieving and Profiling By Pooled Sanger Sequencing
20190127777 ยท 2019-05-02
Inventors
Cpc classification
C12Q2537/143
CHEMISTRY; METALLURGY
C12Q2535/101
CHEMISTRY; METALLURGY
C12N15/1065
CHEMISTRY; METALLURGY
C12Q2535/101
CHEMISTRY; METALLURGY
C12Q2537/165
CHEMISTRY; METALLURGY
C12Q2537/159
CHEMISTRY; METALLURGY
C12Q2537/159
CHEMISTRY; METALLURGY
C12Q1/6806
CHEMISTRY; METALLURGY
International classification
C12Q1/6806
CHEMISTRY; METALLURGY
C12N15/10
CHEMISTRY; METALLURGY
C40B20/04
CHEMISTRY; METALLURGY
Abstract
Disclosed is a novel method of allele profiling, or nucleic acid sieving, with pooled Sanger sequencing as a first (aka screening) stage; where the first step is: amplifying a single sequence, delineated by forward and reverse primers which may represent a single exon, or a segment thereof, or a contiguous stretch of multiple exons and introns. The amplicons produced from a pool of samples include the amplified sequence, and these are next converted into fragments in the standard Sanger labeling reaction. Ambiguities will appear as superposed peaks at any heterozygous position of interest, as the origin of the variant signal cannot be uniquely attributable to a specific sample, or samples, in the pool. These ambiguities may be resolved by the allele profiling process; or, resolution can be done with source-tagged primers generating source-tagged amplicons, which generate position shifts in labels, which can be decoded to resolve the ambiguities.
Claims
1-3. (canceled)
4. The process of claim 24 wherein the source tags are designed, and labeled second reaction products are formed with labeling primers selected, such that peaks representing the labeled di-deoxynucleotide-terminated fragments are shifted by a known amount.
5. The process of claim 24 wherein peak position are determined by capillary electrophoresis.
6. The process of claim 24 wherein the selected sample pools are unambiguous for the desired allele because all constituent samples are homozygous for the desired allele.
7. The process of claim 24 wherein the selected sample pools are ambiguous for the desired allele because at least one constituent sample is not homozygous for the said desired allele.
8. (canceled)
9. The process of claim 24 wherein the desired alleles include markers for genes from the following list: -thalassemia, cystic fibrosis, HLA, and RH.
10. The process of claim 24 wherein determining the presence of a desired allele at the variable site(s) of interest is by determining the presence of one of at least two peaks, at specific positions corresponding to said variable site(s) of interest.
11. The process of claim 10 wherein the determining if any desired alleles are in any combined pool is by determining whether at the position of variable site(s) of interest, a single peak (in a single color channel) or at least two peaks (in at least two color channels) are observed in the combined pools.
12-23. (canceled)
24. A process of selecting subsets of nucleic acid samples having one or more desired alleles of interest at one or more variable sites of interest, or not having any desired alleles, wherein the presence of said desired alleles gives rise to certain labeled reaction products, the process comprising: (a) for each of one or more desired alleles: (i) determining a value, d, representing a first maximum number of samples to be combined into pools, by finding first that the probability of any pool having any of the desired alleles does not exceed a predetermined probability threshold, wherein said desired alleles are known to occur in the population at a specified frequency, and wherein if some of the one or more desired alleles occur at specified frequencies substantially different than the specified frequencies of other desired alleles, forming different sample pools for determining if any of said desired alleles having said substantially different frequencies are in any pool, wherein d is determined in accordance with said substantially different frequencies; (ii) if d for said desired alleles is greater than a preset upper limit d.sub.max, then setting the value d for said desired alleles equal to d.sub.max; (iii) if d for said desired alleles is less than 1, then setting the value of d for said desired alleles equal to 1. (b) performing the following steps: (i) combining aliquots from the nucleic acid samples to form a plurality of sample pools with not more than d samples per pool; (ii) associating particular said desired alleles in different sample pools with a source tag identifying the different sample pools; (iii) amplifying genomic regions of the samples containing the desired alleles to generate amplicons including source-tags; (iv) combining aliquots from one or more of the different amplicon-containing pools to form one or more combined pools wherein the number of amplified samples in each combined pool does not exceed d.sub.max; (v) forming labeled reaction products from said amplicons using labeled di-deoxynucleotides which thereby generates labeled reaction products, but wherein only a subset of the labeled reaction products allow identification of desired alleles; (vi) determining if any pool contains desired alleles by identifying the label(s) of said subset of labeled reaction products; and (c) identifying the source tags of samples having, or not having, desired alleles, to determine the sample pool(s) of origin for any such samples, and selecting particular sample pool(s) containing at least one sample having a desired allele; or, selecting particular sample pool(s) having no sample including a desired allele.
25. The process of claim 24 wherein, in the event of ambiguity, steps b(v) and b(vi) are repeated with a single primer directed to a specific subsequence of the source-tag incorporated in the amplicons such that only selected amplicons, but not the other amplicons, form labeled reaction products.
26. The process of claim 24 wherein source tags are designed to share common 5 subsequences.
27. The process of claim 24 wherein the labeling of combined amplicons is with a single primer directed to a common 5 subsequence of the source tags.
28. The process of claim 24 wherein the labeling of a selected amplicon, but not the other amplicons within the pool of combined amplicons, is with a single primer directed to a specific subsequence of the source-tag incorporated in that amplicon.
29. The process of claim 28 wherein said single primer includes a unique 3 subsequence, of at least 1 nucleotide in length.
Description
BRIEF DESCRIPTION OF THE FIGURES
[0008] The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
DETAILED DESCRIPTION
[0019] The term variable site of interest is defined as a polymorphic site or SNP; or an insertion or deletion mutation.
[0020] The term disambiguating means resolving an ambiguity; which occurs if, based on the results in question, at least one sample in a particular pool cannot be identified as either normal or variant at a variable site of interest. Disambiguation encompasses any method of sequencing or genotyping or allele identification, including but not limited to using allele-specific primers.
[0021] The term reaction steps refers to steps involved in amplifying, labeling, or extending a primer, amplicon or oligonucleotide chain.
[0022] The present invention provides for the use of pooled Sanger sequencing as the first (screening) stage in allele profiling or sieving which may comprise a second (disambiguation) stage, as illustrated, for two samples, in
[0023] In a first embodiment, the present invention provides for combining two or more amplicons in a Sanger labeling (aka cycle sequencing) reaction using differentially fluorescently labeled dideoxy-nucleotides, and subsequent analysis of these pooled labeled products, preferably by capillary electrophoresis. As with standard Sanger sequencing, a single (contiguous) sequence (typically comprising a single exon) is analyzed. As with allele profiling and nucleic acid sieving, samples that contain at least one variant allele in one or more (known or unknown) position in the sequence, generally will introduce an ambiguity associated with two or more superimposed peaks at variable positions in the sequence.
[0024] However, contrary to the general practice of discarding as contaminated double sequence data (see DNA Sequencing Troubleshooting Guide by Eurofins Genomics, available on its website; DNA SEQUENCING SANGER: TECHNICALS SOLUTIONS GUIDE by Secugen, available on its website) the superimposed sequence traces generated by pooled samples may be decoded, so as to achieve disambiguation, as elaborated herein.
[0025] The process of the present invention will be a useful complement to the previously disclosed processes of allele profiling and nucleic acid sieving for the analysis of alleles and mutations, especially if all or the most prevalent mutations or alleles of interest reside on a single exon, as they do, in the case of mutations, for example, for -thalassemia or for cystic fibrosis, or, in the case of alleles, for the polymorphic genes encoding the human leukocyte antigens (HLA) or the Rh antigens. More generally, the present invention also will be useful for analyzing a single amplicon (or other construct or reaction products) comprising multiple exons or sections thereof. Such amplicons may be generated by amplification with primer flanking the region in the sequence comprising a variable site, or multiple variable sites, of interest. Either of these flanking primers may be used in the subsequent Sanger labeling reaction of the combined samples, as illustrated in
[0026] Thus, pooled Sanger sequencing of the -thalassemia gene for a pool of two samples would produce, for IVSI-5 (G>C), the by far most commonly observed -thalassemia mutation in Pakistanis (Ansari2011 Molecular epidemiology of -thalassemia in Pakistan: far reaching implications Int J Mol Epidemiol Genet. 2(4): 403-408), the following expected read (see also,
TABLE-US-00001 G Sample1(normal)CAGGTTGGTATC(SEQIDNO:1) Sample2(het)CAGGTTGGTATC(SEQIDNO:1) C
[0027] The presence of a mutation in the pool will be readily detected, in the form of a het signal, characterized here by peaks in two color channels; however, ambiguity generally will remain as to the identity of the sample(s) carrying the mutation, as the following separately pooled configurations: (GG|GC), (GC|GG), (GG|CC), and (CC|GG) all will produce the same heterozygous signature in the compound sequence trace (though peak intensities may provide additional information). Similarly, insertions and deletions are readily detected, as illustrated in
[0028] Gain in Operational Efficiency
[0029] The probability of encountering at least one heterozygous configuration in a pool of d samples may be estimated from the (assumed known) population frequencies of anticipated variant alleles (as discussed in the allele profiling and sieving references) so as to determine the optimal d, subject to the constraint that the d-fold dilution of samples incurred by pooling will set an upper limit, ddmax, to the extent of practical pooling.
[0030] In comparison to standard Sanger sequencing for mutation analysis, pooled Sanger sequencing will produce a gain reflecting the reduction in the number, N, of sequencing runs in the single sample format to a number not greater than N/d+d*(N/d)*prob (at least one mutation in d samples), where d denotes the number of samples in a pool. Assuming bi-allelic genes, prob (at least one mutation in 2*d alleles)=1(1f).sup.2d where f represents the probability that a sample comprises at least one of a set of variant alleles or mutations of interest.
[0031] For example, taking the carrier frequency for -thalassemia to be f=1/30, reflecting the combined abundances of the most commonly observed mutations in South-Asian populations (Ansari2011), the expression yields 0.13, 0.23, 0.33 and 0.42, respectively, for d=2, 4, 6 and 8. Thus, for d=4, 96 -thalassemia samples, combined into 96/4 pools, would yield 0.23*24 pools requiring disambiguation; if performed on individual samples (d=1), this scenario would entail performing an additional 0.23*24*4=22.8 or roughly 23 runs (assuming, in the worst case, that no two ambiguities are encountered in the same pool); thus, the total number of wells processed would be no greater than 96/4+23=47, a gain of roughly 2 (=96/47).
[0032] Allele profiling (including the use of source tags) would further reduce the number of requisite additional runs, by another factor of 4, to roughly 6 (=23/4), the factor of 4 reflecting the pooling of source-tagged first reaction products (as described in the allele profiling references).
[0033] A more detailed comparison would break out gains in first and second reactionsstandard Sanger sequencing would require 96 first reactions (namely PCR amplification) plus 96 second reactions (namely: labeling), for a total of 192 reactions, the resulting fragments requiring 96 capillaries for analysis. In contrast, assuming d=4, allele profiling with pooled Sanger sequencing as the first stage, would require: 96/4 first reactions, 96/4 second reactions, analyzed in 24 capillaries, plus, for disambiguations: 23 first reactions plus 6 second reactions, analyzed in 6 capillaries.
[0034] Thus, the invention yields a substantial gain in process efficiency for carrier screening given the typically low carrier frequencies for inherited disorders. An example showing the application of pooled Sanger sequencing to molecular sieving for RHCE alleles is given below.
[0035] Disambiguation by Allele Profiling
[0036] Ambiguities may be resolved, in accordance with the allele profiling process previously disclosed, by using allele-specific amplification at heterozygous positions of interest, either one at a time or several at a time. In a preferred embodiment, DNA from the constituent samples of ambiguous pools is amplified using one or more pairs of fluorescently labeled primers directed to the alleles at heterozygous positions, paired with source-tagged primers, as illustrated in
[0037] These allele-specific amplification reactions may be performed using genomic DNA from individual samples, or a set of amplicons independently generated by random priming of the genome or selected genomic regions of these samples. In the latter case, the amplification of a specific sequence of interest will be accomplished in a small number of cycles generating source-and-marker-tagged products.
[0038] Disambiguation by Pooled Sanger Sequencing of Source-Tagged Products
[0039] In a further embodiment of the invention, the Sanger labeling reaction is performed with pools of first reaction products comprising source (S) tags, wherein source tags identifying the first reaction products produce predetermined relative shifts in the expected sequence traces, either by changing fragment length or (by one of several methods well known in the art) electrophoretic mobility.
[0040] In one embodiment, to change fragment length, the source tags have a common 5 subsequence which may comprise the entire sequence of the shortest tag, and the universal labeling primer is complementary to that subsequence, as illustrated in
[0041] In another embodiment of the invention, the source tags differ in composition, by one or more base(s), at the end forming the junction with the gene- or exon-specific primer(s) as in
[0042] Thus, given a post-PCR pool of first reaction products, the choice of labeling primer, in accordance with these embodiments, permits either screening of multiple such first products for ambiguities reflecting heterozygous configurations, or disambiguating such configurations, by introducing peak shifts and labeling only a subset of pooled first reaction products.
[0043] The process of the invention is in contrast to the analysis of Sanger sequence traces comprising signals from multiple samples by decomposition into constituents, with reference of a dictionary holding constituent peak patterns (or representations thereof). Superpositions of two or more such peak patterns are then compared to the observed pattern to infer the composition of the mixture, as shown for sequencing of the 16S rRNA gene for mixtures of bacterial pathogens (the Pathogenomix website; see also Kommeda12008 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2576573/pdf/0213-08.pdf). This approach is akin to that of decomposing genotypes into constituent alleles, where alleles are defined over multiple variable sites, step in the standard analysis of genotypes for highly polymorphic genes such as HLA.
[0044] Non-Integer Peak Shifts
[0045] Ambiguity remains when the mutation or variable position in the shifted peak pattern of a second sample is superimposed on an identical base in the sequence of the first sample. For example, if, in
TABLE-US-00002 G Sample1(normal)S_8-CAGGTCGGTATC(SEQIDNo:2) Sample2(hommut)S10-CAGGTCGCTATC(SEQIDNo:3) C
[0046] However, by constructing source-tags so as to introduce non-integer peak shifts (that is: shifts by non-integer multiples of the nominal peak-to-peak spacing of 1 base), this ambiguity is avoided: the presence of a second C peak, shifted by a non-integer displacement from the first, would unambiguously indicate the presence of a het configuration.
[0047] Non-integer peak shifts may be produced by using source tags that, for given length, differ in base composition. Thus, it has long been known that tag composition alters the electrophoretic mobilities of oligonucleotides (Frank 1979DNA chain length markers and the influence of base composition on electrophoretic mobility of oligodeoxyribonucleotides in polyacrylamide-gels Nucleic Acids Research Vol. 8 pp. 2069-87). Alternatively, the fluorescent dyes used in commercial Sanger sequencing kits are well known to introduce differential peak shifts (requiring correction as a pre-processing step when aligning the traces recorded in the different color channels and normalization with respect to a size ladder included in each reaction). In addition, chemical modifications with drag tags also have been described. Other chemical modifications including methylation also are available to introduce peak shifts.
[0048] Key Process Steps
[0049] The analysis of sequence traces recorded from a pool of source-tagged samples would proceed as follows: [0050] detect all peaks (using a standard peak detection algorithm such as the multi-scale detection algorithm (Du et al., Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching, Bioinformatics, Vol. 22, pp. 2059-65 (2006). [0051] correct peak positions for relative mobility shifts introduced by fluorescent dyes; [0052] for all known or anticipated variable positions in the sequence(s): [0053] detect alleles for each such group of peaks; and [0054] resolve residual ambiguities by allele profiling or by using source-tagged primers and comparing compound sequence traces with expected reads.
Examples
[0055] 1Detecting Variants by Pooled Sanger Sequencing: Sieving for RHCE Exon 5 Alleles
[0056] Exon 5 of this gene comprises several important alleles including (ISBThttps://tinyurl.com/vca97t93)
TABLE-US-00003 Name Polymorphism Phenotype A226P 676G > C e/E V223F 667 G > T hrS-, others Q233E 697 C > G Crawford M238V 712 A > G partial c, e L245V 733 C > G V, VS; partial c, e
[0057] The administration of red cells that are not properly matched for the phenotype determined at this locus contributes to the risk of alloantibody formation. The rapid determination of especially the alleles at this locus, for recipients and donors or red cells, therefore has substantial clinical significance.
[0058] To apply the screening (or sieving) method of the invention using pooled Sanger sequencing, combine DNA samples from at least 2 individuals (d2) for amplification using standard primers flanking exon 5, where d is determined as disclosed; then, commit the resulting amplicons to the Sanger labeling reaction performed with either of the PCR primers, and analyze the resulting labeled products by capillary electrophoresis. As with molecular sieving generally, the abundance of the variant alleles in the table, for the population of interest, determines the probability of a variant and an associated ambiguity, and thus determines the expected number of pools that are unambiguous for one or more of the listed alleles. Constituent samples of these pools may be selected in accordance with desired allele patterns; for example, pools comprising candidate donor samples that are homozygous Emay be selected for immediate assignment to recipients with existing anti-E antibodies.
[0059] 2Detecting Variants by Pooled Sanger Sequencing: Screening for -Thalassemia Mutations
[0060] Exon 1 of this gene comprises several of the most commonly observed mutations including substitutions, insertions and deletions. Illustrated here is the detection of an insertion in a pool of two samples of which one is normal, and the other is homozygous for the codon 8/9 (+G) mutation, producing the following expected read (see also
TABLE-US-00004 Sample1(normal) GAAGTCTGC Sample2(ins) GAAGGTCTG
[0061] A G-T het configuration in the expected position of the insert, highlighted here by a bold-faced G, along with additional predictable downstream hets, indicates the presence of the insert in at least one allele. The corresponding sequence traces, in
[0062] Analogously, deletions, such as the 4-base deletion in codon 41/42-CTTT, another common -thalassemia mutation, would be readily detected by the appearance of predictable het configurations. A simple substitution will produce a characteristic het at the expected position.
[0063] The example illustrates the case d=2, with 2 copies of the variant allele in the pool. The value of d is limited only by detection sensitivity which must be such that 1 copy of a variant allele is reliably detected in a pool comprising 2d copies: thus, it is the detection sensitivity that ultimately limits the value of ddmax.
[0064] 3Disambiguation by Using Source Tags: -Thalassemia Mutations
[0065] As illustrated in
[0066] 4Disambiguation by Using Source Tags: Cystic Fibrosis Mutations
[0067] A further example is that of assigning the G542X (G>T) mutation in exon 11 of the cystic fibrosis gene. Of the expected sequence reads for four possible configurations, shown in
[0068] It is worth pointing out that the use of a labeling primer directed to the S22 tag sequence permits the labeling of only the S22-tagged sample, even when both samples are in the pool: in fact, this is how the trace in the middle panel of