Method for detecting a minority genotype

Abstract

Disclosed are methods for detecting a minority genotype of a target nucleic acid. The disclosed method generally includes the steps of (a) deep sequencing at least a portion of the target nucleic acid; (b) using the deep sequencing results of (a) to detect the presence of variant nucleobases at one or more nucleotide reference positions within the target nucleic acid; (c) using the variant detection results generated in step (b) to perform a statistical analysis of whether the variants are significant; and (d) using the variant detection and variant significance results generated in steps (b) and (c) to perform a statistical analysis of whether a subset of sequences together exhibit a common set of significant variants.

Claims

1. A method for detecting a minority genotype of a target nucleic acid of a virus in a population of patients undergoing a population treatment regime, the treatment regime comprising receiving a drug for treating an infection by the virus, the patients being monitored to provide resistance data of the drug, the method comprising: (a) deep sequencing at least a portion of the target nucleic acid, wherein the deep sequencing is selected from the group consisting of: single molecule real time sequencing, ion semi-conductor sequencing, and bridge polymerase chain reaction sequencing, (b) detecting, using the deep sequencing results of (a), a presence of variant nucleobases at one or more nucleotide reference positions within the target nucleic acid; (c) calculating p-values of the variants detected in step (b) and selecting statistically significant variants with the lowest p-values or for which the p-values are below a threshold; (d) clustering, using non-negative matrix factorization, the statistically significant variants of step (c) into clusters; (e) correlating the clusters of statistically significant variants with the drug resistance data; (f) determining a patient treatment regime, different than the population treatment regime, for a patient infected with the virus and having received the drug, from the population of patients or otherwise, based on a determined presence in the patient of a cluster of statistically significant variants correlated with drug resistance in the population as determined in step (e), wherein the patient treatment regime is tailored to the drug resistance; and (g) administering the determined patient treatment regime to the patient; wherein the virus is hepatitis C and the patient treatment regime and the population treatment regime is each from a class of drugs selected from NS3/4A protease inhibitors, non-nucleoside inhibitors of RdRp and NS5A inhibitors, the selected classes being different for the patient treatment regime and the population treatment regime.

2. The method of claim 1, wherein said variant detection and variant significance is based on a log likelihood-ratio model for nucleotide variant detection, or a Poisson distribution model for codon variant detection.

3. The method of claim 1, wherein said variant detection and variant significance is based on a Bayesian model.

4. The method of claim 1, wherein the statistically significant variants determined in step (c) are subjected to smoothing or a derivative thereof before proceeding to step (d).

5. The method of claim 1, wherein the statistically significant variants determined in step (c) are subjected to correction by a false discovery rate minimization method before proceeding to step (d).

6. The method of claim 5, wherein the false discovery rate minimization method includes Bonferroni correction or a derivative thereof.

7. The method of claim 5, wherein the false discovery rate minimization method includes Benjamini and Hochberg correction or a derivative thereof.

8. The method of claim 1, further comprising amplifying one or more regions of the target nucleic acid prior to step (a), and wherein said deep sequencing comprises sequencing the one or more amplified regions.

9. The method of claim 1, wherein the target nucleic acid comprises at least one of the NS3, NS4, and NS5 regions from HCV.

10. The method of claim 1, wherein at least two samples are collected from each patient in the population, wherein each sample of the at least two samples corresponds to a different time point during a course of the population treatment regime, wherein steps (a)-(d) are performed with respect to the target nucleic acid from each of the at least two samples, and (i) wherein the variant detection results of step (b) include a determination of the frequency for each variant and/or the variant significance results of step (c) include a significance value assigned to each variant, and wherein the method further comprises (h) for each of one or more variants within the common set of significant variants determined in step (d), comparing the variant frequencies and/or variant significance values for the samples corresponding to different time points to determine whether the variant is a differential variant exhibiting different frequencies and/or significance values during the course of the treatment regime; or (ii) wherein a plurality of minority genotypes are detected within the samples corresponding to different time points, wherein the variant detection results of step (b) include a determination of the frequency for each variant and/or the variant significance results of step (c) include a significance value assigned to each variant, and wherein the method further comprises (i) optionally selecting a subset of the plurality of minority genotypes; and (j) for each of one or more variants within the common set of significant variants determined in step (d) for each minority genotype within said plurality or optional subset thereof, comparing the variant frequencies and/or variant significance values for the samples corresponding to different time points to determine whether the variant is a differential variant exhibiting different frequencies and/or significance values during the course of the treatment regime.

11. The method of claim 10, wherein a plurality of differential variants are identified in step (j), and wherein the method further comprises (k) selecting at least a subset of the plurality of differential variants; and (l) performing two-way hierarchical clustering on (1) the frequencies and/or significance values of the differential variants within said plurality or subset thereof and (2) the samples containing the target nucleic acid in which said differential variants were detected, each sample corresponding to a time point.

12. The method of claim 10, further comprising confirming at least one differential variant by Sanger sequencing.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a block flow diagram illustrating a representative embodiment of a method for detecting a minority genotype.

DETAILED DESCRIPTION OF THE INVENTION

(2) The present invention provides a method for detecting a minority (e.g., rare) genotype of a target nucleic acid. The method generally includes (1) the identification of variants and (2) linkage analysis of co-occurring variants to enable grouping or clustering sub-types within the larger population. The target nucleic acid may be, e.g., associated with a disease or disorder or a patient's potential response to one or more treatments for such disease or disorder, including, for example, infectious disease (e.g., HCV), cancer, or immune disorders such as, for example, autoimmune disease. The identification of minority genotypes according to the methods provided herein may in turn be useful for correlating one or more biological responses to the identified minority genotype. Particularly in the context of a disease or disorder, such identification may in turn be useful for, e.g., stratifying and interpreting efficacy and resistance data during clinical testing of a drug or for tailoring treatment schedules with one or more drugs according to a particular genotype for the target nucleic acid associated with the disease.

(3) The method according to the present invention generally includes the following steps: (a) deep sequencing at least a portion of the target nucleic acid; (b) using the deep sequencing results of (a) to detect the presence of variant nucleobases at one or more nucleotide reference positions within the target nucleic acid; (c) using the variant detection results generated in step (b) to perform a statistical analysis of whether the variants are significant; and (d) using the variant detection and variant significance results generated in steps (b) and (c) to perform a statistical analysis of whether a subset of sequences together exhibit a common set of significant variants (i.e., whether a set of significant variants are statistically likely to co-occur). If a subset of sequences together exhibit a common set of significant variants, a minority genotype of the target nucleic acid is detected.

(4) Using the method of the present invention, a minority genotype can be detected down to at least about 1% (1 variant nucleic acid in the presence of 100 wild-type nucleic acids). In certain embodiments, a minority genotype detected in accordance with the present invention represents a variant nucleic acid that is present at a frequency, within a representative population of corresponding nucleic acids, from at least about 1% to about 45%, from at least about 1% to about 40%, from at least about 1% to about 35%, from at least about 1% to about 30%, from at least about 1% to about 25%, from at least about 1% to about 20%, from at least about 1% to about 15%, from at least about 1% to about 10%, from at least about 1% to about 5%, from at least about 1% to about 4%, from at least about 1% to about 3%, or from at least about 1% to about 2%.

(5) FIG. 1 illustrates a representative embodiment of the present method. Referring to FIG. 1 at 100, a target nucleic acid is deep sequenced. The sequencing data can be obtained, for example, from any of the next generation sequencing platforms, including but not limited to Pacific Biosciences RS (“PacBio RS”; single molecule real-time, or SMRT™, sequencing), Ion Torrent™ ion semi conductor personal genome machine (PGM), Illumina HiSeq™, and Illumina MiSeq™ bridge polymerase chain reaction sequencing. Typically, the sequencing data is in a FASTQ formatted file, where each sequence has an ID, a string of letters of length n representing the nucleotide sequence, and a string of quality values (typically ascii characters) representing length n that prescribes a quality or confidence score to the particular nucleotide call. In a particular embodiment, the sequencing data is FASTQ formatted and generated using SMRT™ sequencing.

(6) Referring to FIG. 1 at 200, the sequencing data is aligned to a reference sequence of the region of interest. This could be an entire genome, a subset of the genome or, more specifically, a targeted region of the genome. The alignment method typically depends on the sequencing platform utilized to generate the sequencing data. Typically, software such as BWA (Li and Durbin, Bioinformatics 25:1754-60, 2010), BWA-SW (Li and Durbin, Bioinformatics 26:589-595, 2010), or other suitable alignment (mapping) software is used. In a particular embodiment, PacBio RS sequences are aligned using BWA-SW software with parameters suitable for PacBio sequences. Output of this stage is typically a SAM formatted file (Li et al., Bioinformatics 25:2078-2079, 2009).

(7) Referring to FIG. 1 at 300, the aligned reads are statistically pooled at one or more particular loci on the reference genome and evaluated to establish potential variations from the loci (see, e.g., Li, Bioinformatics 27:2987-2993, 2011). These variants are typically single nucleotide polymorphisms (SNPs), multi-nucleotide polymorphisms (MNPs), small insertions or deletions (e.g., indels), large indels, genetic re-arrangements such as fusions, and the like. In a particular embodiment, the relevant variants are SNPs. Optionally, coordinate transformations can be applied to convert the nucleotide sequences to their corresponding three letter codon representations. In a particular embodiment, variants are computed both in nucleotide and codon space.

(8) Referring to FIG. 1 at 400, for variants thus found in step 300, the significance of variance is computed using any of various statistical approaches. Multiple approaches are suitable at this stage. For example, a parametric model, such as a model based on Poisson distribution, may be assumed and p-values computed. Alternatively, non-parametric approaches are also suitable. In certain variations, a Bayesian model is used to establish likelihood estimates. Optionally, additive smoothing, such as Laplace smoothing, is applied to make the model more consistent. Optionally, a false discovery rate correction method is applied. In a particular embodiment, the significance of nucleotide and codon variants are separately computed using a Poisson model. In this embodiment, the Poisson model parameters are typically derived separately. In some such embodiments, an additional Laplace smoothing is applied to smooth the means of the Poisson distribution and consequently the p-values. In a separate embodiment, the Bonforroni FDR correction is applied to the computed p-values. The p-values can be used to select the most significant variants observed in the sequences.

(9) Referring to FIG. 1 at 500, cluster analysis is performed. Sequences are clustered based on co-occurrence of variants. The most significant p-values can be used to determine the “features” of the clustering analysis. Any of many clustering algorithms can be used, including, for example, k-means, Principal Component Analysis, Singular Value Decomposition, or Non-negative Matrix Factorization (NMF). In a particular embodiment, NMF is used to generate clusters. Output of NMF is a set of clusters where each cluster has a score, a membership which is a subset of variants. Also, sequence IDs for sequences that belong to each cluster, and the number of sequences belonging to each cluster, may be obtained. Based on the cluster analysis, one or more subset of sequences in which a set of significant variants co-occur may be identified, thereby identifying one or more minority genotypes of the target nucleic acid.

(10) In some embodiments, the method for detecting a minority genotype further includes amplifying one or more regions of the target nucleic acid prior to the deep sequencing step, and the deep sequencing comprises sequencing the one or more amplified regions. Generally, amplifying one or more target regions includes the following steps: (1) contacting a sample containing or suspected of containing the target nucleic acid with at least two amplification oligomers for amplifying at least one target region of the target nucleic acid; and (2) performing at least one in vitro nucleic acid amplification reaction, wherein any target nucleic acid present in the sample is used as a template for generating at least one amplification product corresponding to the at least one target region.

(11) In certain embodiments, the method further includes purifying the target nucleic acid from other components in a sample before the deep sequencing step, or before any amplification step prior to sequencing. Such purification may include methods of separating and/or concentrating target nucleic acid contained in a sample from other sample components. In particular embodiments, purifying the target nucleic acid includes capturing the target nucleic acid to specifically or non-specifically separate the target nucleic acid from other sample components. Non-specific target capture methods may involve selective precipitation of nucleic acids from a substantially aqueous mixture, adherence of nucleic acids to a support that is washed to remove other sample components, or other means of physically separating nucleic acids from a mixture that contains the target nucleic acid and other sample components.

(12) In some embodiments, a target nucleic is selectively separated from other sample components by hybridizing the target nucleic acid to a capture probe oligomer. In some variations, the capture probe oligomer comprises a target-hybridizing sequence configured to specifically hybridize to a target sequence within the target nucleic acid so as to form a target-sequence:capture-probe complex that is separated from sample components. In other variations, the capture probe oligomer uses a target-hybridizing sequence that includes randomized or non-randomized poly-GU, poly-GT, or poly U sequences to bind non-specifically to a target nucleic acid so as to form a target-sequence:capture-probe complex that is separated from sample components. In specific variations, the capture probe oligomer comprises a target-hybridizing sequence that includes a randomized poly-(k) sequence comprising G and T nucleotides or G and U nucleotides (e.g., a (k).sub.18 sequence), such as described, for example, in WIPO Publication No. 2008/016988, incorporated by reference herein.

(13) In a preferred variation, the target capture binds the target:capture-probe complex to an immobilized probe to form a target:capture-probe immobilized-probe complex that is separated from the sample and, optionally, washed to remove non-target sample components (see, e.g., U.S. Pat. Nos. 6,110,678; 6,280,952; and 6,534,273; each incorporated by reference herein). In such variations, the capture probe oligomer further includes a sequence or moiety that binds the capture probe, with its bound target sequence, to an immobilized probe attached to a solid support, thereby permitting the hybridized target nucleic acid to be separated from other sample components.

(14) In more specific embodiments, the capture probe oligomer includes a tail portion (e.g., a 3′ tail) that is not complementary to the target sequence but that specifically hybridizes to a sequence on the immobilized probe, thereby serving as the moiety allowing the target nucleic acid to be separated from other sample components, such as previously described in, e.g., U.S. Pat. No. 6,110,678, incorporated herein by reference. Any sequence may be used in a tail region, which is generally about 5 to 50 nt long, and preferred embodiments include a substantially homopolymeric tail of about 10 to 40 nt (e.g., A.sub.10 to A.sub.40), more preferably about 14 to 33 nt (e.g., A.sub.14 to A.sub.30 or T.sub.3A.sub.14 to T.sub.3A.sub.30), that bind to a complementary immobilized sequence (e.g., poly-T) attached to a solid support, e.g., a matrix or particle.

(15) In particular variations, the capture probe oligomer comprises a randomized poly-(k) nucleotide sequence (e.g., (k).sub.18) and a 3′ tail portion, preferably a substantially homopolymeric tail of about 10 to 40 nt (e.g., T.sub.3A.sub.30).

(16) Target capture typically occurs in a solution phase mixture that contains one or more capture probe oligomers that hybridize to the target nucleic acid under hybridizing conditions, usually at a temperature higher than the T.sub.m of the tail-sequence:immobilized-probe-sequence duplex. For embodiments comprising a capture probe tail, the target:capture-probe complex is captured by adjusting the hybridization conditions so that the capture probe tail hybridizes to the immobilized probe, and the entire complex on the solid support is then separated from other sample components. The support with the attached immobilized-probe:capture-probe:target may be washed one or more times to further remove other sample components. Preferred embodiments use a particulate solid support, such as paramagnetic beads, so that particles with the attached target:capture-probe:immobilized-probe complex may be suspended in a washing solution and retrieved from the washing solution, preferably by using magnetic attraction. To limit the number of handling steps, the target nucleic acid may be amplified by simply mixing the target nucleic acid in the complex on the support with amplification oligomers and proceeding with amplification steps.

(17) Amplifying a region of a target nucleic acid utilizes an in vitro amplification reaction using at least two amplification oligomers that flank a target region to be amplified. Suitable amplification methods include, for example, replicase-mediated amplification; polymerase chain reaction (PCR), including reverse transcription polymerase chain reaction (RT-PCR); ligase chain reaction (LCR); strand-displacement amplification (SDA); and transcription-mediated or transcription-associated amplification (TMA). Such amplification methods are well-known in the art and are readily used in accordance with the methods of the present invention.

(18) Once one or more target regions are amplified to produce one or more corresponding amplification products, the amplification product(s) may be used in subsequent procedures deep sequencing, alignment, and analysis to detect a minority genotype for the target nucleic acid as described herein.

(19) In certain embodiments, the target nucleic acid to be sequenced and analyzed according the present method is from a sample obtained from a patient suffering or at risk of a disease or disorder. In some such variations, the target nucleic acid is a disease-associated nucleic acid that may be characterized by heterogeneity within a single patient. Examples of such intra-patient heterogeneity include, e.g., intra-tumor genetic heterogeneity associated with some cancers as well as infectious agents (for example, hepatitis C virus (HCV)) that exhibit genetic heterogeneity within an infected subject due to, e.g., rapid mutation as it replicates. The methods of the present invention can be used to examine quasispecies complexity in such a patient and to detect a minority genotype that, for example, may be present in a low copy number during an early stage of infection or disease progression but that may be more resistant to treatment. In certain embodiments, where a detected minority genotype has a known effect on or correlation to efficacy and/or resistance to one or more treatment regimes for the disease or disorder, the method further includes determining a treatment regime for the patient based on the detected minority genotype. In some such embodiments, the method further includes administering the selected treatment regime to the patient.

(20) The method of the present invention also has particularly useful applications in studying disease-associated nucleic acid complexity in a patient population. For example, in some embodiments, the minority genotype is detected in a plurality of samples from patients suffering from a disease or disorder, such as, e.g., an infectious disease (e.g., HCV infection) or a cancer. In more specific variations, the method can be used to examine genetic heterogeneity associated with disease subtypes in patient populations receiving one or more treatment regimes in order to more effectively interpret clinical testing data. For example, in some embodiments in which the minority genotype is detected in a plurality of samples from patients suffering from a disease or disorder, the patients are receiving a predetermined treatment regime for the disease or disorder. In some such variations, the method further includes (i) monitoring the patients to obtain efficacy and/or resistance data for the treatment regime and (ii) determining the presence or absence of a correlation between the detected minority genotype and the efficacy and/or resistance data. These variations are particularly useful for assessing drugs in order to more effectively stratify and interpret efficacy and resistance data.

(21) For example, in some embodiments, a target nucleic acid is obtained from a plurality of samples from each of one or more patients suffering from a disease or disorder and receiving a treatment regime, where at least two of the samples from each patient each corresponds to a different time point during the course of treatment, and steps (a)-(d) as described herein are performed with respect to the target nucleic acid from each of the at least two samples. Typically, the method further includes a step in which, for each of one or more variants within the common set of significant variants determined in step (d), variant frequencies and/or variant significance values (e.g., Poisson p-values) for the samples corresponding to different time points are compared to determine whether the variant is a “differential variant” exhibiting different frequencies and/or significance values during the course of the treatment regime. If more than one minority genotype is detected within the samples corresponding to different time points, a subset of the minority genotypes may be selected and, for each of one or more variants within the common set of significant variants determined in step (d) for each of the detected minority genotypes or subset thereof, the variant frequencies and/or variant significance values for the samples corresponding to different time points may be compared to determine whether the variant is a differential variant. If a more than one differential variant is identified, a subset of the identified differential variants may be selected, and two-way hierarchical clustering may be performed on (1) the frequencies and/or significance values of the identified differential variants or subset thereof and (2) the samples containing the target nucleic acid in which differential variants were detected. Results from the two-way hierarchical clustering may then be used to determine a correlation between at least one differential variant and the course of the treatment regime. In some embodiments, the method further includes monitoring the one or more patients to obtain efficacy and/or resistance data for the treatment regime (typically including the same time points corresponding to the patient samples), and determining a correlation between at least one differential variant and the efficacy and/or resistance data (e.g., a correlation with at least one efficacy and/or resistance response to the treatment regime). Determining the presence of a correlation with efficacy and/or resistance data may include a comparison of the efficacy and/or resistance data with results from the two-way hierarchical clustering of differential variants across clinical samples. In particular embodiments, a differential variant is identified as one that is either resistant or susceptible to the treatment regime.

(22) In other embodiments, a method of the present invention is used to examine genetic heterogeneity during the progression of a disease or disorder. In some such embodiments, the method includes obtaining a target nucleic acid from a plurality of samples from each of one or more patients suffering from the disease or disorder, where at least two of the samples from each patient each corresponds to a different time point during the course of the disease or disorder progression, and steps (a)-(d) as described herein are performed with respect to the target nucleic acid from each of the at least two samples. Typically, the one or more patients are also monitored for one or more symptoms of the disease or disorder at each of the different time points. In preferred embodiments, the variation detection results of step (b) include a determination of the frequency for each variant and/or the variant significance results of step (c) include a significance value (e.g., a Poisson p-value) assigned to each variant, and the method further includes a step in which, for each of one or more variants within the common set of significant variants determined in step (d), the variant frequencies and/or variant significance values for the samples corresponding to different time points are compared to determine whether the variant is a “differential variant” exhibiting different frequencies and/or significance values during the progression of the disease or disorder. If more than one minority genotype is detected within the samples corresponding to different time points, a subset of the minority genotypes may be selected and, for each of one or more variants within the common set of significant variants determined in step (d) for each of the detected minority genotypes or subset thereof, the variant frequencies and/or variant significance values for the samples corresponding to different time points may be compared to determine whether the variant is a differential variant. A correlation between at least one differential variant and the disease or disorder progression may also be determined. If a more than one differential variant is identified, a subset of the identified differential variants may be selected, and two-way hierarchical clustering may be performed on (1) the frequencies and/or significance values of the identified differential variants or subset thereof and (2) the samples containing the target nucleic acid in which differential variants were detected. Results from the two-way hierarchical clustering may then be used to determine a correlation between at least one differential variant and the disease or disorder progression.

(23) The invention is further illustrated by the following non-limiting examples.

Example 1

(24) A specific example of HCV quasi-species detection was performed to illustrate the practical applicability of the present invention. A deep sequencing assay was developed for deep sequencing of entire NS3, NS4 and NS5 regions from HCV by combining a hybridization-based Target Capture method, RT-PCR and Pacific Biosciences' single molecule real-time sequencing platform. This assay, together with associated bioinformatics data analysis, provides a method for investigating quasispecies complexity of HCV virus.

(25) Table 1 below summarizes a specific experiment for which the variant analysis and clustering for linkage was performed. Sequencing data was generated from such a sequencing run. A specific set of steps are described that processed data generated from such a sequencing run.

(26) TABLE-US-00001 TABLE 1 Step Summary Target Capture Gen-Probe's Hybridization based Target Capture of HCV RNA from serum/plasma. Mutations Two point mutations in NS3 were introduced into a cloned HCV 1a wild-type sequence. This mutated species is used as the minority species. Amplicons Reverse transcription PCR (RT-PCR) of selected viral regions. Wild-type and mutant species amplified separately. 1 kb amplicons of NS3 regions generated. Sanger Amplicons of the mutated species resulting from the Sequencing above step Sanger sequenced for confirmation. Mixture Wild-type and mutant species mixed in desired concentration e.g. 5% minority mutant species and 95% wild-type species. Template Standard SMRTbell template preparation preparation recommended for 1 kb amplicons using C2 chemistry kits. Sequencing Standard SMRT Sequencing on 8 or 16 SMRTcells using C2 chemistry and compatible instrument software.
1) Sequence Alignment

(27) All circular consensus sequences (CCS) produced by the PacBio RS for a given SMRTcell were mapped to the reference sequence with BWA-SW, using the parameters recommended by the Broad Institute for mapping PacBio data. All sequencing reads that did not form CCS were discarded. All CCS reads that did not map to the reference were discarded. For CCS reads with multiple mappings, only the alignment with the highest overall mapping quality was retained. The resulting file was a SAM formatted file. This corresponds to step 200 in FIG. 1.

(28) Error analysis of alignment was performed. At each base position, if the sequence nucleotide did not match the reference nucleotide, the base position was marked erroneous. Error rate at a particular base position was defined by a ratio where the numerator is number of errors across all sequences at the particular base position and denominator is the number of sequences that span the base position.

(29) The raw sequencing reads from PacBio RS can also be aligned to generate SAM file and all subsequent steps are applicable. But the error rate of raw reads is significantly higher than CCS reads and may not result in usable data when the variant frequency is low (e.g., <20%).

(30) 2a) Variant Identification—Nucleotide Space

(31) All aligned CCS reads were piled-up on a nucleotide-by-nucleotide basis in a manner identical to the samtools pileup format. This created, for each nucleotide in the reference sequence, a list of all nucleotides aligned to that position from the alignments retained above. This list of aligned sub-sequences was then filtered to remove insertions and deletions, and to remove all sub-sequences below a quality threshold. At any reference position, the list of variants was gathered. If the reference at position p is an A, and if there are 100 sequences that are aligned at position p and 50 of them are A, 45 are T, 5 are C, then the possible variants at position p are {T, C}. This step corresponds to step 300 in the FIG. 1.

(32) Once the variants were identified, their p-values were computed. From the error analysis, the “baseline” substitution error at mean read length was chosen. Then at a given position, the frequency for a given variant was observed. The likelihood for this variant at this position was computed as L=−2*Ln(baseline substitution error rate/observed frequency)=2*Ln(observed frequency/baseline substitution error rate). The p-value was computed assuming degree of freedom=1 and Chi-square distribution, as shown below: 1. For each variant type at each position, calculate the log-likelihood ratio of variant frequency vs. positional substitution error,

(33) $L = 2 * Ln (\frac{freq (Observed Variant)}{freq (Underline Substitution Errors)})$ 2. L is subject to Chi-squared distribution with degree of freedom=1. 3. The cumulative Chi-squared distribution formula (usually Chi-squared CDF is available from most of scientific softwares, including MS Excel).

(34) $P value = 1 - CDF {\underline{χ}}^{2} (L, k) = 1 - \frac{γ (\frac{k}{2}, \frac{L}{2})}{Γ (\frac{k}{2})}$ When the degree of freedom k=1,

(35) $\begin{matrix} P value = 1 - \frac{1}{\sqrt{π}} * \sqrt{π} \erf (\sqrt{\frac{L}{2}}) \\ = 1 - \erf (\sqrt{\frac{L}{2}}) \end{matrix}$ Where the error function, erf, is defined as,

(36) $\erf (z) = \frac{2}{\sqrt{π}} {.Math.}_{n = 0}^{\infty} \frac{{(- 1)}^{n} z^{2 n + 1}}{n! (2 n + 1)} = \frac{2}{\sqrt{π}} (z - \frac{z^{3}}{3} + \frac{z^{5}}{10} - \frac{z^{7}}{42} + \frac{z^{9}}{216} - .Math.)$ Example: Without loose generality, the “baseline” substitution error is chosen at mean read length. In this case, 0.11% is used as the baseline substitution error. If at a given position, the observed frequency rate for a given variant is 0.98%, the likelihood for this variant at the position is
L=2*Ln(0.98/0.11)=4.374 With degree of freedom=1, we can calculate the p value of L=4.374 based on Chi-squared distribution p=0.03649124.

(37) The lower the p-value is, the more significant the variant is. Typically at each site, one of the variant will emerge as the most significant variant at that particular position. The results of all of the significance tests from all of the analyzed positions were then combined into a list and sorted by p-value. Table 2 describes the resulting most significant variants. This analysis of significance corresponds to step 400 in the overview.

(38) TABLE-US-00002 TABLE 2 Top 10 Variants Rank Reference Top Variant Reference Top Order Position Depth Frequency % P Value Seq Variant 1 446 1739 4.95 0.0085 C T 2 482 1740 4.83 0.0087 A T 3 414 1737 0.98 0.0548 C G 4 154 1724 0.70 0.0829 G C 5 973 1725 0.64 0.0924 C G 6 922 1726 0.64 0.0925 G C 7 350 1735 0.63 0.0931 G C 8 28 1705 0.53 0.1173 A G 9 239 1731 0.52 0.1195 C G 10 151 1723 0.46 0.1383 G C
2b) Variant Identification—Codon Space

(39) All aligned CCS reads were piled-up on a codon-by-codon basis in a manner similar to the samtools pileup format. This created, for each codon in the reference sequence, a list of all codons aligned to that position from the alignments retained above. This list of aligned sub-sequences was then filtered to remove insertions and deletions, and to remove all sub-sequences below a quality threshold. The coverage at a given position is defined as the number of sub-sequences remaining aligned to that position after this filtering has been performed. If the codon at position p is GCC and the coverage is 100, and 50 of them are GCC, 45 are GTC, 5 are GGC, then the possible variants at position p are {GTC, GGC}. This step corresponds to step 300 in FIG. 1.

(40) Next, the counts for each of the variant codons were identified from the pileup and tabulated. Laplacian smoothing was then applied to resulting counts. Each smoothed variant count was then compared to its expected value, which was estimated as the coverage at a given position multiplied by the rate of similar mutations observed in a large corpus of control data. The significance of the difference between these two numbers was estimated with a one-tailed test from the Poisson distribution. The observed p-values can also be corrected using Bonferroni correction to mitigate false discovery rate. The lower the p-value is, the more significant the variant is. The results of all of the significance tests from all of the analyzed positions were then combined into a list and sorted by p-value. Table 3 describes the resulting most significant variants. This analysis of significance corresponds to step 400 in FIG. 1.

(41) TABLE-US-00003 TABLE 3 Top 10 Mutants Frequencies Reference Reference Mutant Position Codon Codon Frequency % Pvalue 161 GAC GTC 1.18 9.095E−31 149 GCC GTC 0.97 9.049E−18 138 TGC TGG 0.90 1.604E−09 270 GGT GGC 0.50 9.917E−06 139 CCC CCG 0.35 2.105E−05 307 GGG GTG 0.36 6.687E−05 19 AAA TAA 0.15 7.150E−05 107 ATT GTG 0.07 1.172E−04 321 TCC ACC 0.14 2.029E−04 247 ACC ACG 0.21 2.688E−04
3) Data Preparation for Clustering

(42) A subset of the most significant variants from the list generated above were then selected for linkage analysis. Either the top 25 variants or all variants with p-values <0.001 was selected, depending on which criteria was more stringent.

(43) Returning to the aligned CCS reads, a matrix of presence-absence calls was generated for the selected variants in each of the alignments with one column for each variant in the subset and one row for each aligned sequence. Each read was then iterated over and each position with at least one variant from the selected subset was examined. The presence of such a mutation was recorded as a ‘1’ in the appropriate position in the data matrix, while its absence was recorded as a ‘0.’ Any such positions that were un-readable due to deletions or clipping of low-quality sequence were assumed to be wild-type and recorded as a ‘0’ to minimize the chance of false-positives. Finally, a small amount of random noise was added to every cell in the matrix to ensure the final data matrix was invertible and non-degenerate.

(44) 4) NMF Clustering

(45) The first step to performing any clustering analysis is to estimate the optimal number of clusters or rank in the result. To do this, the NMF algorithm (see, e.g., Gaujoux R, Seoighe C., BMC Bioinformatics 11:367, 2010) was run 10 times at each value in a range of possible ranks, typically from 2 to 15. For each rank, only the most accurate result was retained as determined by which replication had the lowest residual sum-of-squares at the termination of the algorithm. Next, the cophenetic correlation was calculated for the best matrix from each rank to estimate the goodness-of-fit of each set of clusters. The results were then graphed and inspected, and the first rank that showed a decrease in cophenetic correlation relative to the preceding position (i.e., the first rank that begins to show over-training) was selected as the optimal rank.

(46) Once the optimal rank was selected, the NMF algorithm was re-run an additional 100 times on the data set with the selected rank. The most accurate result as determined by the residual sum-of-squares was then retained for detailed analysis.

(47) The primary output of the clustering analysis was a matrix of dimensions R-by-N, where R was the selected rank and N was the size of the variant subset. The numbers in the matrix thus corresponded to the raw weight assigned to a given variant in a particular cluster, so to accurately separate the major and minor components of each cluster a copy of the matrix was made and normalized such that the contents of each row (cluster) to sum to 1. From this normalized matrix the major variants of each cluster were identified as those components with normalized weightings above a threshold (usually 0.1). Any remaining non-zero components likely corresponded to sequencing errors; therefore, the corresponding values in the un-normalized matrix were set to ‘0’ so they did not affect the scoring of the clusters.

(48) The secondary output of the clustering analysis was a matrix of basis values of dimensions M-by-R, where M was the number of reads in the original matrix and R was the rank selected. The values of each row in this matrix corresponded to the weight of each of the R clusters needed to approximate the corresponding row from the original matrix. Therefore, to score each of the clusters, the original matrix from each cluster was recomposed individually and its similarity to the original data matrix was measured. To do so, each column of basis values from this M-by-R matrix and the corresponding row from the matrix of cluster weightings were selected and then multiplied to get X′, an estimation of the original matrix X. This estimate was then subtracted from the original matrix X to get a difference matrix, D. The score for the cluster can then be calculated as the difference between the sum-of-squares of these matrices, respectively X.sub.SS and D.sub.SS. This difference, the explained sum-of-squares or E.sub.SS, represented the amount of information from matrix X captured in the estimation X′ generated from a given cluster. Finally, in order to present the results in a normalized and human-readable format, the ratio of the explained sum-of-squares was used to the total (E.sub.SS/X.sub.SS). This step corresponds to step 500 in the FIG. 1.

(49) 5) Human Readable Data Output

(50) Clusters were sorted in descending order of the explained sum-of-squares, and output with the names of the columns making significant contributions to that cluster, i.e., the variants that are a part of a given linkage grouping. Also output were the inclusive and exclusive counts for each cluster, which were the number of reads exhibiting at least one of the variants in a cluster, and the number of reads exhibiting all of them, respectively. Table 4 describes a sample output from the analysis. In this particular case, the minority species with two relevant mutations were seen as the most significant cluster.

(51) TABLE-US-00004 TABLE 4 Inclusive Count Exclusive Count Mutations with significant contribution to the Cluster Score (% of TOTAL) (% of TOTAL) cluster; Position (Reference > Mutant) 25.67 18 (1.24%) 13 (0.89%) 149 (GCC > GTC), 161 (GAC > GTC) 11.68 13 (0.89%) 13 (0.89%) 138(TGC > TGG) 8.07 9 (0.62%) 9 (0.62%) 270(GGT > GGC) 7.3 9 (0.62%) 1 (0.07%) 94(TCC > GCC), 139(CCC > CCG) 6.4 7 (0.48%) 1 (0.07%) 215(GGC > GGFA), 256(AGC > AGG) 5.4 6 (0.41%) 6 (0.41%) 307(GGG > GTG) 2.4 15 (1.03%) 0 (0.00%) 18(GAC > AAC), 57(ATC > ACC), 123 (CGG, GGG)

Example 2

(52) The deep sequencing assay referenced in Example 1—for deep sequencing of the entire NS3, NS4 and NS5 regions from HCV by combining a hybridization-based Target Capture method, RT-PCR and Pacific Biosciences' single molecule real-time sequencing platform—was run on clinical samples from HCV-infected patients. The clinical samples were taken both before and after the start of antiviral drug treatment, at four different time points: one day before treatment started (−1T), one day after treatment started (1T), two days after treatment started (2T), and four days after treatment started (4T).

(53) Variant analysis and NMF clustering for linkage was performed on the deep sequencing results from these samples using a method similar to that outlined in parts 1 through 5 of Example 1. Briefly, circular consensus sequences (CCS) produced by the PacBio RS for a given SMRTcell were aligned to the consensus sequence of HCV with BWA-SW. BWA-SW parameters used in this study included the following: mismatch penalty—3; gap open penalty—5; gap extension penalty—2; more penalty for indels; less penalty for mismatches. Codons exhibited in the CCS reads were piled-up along the reference based on the first reading frame. Variant codons in the samples were detected compared to the reference sequence, and the frequency of each variant was calculated. A Lambda table was created based on the codon pile up of wild-type (or control) sample. A Poisson distribution of variants was assumed and the P value for each detected variant was calculated using the Lambda table.

(54) For NMF clustering, about half of the reads were filtered out based on base quality score, and variant codons with <1% frequency were removed (p-values could also be used as a basis for removing variants). An assumption of 10 clusters was used: 10 clusters of variant codons were obtained from each run, with about 200 variant codons per run.

(55) The top-scoring NMF clusters were then compared between clinical samples corresponding to the −1T, 1T, 2T, and 4T time points to identify clusters exhibiting different frequencies during the course of antiviral drug treatment, thereby identifying clusters—and the variant codons contributing to each cluster—that relate to drug responses. Two-way hierarchical clustering was performed with the variant codons contributing to each cluster (“differential variants”) and the clinical samples corresponding to the different time points to create a heat map of 119 differential variants plotted against the time-associated samples.

(56) Based on the two-way hierarchical clustering, correlations between differential variants and the drug treatment time course were determined. Certain differential variants were found to fit within particular correlation profiles. With “+” indicating a higher frequency of the differential variant and “−” indicating a lower frequency for times (−1T/1T/2T/4T), five such profiles were defined as follows: (+/+/+/−) for Profile 1; (−/−/+/−) for Profile 2; (+/+/−/+) for Profile 3; (−/+/++/+++) for Profile 4; and (+/−/−/+) for Profile 5.

(57) Sanger sequencing was also performed to confirm the identified differential variants. Results showed that 86% of the differential codon variants as determined by Sanger sequencing were detected by NMF clustering, while 83% of the differential codon variants as determined by Sanger sequencing were detected by the heat map profile filtering method described above.

(58) From the foregoing, it will be appreciated that, although specific embodiments of the invention have been described herein for purposes of illustration, various modifications may be made without deviating from the spirit and scope of the invention. Accordingly, the invention is not limited except as by the appended claims. All publications, patents, and patent applications cited herein are hereby incorporated by reference in their entireties for all purposes.

Method for detecting a minority genotype

Assignee

Inventors

Cpc classification

Classification Explorer

G16B40/00

PHYSICS

Classification Explorer

C12Q1/6809

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2535/122

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/707

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2537/165

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2535/122

CHEMISTRY; METALLURGY

Classification Explorer

G16B20/20

PHYSICS

Classification Explorer

C12Q2537/165

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2600/106

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6809

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/707

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2600/156

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2600/16

CHEMISTRY; METALLURGY

Classification Explorer

G16B20/00

PHYSICS

International classification

Classification Explorer

C12Q1/6809

CHEMISTRY; METALLURGY

Classification Explorer

G16B40/00

PHYSICS

Classification Explorer

G16B20/00

PHYSICS

Classification Explorer

G16B20/20

PHYSICS

Abstract

Claims

Description