SYSTEMS AND METHODS FOR DNA AMPLIFICATION WITH POST-SEQUENCING DATA FILTERING AND CELL ISOLATION
20170226588 · 2017-08-10
Inventors
Cpc classification
C12Q2537/165
CHEMISTRY; METALLURGY
C12Q2537/165
CHEMISTRY; METALLURGY
C12Q1/6806
CHEMISTRY; METALLURGY
International classification
Abstract
A heuristic filtering system and method are described for variant DNA within a heterogeneous cell sample. After ion semiconductor sequencing, the amplicons are processed through a series of filters designed to eliminate noise in the variants to provide a clearer set of variant results. Reports are generated, showing both the filtered results and the effects the filters had on the original data.
Claims
1. A method for detection of variant DNA in a heterogenous cell sample, the method comprising: sequencing the heterogenous cell sample from a subject, producing an input sequence; and applying a heuristic filter pipeline to the input sequence, producing an output report.
2. The method of claim 1, further comprising: sequencing a control cell sample from the subject, producing a control sequence.
3. The method of claim 2, wherein the heuristic filter pipeline further comprises at least one of: determining amplicons to be excluded; determining read positions to be excluded; and determining variants to be excluded.
4. The method of claim 3, wherein the heuristic filter at least comprises said determining the amplicons to be excluded, and said determining the amplicons to be excluded comprises counting the number of reads mapped to each amplicon and excluding each amplicon that has a number of mapped reads below a threshold value.
5. The method of claim 3, wherein the heuristic filter pipeline at least comprises said determining the read positions to be excluded, and said determining the read positions to be excluded comprises at least one of: excluding each position that has a number or percentage of variant base calls below a variant count threshold; excluding read each position that has been identified in a database to be excluded; and excluding each position that is only present in a number of reads below a base coverage threshold.
6. The method of claim 3, wherein the heuristic filter pipeline at least comprises said determining the variants to be excluded, and said determining the variants to be excluded comprises at least one of: excluding each variant that is found in a negative control sequence at that variant's position; excluding each variant that is found within an end of read threshold range of an that variant's corresponding read; excluding each variant that is within a homopolymer having a length equal to or greater than a homopolymer threshold; excluding each read that contains any variant that has another variant within a cluster threshold range on that read; excluding each variant, each of said each variant being at a corresponding variant position, that has over a variant threshold number of other variants within a global threshold range of the corresponding variant position on any read; and excluding each variant that is determined to be excludable based on clinical ramifications.
7. The method of claim 4, wherein the threshold value is a value from 500 to 2000.
8. The method of claim 5, wherein the variant count threshold is 1% of the number of reads containing that position.
9. The method of claim 5, wherein the base coverage threshold is a value from 500 to 2000.
10. The method of claim 6, wherein the end of read threshold range is 11.
11. The method of claim 6, wherein the homopolymer threshold is 4.
12. The method of claim 6, wherein the cluster threshold range is 100.
13. The method of claim 6, wherein the variant threshold is 0 and the global threshold range is 5.
14. The method of claim 1, further comprising posting the output report.
15. The method of claim 1, wherein the output report includes a report of candidate variants that the heuristic filter removed from an output result of variants.
16. The method of claim 1, wherein the sequencing comprises ion-to-bases sequencing.
17. A computer system comprising: at least one processor and memory configured to perform: generation of a user interface; file input; the method of claim 1; and file output.
18. The system of claim 17, further comprising a database.
19. The system of claim 18, wherein the database is a relational database.
20. The method of claim 1, further comprising procuring the heterogenous cell sample from the subject.
21. The method of claim 6, further comprising excluding each position that has a percentage of variant base calls below a variant count threshold for all reads not excluded by said excluding each read that contains any variant that has another variant within a cluster threshold range on that read
Description
BRIEF DESCRIPTION OF DRAWINGS
[0006] The accompanying drawings, which are incorporated into and constitute a part of this specification, illustrate one or more embodiments of the present disclosure and, together with the description of example embodiments, serve to explain the principles and implementations of the disclosure.
[0007]
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
DETAILED DESCRIPTION
[0014] Genome sequencing is useful for detection and identification of disease mutations in cells, such as with cancer. Difficulties can arise in computer-aided sequencing when the biological sample, for example taken from a blood sample from a patient, contains a heterogeneous mixture of cell deoxyribonucleic acid (DNA).
[0015] Nucleic acid sequencing is a method for determining the exact order of nucleotides present in a given DNA or RNA molecule. Next-generation sequencing (NGS), also known as high-throughput sequencing, is a term used to describe a number of different modern nucleic acid sequencing technologies including Illumia™ sequencing, Roche 454™ sequencing, Ion torrent: Protein/PGM™ sequencing and SOLiD™ sequencing. These sequencing technologies allow one to sequence DNA and RNA quickly and cheaply compared to the previously used Sanger sequencing.
[0016] The term “nucleic acids” “polynucleotides” as used herein refer to biological molecules comprising a plurality of nucleotides. Exemplary nucleic acids include deoxyribonucleic acids (DNA) and ribonucleic acids (RNA), each synthesized from four different types of nucleotides, also called “bases”. The nucleotides for DNA include deoxy-adenosine (“A”), deoxy-thymidine (“T”), deoxy-cytosine (“C”), and deoxy-guanosine (“G”). The nucleotides for RNA include adenosine (“A”), uracil (“U”), cytosine (“C”) and guanosine (“G”). The nucleotides of a DNA or RNA are arranged in a particular order, referred to as the sequence of the DNA or RNA. The precise order of nucleotides, i.e. the four bases, within a DNA or RNA molecule is determined using nucleic acid sequencing methods.
[0017] In cases where a suspected disease or condition is concerned, targeted sequencing of specific genes or genomic regions is preferred. Compared to whole genome sequencing, which sequences an entire genome, targeted sequencing targets on a sequence segment of interest comprising one or more specific genes or genomic regions. Targeted sequencing generally yields higher coverage of genomic regions of interest and reduces sequencing cost and time.
[0018] “Amplicon sequencing” as used herein refers to a targeted sequencing method in which a discrete region of a genome is first amplified from the entire genome using PCR and the generated amplicons are used as templates for subsequent sequencing. Amplicon sequencing is typically used to investigate genetic variants in complex and heterogeneous samples. Sequencing can be carried out in a sample containing amplification products of a single amplicon. Alternatively, the sample can contain mixtures of multiple amplicons pooled together, as will be understood by a skilled person. Amplicon Sequencing is a method where multiple amplicons are pooled together and co-sequenced.
[0019] “Amplicons” as used herein are defined as replicated DNA (or ribonucleic acid—RNA) strands that are formed by polymerase chain reaction (PCR), ligase chain reactions (LCR), or other DNA duplication methods, where the strands are copies of a target region of a genome. In order to multiplex PCR amplification, each amplicon has to be unique and independent (no overlapping amplicons), which requires careful selection of the primers used to tag the regions to be amplified. Amplicons for sequencing have a length typically in the range between 100 bp and 500 bp.
[0020] The processing and sequencing of amplicons with different sequencing platforms can be flexible and allows for a range of experimental designs. A variety of options regarding design parameters can be selected, such as the length of amplicons, the number of amplicons pooled together, the number of reads desired for a given amplicon or a pool of amplicons, whether to read from one end (unidirectional sequencing) or both ends (bi-directional sequencing) of the amplicon and other factors identifiable to a skilled person in the art.
[0021] “Read” or “reads” used herein are defined as a sequenced range of DNA or RNA. A read can be a sequence that is output by a sequencing instrument, where the read attempts to match a range of DNA that was input to the instrument. Each set of reads maps to a particular amplicon, with a read being a sequence for the complete amplicon or, typically, a range of bases comprising a subset of the amplicon. The total set of reads in the input data for the filter pipeline can include multiple amplicons, each having multiple reads mapped to them. The range of the read lengths depends upon the primers chosen for a given library. The mapping of reads to an amplicon can be determined during alignment/assembly using a sequencing alignment tool, for example the Bowtie™ 2 read alignment tool from Johns Hopkins University (see “Fast gapped-read alignment with Bowtie 2” by Ben Langmead and Steven L. Saizberg, Nat Methods, Author manuscript; PMC Apr. 1, 2013).
[0022] In order to analyze libraries formed from heterogeneous mixtures of DNA (i.e., a mixture of different cells), rare sequencing events that contain a disease mutation, called herein the “signal”, must be differentiated or filtered from extraneous sequencing information, called herein the “noise”. A signal that is of the same order of magnitude as noise (e.g., a high frequency of DNA in the sample that is not being targeted for analysis) is difficult to interpret unless a specific filtering method is used to remove at least some of the noise.
[0023] There are at least two sources of noise in the sequencing pipeline. First, the DNA mixtures that are produced from input pellets (DNA or cell pellets) are complicated mixtures of cells and therefore any useful signal is diluted by DNA that has no informational content. A second source of noise is due to the specific sequencing technology employed. For example, sequencing noise or “machine” noise can be derived from an ion-to-bases sequencing process, for example with the Ion Torrent™ Personal Genome Machine (PGM™) platform. For example, ion detection sequencing that reads bases on pH detection is sensitive to homopolymers and will sometimes read a homopolymer chain as being one base too long or too short, particularly if the chain is long.
[0024] As used herein, “ion-to-bases” refers to ion semiconductor sequencing or ion detection sequencing, a method of sequencing DNA based on the detection of hydrogen ions that are released during the polymerization of DNA. This is a method of sequencing by synthesis, such that a complementary strand is built based on the sequence of the target strand.
[0025] Based upon empirical evidence, the machine noise contribution can be 5% to 10% or higher. Based upon the nature of the rare cell pellets recovered from a cell isolation platform, the required theoretical sensitivity needs to be on the order of about 1% to enable useful patient information to be reproducibly recovered from samples. Given that this sensitivity is not compatible with the noise characteristics of the sequencing platform, an informatics based sequence filtering strategy is required to reduce the noise below the required sensitivity (for example, 1%, or one cell in one hundred being a target cell). The noise in a sequencing pipeline can be reduced significantly by a heuristic filtration method.
[0026] The ability to distinguish a sequence variant (SNV) from a non-variant/reference genome requires sufficient sampling of the test sample to ensure a statistically valid result (i.e., a satisfactory degree of confidence in the results). For example, at the 1.0% threshold this translates to 20 informative (mutation bearing) reads per 2000 total reads. Cell-free DNA, however, may not have enough integrity to allow that many reads, so a lower threshold might be required, which in turn results in a lower level of confidence in the results. In addition to collecting a sufficient number of total reads, there are other considerations that affect the ability to call SNV's from sequencing tests. In order to call a sequence variant as a true mutation, confounding artifacts of the sequencing process must be excluded.
[0027] A sequence variant also known as mutations include deletions, insertions, substitutions and duplications of a single or multiple nucleotides and chromosome rearrangements such as translocation and inversion. A particular type of sequence variant indicates a genetic variation formed by single base pair substitution, called a point mutation.
[0028] Once the FASTQ files (i.e., text-based files containing sequences of reads produced from a genome sequencing procedure) are exported from the ion-to-bases conversion server they must be analyzed for sequence variants (SNVs). In order for this to be accomplished, a sequence alignment of the experimental files to the reference sequence must be accomplished. In order to perform an alignment of the FASTQ sequences to a human reference assembly, a sequence alignment software device is required. This alignment output is in a BAM format. The BAM format is a binary version of the SAM (Sequence Alignment/Map) tab delimited file alignment. Once an indexed BAM file has been produced and gapped, the actual alignment can be visualized if needed.
[0029] Despite the alignment of each FASTQ read to the reference sequence (an amplicon), there is still a chance that a given base will be in error due to the base calling or due to biological or machine noise. Thus a post-alignment software program for sequence analysis has been developed. This program is called the “heuristic filter pipeline”, a series of filtering steps that generates an SNV report from the FASTQ data. This SNV report can then be exported into the LIMS (Laboratory Information Management System) for patient reporting. An example heuristic filter algorithm is as below: [0030] 1) Review each amplicon for reads mapping to that amplicon. Exclude the entire amplicon (i.e. all of the reads mapped to that amplicon; as determined, for example, from an alignment/assembly process) from the results if the number of mapped reads is below a threshold value. A threshold of 2000 is typical, but lower thresholds, such as 500, can be set if the threshold excludes too many amplicons. (Amplicon Coverage filter). [0031] 2) Count the total variant base calls across all the reads for each position. If the number of variant base calls is below a threshold, exclude all SNV at that the position from the results. The threshold can be a percentage of variants for the reads (e.g., if less than 1% of the reads has a variant at that position, exclude the position from the results). (Variant Count filter). [0032] 3) Exclude any positions that have been marked in the database as having known problems (for example, as known from previous runs, or from external knowledge and added to the database by a user). (Exclusion filter). [0033] 4) Exclude any positions that have a number of reads below a threshold value (e.g., if a position is only found in under 2000 reads, exclude all SNVs at that position from the results). As with the Amplicon Coverage filter above, the threshold can be lowered if the higher value excludes too many positions. (Base Coverage filter). [0034] 5) Using a “case/control” model, compare the experimental sample DNA to a negative control DNA for each SNV. Any candidate SNV of the experimental sample must not be present in the negative control. (Case/Control filter). [0035] 6) Determine the position of the SNV relative to each end of the read. Any candidate SNV must be greater than a set value (for example, 11) nucleotides from either end of a trimmed read. This is based on idea that hits near the ends of each sequence are unreliable. (End-of-Read filter). [0036] 7) Evaluate the position n.sub.i in the amplicon for homopolymers. Any candidate SNV shall not be found within a preexisting homopolymer track greater or equal to a set value (for example, 4) nucleotides relative to the reference. This is because ion-to-bases resequencing has difficulty reading strings of homopolymers, especially long ones. (Homopolymer filter) [0037] 8) Evaluate the region surrounding SNV (i.e., at position n.sub.i±δ.sub.c) on each read containing a variant for adjacent or clustered variants. Within a particular read there cannot be additional substitutions, regardless of base type, in the delimited region (for example, within 100 bases/positions; or as another example, within the entire amplicon length). Optionally, this step could also be combined with the Variant Count filter, wherein the Variant Count filter can be run (or re-run) with the set of reads remaining after reads are removed with the Cluster filter. For example, suppose the variant cutoff is 1%, and there is an initial count of 4000 reads of which 41 had a variant at position 100. Now suppose the Cluster filter step removes 1000 reads, leaving 3000 remaining reads. If 30 or more of the remaining 3000 reads still have the variant at position 100, the variant passes the step and is retained. If, however, fewer than 30 reads have the variant, the variant fails the step and is removed from further consideration in the pipeline. This could result in some variants that were originally removed by the Variant Count filter to now pass the Variant Count filter. This can be addressed one of two ways: the variants can be re-introduced into the results, optionally with them being re-run through the pipeline to be checked against any filters they would have missed in the previous run; or the pipeline can be run as exclude-only, so that the re-run Variant Count filter does not re-introduce previously failing variants, but only excludes previously passing variants. (Cluster filter) [0038] 9) Evaluate the region surrounding SNV (i.e., at position n.sub.i±δ.sub.g) for all reads of an amplicon (or, alternatively, for a subset of reads) for additional variants and exclude the SNV if too many additional variants not already excluded by the Amplicon Coverage filter and with the same non-reference base are found (i.e. beyond a threshold value, even a threshold of 0 where just one additional variant of that same base would be considered too many). An example value for δ.sub.g is 5. (Global filter) [0039] 10) Determine which variants are reportable based on knowledge of clinical ramifications. (Report filter). [0040] 11) Post the heuristic filter pipeline hit list analysis.
[0041] The filters can be applied in any order, and in any combination (i.e., not all filters need to be used). The inclusion of and thresholds used by the various filters can depend upon the nature of the input data and the sources of noise present in the DNA acquisition and sequencing process. Each filter step can also record a percentage of pass and/or fail rate for that filter as a threshold to determine if the filter should be applied to the results (for example, if the number of amplicons failing the Amplicon Coverage filter is too high—or equivalently if the number of amplicons passing the Amplicon Coverage filter is too low—then the amplicons that would be excluded from the Amplicon Coverage filter are not excluded). This would create a controllable tolerance level for the filter in question, allowing a filter be more permissive for batches that would otherwise have too few remaining SNVs after filtering.
[0042] “Noise”, as used herein, includes false positive and unreliable results from any source, internal or external to the system, or data that is not clinically significant. “Signal”, as used herein, includes highly reliable results that a user is trying to analyze.
[0043] Case/control can include comparing sequences from the patient's sample (e.g., blood to be analyzed) and a germatic control sample (e.g., patient's normal/unmutated tissue).
[0044] In an example library, for a three minute assembly the post-assembly process adds about one and a half minutes to the process.
[0045] In addition to the hit/miss statistics, the hit list analysis can include details of why each removed hit was filtered out. The specific filter (Case/control, End-of-read, Cluster, etc.) that removed the hit can be listed next to the hit for analysis of the noise of the system.
[0046]
[0047]
[0048]
[0049] As it can be shown by
[0050]
[0051]
[0052]
[0053]
[0054]
TABLE-US-00001 TABLE 1 Example Filter Report run pat_id filter chr coordinate aref avar coverage var_count effect 302 LB517 NON 9 133747505 T C 10326 475 — 302 LB517 NON 9 133747506 C T 10317 274 — 302 LB517 NON 9 133747507 C T 10314 482 — 302 LB517 EOR 14 105241519 T C 17603 1134 NS 302 LB5017 NON 2 29432625 C A 5744 786 — 302 LB5017 GLOB 5 112175211 T A 2503 32 NS 302 LB5017 GLOB 5 112175216 G A 2506 71 NS
[0055] Table 1 shows a portion of an example filter report. For a given sequencing run (run) for a given patient (pat_id), variants (avar) are shown relative to the reference base (aref) they substitute with the variant location identified by chromosome number (chr) and gene coordinate (coordinate). The total base coverage (coverage) and variant count (var_count) for the variant is given. A filter report field (filter) reports whether the variant was not filtered by the heuristic filter (value of NON) or, if it was filtered, which filter removed the variant from the final results (e.g., EOR for end-of-read filter, GLOB for global filter, etc.). Another field (effect) reports other effects that can determine scoring of the variant, such as being non-synonymous (value NS). The report can include further information, such as the type of run (e.g., germ line run), base counts at that position, percent variation, deletion counts, gene identification, transcript identification, protein change, complimentary DNA (cDNA) change, or Catalogue of Somatic Mutations in Cancer (COSMIC) identification.
[0056]
[0057] A number of embodiments of the disclosure have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the present disclosure. Accordingly, other embodiments are within the scope of the following claims.
[0058] The examples set forth above are provided to those of ordinary skill in the art as a complete disclosure and description of how to make and use the embodiments of the disclosure, and are not intended to limit the scope of what the inventor/inventors regard as their disclosure.
[0059] Modifications of the above-described modes for carrying out the methods and systems herein disclosed that are obvious to persons of skill in the art are intended to be within the scope of the following claims. All patents and publications mentioned in the specification are indicative of the levels of skill of those skilled in the art to which the disclosure pertains. All references cited in this disclosure are incorporated by reference to the same extent as if each reference had been incorporated by reference in its entirety individually.
[0060] It is to be understood that the disclosure is not limited to particular methods or systems, which can, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the content clearly dictates otherwise. The term “plurality” includes two or more referents unless the content clearly dictates otherwise. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the disclosure pertains.