Methods for genome characterization
11479878 · 2022-10-25
Assignee
- Dana-Farber Cancer Institute, Inc. (Boston, MA, US)
- The Broad Institute, Inc. (Cambridge, MA)
- President And Fellows Of Harvard College (Cambridge, MA)
Inventors
Cpc classification
C12Q1/6806
CHEMISTRY; METALLURGY
C40B40/08
CHEMISTRY; METALLURGY
C12Q2600/112
CHEMISTRY; METALLURGY
International classification
C40B40/08
CHEMISTRY; METALLURGY
C12Q1/6806
CHEMISTRY; METALLURGY
Abstract
The invention provides methods of using low coverage sequencing to assess the relative fraction of tumor versus normal DNA in a sample, and to assess copy number alterations present in the sample.
Claims
1. A method of characterizing DNA in a biological sample comprising or suspected of comprising tumor-derived DNA, the method comprising: (a) isolating fragments of DNA from a biological sample; (b) constructing an unamplified DNA library comprising said fragments, wherein the library is constructed using from about 2 ng to about 85 ng of DNA; (c) obtaining sequence data by sequencing the library to about 0.01-5× genome- or exome-wide sequencing coverage; (d) generating a copy number alteration profile for the sequence data; and (e) using the copy number alteration profile to detect presence or absence of a chromosomal copy number alteration in the sequence data, wherein detection of a chromosomal copy number alteration indicates that at least a portion of the DNA in the sample was derived from a neoplastic cell and failure to detect such an alteration indicates that DNA present in the sample is not derived from a neoplastic cell.
2. A method of characterizing DNA in a biological sample, the method comprising: (a) isolating fragments of DNA from a biological sample; (b) constructing an unamplified DNA library comprising said fragments, wherein the library is constructed using from about 2 ng to about 85 ng of DNA; (c) obtaining sequence data by sequencing the library to about 0.01-5× genome- or exome-wide sequencing coverage; (d) generating a copy number alteration profile for the sequence data, wherein generating the copy number alteration profile comprises computing sequence read coverage and sequence data normalization; and (e) using the copy number alteration profile to detect presence or absence of a chromosomal copy number alteration or focal chromosomal copy number alteration in the sequence data.
3. A method of determining the purity of tumor-derived DNA in a sample, the method comprising: (a) isolating fragments of DNA from a biological sample; (b) constructing an unamplified DNA library comprising said fragments, wherein the library is constructed using from about 2 ng to about 85 ng of DNA; (c) obtaining sequence data by sequencing the library to at least about 0.1X genome- or exome-wide sequencing coverage; (d) generating a copy number alteration profile for the sequence data; (e) using the copy number alteration profile to detect the presence or absence of a chromosomal copy number alteration in the sequence data; and (f) analyzing any chromosomal copy number alteration(s) detected in the sample to determine the purity of tumor-derived DNA in the sample.
4. The method of claim 3, further comprising (g) carrying out whole exome sequencing.
5. The method of claim 2, wherein the focal chromosomal copy number alteration is selected from the group consisting of about 1 KB, 3 KB, 5 KB, 10 kb, 50 kb, 100 kb, 500 kb, 2 MB, 3 MB, 4 MB, 5 MB, 10 MB, 50 MB, and 100 MB of DNA.
6. The method of claim 1, wherein the biological sample is tissue sample or a liquid biological sample selected from the group consisting of blood, plasma, serum, cerebrospinal fluid, phlegm, saliva, urine, semen, prostate fluid, breast milk, and tears.
7. The method of claim 1, wherein the sample is derived from a subject having or suspected of having a neoplasia.
8. The method of claim 1, wherein the sample is a fresh or archival sample derived from a subject having a cancer selected from the group consisting of prostate cancer, metastatic prostate cancer, breast cancer, triple negative breast cancer, lung cancer, colon cancer, or any other cancer comprising aneuploid cells.
9. A method of identifying a subject as having a neoplasia or monitoring disease status, the method comprising: (a) isolating fragments of cell-free DNA from a biological sample derived from the subject; (b) constructing an unamplified, cell-free DNA library comprising said fragments, wherein the library is constructed using from about 2 ng to about 85 ng of DNA; (c) obtaining sequence data by sequencing the library to at least about 0.01-5× exome- or genome-wide sequencing coverage; (d) generating a copy number alteration profile for the sequence data, and (e) using the copy number alteration profile to detect the presence or absence of a chromosomal copy number alteration in the sequence data, wherein the presence of a chromosomal copy number alteration identifies the subject as having a neoplasia, and the absence of a chromosomal copy number alteration indicates that no neoplasia was detected in the sample from the subject and, optionally, comparing the chromosomal copy number alteration(s) in the sequence data over time, thereby identifying the subject as having a neoplasia or monitoring disease status of the subject.
10. The method of claim 9, wherein an increase in chromosomal copy number alterations between a first time point and a later time point indicates that the subject's disease state has progressed or wherein a decrease in chromosomal copy number alterations between a first time point and a later time point indicates that the subject's disease state has stabilized or is not progressing.
11. A method of characterizing the efficacy of treatment of a subject having a disease characterized by an increase in chromosomal copy number, the method comprising: (a) isolating fragments of cell-free DNA from two or more biological samples derived from a subject undergoing cancer therapy, wherein a first biological sample is obtained at a first time point and a second or subsequent biological sample is obtained at a later time point; (b) constructing two or more unamplified, cell-free DNA libraries each comprising fragments from said samples, wherein each library is constructed using from about 2 ng to about 85 ng of DNA; (c) obtaining sequence data for each library by sequencing the libraries to at least about 0.01-5× genome- or exome-wide sequencing coverage; (d) generating a copy number alteration profiles from the sequence data, and (e) using the copy number alteration profiles to compare focal chromosomal copy number alterations in the sequence data over time, thereby characterizing the efficacy of treatment.
12. The method of claim 11, wherein a decrease in chromosomal copy number alterations between the first and later time points indicates that the treatment is effective.
13. The method of claim 11, wherein the disease is cancer.
14. The method of claim 12, wherein the treatment is an anti-cancer therapy selected from the group consisting of chemotherapy, radiotherapy, or surgery.
15. The method of claim 1, wherein the fragments of DNA are cell-free DNA.
16. The method of claim 1, wherein the exome-wide or genome-wide sequencing coverage is about 0.1×.
17. The method of claim 2, wherein detection of a focal chromosomal copy number alteration in the sequence data identifies the presence of tumor derived cell-free DNA present in the sample or wherein failure to detect a focal chromosomal copy number alteration in the sequence data indicates the absence of tumor derived cell-free DNA present in the sample.
18. The method of claim 2, wherein the focal chromosomal copy number alteration correlates with at least about 3% purity of tumor derived DNA.
19. The method of claim 1, wherein the library is constructed using about 5 ng of DNA.
20. The method of claim 1, wherein about 1% of the library is sequenced in (c).
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
DETAILED DESCRIPTION OF THIS INVENTION
(7) The invention generally provides methods of using low coverage sequencing to assess the relative fraction of tumor versus normal DNA in a sample, and to assess copy number alterations present in the sample.
(8) The invention provides an efficient process to qualify DNA samples for whole-exome and/or whole genome sequencing based on tumor fraction (or tumor content). Without being bound by theory, the power to detect a variant at a particular sequencing depth depends on the tumor fraction of a sample. The selected samples can then be used to systematically compare the somatic mutations, indels, and copy number alterations detected in whole exome or whole genome sequencing of cfDNA to whole exome or whole genome sequencing of matched tumor biopsies. Advantageously, methods of the invention provide for ultra-low-pass sequencing (i.e., about 0.01, 0.05, 0.1, 0.5, 1, 2, 3, or 5× genome or exome wide sequencing coverage) for assessing tumor fraction and samples for which sufficient tumor content is estimated, whole exome sequencing of cfDNA can be performed. Among patients with metastatic breast or prostate cancers, whole exome sequencing of cfDNA was found to uncover 91% of the clonal mutations, 59% of the subclonal mutations, and 86% of the clonal copy number alterations (CNA) and 80% of the subclonal CNAs detected in whole exome sequencing of matched tumor biopsies. The high concordance suggests that comprehensive sequencing of cfDNA may enable genomic discovery in a routine and minimally invasive manner.
(9) This invention overcomes the challenge of screening large numbers of blood samples to assess how much tumor-derived cell-free DNA is present in blood plasma. This allows estimation of the fraction of tumor DNA in a sample from a trivial amount of sequencing (˜0.1× coverage or roughly $20 of sequencing coverage). This also provides for the detection of copy number alterations at a 500 kb scale that can be detected in the sample. This serves not only as an efficient way to qualify samples for targeted or whole-exome/whole-genome sequencing for cancer genomics discoveries, but also correlates with therapeutic response and enables the study of copy number alterations in large cohorts.
(10) The invention provides for characterization of the malignant status of a sample or for diagnosis of cancer in the subject from whom the sample is derived. Based on the detection of somatic, tumor-specific copy number alterations from DNA of a sample, the fraction of tumor DNA is estimated. cfDNA and germline DNA was isolated from blood and analyzed using low coverage sequencing to estimate tumor content based on genome-wide copy number. The CNA events and tumor fraction were simultaneously predicted in a unified Bayesian hidden Markov model (HMM) framework. The estimated error rate of tumor fraction estimation was about 0.03 at genome sequencing coverage of 0.1×, as determined from application to healthy donor samples, which should have an expected 0.00 tumor fraction. Therefore, samples having greater than 0.03 tumor fraction, due to the harboring detectable CNA events, are classified as tumor-derived.
(11) In related applications, the methods described herein can be used to assess the tumor cellularity (purity) of tumor biopsies of human primary cancers with high stromal contamination and lymphocytic infiltration. Where little tumor DNA is present, methods of the invention can be used to assess whether sequencing is warranted. This would avoid the sequencing of samples that have little tumor content, which would provide significant cost saving. In addition, the methods of the invention could be used to assess the fraction of tumor relative to normal cells present in a tissue sample to be used for cell line generation. This would avoid the propagation of cell lines from human primary tumors that are largely composed of normal cells rather than of tumor cells.
(12) Whole Genome Sequencing and Whole Exome Sequencing
(13) Whole genome sequencing (also known as “WGS”, full genome sequencing, complete genome sequencing, or entire genome sequencing) is a process that determines the complete DNA sequence of an organism's genome. A common strategy used for WGS is shotgun sequencing, in which DNA is broken up randomly into numerous small segments, which are sequenced. Sequence data obtained from one sequencing reaction is termed a “read.” The reads can be assembled together based on sequence overlap. The genome sequence is obtained by assembling the reads into a reconstructed sequence.
(14) Whole exome sequencing (“WES”) is a technique used to sequence all the expressed genes in a genome (known as the exome). It includes first selecting only the subset of DNA that encodes proteins (exons), and then sequencing the exons using any DNA sequencing technology well known in the art or as described herein. In a human being, there are about 180,000 exons, which constitute about 1% of the human genome, or approximately 30 million base pairs. To sequence the exons of a genome, fragments of double-stranded genomic DNA are obtained (e.g., by methods such as sonication, nuclease digestion, or any other appropriate methods). Linkers or adapters are then attached to the DNA fragments, which are then hybridized to a library of polynucleotides designed to capture only the exons. The hybridized DNA fragments are then selectively isolated and subjected to sequencing using any sequencing method known in the art or described herein.
(15) In one embodiment, the sequencing of a DNA fragment is carried out using commercially available sequencing technology SBS (sequencing by synthesis) by Illumina. In another embodiment, the sequencing of the DNA fragment is carried out using chain termination method of DNA sequencing. In yet another embodiment, the sequencing of the DNA fragment is carried out using one of the commercially available next-generation sequencing technologies, including SMRT (single-molecule real-time) sequencing from Pacific Biosciences, Ion Torrent™ sequencing from ThermoFisher Scientific, Pyrosequencing (454) from Roche, and SOLiD® technology from Applied Biosystems. Any appropriate sequencing technology may be chosen for sequencing.
(16) This invention provides methods for qualifying a DNA sample for whole exome sequencing after ultra low pass (ULP)-whole genome sequencing. In one embodiment, the fraction of cells affected by a disease, such as cancer, may be estimated using one or more of the statistical models described herein. The qualifying process involves selecting a DNA sample having a tumor fraction greater than a threshold value, e.g. 5%, for whole exome sequencing (WES). Understanding the fraction of tumor-derived DNA in a sample allows one to adjust the depth of sequencing used in WES or other deeper sequencing. For instance, if a sample has a very low purity of tumor-derived DNA, a much greater depth of sequencing may be required to achieve the same sensitivity for detection of somatic alterations (Cibulskis et al. Nat. Biotechnol. 31, 213-219 (2013)).
(17) Ultra low pass sequencing advantageously provides for the accurate characterization of genomic or exomic DNA at a significant savings of cost and time, thereby obviating the need for complete integrative clinical sequencing of the whole-exome, matched germline, and/or transcriptome as practiced, for example, by Robinson et al., Cell 161:1215-1228, 2015 where meant target coverage for tumor exomes was 160× and for matched normal exomes was 100×.
(18) As used herein, the term “coverage” refers to the percentage of genome covered by reads. In one embodiment, low coverage or ultra low pass coverage is less than about 1×. Coverage also refers to, in shotgun sequencing, the average number of reads representing a given nucleotide in the reconstructed sequence. It can be calculated from the length of the original genome (G), the number of reads(N), and the average read length(L) as N×L/G. Biases in sample preparation, sequencing, and genomic alignment and assembly can result in regions of the genome that lack coverage (that is, gaps) and in regions with much higher coverage than theoretically expected. It is important to assess the uniformity of coverage, and thus data quality, by calculating the variance in sequencing depth across the genome. The term depth may also be used to describe how much of the complexity in a sequencing library has been sampled. All sequencing libraries contain finite pools of distinct DNA fragments. In a sequencing experiment only some of these fragments are sampled.
(19) Alterations of Cancer Genome
(20) Alterations of the genome have been identified in virtually all cancers. These alterations may include, without limitation, gene mutation, loss of heterozygosity (LOH), changes in chromosome number (aneuploidy), deletions, insertions, inversions, translocations, amplifications, and copy number alterations (CNA). Detection of copy number alterations is useful for characterizing the malignant status of a tumor, diagnosis of cancer, assessment of the purity in tumor biopsies of human primary cancers, and qualifying samples for additional characterization, such as whole exome sequencing.
(21) This invention provides methods to characterize alterations present in a cancer genome. In one embodiment, this invention involves isolating cfDNA from a biological sample, sequencing the DNA using ULP-WGS, analyzing the sequence of the DNA using statistical models described herein, and characterizing the copy number alterations present in the cfDNA, for example, by generating a copy number alteration profile. In one embodiment, the CNA profile is used to determine the tumor fraction of a sample. In another embodiment, the invention provides a method of diagnosing cancer in a subject by detecting the CNA profile of sample from the subject. In yet another embodiment, the invention provides a method of identifying a treatment and/or an agent for treatment and/or prevention of a cancer by characterizing CNAs present in cfDNA of the subject. In one embodiment, the method involves comparing the CNAs profiles before and after the treatment and/or administration of the agent.
(22) The methods of the invention are applicable to any disease and/or disorder having copy number alterations.
(23) Statistical Methods
(24) To analyze ULP-WGS data, a modified approach from the TITAN (Ha G. et al., Genome Res. 24, 1881-1893 (2014) (“Ha 2014”)), the content of which is incorporated by reference in its entirety. The approach in this invention employs a hidden Markov model that simultaneously performs segmentation, copy number prediction, and purity (also termed tumor fraction) and ploidy estimation. This approach is optimized for increased sensitivity to detect events from low purity tumor-derived DNA in the absence of a control sample by using fine-tuned Bayesian priors and considering all CNA events in a unified probabilistic model. TITAN code is available at http://compbio.bccrc.ca/software/titan/ and https://github.com/gavinha/TitanCNA and HMMcopy frameworks and pipelines needed to be adapted. This new approach overcomes major challenges that TITAN/HMMcopy, and all other existing tools, were not designed to address: 1) ultra low coverage (0.1×) sequencing (most tools are optimized for WGS of 20× or greater); 2) unavailable matched germline normal sample as control (many tools require heterozygous SNPs determined from normal); 3) very low tumor content in DNA samples such as in cell-free DNA (many tools advertise benchmarks for reasonable performance at 0.15-0.20 tumor fractions). The landscape of the alterations of the cancer genome may be inferred from such estimates.
(25) Types of Samples
(26) This invention provides methods to extract and sequence a polynucleotide present in a sample. In one embodiment, the samples are biological samples generally derived from a human subject, preferably as a bodily fluid (such as blood, plasma, serum, cerebrospinal fluid, phlegm, saliva, urine, semen, prostate fluid, breast milk, or tears, or tissue sample (e.g. a tissue sample obtained by biopsy). In a further embodiment, the samples are biological samples derived from an animal, preferably as a bodily fluid (such as blood, cerebrospinal fluid, phlegm, saliva, or urine) or tissue sample (e.g. a tissue sample obtained by biopsy). In still another embodiment, the samples are biological samples from in vitro sources (such as cell culture medium). CfDNA attached to a substrate may be first suspended in a liquid medium, such as a buffer or a water, and then subject to sequencing and/or analysis. In yet another embodiment, the sample contains DNA within a cell, which may be extracted, sequenced and subject to the same analysis for landscape of alterations of genome.
(27) Patient Monitoring
(28) The disease state or treatment of a patient having a cancer or disease characterized by copy number alterations can be monitored using the methods and compositions of this invention. In one embodiment, the response of a patient to a treatment can be monitored using the methods and compositions of this invention. Such monitoring may be useful, for example, in assessing the efficacy of a particular treatment in a patient. Treatments amenable to monitoring using the methods of the invention include, but are not limited to, chemotherapy, radiotherapy, immunotherapy, and surgery. Therapeutics that alter the CNA landscape of cfDNA are taken as particularly useful in this invention. In one embodiment, methods of the invention are used to monitor PTEN loss or treatment with PARP inhibitors (e.g., iniparib, talazoparib, olaparib, rucparib, veliparib, MK 4827, BGB-290).
(29) Diagnostics
(30) Neoplastic tissues display alterations in their genome compared to corresponding normal reference tissues. Copy number alterations are correlated with neoplasia. Accordingly, this invention provides methods for detecting, diagnosing, or characterizing a neoplasia in a subject. The present invention provides a number of diagnostic assays that are useful for the identification or characterization of a neoplasia.
(31) In one approach, diagnostic methods of the invention are used to detect the CNA in a biological sample relative to a reference (e.g., a reference determined by an algorithm, determined based on known values, determined using a standard curve, determined using statistical modeling, or level present in a control polynucleotide, genome or exome).
(32) Methods of the invention are useful as clinical or companion diagnostics for therapies or can be used to guide treatment decisions based on clinical response/resistance. In other embodiments, methods of the invention can be used to qualify a sample for whole-exome sequencing.
(33) A physician may diagnose a subject and the physician thus has the option to recommend and/or refer the subject to seek the confirmation/treatment of the disease. The availability of high throughput sequencing technology allows the diagnosis of large number of subjects.
(34) The practice of the present invention employs, unless otherwise indicated, conventional techniques of molecular biology (including recombinant techniques), microbiology, cell biology, biochemistry and immunology, which are well within the purview of the person of ordinary skill. Such techniques are explained fully in the literature, such as, “Molecular Cloning: A Laboratory Manual”, second edition (Sambrook, 1989); “Oligonucleotide Synthesis” (Gait, 1984); “Animal Cell Culture” (Freshney, 1987); “Methods in Enzymology” “Handbook of Experimental Immunology” (Weir, 1996); “Gene Transfer Vectors for Mammalian Cells” (Miller and Calos, 1987); “Current Protocols in Molecular Biology” (Ausubel, 1987); “PCR: The Polymerase Chain Reaction”, (Mullis, 1994); “Current Protocols in Immunology” (Coligan, 1991). These techniques are applicable to the production of the polynucleotides and polypeptides of this invention, and, as such, may be considered in making and practicing this invention. Particularly useful techniques for particular embodiments will be discussed in the sections that follow.
(35) The following examples are put forth so as to provide those of ordinary skill in the art with a complete disclosure and description of how to make and use the assay, screening, and therapeutic methods of this invention, and are not intended to limit the scope of what the inventors regard as their invention.
EXAMPLES
Example 1
Ultra-Low-Pass Whole-Genome Sequencing can be Used to Characterize Tumor Content in a cfDNA Library
(36) A low-cost, sample-conserving, and unbiased approach to screen cell-free DNA libraries prior to whole exome sequencing would help to focus whole exome sequencing on libraries with detectable tumor DNA. While many previous approaches to screening for cancer-derived cfDNA have focused on targeted mutation detection, it was hypothesized that using somatic copy number alterations would be more generally applicable as the vast majority of metastatic cancers harbor arm-level somatic SCNAs (citation) whereas different tumor types may have few recurrent somatic mutations. Additionally, copy number detection is an unbiased approach which borrows statistical strength from genome-wide signals while somatic mutation detection relies on high coverage at targeted regions to achieve high sensitivity in low purity samples. The question of whether ultra-low-pass whole-genome sequencing (ULP-WGS) could sensitively detect large-scale SCNAs and be used to estimate tumor fractions was explored. This would only require a small fraction of each library (˜1%) and conserve enough DNA for hybrid selection.
(37) An analytical approach termed “ichorCNA” was developed to quantify tumor fraction in cfDNA without prior knowledge of single nucleotide variants (SSNVs) or somatic copy number alterations (SCNAs) in patients' tumors from ULP-WGS (
(38) An exemplary process begins with patient blood collection, separation of plasma from blood, extraction of cfDNA from plasma and germline DNA from blood, and construction of cfDNA libraries (
(39) It was found that the size distribution and yields of cfDNA from metastatic cancer patients (median=7.39, range=0.20 to 547.82 ng/mL plasma, n=1642) and healthy donors (median=2.64, range=0.55 to 21.27 ng/mL plasma, n=27) were consistent with previous reports (Snyder et al. Cell 164, 57-68 (2016); Szpechcinski et al. Br. J. Cancer 113, 476-483 (2015)). Whole-genome libraries were constructed from samples using just 4 mL of plasma—the amount of plasma in a single tube of blood. A library construction protocol was optimized for 5 ng of cfDNA input; 96.3% of cancer patients and 81.4% of healthy donors had ≥5 ng of cfDNA per 4 mL of plasma. Only 1% of each cfDNA sequencing library was then used for ULP-WGS to screen for tumor content. This process resulted in cfDNA and germline DNA libraries suitable for hybrid selection and whole-exome sequencing.
(40) ichorCNA simultaneously predicted segments of SCNA and estimates of tumor fraction while accounting for subclonality and tumor ploidy. To evaluate the performance of ichorCNA, ULP-WGS of cfDNA (
(41) To further evaluate how ULP-WGS of cfDNA compares with the metastatic tumor, standard whole-exome sequencing was performed on matched tumor biopsies (average mean target coverage 173×) from 41 patients with metastatic breast and prostate cancers who had a cfDNA sample with ≥0.1 tumor fraction. The majority of large-scale (>1 Mb) SCNAs detected by ULP-WGS of cfDNA were present in the metastatic tumors (median sensitivity 0.82, Spearman ρ=0.66,
(42) A series of benchmarking datasets were generated using in silico mixing of up to 50 cancer patient and 22 healthy donor cfDNA samples to generate. The benchmarking datasets demonstrated accurate estimation of tumor fraction (median absolute deviation of error <0.014) and detection of SCNAs at 0.1× coverage. Furthermore, a lower limit of detection of 0.03 tumor fraction was determined, using only a single arm-level (>100 Mb) gain and loss of one copy to detect the presence of tumor. These results indicate that the application of ichorCNA to ULP-WGS of cfDNA offers an accurate approach to detect SCNAs that are reflective of tumor biopsies and provides accurate estimates of tumor fractions, potentially even in cancer types with few SCNAs.
(43) Whole-exome sequencing of cfDNA (average mean target coverage 191×) from the same 41 patients with matched metastatic breast and prostate tumor biopsies was performed and somatic alterations (SSNVs and SCNAs) were detected. First, ULP-WGS and whole-exome sequencing of cfDNA were compared and high concordance of tumor fraction estimates (Pearson's r=0.94,
Example 2
ULP-WGS Offers an Efficient Way to Screen cfDNA Libraries for Tumor Content Prior to Whole Exome Sequencing
(44) To validate the ULP-WGS approach for screening of cfDNA libraries from cancer patients, blood and matched tumor biopsies were collected from patients with metastatic breast cancer and patients with metastatic prostate cancer. ULP-WGS of cell-free DNA was performed. Cell-free DNA from cancer patient samples harbored SCNAs above 5 Mb, whereas cell-free DNA from healthy donors lacked large-scale copy number alterations events (
(45) The ULP-WGS and whole exome sequencing tumor fraction estimates were correlated with statistical significance (Pearson's r=0.86, p<0.01,
Example 3
ULP-WGS of cfDNA can be Used for Comprehensive Genomic Characterization of a Tumor Biopsy
(46) The overlap of SSNVs and SCNAs between whole-exome sequencing of cfDNA and matched tumor biopsies was examined. Clonal and subclonal events were distinguished by estimating the proportion of an observed somatic event out of the total tumor-derived DNA (cancer cell fraction, hereafter CCF) using ABSOLUTE (Carter et al. Nat. Biotechnol. 30, 413-421 (2012)). On average, 88% of the clonal (CCF≥0.9; range 29-100%) and 47% of the subclonal (CCF<0.9; range 9-100%) SSNVs that were detected in the tumor were confirmed to be present in cfDNA (i.e. supported by ≥3 variant reads) (
(47) Between cfDNA and the metastatic lesions, a median of 46% (range 12%-100%) of SSNVs and 78% (range 25-95%) of genes altered by SCNAs were observed to be clonal (CCF≥0.9) in both samples. For seventeen of the patients with a second cfDNA sample, clonal stability was observed, with the majority (>50%) of SSNVs having similar clonality (±0.1 CCF) between time points. Distinct subclonal patterns of SSNVs was also observed, including evolving clonal dynamics. For instance, in a metastatic breast cancer patient (MBC_284) previously treated with an aromatase inhibitor, multiple mutations were detected in ESR1 (D538G and L536P) in cfDNA at t.sub.1 (0.12 and 0.45 CCF) (
Example 4
ULP-WGS of cfDNA Identified Focal Copy Number Alterations (CNAs) with Comparable Performance to Whole Exome Sequencing of the Same Library or Matched Tumor Biopsy
(48) To determine whether the data obtained through ULP-WGS may be used to reveal genomic aberrations in tumor tissue, a comparison study was conducted (
(49) Whether whole-exome sequencing of cfDNA can serve as a proxy for tumor biopsies in multiple applications of cancer exome analyses was assessed. First, known cancer-associated somatic alterations (Van Allen et al. Nat. Med. 20, 682-688 (2014)) between cfDNA and tumor biopsies was compared for 27 metastatic breast and 14 metastatic prostate cancer patients. In breast cancer, similar frequencies of altered genes (Pearson's r=0.97) were observed in both cfDNA and tumor biopsies, including mutations in TP53, ESR1, and PIK3CA, amplification of MYC, CCND1, ERBB2, PIK3CA, and losses of ATM and RB1 (
(50) As mutational processes operating in tumors have been associated with potential sensitivity to specific therapies (Alexandrov et al. Nat. Commun. 6, 8683 (2015)) and their detection in cfDNA could be clinically significant, the mutational signatures (Kasar et al. Nat. Commun. 6, 8866 (2015); Kim et al. Nat. Genet. 48, 600-606 (2016)) present in cfDNA and tumor biopsy were analyzed. Three previously (Alexandrov et al. Nature 500, 415-421 (2013)) described mutational signatures associated with aging (C>T mutations at CpG dinucleotides), APOBEC activity (C>T or C>G at a TC[A/T] context), and DNA homologous recombination deficiency (BRCA-like (Alexandrov et al. Nat. Commun. 6, 8683 (2015))) were identified (
(51) As cancer immunotherapies have been effective in clinical trials and analysis of neoantigens may influence treatment strategies (Rizvi et al. Science 348, 124-128 (2015)), the number of somatic mutations that were predicted to be neoantigens in cfDNA and matched tumor biopsies were compared. The binding affinity of missense SNVs to patient-specific MHC Class I alleles inferred from germline whole-exome sequencing data (Nielsen et al. PLoS One 2, (2007); Hoof et al. Immunogenetics 61, 1-13 (2009); Shukla et al. Nat. Biotechnol. 33, 1152-1158 (2015)) was predicted. Any mutation with an IC.sub.50<500 nM was considered a predicted neoantigen. The number of predicted neoantigens was strongly correlated between cfDNA and tumor biopsies (adjusted R.sup.2=0.90, p<1×10.sup.−16). Without being bound by theory this indicates that whole-exome sequencing of cfDNA could lead to similar prediction of potential tumor immunogenicity as would sequencing of tumor biopsies (
(52) Finally, these results indicate that many patients with metastatic cancer will have sufficient tumor-derived cfDNA for whole-exome sequencing. ULP-WGS of cfDNA was analyzed from 913 blood samples from 391 patients with metastatic breast cancer and 579 blood samples from 129 patients with metastatic prostate cancer (
(53) Subsequent analysis of SCNAs detected from ULP-WGS of these samples revealed SCNA landscapes that closely reflected those reported (Curtis et al. Nature 486, 346-352 (2012)), including biopsies of metastatic tumors from 150 patients with castration-resistant prostate cancer (Robinson et al. Cell 161, 1215-1228 (2015)) (
Example 5
Focal CNAs Landscape Based on ULP-WGS of cfDNA are Similar to the Copy Number Landscapes Generated Using Conventional Methods
(54) ULP-WGS was used to map focal copy number aberrations (CNAs) present in cell free DNA isolated from blood samples obtained from 43 patients with metastatic castration resistant prostate cancer. In each patient, the purity of tumor cell-free DNA was greater than 10% and the coverage of the genome was greater than 0.05×. Results obtained using ULP-WGS were compared to copy number landscapes generated using whole-exome sequencing of over a hundred metastatic tumor biopsies (
Example 6
ULP-WGS of cfDNA can be Used to Characterize and Monitor a Patient's Response to Therapy
(55) CNAs landscape derived from ULP-WGS of cfDNA were used to determine whether a subject with prostate cancer responded to treatment (
(56) Results described herein above were obtained using the following methods and materials.
(57) CfDNA Isolation, Library Construction, and Sequence Data Generation
(58) CfDNA extractions were carried out using commercially available automated DNA sample preparation technology from Qiagen, QiaSymphony®. The extracted DNA was quantified using commercially available DNA quantification assay, PicoGreen® assay. Sequencing libraries were constructed by direct ligation of adapters using Kapa HyperPrep kit from KapaBiosystems. One library was constructed for each patient. A small fraction (˜1%) of each barcoded library was pooled and submit for 1 lane of HiSeq2500 per 96 samples. The sequence “barcode” allowed the library to be associated with the patient from which it was derived.
(59) High throughput DNA sequencing of the DNA fragments in each library was carried out using commercially available sequencing technology SBS (sequencing by synthesis) by Illumina on commercially available system, HiSeq 2500 from Illumina.
(60) Following analysis of tumor fractions, the remainder of the library was used to perform pooled hybrid capture using the Illumina Rapid Capture protocol using Illumina's baits for the whole-exome. Each library was sequenced to standard exome depths. In parallel, whole-exome sequencing of germline DNA extracted from bulk white blood cells was carried out using conventional methods (Illumina Rapid Capture)
(61) Copy Number Analysis
(62) Cell-free DNA samples were initially qualified for whole exome sequencing (WES) using ultra-low-pass whole genome sequencing (ULP-WGS). The ULP-WGS is a low cost approach to nominate samples containing sufficient fraction of tumor-derived DNA for whole exome sequencing.
(63) Large numbers of cell-free DNA samples were sequenced to an average of 0.1× genome-wide sequencing coverage. Samples with fewer than 1,500,000 reads (0.05×) were excluded due to insufficient coverage for analysis. A statistical approach (available from HMMcopy software) was used to correct for GC-content and mappability (sequence uniqueness) biases in read counts within genomic bins of 1 Mb, which substantially improves signal to noise ratio. Next, a modified approach was developed based on the TITAN framework as described by Ha et al. Genome Res. 22, 1995-2007 (2012); and Ha, et al. Genome Res. 24, 1881-1893 (2014). This approach performs segmentation of the count data, copy number prediction, and tumor fraction and ploidy estimation. This approach was optimized for increased sensitivity to detect events from low amounts of tumor-derived DNA in the absence of a control sample. Benchmarking of the tumor fraction estimates was carried out using simulation of tumor-normal mixing and comparison to corresponding whole exome sequencing data, revealing a robust lower estimation level of tumor content.
(64) Analysis of ULP-WGS Using ichorCNA
(65) In order to access the quality and presence of detectable tumor, ULP-WGS of cfDNA was performed to an average genome-wide fold coverage of 0.1×. The depth of coverage in a ULP sample was analyzed to evaluate large-scale copy number alterations (CNAs) and aneuploidies. A probabilistic model was developed and a software package implemented called “ichorCNA,” which uses concepts from existing algorithms (Ha et al. Genome Res. 24, 1881-1893 (2014); Ha et al. Genome Res. 22, 1995-2007 (2012)) designed for deep coverage WGS/WES data to simultaneously predict regions of CNAs and estimate the fraction of tumor in ULP-WGS. The workflow consisted of 3 steps: 1) Computing read coverage, 2) Data normalization, and 3) CNA prediction and estimation of tumor fraction.
(66) Read Coverage Data
(67) The genome is divided into T non-overlapping windows, or bins, of 1 Mb. Aligned reads are counted based on overlap within each bin. This was done using the tools in HMMcopy Suite (http://compbio.bccrc.ca/software/hmmcopy/). Centromeres are filtered based on chromosome gap coordinates obtained from UCSC for hg19, including one 1 Mb bin up- and downstream of the gap. The short fragment sizes of cfDNA (e.g., 166 bp) often contain overlapping paired reads for 100 bp read lengths and can lead to two overlapping reads representing a single fragment. Abundance of cfDNA fragments has been shown to exhibit tissue-specific differences along local ˜200 bp scale regions of the genome (Snyder et al. Cell 164, 57-68 (2016)). For this analysis, because read counts are computed for large bins, the double-counting at ˜200 bp scale is not likely to have a major effect. To determine copy number alterations and structural rearrangements at 500 bp or smaller scales, then switching to counting read coverage of fragments, rather than reads, will be more appropriate.
(68) Data Normalization
(69) The read counts are then normalized to correct for GC-content and mappability biases using HMMcopy R package (Ha et al. Genome Res. 22, 1995-2007 (2012)). Briefly, two LOESS regression curve-fitting are performed to the bin-wise 1) GC-fraction and read counts, followed by 2) mappability uniqueness score and read counts. The curvefitting was only applied to autosomes. This generates corrected read counts r.sub.t for each bin t∈{1, . . . , T}.
(70) Next, the gender of the patient is determined by inspecting the corrected read counts in chromosome X and Y. There are two criteria to determine if the sample is a male (otherwise the sample is a female): 1. The proportion of uncorrected chrY read counts out of the total number of reads is >0.001 and 2. The median corrected log ratio of chrX is <−0.5.
(71) If the sample is a male, then the bins in chrX are re-scaled, r.sub.t∈chrX/median (r.sub.t∈chrX).
(72) ULP-WGS was also performed on cfDNA from 27 healthy donors using the same protocol in order to create a reference dataset. These data help to further normalize the cancer patient cfDNA to correct for systematic biases arising from library construction, sequencing platform, and cfDNA-specific artifacts. The healthy donor cfDNA ULP data were processed as above and also corrected for GC-content and mappability biases as above. This generated corrected read counts hit for each bin t∈{1, . . . , T} and each donor sample i. Then, the median at each bin was computed across the 27 samples to generate a reference dataset, h.sub.1:T.
(73) For a given cancer patient cfDNA sample and each bin t, the log.sub.2 copy ratios are computed as
(74)
ichorCNA: Copy Number Prediction and Tumor Fraction Estimation Using a HMM Representation of tumor-normal clonality admixture The cancer patient cfDNA CNA signals is composed of an admixture between DNA fragments derived from tumor and non-tumor cells. A 2-component mixture was used to model this explicitly (Carter et al. Nat. Biotechnol. 30, 413-421 (2012); Ha et al. Genome Res. 24, 1881-1893 (2014); Ha et al. Genome Res. 22, 1995-2007 (2012); Van Loo et al., Proc Natl Acad Sci USA 107, 1-6 (2010); Yau et al., Genome Biology 11, R92 (2010))
observed CNA∝2n+(1−n)c
where n is the non-tumor proportion, (1−n) is the tumor proportion, and c is the copy number for a specific alteration (e.g. 1 for deletion, 3 for gain, etc.).
(75) For subclonal events, a third component is used to represent DNA fragments derived from tumor cells not harboring the CNA event (Carter et al. Nat. Biotechnol. 30, 413-421 (2012); Ha et al. Genome Res. 24, 1881-1893 (2014); Yau et al., Genome Biology 11, R92 (2010)),
observed CNA∝2n+2s(1−n)+(1−s) (1−n)c (Equation 1)
where s is the proportion of tumor not containing the event with c copy number. Thus, (1−s) is similar to the definitions of tumor-cellular-prevalence (Ha et al. Genome Res. 24, 1881-1893 (2014)) or cancer-cell-fraction or tissue tumors (Carter et al. Nat. Biotechnol. 30, 413-421 (2012). State space The number of copy number states is dynamic depending on the initial average tumor ploidy ϕ∈{2, 3, 4}.
(76)
(77) The copy number states are mapped to hemizygous deletions (HETD, 1), copy neutral (NEUT, 2), copy gain (GAIN, 3), amplification (AMP, 4), and high-level amplification (HLAMP, 5-7 copies). The homozygous deletions state (HOMD, 0 copies) is excluded. For the analysis performed in this study, the copy number was fixed to be K={1, 2, 3, 4, 5} for all ploidy initializations.
(78) For subclonal events, two additional states are included: subclonal hemizygous deletion (HETD.sub.sc) and subclonal copy gain (GAIN.sub.sc)
K={K,{1,3}.sub.SC}
(79) A copy number state is assigned to G.sub.t for each bin t and the initial distribution of these copy number states is given by G.sub.0˜Mult(π). Emission model The input log copy ratios l.sub.1:T is modeled using a Student's-t distribution with μ.sub.g, λ.sub.g, and v.sub.g are the mean, precision, and degrees of freedom, conditional on copy number state g∈K at bin t,
p(l.sub.t|G.sub.t=g)=St(l.sub.t|μ.sub.g, λ.sub.g, v.sub.g)
Mean μ.sub.g is defined by the 3-component mixture (Equation 1) for copy number state g with unknown global parameters n and average tumor ploidy ϕ,
(80)
Precision λ.sub.gfor each g∈K are also model parameters. The degrees of freedom v.sub.g is a constant (2.1) and is not estimated. Transition model A stationary (homogeneous) transition model is used in the HMM. Because all bins have equal sized intervals, with the exception of centromere regions, a non-stationary transition model to account for varying genomic distances between data points was not used. The transition matrix containing the transition probabilities is given by
(81)
where e is set to 0.99999. Prior model The HMM is implemented as a Bayesian framework with priors for each model parameter: Student' s-t parameters μ.sub.g , λ.sub.g for each g∈K, transition probabilities A, initial state distribution π, and global parameters n, s, and ϕ, n˜Beta (α.sub.n, β.sub.n) s˜Beta (α.sub.s, β.sub.s) ϕ˜Gamma (α.sub.φ, β.sub.φ) λg˜Gamma (α.sub.g, β.sub.g) A˜Dir(δ.sub.A) π˜Dir (δ.sub.pi)
where ψ={δ.sub.A, δ.sub.π, α.sub.g, β.sub.g, α.sub.n, β.sub.n, α.sub.s, β.sub.s, α.sub.φ, β.sub.φ} for all g∈K are the hyper-parameters. Learning and inference The model parameters θ={μ1:|K|, λ1:|K|, A, π, n, φ} are estimated using the expectation-maximization (EM) algorithm given the data D={l.sub.1:T}. In the E-step, we applied the forwards-backwards algorithm to compute the posterior probabilities, p (G.sub.t=g|D, θ). In the M-step, the parameters θ(n) at EM iteration n are estimated using the maximum a posteriori (MAP) estimate.
(82)
The converged parameters θ^ is determined by the EM convergence criteria such that the change the complete-data log-likelihood (including priors) F(n)=log p(D, Z|θ(n−1))+log p(θ(n)|ψ) changes less than 0.1% (F(n)−F (n−1)<0.001). The complete-data log-likelihood at convergence is denoted F^.
(83) The Viterbi algorithm is then applied to find the optimal copy number state path for all bins,
(84)
(85) Chromosome 19 was excluded during parameter estimation (i.e., EM) due to systematic decrease in log.sub.2 copy ratio values across majority of samples for bins within chr19 after GC-content correction. As a result, the estimation of tumor fraction is not influenced by this systematic bias. However, chr19 was included in the Viterbi algorithm as part of generating a genome-wide solution. Model selection In order to avoid the local optimal limitation of EM, multiple restarts are performed by performing EM over a range of initializations for tumor fraction (n.sup.(0)∈{0.35, 0.45, 0.50, 0.65, 0.75, 0.85, 0.95}) and tumor ploidy (φ.sup.(0)∈{2, 3, 4}) parameters. The solution with the maximum complete-data log-likelihood over each initialization pair, (n.sup.(0), φ.sup.(0)) is chosen.
(86) Due to the problem of identifiability between clonal and subclonal events, which is especially challenging in ULP sequencing and the absence of allelic information, solutions with >50% of the genome harboring subclonal CNA or >70% of CNA calls being subclonal are not selected.
(87) Solutions with a total alteration fraction (based on bins)<0.05 and having the largest CNA event be <50 bins are reassigned a tumor fraction of zero. Post-analysis correction For better comparability between samples with varying tumor fractions, the segment median log ratio values are corrected to account for the tumor fraction of the sample. That is, for lower tumor fraction samples, the log ratio values will be adjusted higher signals. Given the previous definition of the emission model (Equation 2), the tumor-content-corrected log2 copy ratios r^t at bin t from the observed log.sub.2 copy ratio l.sub.t and estimated normal content (n) or tumor fraction (1−n),
(88)
If
(89)
is negative, the it is set to 2×10.sup.−9 prior to log transformation. Run-time and complexity The ichorCNA HMM component has O (KT) in memory and O (K.sup.2T) in time. The run-time of the algorithm for 0.1× coverage is 1 minute for read coverage computation and 1 minute for analysis using the HMM.
Other Embodiments
(90) From the foregoing description, it will be apparent that variations and modifications may be made to this invention described herein to adopt it to various usages and conditions. Such embodiments are also within the scope of the following claims.
(91) The recitation of a listing of elements in any definition of a variable herein includes definitions of that variable as any single element or combination (or subcombination) of listed elements. The recitation of an embodiment herein includes that embodiment as any single embodiment or in combination with any other embodiments or portions thereof.
(92) All patents and publications mentioned in this specification are herein incorporated by reference to the same extent as if each independent patent and publication was specifically and individually indicated to be incorporated by reference.