METHOD FOR WHOLE GENOME SEQUENCING OF PICOGRAM QUANTITIES OF DNA
20230031082 · 2023-02-02
Inventors
- Ahmed Ashour AHMED (Oxford (Oxfordshire), GB)
- Mohammad KERAMI NEJAD RANJPAR (Oxford (Oxfordshire), GB)
Cpc classification
G16B40/00
PHYSICS
C12Q2563/159
CHEMISTRY; METALLURGY
C12Q2600/166
CHEMISTRY; METALLURGY
C12Q2537/143
CHEMISTRY; METALLURGY
G16B30/00
PHYSICS
C12Q2537/143
CHEMISTRY; METALLURGY
C12Q1/6806
CHEMISTRY; METALLURGY
C12Q2563/159
CHEMISTRY; METALLURGY
C12Q1/6806
CHEMISTRY; METALLURGY
International classification
C12Q1/6874
CHEMISTRY; METALLURGY
C12Q1/6806
CHEMISTRY; METALLURGY
G16B30/00
PHYSICS
Abstract
The present invention relates to a method of whole genome sequencing of a single cell or cell-group for identification of single nucleotide variants, determining chromosome structural variations, or determining phasing information in the genome of the single cell or cell-group. Methods of preparing an indexed DNA library for sequencing of nucleic acid molecules; preparing an indexed DNA library for whole genome sequencing of single cells or cell-groups for the identification of single nucleotide variants, determining chromosome structural variations, or determining phasing information in the genome of the single cells or cell-groups; and whole genome sequencing of a single cell or cell-group to provide data for the identification of single nucleotide variants (SNVs), determining chromosome structural variations, or determining phasing information in the genome of the single cell or cell-group are also described.
Claims
1. A method of whole genome sequencing of a single cell or cell-group for identification of single nucleotide variants, determining chromosome structural variations, or determining phasing information in the genome of the single cell or cell-group, the method comprising: i) providing a multi-well array plate comprising rows and columns of reaction wells; ii) providing genomic DNA of single cells or cell-groups, wherein the genomic DNA is distributed into a plurality of reaction wells on the multi-well array plate, such that there is no more than one single-stranded genomic DNA molecule of any given locus per reaction well, iii) carrying out whole genome amplification (WGA) of each genomic DNA molecule to provide multiple copies of the genomic DNA molecule in each reaction well; iv) fragmenting the DNA molecules of each reaction well and ligating a pair of looped adapters at each end or tagmenting using transposase-delivered adapters to form adapted-DNA fragments, wherein the looped adapters or transposase-delivered adapters comprise either a Column Index (Ci) sequence or a Row Index (Ri) sequence, wherein the Ci sequence is common to each looped adapter or transposase-delivered adapter of every reaction well in a column of the multi-well array plate, or wherein each Ri sequence is common to each looped adapter or transposase-delivered adapter of every reaction well in a row of the multi-well array plate; vi) providing the indexed DNA library by performing indexing PCR on the adapted-DNA fragments, wherein the adapted-DNA fragments are amplified to form indexed PCR products using forward and reverse indexing primers, wherein either a Row Index (Ri) sequence or Column Index (Ci) sequence is introduced by each forward and reverse indexing primers onto each end of the adapted-DNA fragments, such that the resulting indexed PCR products comprise both a pair of flanking Column Index (Ci) sequences that are common to each well of a column and a pair of flanking Row Index (Ri) sequences that are common to each well of a row, and vii) sequencing the indexed DNA library to provide data for determining any single nucleotide variants, determining chromosome structural variations, or determining phasing information in the genome of the single cell or cell-group.
2. The method of whole genome sequencing of a single cell or cell-group according to claim 1, wherein the cell or cell-group is from a tissue biopsy from a subject.
3. The method of whole genome sequencing of a single cell or cell-group according to claim 1 or claim 2, wherein the cell or cell-group comprises cancerous cells, pre-cancerous cells, or suspected cancerous cells, or a combination of cells thereof.
4. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein the genomic DNA comprises DNA of between about 1 and 30 cells.
5. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein the DNA content of a single cell is distributed amongst wells of a single row; or the DNA content of a cell or cell group is distributed amongst wells of both rows and columns of a single multi-well array plate.
6. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein the multi-well array plate comprises a 384 well plate.
7. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein the a DNA polymerisation reporter molecule is provided in the amplification mix.
8. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein looped adapters are provided, such that the method comprises a step of fragmenting the DNA molecules of each reaction well and a subsequent ligation reaction to ligate looped adapters to the fragmented DNA; or wherein the transposase-delivered adapters are provided such that the method comprises fragmenting the DNA molecules by the process of tagmentation.
9. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein the fragmenting of the DNA molecules of each reaction well into multiple dsDNA fragments comprises direct fragmentation by enzyme.
10. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein the fragmenting or tagmentation reagents are concurrently added to each well.
11. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein following fragmenting of the DNA to form DNA fragments, the DNA fragments are end-repaired and dA-tailed, such that they can be ligated to the looped adapters.
12. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein the looped adapters comprise an oligonucleotide having a secondary stem-loop structure, and wherein the looped adapter comprises a pair of complementary sequence regions flanking a loop region, wherein the pair of complementary sequences are arranged to hybridise with each other to form the stem-loop structure of the looped adapter.
13. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein the ends of the adapted-DNA fragments are symmetrical.
14. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein the looped adapters comprise a uracil in the loop region and following ligation of the looped adapters, the single-stranded region of looped DNA is cleaved at the uracil.
15. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein Ci sequences are provided in the adapted-DNA fragments, and the method may additionally comprise the step of pooling the adapted-DNA fragments of each reaction well in a row prior to the indexing PCR; or wherein Ri sequences are provided in the adapted-DNA fragments, and the method additionally comprise the step of pooling the adapted-DNA fragments of each reaction well in a column prior to the indexing PCR.
16. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein the adapted-DNA fragments comprise Ci sequences, and the forward and reverse indexing PCR primers each comprise Ri sequences, for providing a pair of Ri sequences in the indexed PCR product.
17. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein the forward and reverse indexing PCR primers further comprise sequencing adapter sequences, such that sequencing adapters are incorporated into the indexed PCR product.
18. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein the indexed DNA fragment sizes of the indexed DNA library are filtered.
19. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein the method comprises determining any real SNVs in the genome of the single cell or cell-group by determining if substantially all indexed DNA library sequences originating from a single well comprise the same SNV, or if only a fraction of the indexed DNA library sequences comprise the same SNV, wherein a SNV represented in substantially all indexed DNA library sequences originating from a single well is determined to be a real SNV in the genomic DNA, and a SNV found in only a fraction of the indexed DNA library sequences originating from a single well is determined to be a false positive (FP) SNV.
20. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim, wherein the method further comprises matching indexed DNA library sequences originating from a single well representing one strand of the genomic DNA with indexed DNA library sequences originating from another well representing the complementary strand of genomic DNA, wherein a SNV substantially present in all indexed DNA library sequences of both complementary strands of the genomic DNA is determined to be a real SNV, and a SNV not substantially present in all indexed DNA library sequences of both complementary strands of genomic DNA is determined to be a false positive.
21. The method of whole genome sequencing of a single cell or cell-group according to claim 19 or 20, wherein the determination is carried out in silico using BAM file data that has been generated from mapping the sequence data to a reference genome.
22. The method of whole genome sequencing of a single cell or cell-group according to claim 21, wherein in silico determinations or matching of indexed DNA sequences, and/or the calculation of probability scores is carried out by an artificial neural network (ANN) model, optionally by a multilayer perceptron.
23. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim wherein the method comprises the preparation of an indexed DNA library from both a tumour cell(s), suspected tumour cell(s), or pre-cancerous cell(s), and a normal (i.e. non-cancerous) cell(s), and wherein the sequencing data from the tumour cell(s), suspected tumour cell(s), or pre-cancerous cell(s) is compared to sequencing data obtained from the normal cell(s) (i.e. non-cancerous cell(s)) taken from normal tissue, as a control.
24. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim wherein a probability score of a particular nucleotide variant being a real SNV or a false positive is calculated in silico, such that a given variant nucleotide is determined to have a statistically significant probability of being a real SNV or a false positive.
25. The method of whole genome sequencing of a single cell or cell-group according to any preceding claim wherein the sequence data is provided in the form of paired-read FastQ files.
26. The method of whole genome sequencing of a single cell or cell-group according to claim 25, wherein the sequence data of the Paired-read FastQ files is trimmed for removal of adapter sequences and for quality, to provide trimmed data.
27. A method of preparing an indexed DNA library for sequencing of nucleic acid molecules the method comprising: i) providing a multi-well array plate comprising rows and columns of reaction wells; ii) providing nucleic acid molecules, wherein the nucleic acid molecules are distributed into a plurality of reaction wells on the multi-well array plate, such that there is no more than one single-stranded nucleic acid molecule from any given locus per reaction well, iii) carrying out amplification of the nucleic acid molecule to provide multiple DNA copies of the nucleic acid molecule in each reaction well; iv) fragmenting the DNA molecules of each reaction well and ligating a pair of looped adapters at each end or tagmenting using transposase-delivered adapters to form adapted-DNA fragments, wherein the looped adapters or transposase-delivered adapters comprise either a Column Index (Ci) sequence or a Row Index (Ri) sequence, wherein the Ci sequence is common to each looped adapter or transposase-delivered adapter of every reaction well in a column of the multi-well array plate, or wherein each Ri sequence is common to each looped adapter or transposase-delivered adapter of every reaction well in a row of the multi-well array plate; vi) providing the indexed DNA library by performing indexing PCR on the adapted-DNA fragments, wherein the adapted-DNA fragments are amplified to form indexed PCR products using forward and reverse indexing primers, wherein either a Row Index (Ri) sequence or Column Index (Ci) sequence is introduced by each forward and reverse indexing primers onto each end of the adapted-DNA fragments, such that the resulting indexed PCR products comprise both a pair of flanking Column Index (Ci) sequences that are common to each well of a column and a pair of flanking Row Index (Ri) sequences that are common to each well of a row, and optionally wherein the forward and reverse indexing primers further provide respective 5′ and 3′ sequencing adapters onto the indexed PCR products that are suitable for use in a sequencing reaction.
28. A method of preparing an indexed DNA library for whole genome sequencing of single cells or cell-groups for the identification of single nucleotide variants, determining chromosome structural variations, or determining phasing information in the genome of the single cells or cell-groups, the method comprising: i) providing a multi-well array plate comprising rows and columns of reaction wells; ii) providing genomic DNA of single cells or cell-groups, wherein the genomic DNA is distributed into a plurality of reaction wells on the multi-well array plate, such that there is no more than one single-stranded genomic DNA molecule of any given locus per reaction well, iii) carrying out whole genome amplification (WGA) of each genomic DNA molecule to provide multiple copies of the genomic DNA molecule in each reaction well; iv) fragmenting the DNA molecules of each reaction well and ligating a pair of looped adapters at each end or tagmenting using transposase-delivered adapters to form adapted-DNA fragments, wherein the looped adapters or transposase-delivered adapters comprise either a Column Index (Ci) sequence or a Row Index (Ri) sequence, wherein the Ci sequence is common to each looped adapter or transposase-delivered adapter of every reaction well in a column of the multi-well array plate, or wherein each Ri sequence is common to each looped adapter or transposase-delivered adapter of every reaction well in a row of the multi-well array plate; vi) providing the indexed DNA library by performing indexing PCR on the adapted-DNA fragments, wherein the adapted-DNA fragments are amplified to form indexed PCR products using forward and reverse indexing primers, wherein either a Row Index (Ri) sequence or Column Index (Ci) sequence is introduced by each forward and reverse indexing primers onto each end of the adapted-DNA fragments, such that the resulting indexed PCR products comprise both a pair of flanking Column Index (Ci) sequences that are common to each well of a column and a pair of flanking Row Index (Ri) sequences that are common to each well of a row, and optionally wherein the forward and reverse indexing primers further provide respective 5′ and 3′ sequencing adapters onto the indexed PCR products that are suitable for use in a sequencing reaction.
29. A method of whole genome sequencing of a single cell or cell-group to provide data for the identification of single nucleotide variants (SNVs), determining chromosome structural variations, or determining phasing information in the genome of the single cell or cell-group, the method comprising: i) preparing an indexed DNA library by carrying out the method according to claim 27 or 28, or providing an indexed DNA library prepared in accordance with claim 27 or 28; ii) sequencing the indexed DNA library to provide data for determining any single nucleotide variants (SNVs), determining chromosome structural variations, or determining phasing information in the genome of the single cell or cell-group.
Description
[0121] Embodiments of the invention will now be described in more detail, by way of example only, with reference to the accompanying drawings.
[0122]
[0123]
[0124]
[0125]
[0126]
[0127]
[0128]
[0129]
[0130]
[0131]
[0132]
[0133]
[0134]
[0135]
[0136]
[0137]
[0138]
[0139]
EXAMPLE 1—REVEALING ACTIVE MUTATIONAL PROCESSES IN TUMOURS USING DIGIPICO/MUTLX AT UNPRECEDENTED ACCURACY
[0140] Summary
[0141] Bulk whole genome sequencing (WGS) enables the analysis of tumour evolution but, because of depth limitations, can only identify old mutational events. The discovery of current mutational processes for predicting the tumour's evolutionary trajectory requires dense sequencing of individual clones or single cells. Such studies, however, are inherently problematic because of the discovery of excessive false positive mutations when sequencing picogram quantities of DNA. Data pooling to increase the confidence in the discovered mutations, moves the discovery back in the past to a common ancestor. Here we report a robust whole genome sequencing and analysis pipeline (DigiPico/MutLX) that virtually eliminates all false positive results while retaining an excellent proportion of true positives. Using our method, we identified, for the first time, a hyper-mutation (kataegis) event in a group of ˜30 cancer cells from a recurrent ovarian carcinoma. This was unidentifiable from the bulk WGS data. Overall, we propose DigiPico/MutLX method as a powerful framework for the identification of clone-specific variants at an unprecedented accuracy.
[0142] Introduction
[0143] In this work we developed a single DNA molecule WGA and sequencing approach to obtain high-quality and data-rich sequencing results from picogram quantities of DNA obtained from clinical samples (we termed DigiPico; for Digital sequencing of Picograms of DNA). Moreover, we implemented a complementary analysis workflow for DigiPico data using an artificial neural network (ANN)-based algorithm (MutLX, for Mutation Learn) to eliminate false positive results while maintaining excellent sensitivity for true positive mutations on a whole genome scale. We validate our approach using data from an extensively sequenced tumor from a single patient with a cumulative depth of ˜4200× obtained from 45 whole genome sequencing runs on DNA from three different time points. We show the versatility of the methods by sequencing samples from 4 additional cancer patients and a lymphoblastoid cell line.
[0144] Material and Methods
[0145] Patient Samples and Consent
[0146] Patients #11152, #11502 and #11513 provided written consent for participation in the prospective biomarker validation study Gynaecological Oncology Targeted Therapy Study 01 (GO-Target-01) under research ethics approval number 11/SC/0014. Patient OP1036 participated in the prospective Oxford Ovarian Cancer Predict Chemotherapy Response Trial (OXO-PCR-01), under research ethics approval number 12/SC/0404. Necessary informed consents from study participants were obtained as appropriate. Blood samples were obtained on the day of surgery. Tumour samples were biopsied during laparoscopy or debulking surgery and were immediately frozen on dry ice. All samples were stored in clearly labelled cryovials in −80° C. freezers.
[0147] Cell Lines
[0148] GM12885 lymphoblastoid cell line (RRID:CVCL_5F01) was obtained from Coriell institute and cells were kept in culture as recommended by the provider.
[0149] Sectioning and LCM
[0150] Frozen tumour samples were embedded in OCT (NEG-50, Richard-Allan Scientific) and 10-15 μm sections were taken using MB DynaSharp microtome blades (ThermoFisher Scientific) in a CryoStar cryostat microtome (ThermoFisher Scientific). Tumour sections were then transferred to PEN membrane glass slides (Zeiss) and were immediately stained on ice (2 minutes in 70% ethanol, 2 minutes in 1% Cresyl violet (Sigma-Aldrich) in 50% ethanol, followed by rinse in 100% ethanol. A PALM Laser Microdissection System (Zeiss) was used to catapult individual tumour islets into a 200 μl opaque AdhesiveCap (Zeiss).
[0151] Standard WGS and Data Analysis
[0152] DNA was extracted using DNeasy blood and tissue kit (Qiagen). Up to 1 μg DNA was diluted in 50 μl of water for fragmentation using a Covaris S220 focused-ultrasonicator instrument to achieve 250-300 bp fragments. The resulting DNA fragments were then used for library preparation using NEBNext Ultra II library preparation kit (NEB), following the manufacturer's protocol. The resulting libraries were sequenced on Illumina NextSeq or HiSeq platforms at a depth of 30-40× over human genome. Sequencing reads in the FastQ format were initially trimmed using TrimGalore (14) and were then mapped to human hg19 genome using Bowtie2 (15). Germline variant calling was performed using GATK's HaplotypeCaller (16). Somatic variants were called using Strelka2 with a variant allele fraction cut-off of 0.2 (17).
[0153] DigiPico Sequencing
[0154] 200 pg of purified DNA, 20-30 resuspended nuclei, or laser-capture micro-dissected tumour islets, were first denatured using 5 μl of D2 buffer from Repli-g single cell kit (Qiagen). After 5 minutes incubation at room temperature, 95 μl of water was added to the sample and then using Mosquito HTS liquid handler (TTP Labtech) 200 nl of the denatured template was added to each well of a 384-well reaction plate already containing 800 nl of WGA mix (0.58 μl Sc Reaction Buffer, 0.04 μl Sc Polymerase (REPLI-g Single Cell kit, Qiagen), 0.075 μl 1 mM dUTP (Invitrogen), 0.04 μl EvaGreen 20× (Biotium), and 0.065 μl water). The plate was incubated at 30° C. for 2 hours followed by heat inactivation at 65° C. for 15 minutes. Addition of EvaGreen in the reaction allows for monitoring of the WGA reaction using a real-time PCR machine if required (18). Next, controlled enzymatic fragmentation (19) reaction steps were sequentially performed on the whole genome amplified DNA without any purification steps. Briefly, (A) 1200 nl of UDG mix (0.08 U/μl rSAP (NEB), 0.2 U/μUDG (NEB), 0.4 U/μl EndoIV (NEB) in 1.8× NEBuffer 3) was added with 2 hours incubation at 37° C. and heat inactivation at 65° C. for 15 minutes. (B) 1200 nl of Poll mix (0.4 U/μl DNA Polymerase I (NEB) 0.25 mM dNTP, 8 mM MgCl2, and 0.8 mM DTT) was added with 1.5 hours incubation at 37° C. and heat inactivation at 70° C. for 20 minutes. (C) 1200 nl of Klenow mix (0.5 U/μl Klenow exo− (NEB), 0.5 mM dATP, 8 mM MgCl2, and 0.8 mM DTT) was added with 45 minutes incubation at 37° C. and heat inactivation at 70° C. for 20 minutes. (D) 400 nl of 20 μM full-length Illumina adapter oligos (Table S1) with well specific indices were added to each well followed by the addition of 1100 nl of Ligation mix (40 U/μl T4 DNA Ligase (NEB), 5 mM ATP, 11.5% PEG 8000 (Qiagen), and 6.8 mM MgCl2) and with 30 minutes incubation at 20° C. and heat inactivation at 65° C. for 15 minutes.
[0155] The resulting products were then pooled and the DNA was precipitated using an equal volume of isopropanol. DNA was then resuspended in water and the products were dual-size selected using Agencourt AMPure XP SPRI magnetic beads (Beckmann coulter) with 0.45× bead ratio for the left selection and an additional 0.32× for the right selection. The purified DNA was then resuspended in water and was immediately used for limited-cycle PCR amplification using the P5 and P7 primer mix (Table S1). PCR was performed for 12 cycles with 10 seconds annealing at 55° C. and 45 seconds extension at 72° C. Final products were bead purified at 0.9× ratio. The resulting libraries were then sequenced on Illumina sequencing platforms in 2×150 paired-end sequencing mode to achieve a depth of coverage of 30-40× over human genome. The additional processing steps required for the DigiPico library preparation, at present, adds nearly £250 to the total reagent costs.
[0156] Analysis of DigiPico Sequencing Data
[0157] The analysis pipeline of DigiPico sequencing data is presented in Supplementary
[0158] MutLX Algorithm
[0159] The MutLX analysis pipeline is summarised in
[0160] Artificial Neural Network Architecture
[0161] The neural network model used in this study is a multilayer perceptron with an input layer consisting of N neurons (N=41) where N is the number of features used in each experiment and was implemented in Python3 using Keras (24). The model has two hidden layers with ReLU activations. We varied these numbers but did not see any significant improvement when using larger numbers of neurons. The last layer is a single output neuron with a sigmoid activation. The loss function is binary cross-entropy. For training, we applied a stochastic gradient descent optimization with momentum (Adam (25)) with a learning rate of 0.001, a batch-size of 8 and for 10 epochs. After 10 epochs we did not observe any additional improvement in performance.
[0162] Features Used for Training
[0163] The following features, extracted from the Platypus output of the DigiPico data, were used as the input of neural network model:
[0164] Platypus quality parameters: QUAL, BRF, FR, HP, HapScore, MGOF, MMLQ, MQ, QD, SbPval, NF, NR, TCF, and TCR (21).
[0165] Sequence context complexity: F.sub.20[1], F.sub.20[2], F.sub.20[3]. Where F.sub.20[i] is the sum of the frequency of the i most abundant nucleotides in the 10 bp sequence on either side of the variant position.
[0166] Read distribution data: R.sub.merge[ref+var], R.sub.merge[var], W[R[ref]>0s & R[var]=0], W[R[ref]>0][0/0], W[R[ref]>0][0/1], W[R[var]>0], W[R[var]>0][1/1], W[R[var]>0][0/1], W[R[var]>0 & R[ref]=0][0/1], W[R[ref]>0 & R[var]>0], W[R[ref]=0 & R[var]>0], W[R[ref]=0 & R[var]>1], W[R[ref]=0 & R[var]>2], W[R[ref]=0 & R[var]>3], W[R[ref]=0 & R[var]>4], W[R[ref]=0 & R[var]>5], R.sub.max[1][var], R.sub.max[2][var], R.sub.max[3][var], R.sub.max[1][ref+var], R.sub.max[2][ref+var], R.sub.max[3][ref+var], Max.sub.c+Max.sub.r, W[R[var]>0]−(Max.sub.c+Max.sub.r). Where R.sub.merge[x] indicates the total number of reads in the merged bam file supporting the allele x (ref indicates reference allele, var indicates variant allele). W[i][j] indicates the number of wells matching criteria i with reported genotype j, where indicated. Where in criteria i the R[x] indicates the number of reads in the specific well supporting allele x. R.sub.max[y][x] shows the number of reads in the well with the y.sup.th highest number of reads supporting allele x. Lastly, Max.sub.c is the number of variant supporting wells in the column with the highest number of wells supporting the variant allele and Max, is the number of variant supporting wells in the row with the highest number of wells supporting the variant allele.
[0167] Training Using MutLX
[0168] For each DigiPico run, we consider a full training set as the collection of all UTD variants (labelled as 0) and heterozygous germline SNPs (labelled as 1). The number of UTD variants in this set is much smaller than heterozygous germline SNPs, making the set imbalanced. Therefore, in order to avoid bias towards a specific label in the training we create 25 different balanced training subsets for each DigiPico run. This is done so that each training subset is composed of all UTD variants and a randomly selected subset of heterozygous germline SNPs with a size equal to the number of UTD variants. As explained previously, the majority of UTD variants are FP variant calls with an unknown ratio of true clone-specific variants among them, hence making the 0 labels noisy. To perform two-step training considering these noisy labels, we employ the following strategy. After training an initial model on each balanced training subset, the resulting model is applied to the mutations in the full training set to obtain an initial probability value for each mutation. These probability values indicate the predicted probability of a mutation belonging to label 1 category. Hence, any 0 labelled mutation that attains a predicted probability value close to 1, is likely to be a mislabelled mutation. Therefore, to reduce the level of mislabelled data in the training set, all UTD variants with a probability value of more than 0.7 and all germline SNPs with a probability value of less than 0.3 are considered mislabelled and are removed from the training set. The cut-off values in this step were empirically determined by the analysis of various simulated datasets. In the end, following a similar sub-sampling strategy as in the initial training, a new model is trained on the remaining mutations of the training set. This model is then used for the analysis of all UTD variants.
[0169] Calculation of Probability and Uncertainty Scores
[0170] As explained earlier, in MutLX the training process is repeated 25 times with different randomly selected germline SNP subsets, resulting in different models each time and hence 25 different predicted probability values for each mutation. We therefore defined the “probability score” for each mutation as the average of all of its predicted probability values:
[0171] Where P.sub.i is the probability value obtained from the i.sup.th training subset and n represents the number of subsets (n=25).
[0172] Moreover, to obtain an uncertainty estimation for each probability value we performed a test-time drop-out analysis (26). The trained model was applied to each mutation for 100 iterations during which, different neurons were dropped out with a rate of 0.8 and 0.7 for the first and the second hidden layers of the neural network, respectively. This process resulted in 100 probability values for each mutation. Based on these values, we defined the “uncertainty score” for each mutation as the average of the dropout variances from the 25 different subsets:
[0173] Where σ.sub.i.sup.2 is the variance of 100 probability values obtained from the dropout analysis of the i.sup.th training subset and n represents the number of subsets (n=25).
[0174] The uncertainty scores of all variants with a probability score above 0.2 (
[0175] Generation and Analysis of Simulated DigiPico Datasets
[0176] Simulated data were used to: (a) validate the negative correlation between the number of true UTDs and AUC (
[0177] To generate simulated datasets, we first identified somatic mutations in the bulk WGS data of the tumour sample PT2R from patient #11152 using Strelka2 somatic variant caller. These somatic variants were then identified in the de novo variant calling data of run D1110 and any somatic variant with a Tw>6 and Vw/Tw>0.45 was selected as a high-confidence somatic variant. Next, various numbers of randomly selected high-confidence somatic variants were artificially mislabelled as UTDs (UTD*) to achieve 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, and 0.1 UTD*/UTD ratios. The resulting synthetic list of variants were then independently used for MutLX analysis and the number of UTDs and UTD*s that passed the MutLX filtering were calculated for each run. To ensure a robust analysis, for each ratio 10 different subsets of the somatic variants were analysed. A similar analysis was also performed on the DigiPico data DE111, obtained from the bulk DNA extraction from an ascites sample of patient #11513.
[0178] Validation of MutLX Algorithm
[0179] Tumour sample PT2R from patient #11152 was used for the validation of the MutLX algorithm. A small piece of the tumour was macro-dissected from a frozen specimen and was embedded in OCT medium for sectioning. The first section (15 μm) from the tumour was collected in a separate tube and the nuclei were resuspended in 50 μl of sterile PBS solution. Total number of nuclei in the suspension was measured and a volume containing 30 nuclei was used for direct denaturation with an equal volume of D2 buffer from Repli-g mini WGA kit (Qiagen). The resulting crude lysate was directly used for DigiPico library preparation for run D1111. The remainder of the tumour sample was then used for bulk DNA extraction using DNeasy blood and tissue kit (Qiagen). 200 pg of the resulting DNA was directly used to prepare DigiPico library D1110. 1 μg of the DNA was used for standard library preparation using NEBNext Ultra DNA library preparation kit (NEB). In this setting only run D1111 is expected to have true clone-specific variants. Since the template for run D1110 is a subset of the template used in the bulk WGS analysis, nearly all real variants in run D1110 will also be present in the WGS data at similar frequencies and therefore will not be identified as UTDs. A similar logic is also applicable to the results of DigiPico runs DE011 and GM12885. Since both of these DigiPico runs had been performed on 200 pg of DNA from bulk DNA extractions, no true UTD variants are expected to be present in these samples. It is also worth noting that because of the digitized nature of the data, variants with very low frequencies (<0.05%) will show an inflated variant allele fraction in runs D1110, DE011, and GM12885, however, since such variants are unlikely to appear in more than one well, they will be eliminated from the data based on the Vw filter. Therefore, it is safe to assume that nearly all UTD variants in these runs are FP calls.
[0180] Application of SCcaller on DigiPico Data
[0181] SCcaller has originally been developed for the analysis of multiple displacement amplified single cell sequencing data (11). Since DigiPico library preparation also requires multiple displacement amplification on limiting amounts of template DNA, the resulting data are fundamentally similar to the natural input for SCcaller. Therefore, we used the merged bam file of DigiPico data as an input for SCcaller. For the analysis, the list of heterozygous SNPs was obtained from the respective bulk WGS data using GATK HaplotypeCaller and cut-off values were used for alpha=0.01. Next, all filtered SNVs were used for variant re-calling on respective standard WGS data and all variants that were confidently unsupported by WGS data were extracted as UTD variants.
[0182] Mutation Validation
[0183] Variants that pass the MutLX analysis were validated by comparison with deep sequencing data of the bulk tumour from an independent sequencing platform. All DigiPico data from patient #11152 were validated through comparison with 39 deep sequencing datasets obtained from the same tumour masses sequenced on Complete Genomics sequencing platform (27). This included three Complete Genomics bulk sequencing and 36 LFR (Long-Fragment Read) sequencing data. Since the independent sequencing data for the omental mass were not obtained from exactly the same tumour mass as the ones that were used for DigiPico sequencing the validation rate by such a comparison for these runs is not expected to be high.
[0184] For targeted validation, primers were designed to obtain amplicons containing the variants using the primer3 tool (Table 51). Amplicons were obtained by performing a 2-step PCR using Phusion® High-Fidelity PCR Master Mix with GC Buffer for 16 cycles on 1 ng of template. All amplicons from each sample were then pooled and purified before adapter ligation and indexing using NEBNext Ultra II kit. The resulting libraries were sequenced on a MiSeq platform. Sequencing results were mapped to human hg19 genome using Bowtie2 and the number of reads supporting each variant was counted using Platypus variant caller.
[0185] Local Hyper-Mutation (Kataegis) Analysis
[0186] To generate the rainfall plots, the distances between pairs of consecutive somatic mutations on chromosome 17 were plotted against their genomic position of the second mutation in each pair using a custom script in R. The presence of clusters of localized mutations indicates kataegis events. In these plots, each dot is coloured based on the mutation type of the second mutation in the pair in respect to the hg19 human reference genome.
[0187] Results
[0188] Implementation of DigiPico Sequencing Approach
[0189] A key feature of amplification errors with or without prior DNA damage is that they are introduced at random during the amplification process (6, 7, 28). We, therefore, hypothesized that when amplifying and sequencing a single DNA molecule an artefactual mutation would be present in only a fraction of the reads that have resulted from sequencing the original single DNA molecule. In contrast, genuine variants would be expected to be present in all such reads. Partitioning of the template DNA into individual compartments prior to WGA, such that each compartment receives no more than one DNA molecule from every locus would result in such single DNA molecule sequencing data (
[0190] To fully benefit from the data-richness of a partitioning and sequencing approach for accurate genomics study of clinical samples, we developed DigiPico sequencing (
[0191] DigiPico Sequencing Platform Generates High Quality Libraries from Limited Clinical Samples
[0192] Having optimized all the necessary aspects of DigiPico library preparation process, we decided to assess the quality of DigiPico libraries obtained from clinical samples. For this purpose, we prepared DigiPico libraries D1110 and D1111 from a frozen recurrent tumour sample (PT2R) obtained from a high-grade serous ovarian cancer patient (#11152). In this experiment, while D1110 library was prepared from 200 pg of template taken from a bulk DNA extraction of the PT2R sample, the D1111 library preparation was directly performed on a small frozen section of the remainder of this tumour sample (containing nearly 30 cancer cells). Each library was sequenced on an Illumina NextSeq platform to obtain nearly 400,000,000 reads in 150×2 paired-end format. The initial assessment of the obtained sequencing data revealed that both the D1110 and D1111 libraries have resulted in high quality sequencing data with an overall mapping rate of 91.35% and 94.27% on human hg19 genome, respectively (
[0193] Lastly, we assessed whether our initial hypotheses regarding the distinctive distribution pattern of different mutation types hold true in actual DigiPico datasets. For this purpose, we assumed that any variant that is shared between a DigiPico dataset and the standard bulk sequencing data of the same tumour sample must be a true variant. These should mainly consist of germline SNPs and clonal somatic variants. As a result, by definition, all FP variant calls and the majority of clone-specific mutations (had they existed in the sample under study) will be among variants that are only present in the DigiPico data and not in the bulk WGS data. These variants are referred to as UTD (Unique to DigiPico) for simplicity, hereafter. Consequently, given that the standard bulk sequencing data of the PT2R sample had been obtained from the same DNA extract that was used for D1110 library preparation, nearly all the UTD variants in D1110 DigiPico run ought to be artefacts (
[0194] MutLX Analysis Pipeline for DigiPico Data
[0195] Having obtained high-quality data using DigiPico sequencing, we decided to implement an analysis pipeline to eliminate FP variant calls based on the distribution pattern of mutations. As mentioned earlier, ANN algorithms are ideally suited for problems with such complex patterns. Given a representative set of correctly labelled examples (training set), an ANN can learn to classify mutations without the need for any class-specific information. However, there are two main issues in implementing ANN algorithms for the problem of eliminating FP mutations from sequencing data; (a) the difficulty in obtaining a generalizable model and (b) unavailability of representative accurately labelled training sets. First, it is not possible to generate a model that is generalizable for the analysis of every DigiPico dataset because the distribution pattern of mutations depends on various run-specific initial conditions that cannot easily be accounted for (e.g. the copy number state of the genome). Therefore, run-specific models tailored to each DigiPico run will be required. This means that subsets of run-specific mutations need to be selected as a training sets for each DigiPico run. Second, while correctly labelled examples of true mutations can easily be extracted from known SNPs in the genome, identifying a representative and accurate set of examples for artefactual mutations is not possible. To address this issue, we considered UTDs as a reasonable approximation for a representative set of artefactual mutations, assuming that UTDs are predominantly composed of such mutations. This assumption, however, can result in a key challenge. UTDs by definition are composed of artefactual mutations as well as true clone-specific mutations. While artefactual mutations are expected to be abundantly present in all DigiPico runs, true clone-specific mutations may be present at different frequencies depending on the sample (
[0196] Considering all the aforementioned limitations and issues, we designed and implemented an ANN-based binary classifier, MutLX, for the analysis of DigiPico datasets. The focus of the DigiPico analysis pipeline was set for effective elimination of FP calls and accurate identification of true clone-specific variants from UTDs. To address the issue of training with imperfect training sets, we employed the following approach in training MutLX. Initially we considered all UTD variants as examples for artefactual mutations (labelled as 0s) and a similar number of randomly selected heterozygous germline SNPs as example for true variants (labelled as 1s). Since the majority of UTD variants are FP calls with an unknown ratio of true clone-specific variants, the 0 labels are considered to be “noisy” at this stage. In other words, while true-clone specific variants, must have been labelled as 1, because of their anonymity at this stage, they are labelled as 0 among others. To accommodate for this type of noise in the training dataset we employed a two-step training process (
[0197] Validation of the MutLX Algorithm
[0198] To validate our strategy, we chose to test the MutLX analysis pipeline on runs D1110 and D1111. This is because these DigiPico runs had been obtained from a HGSOC that was previously extensively sequenced with data available from 48 independent whole genome sequencing data sets across three different time points (patient #11152) at a total depth of approximately 4200× from two independent sequencing platforms (33). To our knowledge, this comprises the most extensively whole genome sequenced tumour to date. This exceptionally large dataset allows for reliable cross-validation of mutations in this tumour. For this purpose, we used the MutLX algorithm to analyse the sequencing data from runs D1110 and D1111. As explained previously, when using the bulk sequencing data from the PT2R site for comparison with these DigiPico datasets, true UTD variants (clone-specific variants) are only expected to be present in run D1111, while nearly all UTDs in run D1110 are expected to be artefacts (
[0199] Additionally, we investigated whether the presence of true clone-specific mutations could compromise the sensitivity of the model due to over-fitting. For this purpose, we artificially mislabelled varying numbers of somatic mutations in runs D1110 and DE111 as artificial UTD variants (UTD*) to generate synthetic datasets with various ratios of true UTDs. These synthetic datasets were then independently analysed by MutLX and the FP rate as well as the recovery rate of UTD*s at varying UTD*/UTD ratios were examined in all the synthetic datasets. The results showed that a UTD*/UTD ratio as high as 10% does not significantly affect the recovery rate of UTD* variants, indicating that overfitting does not occur in MutLX (
[0200] Versatility of DigiPico/MutLX Sequencing and Analysis Approach
[0201] Finally, to ensure the versatility of our proposed method, DigiPico sequencing was performed on various sources of template DNA from four different HGSOC patients and the resulting UTDs were analysed using MutLX algorithm. The results clearly indicate that MutLX can reliably identify and eliminate the artefactual variant calls from a diverse set of DigiPico libraries (Table 1). This strongly suggests that DigiPico/MutLX can effectively enable the study of recently acquired mutations in solid tumours. Importantly, analysing the frequency of different mutation types in these data indicated the presence of a higher level of C>A mutations among the identified artefactual mutations, consistent with the notion that such FP calls are a result of oxidative damage to the template DNA (
[0202] The Study of Active Mutational Processes Using DigiPico/MutLX
[0203] We next tested the feasibility of studying mutational processes in a patient with HGSOC (#11152). For this patient, various sequencing data from a pre-chemotherapy omental mass were available (standard bulk sequencing at 30× as well as five tumour islet DigiPico runs). The patient subsequently had a recurrence and tumour samples were collected from the pelvis (pelvic recurrent tumour; PT2R) and from the para-aortic lymph node (PALNR) for standard bulk sequencing as well as DigiPico sequencing of tumour islets. The analysis of the bulk pre-chemotherapy sequencing data identified 13,721 somatic mutations. 84.6% of these mutations were present in at least three tumour islets from DigiPico data, 91.4% of which were also present in at least three additional islets from previously published LFR data (33). The high occurrence of the mutations indicates that they were early mutations that became fixed in the tumour. The analysis of DigiPico data from tumour islets revealed that there was a limited number of clone-specific mutations that were absent in the bulk tumour. Each of the five pre-chemotherapy islets harboured a number of truly unique mutations (2, 6, 8, 8, and 36), compared to other islets, indicating that they were recent occurrences (
[0204] Discussion
[0205] In this work we presented DigiPico/MutLX as an integrated platform for the identification of mutations from small groups of cells with unprecedented accuracy on a whole genome scale. We believe that this work provides an important stepping stone for the discovery of current or recent somatic mutational processes that occur in cancer and normal tissue. Understanding current mutational processes is key for predicting the evolutionary trajectory of a tumour and, potentially, for interfering with such trajectories therapeutically. A mutation that is identified in bulk sequencing of a tumour must have occurred at a point during the extended history of a tumour from the initiation till presentation. In contrast, a cell-specific mutation must have occurred during the limited lifetime of that cell. Similarly, a mutation in a small clone that has been derived from a single cell is also recent. The age of such a mutation can't be more than the age of the clone which is defined by the number of cell divisions it took to generate that clone. Studying patterns in cell-specific or small-clone-specific mutations can allow for the identification of recent or current mutational processes (1). Defining such processes is highly desirable since they can be causally linked to biological or chemical phenomena and, therefore, yield significant mechanistic insights. Identifying these mechanisms have important practical implications since they are potentially amenable for therapeutic intervention or for predicting future tumour behaviour. The current state of the art does not allow the direct accurate identification of mutations from individual cells or individual small clones from tumours. DigiPico/MutLX enables this endeavour for the first time.
[0206] To overcome significant technical pitfalls predominantly related to the discovery of false positive mutations, current methods for single cell WGS analysis either require extensive validation studies (11) or rely on combining data from multiple cells to obtain reliable mutations that are shared between cells (12, 34). These cells are then grouped into clones that have been derived from a common ancestor. While such techniques go to a more recent common ancestor compared to bulk sequencing, they are still not ideal as data derived from these approaches do not reflect mutational processes that are taking place in existing cells. Furthermore, reducing the depth of sequencing per cell to enable sequencing large numbers of cells, reduces the breadth of coverage, which is already compromised by loss of genetic materials during preparation steps. This increases the number of cells that are needed to be analysed for inferring and identifying clones which in turn moves the ancestor further back into the past. In addition, the lack of information about physical relatedness in single cell analysis methods, results in loss of an opportunity to group cells that are likely to be from a single clone. This increases the gap between the ancestor of an inferred clone and the present time, making it difficult to define processes that are active within currently existing cells in a tumour.
[0207] DigiPico/MutLX has the distinct advantage of enabling the preservation of spatial information. Analysing spatially-related cells, preserves physical relatedness and enables the assumption that physically related cells belong to an individual clone (9). Defining distinct structures that may have arisen from a tissue resident stem cell has also been suggested to identify and analyse clones. For example, cells from a single small intestinal crypt or a single endometrial gland could be reasonable expected to come from a single tissue-resident stem cell (35, 36). Under these circumstances, each anatomical unit defines a clone that may or may not have clone specific mutations that can be related to a mutational driver. Furthermore, sequencing data from a clone can be computationally used to infer subclones and predict more recent events that may have arisen within a clone. This is akin to what bulk sequencing and analysis achieves but at the level of a single clone that is composed of a limited number of cells. Preserving spatial information is also particularly interesting because of the recent developments in enabling spatial transcriptomics technologies (37). It is conceivable that combining highly accurate DNA sequencing with spatial transcriptomics would allow the dissection of genetic and non-genetic heterogeneity in tissues. In short, current technologies, for the analysis of small clones yield large number of false positive results making it impossible to obtain direct accurate clone specific information on a genome scale without exhausting validation. Combing data from multiple clones, is a common solution but moves the ancestor further back into the past. We have previously used this approach for the analysis of small collection of tumour cells (tumour islets) (33). Because of the uncertainty associated with the mutation calls from individual islets, it was necessary to only call mutations that were shared between all tumour islets and effectively identify only truncal mutations. This was then followed by independent validation of some 700 mutations using targeted sequencing. While this still yielded important biological insights, we were unable to study islet-specific mutations. DigiPico/MutLX is now enabling the study of such mutations. We demonstrated how the direct analysis of DNA from ˜30 cancer cells, resulted in the confident identification of a sub-clonal kataegis event.
[0208] Overall, here we showed that DigiPico and MutLX can enable hyper-accurate identification of somatic mutations from limiting numbers of cells obtained from clinical samples, as an important improvement over the existing methodologies. Moreover, unlike other computational methods that rely on diploid regions of the genome to calculate amplification biases, our method is also compatible with genomes that suffer from extensive copy number alterations, such as in HGSOC. We believe that the versatility of the DigiPico/MutLX method enables the study of active mutational processes in tumours as well as in normal tissues.
[0209] Availability
[0210] Source code for MutLX is available on Github (https://github.com/mmdknr/DigiPico).
[0211] Accession Numbers
[0212] All sequencing data used in this study are available on EGA (EGAD00001005118).
REFERENCES
[0213] 1. Turajlic, S., Sottoriva, A., Graham, T. and Swanton, C. (2019) Resolving genetic heterogeneity in cancer. Nat. Rev. Genet., 10.10381s41576-019-0114-6. [0214] 2. Zhang, J., Spath, S. S., Marjani, S. L., Zhang, W. and Pan, X. (2018) Characterization of cancer genomic heterogeneity by next-generation sequencing advances precision medicine in cancer treatment. Precis. Clin. Med., 1, 29-48. [0215] 3. Gerstung, M., Jolly, C., Leshchiner, I., Dentro, S. C., Gonzalez, S., Mitchell, T. J., Rubanova, Y., Anur, P., Rosebrock, D., Yu, K., et al. (2017) The evolutionary history of 2, 658 cancers. bioRxiv, 10.1101/161562. [0216] 4. Barber, L. J., Davies, M. N. and Gerlinger, M. (2015) Dissecting cancer evolution at the macro-heterogeneity and micro-heterogeneity scale. Curr. Opin. Genet. Dev., 30, 1-6. [0217] 5. Bohrson, C. L., Barton, A. R., Lodato, M. A., Rodin, R. E., Luquette, L. J., Viswanadham, V. V, Gulhan, D. C., Cortes-Ciriano, I., Sherman, M. A., Kwon, M., et al. (2019) Linked-read analysis identifies mutations in single-cell DNA-sequencing data. Nat. Genet., 10.1038/s41588-019-0366-2. [0218] 6. Chen, L., Liu, P., Evans, T. C. J. and Ettwiller, L. M. (2017) DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification. Science, 355, 752-756. [0219] 7. Costello, M., Pugh, T. J., Fennell, T. J., Stewart, C., Lichtenstein, L., Meldrim, J. C., Fostel, J. L., Friedrich, D. C., Perrin, D., Dionne, D., et al. (2013) Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res., 41, e67. [0220] 8. Nik-Zainal, S., Alexandrov, L. B., Wedge, D. C., Van Loo, P., Greenman, C. D., Raine, K., Jones, D., Hinton, J., Marshall, J., Stebbings, L. A., et al. (2012) Mutational processes molding the genomes of 21 breast cancers. Cell, 149, 979-993. [0221] 9. Martincorena, I., Fowler, J. C., Wabik, A., Lawson, A. R. J., Abascal, F., Hall, M. W. J., Cagan, A., Murai, K., Mahbubani, K., Stratton, M. R., et al. (2018) Somatic mutant clones colonize the human esophagus with age. Science, 362, 911-917. [0222] 10. Tubbs, A. and Nussenzweig, A. (2017) Endogenous DNA Damage as a Source of Genomic Instability in Cancer. Cell, 168, 644-656. [0223] 11. Dong, X., Zhang, L., Milholland, B., Lee, M., Maslov, A. Y., Wang, T. and Vijg, J. (2017) Accurate identification of single-nucleotide variants in whole-genome-amplified single cells. Nat. Methods, 14, 491-493. [0224] 12. Zafar, H., Wang, Y., Nakhleh, L., Navin, N. and Chen, K. (2016) Monovar: single-nucleotide variant detection in single cells. Nat. Methods, 13, 505-507. [0225] 13. Chen, C., Xing, D., Tan, L., Li, H., Zhou, G., Huang, L. and Xie, X. S. (2017) Single-cell whole-genome analyses by Linear Amplification via Transposon Insertion (LIANTI). Science, 356, 189-194. [0226] 14. Krueger F. (2016) Trim Galore! [0227] 15. Langmead, B. and Salzberg, S. L. (2012) Fast gapped-read alignment with Bowtie 2. Nat Meth, 9, 357-359. [0228] 16. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., et al. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res., 20, 1297-1303. [0229] 17. Kim, S., Scheffler, K., Halpern, A. L., Bekritsky, M. A., Noh, E., Kallberg M. Chen, X., Kim, Y., Beyter, D., Krusche, P., et al. (2018) Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods, 15, 591-594. [0230] 18. Hosokawa, M., Nishikawa, Y., Kogawa, M. and Takeyama, H. (2017) Massively parallel whole genome amplification for single-cell sequencing using droplet microfluidics. Sci. Rep., 7, 5199. [0231] 19. Peters, B. A., Kermani, B. G., Sparks, A. B., Alferov, O., Hong, P., Alexeev, A., Jiang, Y., Dahl, F., Tang, Y. T., Haas, J., et al. (2012) Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature, 487, 190-195. [0232] 20. Picard Tools (2018). [0233] 21. Rimmer, A., Phan, H., Mathieson, I., Iqbal, Z., Twigg, S. R. F., Wilkie, A. O. M., McVean, G. and Lunter, G. (2014) Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet., 46, 912-918. [0234] 22. Derrien, T., Estellé, J., Marco Sola, S., Knowles, D. G., Raineri, E., Guigó, R. and Ribeca, P. (2012) Fast computation and applications of genome mappability. PLoS One, 7, e30377-e30377. [0235] 23. Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156-2158. [0236] 24. Chollet, F. and others (2015) Keras. [0237] 25. Kingma, D. P. and Ba, J. (2014) Adam: A Method for Stochastic Optimization. CoRR, abs/1412.6. [0238] 26. Gal, Y. and Ghahramani, Z. (2015) Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. arXiv e-prints. [0239] 27. Drmanac, R., Sparks, A. B., Callow, M. J., Halpern, A. L., Burns, N. L., Kermani, B. G., Carnevali, P., Nazarenko, I., Nilsen, G. B., Yeung, G., et al. (2010) Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays. Science (80-.)., 327. [0240] 28. Arbeithuber, B., Makova, K. D. and Tiemann-Boege, I. (2016) Artifactual mutations resulting from DNA lesions limit detection levels in ultrasensitive sequencing applications. DNA Res., 23, 547-559. [0241] 29. Amini, S., Pushkarev, D., Christiansen, L., Kostem, E., Royce, T., Turk, C., Pignatelli, N., Adey, A., Kitzman, J. O., Vijayan, K., et al. (2014) Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat. Genet., 46, 1343-1349. [0242] 30. Zheng, G. X. Y., Lau, B. T., Schnall-Levin, M., Jarosz, M., Bell, J. M., Hindson, C. M., Kyriazopoulou-Panagiotopoulou, S., Masquelier, D. A., Merrill, L., Terry, J. M., et al. (2016) Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol., 34, 303-311. [0243] 31. Northcutt, C. G., Wu, T. and Chuang, I. L. (2017) Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI'17. AUAI Press. [0244] 32. Natarajan, N., Dhillon, I. S., Ravikumar, P. K. and Tewari, A. (2013) Learning with noisy labels. In Advances in neural information processing systems. pp. 1196-1204. [0245] 33. Hellner, K., Miranda, F., Fotso Chedom, D., Herrero-Gonzalez, S., Hayden, D. M., Tearle, R., Artibani, M., KaramiNejadRanjbar, M., Williams, R., Gaitskell, K., et al. (2016) Premalignant SOX2 overexpression in the fallopian tubes of ovarian cancer patients: Discovery and validation studies. EBioMedicine, 10, 137-149. [0246] 34. Laks, E., Zahn, H., Lai, D., McPherson, A., Steif, A., Brimhall, J., Biele, J., Wang, B., Masud, T., Grewal, D., et al. (2018) Resource: Scalable whole genome sequencing of 40,000 single cells identifies stochastic aneuploidies, genome replication states and clonal repertoires. bioRxiv, 10.1101/411058. [0247] 35. Moore, L., Leongamornlert, D., Coorens, T. H. H., Sanders, M. A., Ellis, P., Dawson, K., Maura, F., Nangalia, J., Tarpey, P. S., Brunner, S. F., et al. (2018) The mutational landscape of normal human endometrial epithelium. bioRxiv, 10.1101/505685. [0248] 36. Lee-Six, H., Ellis, P., Osborne, R. J., Sanders, M. A., Moore, L., Georgakopoulos, N., Torrente, F., Noorani, A., Goddard, M., Robinson, P., et al. (2018) The landscape of somatic mutation in normal colorectal epithelial cells. bioRxiv, 10.1101/416800. [0249] 37. Burgess, D. J. (2019) Spatial transcriptomics coming of age. Nat. Rev. Genet., 20, 317.
[0250] All references described herein may be incorporated by reference.
[0251] DigiPico2
EXAMPLE 2—DIGIPICO2, A NOVEL METHOD FOR WHOLE GENOME SEQUENCING OF PICOGRAM QUANTITIES OF DNA WITH UNPRECEDENTED ACCURACY
[0252] Introduction
[0253] Previously we described DigiPico library preparation pipeline and MutLX analysis platform as a method for accurate identification of single nucleotide variants (SNV) from limited amount of clinical material. This was an important methodological advancement mainly due to the fact that the limited amount of genetic material obtained from clinical samples must be whole genome amplified (WGA) prior to sequencing. The process of WGA however introduces up to 100,000 artefactual mutations in the amplified DNA which inundates the final analysis results with false positive variant calls that hamper any meaningful genetic interpretation from the original sample. In DigiPico/MutLX strategy we overcome this obstacle by separating individual molecules of DNA into independent compartments before the WGA step and indexing them after the process. By doing so we digitize the information for real mutations, meaning each compartment will either carry the mutated allele or not. Artefactual mutations, however, because of the way that they are generated during the WGA process, will result in compartments that will contain both the mutated and the reference allele information (
[0254] While producing high quality data, DigiPico library preparation method, however, suffers from few technical limitations. Firstly the fragmentation step of the library preparation (CoREF), which was borrowed from a previously described method, is extremely complex and time consuming. Moreover, CoREF requires the use of dUTP during the WGA process. Since dUTP is an unnatural nucleotide it is likely that it may introduce further artefactual mutations in the final products. Next, we found that the adapter ligation efficiency was very low in DigiPico which could sometimes compromise the library quality. Lastly, because of the high number of indices and lack of redundancy in the index information there is a chance for index cross-contamination which could adversely affect the final results. Therefore, we developed DigiPico2 library preparation method to address all these issues.
[0255] Results
[0256] Improving the DigiPico Library Preparation Workflow
[0257] As explained previously, the use of dUTP in the DigiPico method is because of the requirement of CoREF fragmentation procedure, which is a very complex fragmentation strategy (
[0258] Next, we aimed at addressing the low ligation efficiency in DigiPico. Originally our adapter ligation and indexing relied on using an asymmetric ligation approach. In this approach long indexing oligos with short complementary regions were used for ligation which is extremely inefficient (
[0259] DigiPico2 Workflow Significantly Improves the Library Quality
[0260] To test the effect of these modification on the final quality of the data, DigiPico2 sequencing was performed on 120 pg of DNA from blood of patient 11152. This sample was used because we had previously extensively sequenced the tumour and normal cells from this patient. As expected the WGA, similar to the previous version, resulted in a very uniform distribution of products (
[0261] Extending DigiPico2 Workflow to Single Cell Whole Genome Sequencing
[0262] Having established DigiPico2 workflow, we tested whether it can be applied to single cell whole genome sequencing. This is important, since an active mutational process is likely to start within an individual cell. We, therefore, introduced a work flow for single cell DigiPico (ScDigiPico) sequencing by partitioning DNA from individual cells to an entire row of a 384-well plate (
[0263] DigiPico2 Protocol
[0264] 200 pg of purified DNA, 20-30 resuspended nuclei, or laser-capture micro-dissected tumour islets, were first denatured using 5 μl of D2 buffer from Repli-g single cell kit (Qiagen). After 5 minutes incubation at room temperature, 95 μl of water was added to the sample and then using Mosquito HTS liquid handler (TTP Labtech) 200 nl of the denatured template was added to each well of a 384-well reaction plate already containing 800 nl of WGA mix (0.58 μl Sc Reaction Buffer, 0.04 μl Sc Polymerase (REPLI-g Single Cell kit, Qiagen), 0.04 μl EvaGreen 20× (Biotium), and 0.065 μl water). The plate was incubated at 30° C. for 1.5 hours followed by heat inactivation at 65° C. for 15 minutes. Addition of EvaGreen in the reaction allows for monitoring of the WGA reaction using a real-time PCR machine if required. Next, 250 nl of the WGA reactions was transferred to a new plate and 1.1 μl of NEBNext Ultra II FS Reaction mixture (753 nl water, 270 nl Ultra II FS Reaction Buffer, and 77 nl Ultra II FS Enzyme Mix) was added to each well using I-DOT dispenser (Dispendix). Plate was incubated at 37° C. for 6 minutes followed by incubation at 65° C. for 30 minutes. Next 150 nl of DigiPico indexed loop-adapters carrying column indices was added to all wells. Please note that all wells within the same column would receive the same indexing oligo at this stage. Next 1.2 μl of Ultra II Ligation mix (1150 nl Ultra II Ligation Master Mix, 38 nl Ligation Enhancer, and 12 nl water) was added to each well using Mosquito liquid handler with 5 cycles of mixing and the plate was incubated at 20° C. for 15 minutes followed by heat inactivation at 65° C. for 10 minutes. Next all wells within same row were pooled together using Mosquito liquid handler. Then 1.5 μl of USER enzyme (NEB) was added to 20 μl of pool from each row and the reaction was incubated at 37° C. for 15 minutes. USER enzyme cuts the looped adapters at the Uracil position. Next the products were size selected using SPRI beads to achieve a size range of 300-400 bp. Products from each row were then amplified using row indexing primers for 4 cycles. The final products were pooled together and the final library was purified using SPRI beads.
[0265] ScDigiPico Protocol
[0266] Individual cells were sorted to wells in the first column of a 384-well plate. Each well contained 4.5 μl of MyPK buffer 0. Plate was incubated at 55° C. for 30 minutes. Then 900 nl of Stop Solution was added to each well and the plate was kept at 95° C. for 5 minutes to inactivate the proteinase K. Lysed cells were then distrusted across the rows using Mosquito liquid handler, 200 nl in each well. Next 800 nl of WGA reaction mixture was added to each well and the WGA and library preparation was followed similar to DigiPico2.
[0267] Indexed Looped-Adapter—Column Indices
[0268] (some parts of the sequence from the instruction manual of library preparation (NEBNext® Multiplex Oligos for Illumin® (Index Primers Set 1))—https://international.neb.com/-/media/nebus/files/manuals/manuale7335.pdf?rev=4bf1622b342b4d73a2b01443068ed 2c5&hash=B049D91A18CDB471AB388DC6E67E06B79263E5C5)
TABLE-US-00001 (SEQ ID NO: 1) P-[index′] AGATCGGAAGAGCACACGTCTGAACTUCCCTACACGACGCTCTTCCGAT CT[index]*T
[0269] Where P is a 5′ phosphate group and * shows a phosphorothioate bond. Index=a column index (Ci) or row index (Ri) sequence acting as a unique barcode for each column or row respectively.
[0270] Row indexing primers (oligonucleotide sequences © 2007-2013 Illumina, Inc. All rights reserved.)
TABLE-US-00002 P5: (SEQ ID: NO: 2) AATGATACGGCGACCACCGAGATCTACAC ACACTCTTTCCCTACACGACGCTCTTCCGATC*T [r-index] P7: (SEQ ID: NO: 3) CAAGCAGAAGACGGCATACGAGAT GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T [r-index] *shows a phosphorothioate bond.