DETERMINING CELL TYPE ORIGIN OF CIRCULATING CELL-FREE DNA WITH MOLECULAR COUNTING

20220403467 · 2022-12-22

    Inventors

    Cpc classification

    International classification

    Abstract

    Provided herein are compounds, methods, and compositions for use in determining the cellular origin of circulating cell-free DNA.

    Claims

    1. A method of determining tissues and/or cell types giving rise to cell-free DNA (cfDNA) in a subject, the method comprising: a. isolating cfDNA from a biological sample from the subject, the isolated cfDNA comprising a plurality of cfDNA fragments; b. tagging a unique molecular identifier (UMI) to each isolated cfDNA fragment, the UMI comprising an oligomer of at least two nucleotides; c. determining pairs of sequences associated with at least a portion of the plurality of UMI-tagged cfDNA fragments; d. determining the subset of these pairs of sequences for which the sequence associated with the cfDNA fragment has more than one genomic location within a reference genome; and e. determining at least some of the tissues and/or cell types giving rise to the cfDNA fragments as a function of this subset of pairs of sequences.

    2. The method of claim 1 wherein the step of determining at least some of the tissues and/or cell types giving rise to the cfDNA fragments comprises comparing the sequences associated with the cfDNA fragments to one or more reference maps.

    3. The method of claim 2 wherein the reference maps comprise binding motifs for at least one transcription factor.

    4. The method of claim 2 wherein the reference maps comprise binding locations for at least one transcription factor.

    5. The method of claim 4 wherein the binding locations for at least one transcription factor are determined by immunoprecipitation (e.g. with ChIP-seq).

    6. The method of any preceding claim further comprising generating a report comprising a list of the determined tissues and/or cell types giving rise to the isolated cfDNA.

    7. A method of identifying a disease or disorder in a subject, the method comprising: a. isolating cfDNA from a biological sample from the subject, the isolated cfDNA comprising a plurality of cfDNA fragments; b. tagging a unique molecular identifier (UMI) to each isolated cfDNA fragment, the UMI comprising an oligomer of a least two nucleotides; c. determining pairs of sequences associated with at least a portion of the plurality of UMI-tagged cfDNA fragments; d. determining the subset of these pairs of sequences for which the sequence associated with the cfDNA fragment has more than one genomic location within a reference genome; e. determining at least some of the tissues and/or cell types giving rise to the cfDNA fragments as a function of this subset of pairs of sequences; and f. identifying the disease or disorder as a function of the determined tissues and/or cell types giving rise to the cfDNA.

    8. The method of claim 7 wherein the step of determining at least some of the tissues and/or cell types giving rise to the cfDNA fragments comprises comparing the sequences associated with the cfDNA fragments to one or more reference maps.

    9. The method of any preceding claim wherein the reference genome is associated with a human.

    10. The method of any preceding claim comprising generating a report comprising a statement identifying the disease or disorder.

    11. The method of claim 10 wherein the report further comprises a list of the determined tissue(s) and/or cell type(s) giving rise to the isolated cfDNA.

    12. The method of any preceding claim wherein the biological sample comprises, consists essentially of, or consists of whole blood, peripheral blood plasma, urine, or cerebral spinal fluid.

    13. The method of claim 1 wherein the step of determining at least some of the tissues and/or cell types giving rise to the cfDNA fragments comprises counting UMIs associated with identical cfDNA sequences to produce a vector of counts.

    14. The method of claim 13 wherein each UMI is tallied only once regardless of the number of times it appears in the subset.

    15. The method of claim 13 wherein the step of determining at least some of the tissues and/or cell types giving rise to the cfDNA fragments comprises performing a mathematical transformation on the vector of counts.

    Description

    DESCRIPTION OF EXEMPLARY EMBODIMENTS

    [0018] Provided herein are compounds, methods, and compositions for use in determining the cellular origin of circulating cell-free DNA. In certain embodiments, provided herein are methods for determining or quantifying the cell types and tissue-of-origin composition of cfDNA in bodily fluids on the basis of transcription factor (TF) footprints in short cfDNA fragments.

    [0019] In particular embodiments of the methods, cell-free DNA (cfDNA) is extracted and purified from a source. Extraction and purification can proceed according to techniques known to those of skill in the art. For example, the QIAGEN QIAamp Circulating Nucleic Acid kit is a common method, based on the binding of cfDNA to a silica column, for purification of cfDNA from plasma or urine. An alternative method, phenol-chloroform extraction followed by isopropanol or ethanol precipitation, provides similar results while allowing for more flexibility in the volume of the biological sample.

    [0020] After purification of cfDNA from biological fluids, the fragments can be subjected to one or more enzymatic steps to create a sequencing library. The enzymatic steps can proceed according to techniques known to those of skill in the art. An example of these enzymatic steps is described in Kivioja et al. (2011. Nat Methods 9(1):72-74). Commercial products such as Rubicon Genomics' ThruPLEX Tag-seq are also used to create sequencing libraries from purified cfDNA.

    [0021] According to the methods, the cfDNA fragments are tagged with an oligonucleotide unique molecular identifier (UMI) to facilitate identification of unique fragments. The UMI is typically a DNA oligomer. In certain embodiments, the UMI has a random sequence. In certain embodiments, the UMI is approximately 3-10 base-pairs in length. The UMI can serve as a molecular barcode.

    [0022] Library amplification (e.g. with PCR) and sequencing can each result in the same original cfDNA fragment being sequenced more than once and thus appearing as duplicate reads. However, cfDNA fragments may also be truly biologically duplicated at the sequence level—a possibility that is magnified as fragment length decreases. Disentangling these two scenarios—true biological duplication and technical duplication—is difficult or impossible with conventional DNA sequencing workflows. However, the addition of a UMI to each molecule allows these two scenarios to be disentangled, by uniquely tagging each molecule to allow the identification of technical duplicates (which would carry the same UMI).

    [0023] In certain embodiments, the UMI-tagged cfDNA fragments are amplified. Amplification can proceed according to techniques known to those of skill in the art. In certain embodiments, the UMI-tagged cfDNA fragments are sequenced. Sequencing can proceed according to techniques known to those of skill in the art.

    [0024] Following sequencing, duplicates can be identified by comparing reads on the basis of both their UMIs and their genomic locations and/or sequences. Technical duplicates, which share genomic locations and/or sequences as well as UMIs, can be discarded. Biological duplicates, which share genomic locations and/or sequences but do not share UMIs, can be retained. These remaining sequences can then be partitioned into length classes to enrich for TF footprints in the shortest class(es).

    [0025] In certain embodiments, the reads that cannot be uniquely mapped are separated from the reads that can be uniquely mapped. These reads can be computationally compared to existing compendia of TF footprints (also known as “motifs”) to identify TFs that are likely to have conferred protection to the fragments from which the reads were derived. The comparison to existing compendia does not require exact sequence matches. In some embodiments, one or more sequence mismatches can be allowed to account for imperfect sequence specificity on the part of the TF. In some embodiments, the comparison is performed by searching for one or more informative subsequences of length k (often called “k-mers”), with gaps (“gapped k-mers) or without gaps. The number of such reads derived from each TF using this comparison is tallied by counting the UMIs, thus allowing the relative frequency of each TF's footprint in the sample of reads to be quantified. By iterating this procedure across a large number of TFs, a vector of TF frequencies can be populated for each biological sample. This vector can then be normalized across biological samples and sequencing datasets (e.g. from multiple individuals, or from the same individual over time) by comparing to counts of uniquely mapped reads within a predefined set of genomic loci in each sample (i.e., accounting both for sequencing coverage and for fragment length biases owing to technical differences between samples).

    [0026] In some embodiments, the vector of counts far each TF is then modeled as a mixture of TF profiles found in myriad cell types using orthogonal methods, including ChIP-seq assays such as those performed by the ENCODE project. This modeling can have several embodiments. In one embodiment the comparison involves a computational search for TF footprints that are present in the biological sample and whose cognate TFs are specific to a single cell type. In another embodiment, the vector of molecular counts described above and derived from a biological sample is modeled as a linear combination of vectors of TF profiles derived from orthogonal methods. The output from each embodiment is a list of contributing cell types, optionally including estimated proportions for each contributor in some embodiments.

    [0027] Transcription factor utilization is a dynamic process, such that single cells of the same type are not identical with respect to TF occupancy along their genome. Nonetheless, at the aggregate level, the complement of TF's within a cell is known to be cell type-specific. In other words, there are many coordinates in the genome at which the probability of TF occupancy substantially differs between cell or tissue types.

    [0028] The methods provided herein are based, at least in part, on the discovery that short cfDNA fragments, despite typically being discarded because their length challenges unique genomic placement, contain information about the complement of TFs active upon cell death and cfDNA genesis. The addition of unique molecular identifiers enables counting-based relative quantification of these TFs, and can be used to differentiate, the relative contributions of two or more tissue or cell types to the composition of cfDNA in bodily fluids. Furthermore, the comparison of TF profiles between individuals and/or samples can be used to diagnose and/or monitor any pathology or clinical conditions in humans in which the tissue-of-origin composition of cfDNA in bodily fluids is substantially altered in a way that consistently correlates with that pathology or clinical condition.

    [0029] All publications and patent, applications cited in this specification are herein incorporated by reference as if each individual publication or patent application were specifically and individually indicated to be incorporated by reference. While the claimed subject matter has been described in terms of various embodiments, the skilled artisan will appreciate that various modifications, substitutions, omissions, and changes may be made without departing from the spirit thereof. Accordingly, it is intended that the scope of the subject matter limited solely by the scope of the following claims, including equivalents thereof.