Deciphering Multi-Way Interactions In The Human Genome With Use Of Hypergraphs
20220406407 · 2022-12-22
Assignee
Inventors
Cpc classification
G16B25/00
PHYSICS
G16B45/00
PHYSICS
G16B40/10
PHYSICS
International classification
G16B25/00
PHYSICS
Abstract
A method is presented for analyzing interactions in a human genome. The method includes: receiving a biological sample of a cell from a subject; extracting read data from the biological sample, where the read data includes a set of reads; and constructing, by a computer processor, a hypergraph from the read data, where each node in the hypergraph represents a locus and hyperedges in the hypergraph represent interactions between two or more loci. The hypergraphs may be used for different applications including determining entropy, comparing different biological samples and reporting multi-way contacts in a set of transcription clusters.
Claims
1. A method for analyzing interactions in a human genome, comprising: receiving a biological sample of a cell from a subject; extracting read data from the biological sample, where the read data includes a set of reads; and constructing, by a computer processor, a hypergraph from the read data, where each node in the hypergraph represents a locus and hyperedges in the hypergraph represent interactions between two or more loci.
2. The method of claim 1 wherein the read data has a length in range of 100 to 500 base pairs.
3. The method of claim 1 wherein the read data has a length selected from one of 100,000 base pairs, one million base pairs or 25 million base pairs.
4. The method of claim 1 further comprises constructing an incidence matrix from the read data; constructing a Laplacian matrix for the incidence matrix; computing eigenvalues of the Laplacian matrix using eigendecomposition; normalizing the eigenvalues of the Laplacian matrix; and determining entropy of the hypergraph using the normalized eigenvalues.
5. The method of claim 4 wherein the eigenvalues of the Laplacian matrix are normalized such that Σλ.sub.i=1 and entropy is computed using Shannon entropy formula, where λ.sub.i is an eigenvalue of the Laplacian matrix.
6. The method of claim 1 further comprises constructing an incidence matrix for the hypergraph; for each multi-way contact in the incidence matrix, add a given multi-way contact to a set of potential transcription clusters in case where each locus associated with the given multi-way contact is accessible and at least one locus associated with the given multi-way contact is a binding site and the binding site is an indicator of transcription; for each multi-way contact in the set of potential transcription clusters, add a particular multi-way contact to a set of transcription clusters in case where loci associated with the particular multi-way contact contains two or more expressed genes and have at least one common transcription factor; and reporting multi-way contacts in the set of transcription clusters.
7. The method of claim 6 further comprises receiving chromatin accessibility data for the biological sample and determining whether locus are accessible from the chromatin accessibility data.
8. The method of claim 6 further comprises receiving binding data for the biological sample and determining whether a given locus is a binding site from the binding data, where the binding site is an indicator of transcription.
9. The method of claim 6 further comprises receiving gene expression data for the biological sample and determining whether a given loci contains two or more expressed genes from the gene expression data.
10. A method for analyzing interactions in a human genome, comprising: receiving a first biological sample of a cell from a subject; extracting read data from the first biological sample, where the read data includes a set of reads; constructing a first hypergraph from the read data, where each node in the first hypergraph represents a locus and hyperedges in the first hypergraph represent interactions between two or more loci; receiving a second biological sample of a cell from the subject; extracting read data from the second biological sample, where the read data includes a set of reads; constructing a second hypergraph from the read data, where each node in the second hypergraph represents a locus and hyperedges in the second hypergraph represent interactions between two or more loci; and comparing the first hypergraph to the second hypergraph by computing a distance between the first hypergraph and the second hypergraph.
11. The method of claim 10 wherein the first biological sample is taken from a cell having a first cell type and the second biological sample is taken from a cell having a second cell type different from the first cell type.
12. The method of claim 10 wherein the first biological sample is taken from a cell having a given cell type at a given time and the second biological sample is taken from a cell of the subject having the same cell type but at a time different than the given time.
13. The method of claim 10 further comprises constructing a first incidence matrix for the first hypergraph; constructing a first normalized Laplacian matrix for the first incidence matrix; computing a first set eigenvalues of the first normalized Laplacian matrix using eigendecomposition; constructing a second incidence matrix for the second hypergraph; constructing a second normalized Laplacian matrix for the second incidence matrix; computing a second set of eigenvalues of the second normalized Laplacian matrix using eigendecomposition; computing the distance between the first hypergraph and the second hypergraph using the first set of eigenvalues and the second set of eigenvalues.
14. The method of claim 13 wherein the first and the second normalized Laplacian matrix are constructed according to
15. A method for identifying transcription clusters in a human genome, comprising: receiving a biological sample of a cell from a subject; extracting read data from the biological sample, where the read data includes a set of reads; and constructing a hypergraph from the read data, where each node in the hypergraph represents a locus and hyperedges in the hypergraph represent interactions between two or more loci; constructing an incidence matrix for the hypergraph; for each multi-way contact in the incidence matrix, add a given multi-way contact to a set of potential transcription clusters in case where each locus associated with the given multi-way contact is accessible and at least one locus associated with the given multi-way contact is a binding site and the binding site is an indicator of transcription; for each multi-way contact in the set of potential transcription clusters, add a particular multi-way contact to a set of transcription clusters in case where loci associated with the particular multi-way contact contains two or more expressed genes and have at least one common transcription factor; and reporting multi-way contacts in the set of transcription clusters.
16. The method of claim 15 further comprises receiving chromatin accessibility data for the biological sample and determining whether locus are accessible from the chromatin accessibility data.
17. The method of claim 15 further comprises receiving binding data for the biological sample and determining whether a given locus is a binding site from the binding data, where the binding site is an indicator of transcription.
18. The method of claim 15 further comprises receiving gene expression data for the biological sample and determining whether a given loci contains two or more expressed genes from the gene expression data.
Description
DRAWINGS
[0019] The drawings described herein are for illustrative purposes only of selected embodiments and not all possible implementations, and are not intended to limit the scope of the present disclosure.
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036] Corresponding reference numerals indicate corresponding parts throughout the several views of the drawings.
DETAILED DESCRIPTION
[0037]
[0038] In one example, Pore-C read data was extracted from the biological sample. To obtain Pore-C read data, DNA is cross-linked to histones, digested by a restriction enzyme, ligated together, and then sequenced as shown in
[0039] Hypergraphs are used to represent multi-way contacts as seen in
[0040] Using more standard experimental techniques, such as Hi-C, adjacency matrices are often used to capture the pair-wise genomic contacts. Multi-way contacts, however, are not able to be represented in this manner, since the rows and columns of adjacency matrices only account for individual loci. In contrast, incidence matrices are used to represent multi-way contacts (
TABLE-US-00001 Algorithm 1: Hypergraph incidence construction 1: Input: Aligned Pore-C data 2: for each multi-way contact j do 3: if multi-way contact contains locus i then 4: H(i, j) = 1 5: else 6: H(i, j) = 0 7: end if 8: end for 9: Return: Hypergraph incidence matrix H ϵ .sup.n×n where n is the total number of loci, and m is the total number of multi-way contacts.
In this way, incidence matrices allow one to include more than two loci per contact and provide a clear visualization of multi-way contacts. Multi-way contacts are be decomposed into pair-wise contacts by extracting all combinations of loci as seen in
[0041] With reference to
[0042] Incidence matrix visualization of a region in Chromosome 22 from fibroblasts (V1-V4) is shown in
[0043] To gain a better understanding of genome structure with multi-way contacts, hypergraphs were constructed for entire chromosomes in 1 Mb resolution. The incidence matrix of Chromosome 22 is shown as an example in
[0044] Network entropy is often used to measure the connectivity and regularity of a network. Hypergraph entropy is used to quantify the organization of chromatin structure from the read data (e.g., Pore-C data), where higher entropy corresponds to less organized folding patterns. Although there are different definitions for hypergraph entropy, one example analysis technique is further described in relation to
[0045] In mathematics, eigenvalues can quantitatively represent different features of a matrix. In this example, eigenvalues of a Laplacian matrix are exploited and then fit into the Shannon entropy. That is, an incidence matrix is constructed at 91 from the read data, for example in the manner described above, and a Laplacian matrix is then constructed for the incidence matrix as indicated at 92. The incidence matrix of the genomic hypergraph is denoted by by H and the Laplacian matrix is an n-by-n matrix (n is the total number of genomic loci in the hypergraph), which can be computed by L=HH.sup.T ∈R.sup.n×n, where T denotes matrix transpose. Eigenvalues of the Laplacian matrix are computed at 93, for example using eigendecomposition. In some embodiments, the eigenvalues are normalized such that Σ.sub.i=1.sup.nλ.sub.i as indicated at 94. Finally, the entropy of the hypergraph is computed at 95 using the normalized eigenvalues. More specifically, the hypergraph entropy is defined by
where λ.sub.i are the normalized eigenvalues of L, such that Σ.sub.i=1.sup.nλ.sub.i=1, and the convention 0ln0=0 is used. Biologically, genomic regions with high entropy are likely associated with high proportions of euchromatin, as euchromatin is more structurally permissive than heterochromatin. Further details for this example definition of hypergraph entropy are set forth by C. Chen and I. Rajapakse in “Tensor entropy for uniform hypergraphs” IEEE Transactions on Network Science and Engineering, 7(4):2889-2900, 2020, which is incorporated in its entirety herein by reference. Other definitions for hypergraph entropy also fall within the broader aspects of this disclosure.
[0046] Comparing graphs is a ubiquitous task in data analysis and machine learning. There is a rich body of literature for computing graph distance with examples, such as Hamming distance, Jaccard distance, and other spectral based distances. This disclosure proposes a spectral-based hypergraph distance measure which can be used to quantify global difference between two genomic hypergraphs G.sub.1 and G.sub.2.
[0047]
[0048] For each biological sample, read data is extracted from the biological sample as indicate at 102 and 105, and a hypergraph is constructed from the read data as indicated at 103 and 106. The two hypergraphs can then be compared at 107, for example by computing a distance between the two hypergraphs. Other techniques for comparing hypergraphs are also contemplated by this disclosure.
[0049] In an example embodiment, a spectral-based distance measure is used to compare the two hypergraphs. For each hypergraph, an incidence matrix is constructed and then a normalize Laplacian matrix is constructed. Denote the incidence matrices of two genomic hypergraphs by Hi E R.sup.nxm.sup.
where I∈R.sup.n×n is the identity matrix, E.sub.i∈R.sup.m.sup.
where λ.sub.ij is the jth eigenvalue of {tilde over (L)}.sub.i for i=1,2, and p≥1. In the example embodiment, p=2 although other values may be considered. In this way, the hypergraph distance is used to compare two genomic hypergraphs in a global scale since the eigenvalues of the normalized Laplacian are able to capture global connectivity patterns within the hypergraph.
[0050] Genes are transcribed in short sporadic bursts and transcription occurs in localized areas with high concentrations of transcriptional machinery. This includes transcriptionally engaged polymerase and the accumulation of necessary proteins, called transcription factors. Multiple genomic loci can colocalize at these areas for more efficient transcription. In fact, it has been shown using fluorescence in situ hybridization (FISH) that genes frequently colocalize during transcription. Simulations have also provided evidence that genomic loci, which are bound by common transcription factors, can self-assemble into clusters, forming structural patterns commonly observed in Hi-C data. These instances of highly concentrated areas of transcription machinery and genomic loci are referred to herein as transcription clusters. The colocalization of multiple genomic loci in transcription clusters naturally leads to multi-way contacts, but these interactions cannot be fully captured from the pair-wise contacts of Hi-C. Multi-way contacts derived from Pore-C and similar read data can detect interactions between many genomic loci, and are well suited for identifying potential transcription clusters.
[0051]
[0052] From the incidence matrix, potential transcription clusters are identified at 115. More specifically, each locus in the incidence matrix (i.e., multi-way contact) is queried for chromatin accessibility and binding. For a given multi-way contact, the given multi-way contact is added to a set of potential transcription clusters in the case where each locus associated with the given multi-way contact is accessible and at least one locus associated with the given multi-way contact is a binding site. Locus accessibility can be determined from chromatin accessibility data. In the example embodiment, the chromatin accessibility data is derived from the biological sample using Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq). Chromatin accessibility data can be derived using other techniques including but not limited to DNase-seq and MNase-seq. Determining whether a given locus is a binding site can be determined from binding data, such as RNA Pol II data. In the example embodiment, the binding data is derived from the biological sample using ChIP-seq although other techniques are contemplated as well. Although not limited thereto, chromatin accessibility and transcription factor binding sites are preferably queried ±5 from the gene's transcription start site.
[0053] From the set of potential transcription clusters, transcription clusters are identified at 116. To do so, each multi-way contact in the set of potential transcription clusters is further evaluated. That is, each multi-way contact is queried for nearby expressed genes. For a given multi-way contact, the given multi-way contact is added to a set of transcription clusters in the case where loci associated with the particular multi-way contact contains two or more expressed genes and have at least one common transcription factor. Loci that contains two or more expressed genes is determined from gene expression data, for example obtained using RNA-seq or similar methods.
[0054] In the example embodiment, genes having common transcription factors were determined through binding motifs. For demonstration purposes, transcription factor binding site motifs were obtained from “The Human Transcription Factors” data. FIMO (https://meme-suite.org/meme/tools/fimo) was used to scan for motifs within ±5 kb of the transcription start sites for protein-coding and microRNA genes. The results were converted to a 22,083×1,007 MATLAB table, where rows are genes, columns are transcription factors, and entries are the number of binding sites for a particular transcription factor and gene. The table was then filtered to only include entries with three or more binding sites in downstream computations. This threshold was determined empirically and can be adjusted by changes to the provided MATLAB code. This method for identifying transcription clusters is also set forth below
TABLE-US-00002 Algorithm 4: Identification of Transcription Clusters 1: Input: Hypergraph incidence matrix H, gene expression R (RNA-seq), RNA Pol II P (ChIP-seq), chromatin accessibility C (ATAC-seq), transcription factor binding motifs B 2: for each multi-way contact J in H do 3: if all loci are accessible from C and 1 locus has Pol II binding from P then 4: mult-way contact j from H is added to the set of potential transcription clusters T.sub.p 5: end if 6: end for 7: for each potential transcription cluster k in T.sub.p do 8: if loci contain 2 expressed genes from R which have 1 common TFs from B then 9: multi-way contact k from T.sub.p is added to the set of transcription clusters T.sub.c 10: end if 11: if loci contain 2 expressed genes from R which have 1 common MRs from B then 12: multi-way contact k from T.sub.p is added to the set of transcription clusters T.sub.s 13: end if 14: end for 15: Return: Potential transcription clusters T.sub.p, transcription clusters T.sub.c, and specialized transcription clusters T.sub.s
[0055] Continuing with the example described above in relation to
[0056] The criteria for potential transcription clusters was tested for statistical significance. That is, a test was conducted to determine whether the identified transcription clusters are more likely to include genes, and if these genes more likely to share common transcription factors, than arbitrary multi-way contacts in both fibroblasts and B lymphocytes. It was found that the transcription clusters were significantly more likely to include 1 gene and 2 genes than random multi-way contacts (p<0.01). In addition, transcription clusters containing 2 genes were significantly more likely to have common transcription factors and common master regulators (p<0.01). After testing all order multi-way transcription clusters, the 3-way, 4-way, 5-way, and 6-way (or more) cases were tested individually. It was found that all cases were statistically significant (p<0.01) except for clusters for common transcription factors or master regulators in the 6-way (or more) case for both fibroblasts and B lymphocytes. One can hypothesize that these cases were not statistically significant due to the fact that the large number of loci involved in these multi-way contacts will naturally lead to an increase of overlap with genes. This increases the likelihood that at least two genes will have common transcription factors or master regulators. Approximately half of transcription clusters with at least two genes with common transcription factors also contained at least one enhancer locus (˜51% and ˜44% in fibroblasts and B lymphocytes, respectively). This offers even further support that these multi-way contacts represented real transcription clusters.
[0057] Through advancements in sequencing technology, multi-way contacts within the genome can be captured and reported. Multiway contacts will become increasingly important within biological studies, as the relationship higher-order chromatin structures and genome function are intrinsically linked. Based on this information, medical diagnosis and treatment of patients can be made. Furthermore, this information can be used to reprogram cells of a patient, for example by introducing a given transcription factor into a particular cell of the patient.
[0058] The techniques described herein may be implemented by one or more computer programs executed by one or more processors. The computer programs include processor-executable instructions that are stored on a non-transitory tangible computer readable medium. The computer programs may also include stored data. Non-limiting examples of the non-transitory tangible computer readable medium are nonvolatile memory, magnetic storage, and optical storage.
[0059] Some portions of the above description present the techniques described herein in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. These operations, while described functionally or logically, are understood to be implemented by computer programs. Furthermore, it has also proven convenient at times to refer to these arrangements of operations as modules or by functional names, without loss of generality.
[0060] Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system memories or registers or other such information storage, transmission or display devices.
[0061] Certain aspects of the described techniques include process steps and instructions described herein in the form of an algorithm. It should be noted that the described process steps and instructions could be embodied in software, firmware or hardware, and when embodied in software, could be downloaded to reside on and be operated from different platforms used by real time network operating systems.
[0062] The present disclosure also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a computer selectively activated or reconfigured by a computer program stored on a computer readable medium that can be accessed by the computer. Such a computer program may be stored in a tangible computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
[0063] The algorithms and operations presented herein are not inherently related to any particular computer or other apparatus. Various systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatuses to perform the required method steps. The required structure for a variety of these systems will be apparent to those of skill in the art, along with equivalent variations. In addition, the present disclosure is not described with reference to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present disclosure as described herein.
[0064] The foregoing description of the embodiments has been provided for purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure. Individual elements or features of a particular embodiment are generally not limited to that particular embodiment, but, where applicable, are interchangeable and can be used in a selected embodiment, even if not specifically shown or described. The same may also be varied in many ways. Such variations are not to be regarded as a departure from the disclosure, and all such modifications are intended to be included within the scope of the disclosure.