METHODS OF DETERMINING CORRESPONDENCES BETWEEN BIOLOGICAL PROPERTIES OF CELLS

Abstract

A method of determining a correspondence between a first biological property of a cell and one or more further biological properties of cells is provided. The first biological property and the further biological properties are determined by different analysis techniques and each are contained in a respective one of a plurality of sets of biological properties. The method includes the steps of: converting the plurality of sets of biological properties into corresponding representations in a representation format which is invariant to the technologies used to derive the biological properties; determining, in said representation format, a representation from each of the converted sets of further biological properties which most closely matches the first representation of the first biological property; and re-converting the determined representations from the representation format back to the biological properties associated with the determined representations and thereby determining a correspondence between the first biological property and each of the further biological properties.

Claims

1. A computer-implemented method of determining a correspondence between a first biological property of a cell contained in a first set of biological properties determined by a first analysis technique and a second biological property of a cell contained in a second set of biological properties determined by a second, different, analysis technique, the method including the steps of: converting the first and second sets of biological properties into corresponding representations in a representation format which is invariant to the technologies used to derive the biological properties; determining, in said representation format, a second representation from the converted second set of biological properties which most closely matches a first representation of the first biological property; and re-converting the second representation from the representation format back to the biological property associated with the second representation and thereby determining a correspondence between the first biological property and the second biological property.

2. A computer-implemented method of determining a correspondence between a first biological property of a cell and a plurality of further biological properties of cells, the first biological property and the further biological properties each being determined by different analysis techniques and each being contained in a respective one of a plurality of sets of biological properties, the method including the steps of: converting the plurality of sets of biological properties into corresponding representations in a representation format which is invariant to the technologies used to derive the biological properties; determining, in said representation format, a representation from each of the converted sets of further biological properties which most closely matches the first representation of the first biological property; and re-converting the determined representations from the representation format back to the biological properties associated with the determined representations and thereby determining a correspondence between the first biological property and each of the further biological properties.

3. The method of claim 1, wherein at least one of the analysis techniques is a single-cell analysis technique.

4. The method of claim 1, wherein the representation which is invariant to the technologies used to derive the biological properties is a latent space.

5. The method of claim 4, wherein the latent space is constructed by creating neural networks for a) an encoder for each data set; b) a decoder for each data set; and c) a discriminator acting on the representations.

6. The method of claim 5 wherein the discriminator is a binary classifier.

7. The method of claim 5, wherein the latent space is constructed by minimizing the reconstruction error between the encoder and decoder for each data set while adversarially fooling the discriminator.

8. The method of claim 1, wherein the step of determining uses a bipartite matching approach between each pair of the converted sets.

9. The method of claim 8, wherein the step of determining includes the sub-step of reducing the search space using a k-Nearest Neighbour approach to reduce the number of possible correspondences to a pre-determined maximum.

10. The method of claim 8, wherein the bipartite matching approach uses a minimum-cost maximum-flow algorithm.

11. The method of claim 1, wherein the step of determining allows for the possibility of a correspondence not being found by adding a null node to each converted set.

12. The method of claim 1, wherein the step of determining allows for many-to-one matches to representations in a converted set which has fewer elements.

13-15. (canceled)

16. One or more computer-readable non-transitory storage media including instructions that, when executed by one or more processors, are configured to cause the one or more processors of a system to perform operations comprising: determining a correspondence between a first biological property of a cell contained in a first set of biological properties determined by a first analysis technique and a second biological property of a cell contained in a second set of biological properties determined by a second, different, analysis technique, the method including the steps of: converting the first and second sets of biological properties into corresponding representations in a representation format which is invariant to the technologies used to derive the biological properties; determining, in said representation format, a second representation from the converted second set of biological properties which most closely matches a first representation of the first biological property; and re-converting the second representation from the representation format back to the biological property associated with the second representation and thereby determining a correspondence between the first biological property and the second biological property.

17. The method of claim 16, wherein at least one of the analysis techniques is a single-cell analysis technique.

18. The method of claim 16, wherein the representation which is invariant to the technologies used to derive the biological properties is a latent space.

19. The method of claim 18, wherein the latent space is constructed by creating neural networks for a) an encoder for each data set; b) a decoder for each data set; and c) a discriminator acting on the representations.

20. The method of claim 19 wherein the discriminator is a binary classifier.

21. The method of claim 19, wherein the latent space is constructed by minimizing the reconstruction error between the encoder and decoder for each data set while adversarially fooling the discriminator.

22. The method of claim 16, wherein the step of determining uses a bipartite matching approach between each pair of the converted sets.

23. A system comprising: one or more processors and one or more computer-readable non-transitory storage media coupled to one or more of the processors and comprising instructions operable when executed by one or more of the processors to cause the system to perform operations comprising: determining a correspondence between a first biological property of a cell contained in a first set of biological properties determined by a first analysis technique and a second biological property of a cell contained in a second set of biological properties determined by a second, different, analysis technique, the method including the steps of: converting the first and second sets of biological properties into corresponding representations in a representation format which is invariant to the technologies used to derive the biological properties; determining, in said representation format, a second representation from the converted second set of biological properties which most closely matches a first representation of the first biological property; and re-converting the second representation from the representation format back to the biological property associated with the second representation and thereby determining a correspondence between the first biological property and the second biological property.

Description

[0061] Embodiments of the present invention will now be described by way of non-limitative examples with reference to the accompanying drawings, in which:

[0062] FIG. 1 shows, schematically, a method according to an embodiment of the present invention which performs a pairwise matching of cell across multiple single cell analysis technologies;

[0063] FIG. 2 shows a schematic matching graph structure forming part of a method according to an embodiment of the present invention;

[0064] FIG. 3 shows the tree used to simulate a PROSSTT dataset used to test a method according to an embodiment of the present invention;

[0065] FIG. 4 shows the tree used to simulate a further PROSSTT dataset used to test a method according to an embodiment of the present invention and a density plot of the pseudotime label between a pair of cells matched between source and target technologies;

[0066] FIG. 5 is a tSNE (perplexity=32) plot of the integrated latent space for the simulated data derived from the tree structure shown in FIG. 3;

[0067] FIG. 6 is a density plot of the pseudotime label in the simulated data between a pair of cells matched between source and target technologies;

[0068] FIG. 7 is tSNE (perplexity=32) plot of the integrated latent space for the simulated data derived from the tree structure shown in FIG. 4;

[0069] FIG. 8 is a density plot of the pseudotime label between a pair of cells matched between source and target technologies using latent codes obtained from MATCHER, a prior art system;

[0070] FIG. 9 is a tSNE embedding (perplexity=128) of the integrated latent space created by the application of a method according to an embodiment of the present invention to datasets derived from human tumour cells, with cell matches indicated by the grey lines;

[0071] FIG. 10 is a series of density plots of HLA-DRA marker abundance measured with scRNA-Seq (gene, x-axis) and CyTOF (protein, y-axis);

[0072] FIG. 11 shows a comparison of matching accuracy, with respect to cell-type label, between sparse and dense connections when a method according to an embodiment of the present invention is applied to datasets derived from human tumour cells;

[0073] FIG. 12 is a tSNE embedding similar to that of FIG. 6 but with cell matches obtained via matching on the data space indicated by the grey lines; and

[0074] FIG. 13 is a series of charts showing the training progress of a method according to an embodiment of the present invention on the datasets derived from human tumour cells.

[0075] An embodiment of the present invention which provides a method of matching cells from a source technology to cells in a target technology will be described below. The method has been tested on both simulated and real data networks as explained further below.

[0076] FIG. 1 illustrates the formation and operation of the method of this embodiment. The method assumes that the input of each technology comes from a similar distribution of cells, for example from a common sample, processed in parallel. The method proceeds by mapping cells into a technology-invariant latent space using an auto-encoder framework as described further below and an adversarial objective, such that the original cell signal can be faithfully reconstructed from the latent space, while the technology of each latent representation is indistinguishable. The obtained shared latent codes are then used to find the best cell analogues across technologies by using a fast bipartite matching framework.

Constructing a Technology-Invariant Latent Space & Model Architecture

[0077] An integrated latent space has ideally two properties. It must be possible to decode the latent representations into faithful reconstructions of their inputs, while the source technology of each latent representation should be indistinguishable. To this end, the method of the present embodiment has three types of networks: a pair of encoder (ϕ.sub.k) and decoder (ψ.sub.k) networks for each technology k, which map into and reconstruct out of the latent space, and a single discriminator network (γ) acting on the latent representations.

[0078] The discriminator is a binary classifier trained to identify the latent representations of a source technology from latent representations of all other technologies. The discriminator uses binary cross entropy.

[0079] The method of the present embodiment creates an integrated latent space by minimizing the reconstruction error while adversarially fooling the discriminator. Given the measurements of a batch of cells from the target technology, x.sub.t and the (fixed) latent representations of a batch of cells from the source technology, z.sub.s=ψ.sub.s (x.sub.s) the method minimizes the following objective function:

custom-character (x.sub.t,z.sub.s;ϕ.sub.r,ψ.sub.r)=.sub.nll(x.sub.t,{circumflex over (x)}.sub.t;ϕ.sub.t,ψt)+β.sub.adv(z.sub.s,z.sub.t;ϕt) (1)

where {circumflex over (x)}.sub.t=ϕ.sub.t(z.sub.t) is the reconstruction of x.sub.t and custom-character .sub.nll(x.sub.t, {circumflex over (x)}.sub.t) is the negative log-likelihood of the inputs under the reconstruction. For a Gaussian likelihood the latter has the form L.sub.nll(x.sub.t,{circumflex over (x)}.sub.t)=∥x.sub.t−{circumflex over (x)}.sub.t∥.sub.2.sup.2. .sub.adv is the discriminator's classification error when trying to classify z.sub.s/z.sub.t as the target/source technology. β is a hyperparameter weighing the influence of the adversarial loss.

[0080] All networks use the ReLU activation. The latent dimension of all models is set to 8, and discriminator networks with 2 layers and 8 hidden units each are used. The SNGAN (Miyato et al., 2018) framework is used to train the discriminator.

[0081] For the simulated data networks discussed further below, a 2 layer architecture with 64 hidden units for was used. For the real data networks, a 2 layer architecture with 8 hidden units was used for the CyTOF networks, and a 2 layer architecture with 64 hidden units used for the scRNA-Seq networks. A Gaussian activation was used for all decoders. Architecture search was done on VAE codes and reflects a trade-off between number of features and number of parameters.

[0082] Optimization was carried out by iteratively fixing one technology as the source and one technology as the target. In the case of more than two technologies, the technology corresponding to the discriminator's positive class must either be the source or target technology. The codes of the source technology are fixed and equation (1) is minimized with gradient updates to the encoder and decoder, ϕ.sub.t and ψ.sub.t, of the target technology using gradients computed on the batch x.sub.t. After each update, the discriminator is trained on z.sub.s and z.sub.t. The method is initialized by first training a VAE (Kingma and Welling, 2013) on a single technology, and using the latent representations as the first set of z.sub.s. All networks are optimized using the ADAM (Kingma and Ba, 2014) optimizer.

[0083] Due to the min-max nature of adversarial training, model comparison is challenging since it is not possible to directly compare the minimized objective functions of converged models (Lucic et al., 2017). The computer vision community has introduced a number of metrics specific to the image domain to help compare models (Heusel et al., 2017; Salimans et al., 2016), however, and these can be used to validate the quality of a set of lower dimensional latent representations. As was done in Wang et al. (2019), a k-Nearest Neighbor (kNN)-based divergence estimator (Wang et al., 2009) is used to quantitatively evaluate the quality of the integrated latent space. The divergence score between two sets of codes Z.sub.s and Z.sub.t is calculated as:

[00003] $\begin{matrix} \hat{D} (Z_{s} .Math. Z_{t}) = \frac{1}{2} {\hat{D}}_{K L} (Z_{s} .Math. Z_{t}) + \frac{1}{2} {\hat{D}}_{K L} (Z_{t} .Math. Z_{s}) where {\hat{D}}_{K L} (P .Math. Q) = \frac{d}{.Math. P .Math.} \underset{p_{i} \in P}{.Math.} \log \frac{v_{k} (p_{i})}{ρ_{k} (p_{i})} + \log \frac{.Math. P .Math.}{.Math. Q .Math. - 1} & (2) \end{matrix}$

[0084] where v.sub.k(p.sub.i) and ρ.sub.k (p.sub.i) are the distances from p.sub.i to the k.sup.th nearest neighbour in the sets P and Q respectively and all p.sub.i ∈ custom-character .sup.d.

[0085] This estimator approximates a symmetric variant of a Kullback-Leibler (KL) divergence, a measure of how much two distributions differ, using only empirical data. The divergence estimate is computed between the latent representations of the source technology and the target technology.

Matching

[0086] The obtained shared latent representation can then be used for finding analogous cells across technologies, i.e. those cells in the samples analysed by each technology which are most closely related to each other. Each cell is characterized in the latent space by a low-dimensional vector of latent codes, which are in one-to-one correspondence across technologies. To find the optimal pairwise matching in a computationally efficient way, the task can be viewed as a combinatorial bipartite matching problem.

[0087] However, given the high-dimensional nature of single-cell data, as a first step it is helpful to reduce the search space to the most-likely potential matches. In order to achieve this, a k-Nearest Neighbours (kNN) graph is built using source data and then queried by the target technology cells. The nodes correspond to cells and edge weights correspond to the distance between the cells. Since the latent codes are dense and directly comparable between the technologies, the Euclidean distance is used to measure the dissimilarity between each pair of cells. The sparsity of connections, regulated by the choice of hyperparameter k, corresponds to the trade-off between the computational performance (memory usage, run time) and the matching accuracy.

[0088] Given a cost matrix (e.g., Euclidean distance in latent or data space) with distances between all pairs of cells across technologies, the aim is to find row-cell pairs that result in the smallest overall cost. In the present embodiment, Euclidean distance in the latent space is used to generate the cost matrix. The aim then corresponds to solving a Linear Assignment Problem (LAP), i.e. bipartite matching, and can be formulated as:

[00004] $\begin{matrix} \min_{i, j} {.Math.}_{i} {.Math.}_{j} C_{i j} X_{i j} & (3) \end{matrix}$

[0089] with cost matrix C, Boolean assignment matrix X and cell indices i∈{1, . . . , n}, j∈{1, . . . , m}, where n, m denote the number of cells in source and target dataset, respectively. In the classical bipartite matching n=m and when formulated as a graph with cellular nodes, the capacity of each edge is exactly one, as bijective matching is enforced. To efficiently solve a LAP with sparse connections we use the Jonker-Volgenant Algorithm (Jonker and Volgenant, 1987), which has been shown to have good computational performance even in high-dimensional settings (Dell'Amico and Toth, 2000), despite the worst-case complexity of O(n.sup.3).

[0090] To relax the restriction on only one match, the generalized framework of Minimum-Cost Maximum-Flow Problems (Ahuja et al., 1993; Klein, 1967) can be applied.

[0091] Given a directed graph G=(V, E), with V denoting vertices and E edges, the Minimum-Cost Maximum-Flow of G would be the maximum flow that can be pushed from source to sink at a minimum cost. If u denotes the non-negative edge capacities and c the corresponding costs, then the flow of f(v, w) units on edge (v, w) would contribute c(v, iv) (v, w) to the objective

[00005] $\begin{matrix} \min_{(v, w)} {.Math.}_{(v, w)} c (v, w) f (v, w) subject to & (4) \end{matrix}$ $\begin{matrix} 0 \leq f (v, w) \leq u (v, w) \forall_{(v, w) \in E} & (5) \end{matrix}$ $\begin{matrix} {.Math.}_{w \in V} f (v, w) - {.Math.}_{w \in V} f (w, v) = b (v) \forall_{v \in V} & (6) \end{matrix}$

[0092] Finding a minimum cost of the flow through the network corresponds to finding the shortest path. Many algorithms have been designed to efficiently solve the Minimum-Cost Maximum-Flow Problem in polynomial time (see, for example, Ahuja et al., 1993; Kovacs, 2015).

[0093] The approach described so far makes a strong assumption of observing the same cellular composition across datasets, i.e., each cell has a direct corresponding sibling in the other technology. In order to allow for mismatches due to expected variation in cellular composition, the kNN graph is expanded with sparse connections by adding a densely-connected null node with high capacity and high assignment cost, capturing the otherwise poorly matched cells.

[0094] This graph structure is depicted in FIG. 2 in which the latent space is depicted as a set of nodes corresponding to cells in the source (S) and target (T) datasets and edges to the sparse connections between them, resulting from a kNN search. Edge labels (a, b) indicate matching cost and edge capacity, respectively. The null node, colored in grey, bottom-right, captures matches with source cells that lack a close enough analogue in the target technology. Its capacity equals the cardinality of the source data and the cost C.sub.*0 is relatively high. The thicker lines linking the nodes represent the actual matches selected by the algorithm.

[0095] Furthermore, to account for differences in the number of cells between modalities, many-to-one matches are allowed by increasing the capacity of the outgoing source edges, assuming the source corresponds to the smaller dataset. In order to penalize for leaving cells unmatched, the sum of outgoing source capacities, excluding the null node, is constrained to equal the cardinality of the target set

Σu.sub.i=|T|i∈{1, . . . ,m} (7)

[0096] The quality of matching is evaluated on different levels. First, the accuracy corresponding to the fraction of true positives w.r.t. cell-type label is reported. Cell-types can be determined in a technology-specific manner. If more fine-grained cellular information is available, such as pseudotime, a direct comparison of this quantity is carried out.

Simulated Dataset

[0097] The method of the present embodiment was first evaluated using two synthetic datasets generated by PROSSTT (Papadopoulos et al., 2019). PROSSTT simulates a temporal branching process that parameterizes a negative binomial model.

[0098] Two single cell analysis technologies were simulated by running PROSSTT under different seeds and different numbers of genes but retaining the underlying branching structure (see Table 1). Thus these two datasets have a common latent structure yet their features do not have any correspondences. A three branch tree as illustrated in FIG. 3 was used with different branch lengths. The first branch A runs from pseudotime 0 to 30, branch B from 30 to 50 and branch C from 30 to 60. The PROSSTT simulated data allows an underlying latent structure to be defined.

TABLE-US-00001 TABLE 1 Characteristics of PROSSTT simulated dataset using different seeds but the same underlying latent structure. Dataset # Markers # Cells Source 256 32,000 Target 128 32,000

[0099] The method of the present embodiment was run using the larger dataset (i.e. the one with more markers) as the source technology and the smaller dataset as the target technology. The latent space was initialized by training a VAE on the source technology for 250 epochs. These latent representations are fixed, and then the target technology was trained for 250 epochs. To help orient the latent space, branch labels were given to the discriminator (Makhzani et al., 2015). For each target cell a set of its most similar cells in the source dataset was identified by k-Nearest Neighbors algorithm with k=500, using the Euclidean distance on the latent codes. The optimal matching was then found by solving the Minimum-Cost Maximum-Flow problem, allowing for many-to-one matches.

[0100] In a second approach, three single cell analysis technologies were simulated by running PROSSTT under different seeds but retaining the underlying branching structure. Thus these three datasets also have a common latent structure yet their features do not have any correspondences. A five branch tree, shown on the left hand side of FIG. 4, was used with different branch lengths. Each dataset contained 64,000 cells with 256 markers.

[0101] The method of the present embodiment was run on the three simulated datasets. The latent space was initialized by training a VAE on the source technology for 256 epochs. These latent representations were fixed, and then the two target technologies are trained for 256 epochs. For each target cell a set of its most similar cells in the source dataset were identified by k-Nearest Neighbors algorithm with k=500, using the Euclidean distance on the latent codes. The optimal matching was then found by solving the bipartite matching problem.

Results

[0102] The method of the present embodiment was evaluated on the above pair of datasets generated by PROSSTT, such that there are no feature correspondences between the two datasets, while the shared underlying characteristics are preserved. Branches in PROSSTT define over-arching structure that mimics cell-types, while the temporal component, i.e., pseudotime, provides a continuous interpolation from one branch to another, as defined by the tree.

[0103] In latent space, the branch structure within the data produces large clusters, while the pseudotime structure provides orientation within each cluster as well as global smoothing of the manifold. For the three branch tree, the method of the present embodiment was able to correctly capture this structure as well as orient it across datasets, as shown in tSNE embeddings computed on the combined latent representations from both sources as shown in FIG. 5.

[0104] Each point in FIG. 5 corresponds to a cell's latent representation. In the first axis, points are coloured by the branch label and shaded by the source technology. FIG. 5 shows that the tree's branching structure is represented in the embedding and that the technology codes are grouped globally by branch label. In the second and third axes the source/technology codes are coloured by pseudotime label. As illustrated, the method was able to correctly capture the intra-cluster orientation.

[0105] The optimal matching algorithm was applied to a sparse kNN graph with k=1500. FIG. 6 provides a quantitative evaluation of the matches with respect to pseudotime. Coloured contours correspond to cells paired within the same branch. The grey contour corresponds to cells paired to a different branch. It can be seen that this mismatching mostly happens around the branching point, where cells from different branches are most similar.

[0106] The highest density of the cells is observed within a close proximity of the diagonal, thus showing that the derived matches do not only follow the overall branches but also reflect subtle changes. The mismatched cells (grey colour), w.r.t. branch label, constitute 15% of the total cell population.

[0107] Table 2 below is the corresponding confusion table.

TABLE-US-00002 TABLE 2 Confusion table showing branch labels of the optimal matches of cells in the simulated PROSSTT data. Entries on the diagonal correspond to correct matches whereas off-diagonal elements are mismatches. The overall accuracy, with respect to branch label, is 85%. Target Source A B C A 10007 1290 612 B 1419 6277 332 C 858 156 11049

[0108] The majority of the mismatches is located around the branching point, where the differences between cells become indistinguishable. As expected, the overall accuracy of the matches increases if the area of the branching point (pseudotime ∈[30−E, 30+E]) is not taken into account (accuracy: 98%, Table 3 below). The Spearman's and Pearson's correlation coefficients between all the matches amount 0.89 and 0.87, respectively. These results demonstrate that the method of the present embodiment is capable of accurately identifying the best matching cells across technologies, based on the shared latent representations, even in the absence of paired features.

TABLE-US-00003 TABLE 3 Confusion table showing branch labels of the optimal matches of cells in the simulated PROSSTT data, but excluding the branching point (pseudotime = 30) +/− epsilon of 10% of total pseudotime (E = 0.1 * 60 = 6). Entries on the diagonal correspond to correct matches whereas off- diagonal elements correspond to mismatches. The overall accuracy, with respect to branch label and with branching points excluded, is 98%. Target Source A B C A 7345 112 1 B 212 4301 6 C 120 1 8410

[0109] For the five branch dataset, the method of the present embodiment was also able to correct capture the underlying data structure as well as orient it across the datasets as shown in tSNE embeddings computed on the combined latent representations from all datasets as shown in FIG. 7.

[0110] The optimal matching Minimum-Cost Maximum-Flow algorithm described above was applied to a sparse kNN graph with k=50. The plot on the right hand side of FIG. 4 provides a quantitative evaluation of the matches with respect to pseudotime.

[0111] The highest density of the cells is observed within a close proximity of the diagonal, thus showing that the derived matches do not only follow the overall branches but also reflect subtle changes. The mismatched cells (grey color), with respect to branch label, constitute 15% of the total cell population. The corresponding confusion table is shown as Table 4 below.

[0112] The majority of the mismatches are located around the branching points, where the differences between cells become indistinguishable. The Spearman's and Pearson's correlation coefficients between all the matches amount 0.83 and 0.86, respectively. These results demonstrate that the method of the present embodiment is capable of accurately identifying the best matching cells across technologies, based on the shared latent representations, even in the absence of paired features.

TABLE-US-00004 TABLE 4 Confusion table showing branch labels of the optimal matches of cells in the simulated PROSSTT data (5 branches). Entries on the diagonal correspond to correct matches whereas off- diagonal elements correspond to mismatches. The overall accuracy, with respect to branch label, is 86%. Target Source A B C D E A 8528 583 679 99 23 B 724 6285 758 466 928 C 770 777 18917 48 99 D 42 371 22 7804 266 E 28 1086 41 305 7363

[0113] The method of the present embodiment was also compared to the MATCHER approach disclosed in Welch et al., 2017. MATCHER assumes a one-dimensional latent structure, which is violated by the branching structure in the simulated data. Additionally, MATCHER is a probabilistic model based on Gaussian processes and is thus limited in terms of scalability. To this end a budget of 48 hours compute time with maximum of 40 Gb memory was set. After this, the matching step of the present embodiment was applied to find matches using a kNN graph with k=50 and null node cost of 95. FIG. 8 shows a density plot of the pseudotime lable between a pair of cells matched between source and target technologies using latent codes obtained from MATCHER. Colored contours correspond to cells paired within the same branch, while the grey contour corresponds to cells paired to a different branch. The accuracy with respect to the branch label was 4%, while 40% of the cells were matched to the null node, since a better match could not be identified.

Single-Cell Profiles of a Melanoma Patient

[0114] The real dataset used to further exemplify the operation of the method of this embodiment is generated by the Tumor Profiler (TuPro) Consortium (Irmisch et al., 2020) as part of a multi-center, multi-cancer study comprising metastatic tumors from a cohort of deeply phenotyped individuals. Each patient's data is analyzed with multiple technologies, including scRNA-sequencing (Tang et al., 2009), Cytometry by Time Of Flight (Bandura et al., 2009, CyTOF) and Imaging Mass Cytometry (Giesen et al., 2014, IMC). All of these are capable of dissecting the tumor microenvironment and providing single-cell level, complementary information about the sample of interest. Although cell identity is lost throughout the process, the cells investigated by both technologies stem from the same population (i.e., were obtained from an aliquot of a common cell suspension).

[0115] For the CyTOF dataset, the patient's sample was profiled with CyTOF using a 40-markers panel designed for an in-depth characterization of the immune compartment of a sample. Data pre-processing was performed following the workflow described in Chevrier et al., 2017, 2018. Cell-type assignment was performed using a Random Forest classifier trained on multiple manually-gated samples. To investigate the utility of the method of the present embodiment, a subset of B Cells and T Cells only was considered, comprising m=135,334 cells. This dataset is referred to below as the target dataset.

[0116] A second aliquot of the same patient sample was analyzed by droplet-based scRNA-sequencing using the 10× Genomics platform. Standard QC-measures and pre-processing steps, such as removal of low quality cells, as well as filtering out mitochondrial, ribosomal and non-coding genes, were applied. Expression data was library-size normalized and corrected for the cell-cycle effect. Cell-type identification was performed using a set of cell-type-specific marker genes (Tirosh et al., 2016). Genes were then filtered to a set of 256, retaining those that could code for proteins measured in CyTOF channels, the top 32 T Cell/B Cell marker genes, and the remaining most variable genes. The total number of B Cells and T Cells in this dataset amounts to n=4,683. The scRNA-Seq dataset is used as the source dataset in the examples below.

TABLE-US-00005 TABLE 5 Characteristics of TuPro patient datasets. Dataset # Markers # Cells T Cells B Cells scRNA-Seq 1,024 135,334 70% 30% CyTOF 41 4,683 73% 27%

[0117] A 2-layer encoder and decoder was used for the scRNA-Seq/CyTOF technology, with 64/8 units in each hidden layer respectively. The discriminator was a 2 layer network with 8 units and the latent dimension was set to 8. All networks used the ReLU activation. The discriminator was conditioned with the cell type label. These were trained for 256 epochs. Identification of cell analogues between scRNA-Seq and CyTOF data, downsampled to the size of scRNA-Seq data (m=n=4,683 cells) is based on a sparse kNN graph, with k=500, computed using Euclidean distance.

Results

[0118] The method of the present embodiment was applied to integrate scRNA-Seq and CyTOF datasets generated from a single sample by the TuPro project as discussed above. Integrating these technologies (and other single-cell analysis techniques) can allow for a multi-view perspective on cell dynamics and thus can lead to a more thorough understanding of the undergoing biological processes.

[0119] The optimal matching on the latent codes has a 97% accuracy to recover the cell-type label. In comparison, matching on the data space for the same cells would result in only 72% accuracy with respect to the cell-type label.

[0120] A more fine-grained visual evaluation is performed by inspecting the matches on a tSNE embedding of the integrated latent space shown in FIG. 9, marked by grey lines. The colours (blue, orange) represent cell-types (B Cell, T Cell) and colour shades correspond to the profiling technology (light: scRNA-Seq, dark: CyTOF). The data has been downsampled at random to the size of the scRNA-Seq data (n=4.683 cells).

[0121] Given different cell-type proportions in the data, a certain percentage of mismatches, in this case 3%, is expected, which corresponds to the lines joining points across the two cell types. Note that, for this visualization and to investigate and demonstrate the accuracy of the method, no null node was included in the graph, which would have captured most of the mismatches.

[0122] The equivalent plot using matching directly on the data space is shown in FIG. 12. With cell matches obtained via matching on the data space indicated by the grey lines. Colours (blue, orange) represent cell types (B Cell, T Cell) and colour shades correspond to the profiling technology (light: scRNA-Seq, dark: CyTOF). The data has been downsampled at random to the size of scRNA-Seq data (n=4.683 cells). The accuracy of such matching is inferior to that using the latent representations and thus, many more links across the cell types can be observed.

[0123] Further, a more fine-grained information of marker expression correlation was used to quantitatively assess the matching quality in both cases. This used the correlation coefficients between the expression of a cancer-relevant HLA-DRA gene and the corresponding HLA-DR protein abundance, with the bivariate density plots shown in FIG. 10 with the HLA-DRA marker abundance measured with scRNA-Seq (gene, x-axis) and CyTOF (protein, y-axis). Coloured contours (bottom left, top right) correspond to cells paired within the same cell-type. The grey contour (top left, bottom right) corresponds to cells paired to a different cell-type. The colour intensity is weighted by the cell-type proportions. The plots correspond to, from left to right, matching in latent space, matching in data space and matching at random. Spearman's and Pearson's correlation coefficients are indicated for each plot and clearly show the benefits of performing optimal matching (matching at random has the lowest correlation coefficients) and at using shared latent representations for this task (highest correlation coefficients).

[0124] These show that matching using shared latent representations is significantly superior to using common features in data space (Pearson's coefficients: 0.64 and 0.22, respectively). Moreover, in both cases the optimal matching led to an improved expression correlation in comparison to matching the cells between the two technologies at random (Pearson's coefficient of 0.01). Thus, even in the presence of a subset of paired features across the technologies, using the shared latent representations proves beneficial for finding cell analogues.

[0125] A universal divergence estimator (Wang et al., 2009) was used, as described above, to evaluate the quality of the integrated latent space. FIG. 11 shows the training progress of a method according to the embodiment of the present invention on the cancer patient sample. A VAE was trained on scRNA-Seq data and a method according to the embodiment described above used to integrate CyTOF representations into the latent space defined by the scRNA-Seq codes.

[0126] FIG. 13 shows several metrics as the model trains. The top panel shows the negative log-likelihood of the CyTOF reconstruction. The middle panel shows the performance of the discriminator to correctly classify scRNA-Seq codes (critic-prior), CyTOF codes (critic-code) and the ability of the encoder to fool the discriminator, i.e. the mis-classification accuracy (generator). The bottom left panel shows the divergence of the latent representations. To measure the degree of integration of the latent space, the estimated divergence is computed on the latent representations of the two technologies, and the optimization is deemed to be a success if the divergence and reconstruction error are below empirically-set thresholds (divergence<0.3, custom-character .sub.nll<47). It was found that performance depended on tuning β and the learning rates of the encoder and critic networks as shown in Table 6 below. The most stable hyperparameter setting employs a large weight on the adversarial gradients, β=512, with a critic learning rate of 1×10.sup.−3, and an encoder/decoder learning rate of 5×10.sup.−4.

TABLE-US-00006 TABLE 6 Ablation study for training a method of the present embodiment on a real tumor sample. β is the regularization strength of the adversarial loss. Learning rates are the initial settings of the ADAM optimizer. If the latent space divergence (Wang et al., 2009) is below 0.3 and the negative log likelihood of the input under the reconstruction is below 47, the training is deemed a success. These values were chosen empirically. β and optimizer learning rates were heavily influential on model success. Learning Rate β Discriminator Encoder/Decoder Success Replicates 512 0.0010 0.0005 30% 10 128 0.0005 0.0005 12% 8 64 0.0010 0.0010 12% 8 64 0.0005 0.0010 11% 9 256 0.0010 0.0005 10% 10 256 0.0005 0.0005 7% 14 512 0.0005 0.0010 0% 9 512 0.0005 0.0005 0% 4 512 0.0010 0.0010 0% 3 256 0.0010 0.0010 0% 8 256 0.0005 0.0010 0% 6 128 0.0010 0.0005 0% 5 128 0.0010 0.0010 0% 6 128 0.0005 0.0010 0% 1 64 0.0010 0.0005 0% 6 64 0.0005 0.0005 0% 3 32 0.0005 0.0005 0% 5 32 0.0005 0.0010 0% 2 32 0.0010 0.0010 0% 4 32 0.0010 0.0005 0% 7

[0127] Bipartite matching was chosen since it ensures a global optimal matching between two technologies across all cells. Due to the large number of cells profiled with each individual technology per sample as well as the nature of the cell matching problem, as described above, a kNN search was combined with a bipartite matching. A comparison of the accuracy is shown in FIG. 11 in which the number of considered Nearest Neighbors k is indicated on the x-axis, which is displayed on a log 10-scale. The fraction of True Positive matches, with respect to cell-type label, is depicted on the y-axis. The accuracy level of dense connection is achieved already for k=500 and is within 1% absolute difference for k=100.

[0128] Thus, comparing the bipartite matching alone to the combined approach shows that equivalent accuracies can be achieved using a neighbourhood of only k=500 for the total of 4,683 cells. This is in line with expectations, since a match to an extremely distant neighbor could only be justified in case of many multiple ties, i.e., a set of indistinguishable cells. Thus, matching in the search space reduced to the closest neighborhood is correspondent to finding cell analogues based on the dense distance matrix.

[0129] Further, solving the task using the combined approach is computationally efficient, as with 0.5Gb of memory usage and less than 1 min of compute the pairwise matches for all the 4,683 cells were found, using k=500. As expected, with increasing k more computational resources are required to solve the optimal matching problem (Table 7 below). Nevertheless, the hyperparameter k can be set to be much smaller than the data dimensionality, whilst in most cases providing a substantial gain in computational performance over the matching on a fully connected graph.

TABLE-US-00007 TABLE 7 Memory usage and computation time of the optimal matching as the k hyperparameter in the kNN search increases. The table indicates the values obtained on the downsampled TuPro dataset, consisting of 4.683 cells per technology. k Max Memory [Mb] CPU time [s] 10 349.14 3.86 20 349.14 4.43 50 349.20 5.43 100 347.48 6.43 150 347.47 7.86 320 414.90 4.57 500 538.60 4.50 1000 826.68 7.57 1500 1119.6 9.71 2000 1412.5 12.43 2500 1705.3 15.43 3000 1998.1 17.57 3500 2289.2 20.29 4000 2583.2 23.00 4683 2983.9 26.29

[0130] Similar results were observed for the effect of varying the hyperparameter k in the application of the Minimum-Cost Maximum-Flow matching algorithm as illustrated in Table 8 below. Whilst the overall speed of this algorithm is slower, the method has other advantages in terms of scalability and providing trade-offs between true and false positive rates.

TABLE-US-00008 TABLE 8 Memory usage and computation time of the MCMF matching algorithm as the k hyperparameter in the kNN search increases. The table indicates the values obtained on the downsampled TuPro dataset, consisting of 4.683 cells per technology. k Max Memory [Mb] CPU time [s] 5 1674.9 1492.7 10 2456.3 2803.3 25 4690.1 7723.7 50 8740.3 7441.3 75 12344 12656 100 17002 17331 125 20606 22857 150 24213 29077

Conclusions

[0131] The method of the present embodiment provides a technology-invariant approach that pairs single-cell measurements across multi-modal datasets, without requiring feature correspondences, making real multi-modal single-cell analysis possible, and opening up new opportunities to gain a multi-view understanding of the dynamics of individual cells in various disease or developmental states. The underlying auto-encoder framework combined with a customized bipartite matching approach ensures scalability.

[0132] By introducing a divergence measure, common problems in adversarial training like training instability and convergence problems were addressed.

[0133] Scalability was provided by using a modified bipartite matching solution to efficiently match corresponding cells across technologies. The modifications and extensions to the bipartite matching can provide a wider applicability of our method, since shifts in cell-type composition across disjoint aliquots, even coming from the same sample, can be expected. Furthermore, the introduction of the null node ensures a higher quality of matches, by avoiding forced mismatches and thus, improving confidence in the cell-to-cell assignments.

[0134] With increasing data dimensionality, the number of nearest neighbours (k) should also rise, since more ties are likely to occur. Nevertheless, the difference in the number of True Positives across various values of k for the same dataset stays within 5% in the experiments carried out and hence it is considered that the performance is quite robust against the choice of this hyperparameter.

[0135] While the specific embodiment described above uses two datasets, it will be appreciated that the approaches described can be extended to situations in which there are three (or more) datasets. The latent space may be constructed using all of the datasets simultaneously, with the objective function adapted accordingly. Further, for larger numbers of datasets, it is believed to remain most computationally efficient to perform the matching using a bipartite matching approach between each potential pair of datasets.

[0136] The systems and methods of the above embodiments may be implemented in a computer system (in particular in computer hardware or in computer software) in addition to the structural components and user interactions described.

[0137] The term “computer system” includes the hardware, software and data storage devices for embodying a system or carrying out a method according to the above described embodiments. For example, a computer system may comprise a central processing unit (CPU), input means, output means and data storage. Preferably the computer system has a monitor to provide a visual output display. The data storage may comprise RAM, disk drives or other computer readable media. The computer system may include a plurality of computing devices connected by a network and able to communicate with each other over that network.

[0138] The methods of the above embodiments may be provided as computer programs or as computer program products or computer readable media carrying a computer program which is arranged, when run on a computer, to perform the method(s) described above.

[0139] The term “computer readable media” includes, without limitation, any non-transitory medium or media which can be read and accessed directly by a computer or computer system. The media can include, but are not limited to, magnetic storage media such as floppy discs, hard disc storage media and magnetic tape; optical storage media such as optical discs or CD-ROMs; electrical storage media such as memory, including RAM, ROM and flash memory; and hybrids and combinations of the above such as magnetic/optical storage media.

[0140] Those skilled in the art will appreciate that while the foregoing has described embodiments of the present invention, the present invention should not be limited to the specific configurations and methods disclosed in this description of the preferred embodiments. Those skilled in the art will recognise that present invention has a broad range of applications, and that the embodiments may take a wide range of modifications without departing from any inventive concept as defined in the appended claims.

[0141] The following references are hereby incorporated by reference in their entirety: [0142] Abadi, M. et al. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org. [0143] Ahuja, R. K. et al. (1993). Network Flows: Theory, Algorithms, and Applications. Prentice-Hall, Inc., USA. [0144] Amodio, M. and Krishnaswamy, S. (2018). MAGAN: Aligning biological manifolds. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 215-223, Stockholmsmässan, Stockholm Sweden. PMLR. [0145] Bandura, D. R. et al. (2009). Mass cytometry: Technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. Analytical Chemistry, 81(16), 6813-6822. [0146] Chevrier, S. et al. (2017). An immune atlas of clear cell renal cell carcinoma. Cell, 169(4), 736-749.e18. [0147] Chevrier, S. et al. (2018). Compensation of signal spillover in suspension and imaging mass cytometry. Cell Systems, 6(5), 612-620.e5. [0148] Dell'Amico, M. and Toth, P. (2000). Algorithms and codes for dense assignment problems: the state of the art. Discrete Applied Mathematics, 100(1), 17-48. [0149] Giesen, C. et al. (2014). Highly multiplexed imaging of tumor tissues with subcellular resolution by mass cytometry. Nature Methods, 11(4), 417-422. [0150] Heusel, M. et al. (2017). GANs trained by a two Time-Scale update rule converge to a local nash equilibrium. [0151] Irmisch, A. et al. (2020). The tumor profiler study: Integrated, multi-omic, functional tumor profiling for clinical decision support. medRxiv. [0152] Jonker, R. and Volgenant, A. (1987). A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38(4), 325-340. [0153] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. [0154] Kingma, D. P. and Welling, M. (2013). Auto-Encoding variational bayes. Klein, M. (1967). [0155] A primal method for minimal cost flows with applications to the assignment and transportation problems. Management Science, 14(3), 205-220. [0156] Kovács, P. (2015). Minimum-cost flow algorithms: an experimental evaluation. Optimization Methods and Software, 30(1), 94-127. [0157] Liu, J. et al. (2019). Jointly embedding multiple single-cell omics measurements. BioRxiv, page 644310. [0158] Lucic, M. et al. (2017). Are GANs created equal? a Large-Scale study. [0159] Makhzani, A. et al. (2015). Adversarial autoencoders. [0160] Miyato, T. et al. (2018). Spectral normalization for generative adversarial networks. [0161] Oetjen, K. A. et al. (2018). Human bone marrow assessment by single-cell rna sequencing, mass cytometry, and flow cytometry. JCI Insight, 3(23). [0162] Papadopoulos, N. et al. (2019). PROSSTT: probabilistic simulation of single-cell RNA-seq data for complex differentiation processes. Bioinformatics, 35(18), 3517-3519. [0163] Rozenblatt-Rosen, O. et al. (2017). The human cell atlas: from vision to reality. Nature News, 550(7677), 451. [0164] Salimans, T. et al. (2016). Improved techniques for training GANs. In D. D. Lee, M. [0165] Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2234-2242. Curran Associates, Inc. [0166] Stoeckius, M. et al. (2017). Simultaneous epitope and transcriptome measurement in single cells. Nature Methods, 14, 865. [0167] Tang, F. et al. (2009). mrna-seq whole-transcriptome analysis of a single cell. Nature Methods, 6(5), 377-382. [0168] Tirosh, I. et al. (2016). Dissecting the multicellular ecosystem of metastatic melanoma by single-cell rna-seq. Science, 352(6282), 189-196. [0169] Wang, Q. et al. (2009). Divergence estimation for multidimensional densities via k-nearest-neighbor distances. IEEE Transactions on Information Theory, 55(5), 2392-2405. [0170] Wang, T. et al. (2019). Bermuda: a novel deep transfer learning method for single-cell ma sequencing batch correction reveals hidden high-resolution cellular subtypes. Genome Biology, 20(1), 165. [0171] Welch, J. D. et al. (2017). Matcher: manifold alignment reveals correspondence between single cell transcriptome and epigenome dynamics. Genome Biology, 18(1), 138. [0172] Yang, K. D. and Uhler, C. (2019). Multi-domain translation by learning uncoupled autoencoders. CoRR, abs/1902.03515. [0173] Zhu, C. et al. (2020). Single-cell multimodal omics: the power of many. Nature Methods, 17(1), 11-14.

METHODS OF DETERMINING CORRESPONDENCES BETWEEN BIOLOGICAL PROPERTIES OF CELLS

Inventors

Cpc classification

Classification Explorer

C12Q1/68

CHEMISTRY; METALLURGY

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06V30/1988

PHYSICS

Classification Explorer

G16H50/20

PHYSICS

Classification Explorer

G06V20/69

PHYSICS

Classification Explorer

G06V30/19173

PHYSICS

Classification Explorer

G16B40/30

PHYSICS

International classification

Classification Explorer

G16B40/30

PHYSICS

Abstract

Claims

Description