PREDICTING METHOD OF CELL DECONVOLUTION BASED ON A CONVOLUTIONAL NEURAL NETWORK
20230223099 · 2023-07-13
Assignee
Inventors
- Zhendong Liu (Shanghai, CN)
- Xinrong Lv (Shandong, CN)
- Yunxiang Liu (Shanghai, CN)
- Ying Chen (Shanghai, CN)
Cpc classification
G16B5/00
PHYSICS
International classification
G16B5/00
PHYSICS
G16B30/00
PHYSICS
Abstract
A predicting method of cell deconvolution based on a convolutional neural network is provided. The convolutional neural network technology is used to speculate the cell type composition proportion of a tissue from single-cell RNA sequencing data. Compared with a traditional cell deconvolution algorithm, the predicting method of cell deconvolution based on a convolutional neural network overcomes the defects that the traditional cell deconvolution algorithm needs to carry out complex data preprocessing and needs to design a mathematical algorithm to standardize the single-cell sequencing data. According to the convolutional neural network designed by the present disclosure, hidden features can be extracted from the single-cell RNA sequencing data, network nodes have very high robustness to noise and errors of the data, and internal relations among various genes are fully mined, so that the cell deconvolution performance is improved. Meanwhile, the model of the present disclosure is established based on the neural network.
Claims
1. A method of cell deconvolution based on a convolutional neural network, comprising the following steps: (1) using single-cell RNA sequencing data to simulate artificial tissues, and determining a total number K of cells in a simulated artificial tissue and a number Q of artificial tissues that need to be generated; extracting K cells from the single-cell RNA sequencing data, and combining a gene expression matrix of the extracted cells to form a gene expression matrix of the simulated artificial tissue X = {X.sub.1, X.sub.2,.., X.sub.u,..,X.sub.n} , in which X.sub.u is a feature of the simulated tissue, 1≤u≤n ; denoting a proportion Z = {Z.sub.1, Z.sub.2,..Z.sub.i,..Z.sub.t} of each cell type in the tissue as a marking information of the tissue, in which Z.sub.i is the cell proportion of a certain cell type in the tissue, and t is the number of cell types in the tissue, 1≤1≤t; K is a positive integer greater than 1, and Q is a positive integer greater than 1; (2) screening the features of the simulated artificial tissue X ={X.sub.1, X.sub.2,.., X.sub.u,.., X.sub.n} obtained in step (1), and converting each feature X.sub.u into logarithmic space and performing normalizing operation on each feature, 1 ≤ u ≤ n ; obtaining a data set X′ through the above processing; (3) if the data set X′ obtained in step (2) comes from s different data sets, dividing the data set X′ into a training set X′.sub.train a test set X′.sub.test for s-fold cross-validation, in which the training set consists of s-1 data from different sources, and the test set consists of partial data from the remaining one source, determining the batch size, and randomly extracting the batch size data X′.sub.batch from the training set X′.sub.train as input data of one training; (4) obtaining the cell type number t of the tissue from the input data in step (3) as the number of neurons in the last layer of the fully connected module of the convolutional neural network, constructing a convolutional neural network model Cbccon, and determining the learning rate of the model, the testing number of times step of the model training, and the optimized algorithm of the model; inputting X′.sub.batch in step (3) as the data of one training into the Cbccon model for performing model training, and obtaining the predicted tissue cell proportion Ẑ = {Ẑ.sub.1,Ẑ.sub.2,.,Ẑ.sub.i,..,Ẑ.sub.t}, in which Ẑ.sub.i is the cell proportion of a certain cell type in the tissue predicted by the training set, 1 ≤i ≤ t; calculating the loss function between the predicted value and the real value of the cell proportion by the formula
2. The method of cell deconvolution based on the convolutional neural network according to claim 1, wherein the K is 100-5000, and the Q is 1000-100000.
3. The method of cell deconvolution based on the convolutional neural network according to claim 1, wherein using single-cell RNA sequencing data for simulation in step (1) comprises the following steps: (1-1) determining the proportion of each cell type in a single simulated cell tissue by the formula
4. The method of cell deconvolution based on the convolutional neural network according to claim 1, wherein the value of the batch size in step (3) is 128.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
DESCRIPTION OF THE EMBODIMENTS
[0031] In order to clearly illustrate the technical scheme of the present disclosure, the present disclosure will be described hereinafter with reference to
[0032] It should be pointed out that the following detailed description is exemplary and is intended to provide further explanation of the present disclosure. Unless otherwise indicated, all technical and scientific terms used herein have the same meanings as commonly understood by those skilled in the art to which the present disclosure belongs.
[0033]
[0034]
[0035] The data is the single-cell RNA sequencing data from human peripheral blood mononuclear cells (PBMC), which comes from four data sets. The above data is cited in the form of data6k, data8k, donorA and donorC herein. The input file of Cbccon contains two txt files, in which the single-cell gene expression matrix of PBMC data is in count.txt, and the type of cells contained in pbmc tissues is in celltype.txt. The output file of Cbccon contains a pb file, a txt file and a csv file. The parameters in the model after training are saved in savemodel.pb file. The prediction.txt predicts the proportion of each cell type in the tissue. The compare.csv file compares the scores of a Cbccon model with various evaluation indexes RMSE, relate, hrelate and uniform of CPM, Ci, Cix and Music methods, so as to compare the performance of the model. The total number of cells in a simulated artificial tissue is set as K=500, and the number of artificial tissues to be generated is set as Q=32000. The number of data in one training is batch size=128. The learning rate of the model is learning rate=0.0001. The testing number of times of the model training is step=5000. The optimized algorithm of the model is set as RMSprop algorithm. The following are the specific steps of performing the cell deconvolution algorithm.
1 Single-Cell RNA Sequencing Data Is Used to Simulate Artificial Tissue
[0036] Single-cell RNA sequencing data of data6k, data8k, donorA and donorC of PBMC is used to simulate artificial tissues, and the total number K=500 of cells in a simulated artificial tissue and the number Q=32,000 of artificial tissues to be generated are determined. 500 cells are extracted from the single-cell RNA sequencing data, and a gene expression matrix of the extracted cells are combined to form a gene expression matrix of the simulated artificial tissue X = {X.sub.1,X.sub.2,...,X.sub.i,.,X.sub.n},X.sub.i(1≤i≤32738), X.sub.0(1≤j≤3200) , which is the feature of the simulated tissue. The proportion Z = {Z.sub.1,Z.sub.2,..,Z.sub.i,..Z.sub.t} of each cell type in the tissue is denoted as the marking information of the tissue. Zi(1≤i≤6) is the cell proportion of a certain cell type in the tissue, including the following steps: [0037] (1-1) determining the proportion of each cell type in a single simulated cell tissue by the formula that is, determining the marking information Z = {Z.sub.1, Z.sub.2,..,Z.sub.1} of the simulated tissue, in which Z.sub.i (1≤i≤6) is the cell proportion of a certain cell type in the simulated tissue; f.sub.i is a random number created for a single cell type, Z.sub.i has a value between [0,1], and is the sum of random numbers created for all cell types, in which [0038] (1-2) determining the number of cells of each cell type to be actually extracted for a single simulated cell tissue by the formula C.sub.i = Z.sub.i*K (1≤i≤6), K=500, that is, determining the number of cells C = {C.sub.1,C.sub.2,.,C.sub.i..,C.sub.t} extracted for each cell type of a single simulated cell tissue, in which C.sub.i(1≤i≤6) is the number of cells to be extracted for a single cell type of a simulated tissue, Z.sub.i is the cell proportion of a certain cell type in the simulated tissue, K is the total number of cells in a set simulated artificial tissue, and C.sub.i the number of cells of each cell type to be actually be extracted for a single simulated cell tissue, in which
2. Data Preprocessing
[0039] The data of the simulated artificial tissue X = {X.sub.1,X.sub.2,..,X.sub.i,..X.sub.n},X.sub.1(1 ≤ i ≤ 32738) , X.sub.0(1≤ j ≤ 32000) obtained in step 1 is pre-processed. Each feature X.sub.i(1≤i≤32738) n the data set X is screened to remove 21,410 feature items, leaving 11,328 features. Thereafter, X is converted into logarithmic space and normalizing operation is performed. The data set X′ is obtained through the above data pre-processing, including the following steps.
[0040] (2-1) the data X.sub.i(1≤i≤32738) is converted into logarithmic space by the formula X̃.sub.ij = log.sub.2(X.sub.ij + 1) to obtain X̃. X̃.sub.1 is taken as an example, that is, the eigenvalues of the A1BG feature are converted from [105.2, 83.5, 55.8, ...] into [6.73, 6.4, 5.82, ...].
[0041] (2-2) the linear normalization is performed on X̃ by the formula
(1≤i≤n,1≤j≤m), and the value of X̃.sub.i is scaled to [0,1] to obtain X′ . X̃.sub.1 is taken as an example, that is, the maximum value of the A1BG feature is 10.54, and the minimum value thereof is 0.53.
3. Dividing the Data Set
[0042] The data set X′ obtained in step 2 comes from 4 different data sets, namely, data6k, data8k, donorA and donorC. There are six cell types in the data set, namely, Monocytes, Unknown, CD4Tcells, Bcells, NK and CD8Tcells, in which Unknown represents unknown cell type. The X′.sub.train and a test set X′.sub.test for 4-fold cross-validation, data set is divided into a training set and a test set for 4-fold cross-validation, in which the training set consists of 3 data from different sources, and the test set consists of partial data from the remaining one source. The data from data6k, data8k, and donorC are selected from X′ as the training set, and data from donorA is used as the test set. For the convenience of testing, only 500 data are extracted from donorA as the test set. The batch size is determined to be 128. 128 data X′.sub.batch are randomly extracted from the training set X′.sub.train as the input data of one training.
4. Training the Cbccon Model
[0043] The cell type number t=6 of the tissue is obtained from the input data in step 3 as the number of neurons in the last layer of the fully connected module of the convolutional neural network. A convolutional neural network model Cbccon is constructed. It is determined that the learning rate of the model is = 0.0001, the testing number of times step of the model training is =5000, and the optimized algorithm of the model is RMSprop algorithm. X′.sub.batch in step 3 as the data of one training is input into the Cbccon model for performing model training, so as to obtain the predicted tissue cell proportion Ẑ = {Ẑ.sub.1, Ẑ.sub.2,..,Ẑ.sub.i..,Ẑ.sub.t} of the training set, in which Ẑ.sub.i (1≤i≤6) is the cell proportion of a certain cell type in the tissue predicted by the training set. The loss function between the predicted value and the real value of the cell proportion is calculated by the formula
in which Z.sub.i is the real cell fraction label of the tissue, and Ẑ.sub.i is the cell proportion finely predicted by the tissue. The loss function J.sub.MSE is optimized using the optimized algorithm RMSprop. According to the step 3, X′.sub.batch is randomly extracted for 4,999 times for continuous training, and after the training, the trained parameters in the Cbccon model are saved.
5. Using the Trained Model for Prediction
[0044] The Cbccon model trained in step 4 is used to predict the data. The test set data X′.sub.test , that is, 500 test data in donorA, is input into the trained model to obtain the prediction result, that is, the predicted tissue cell type proportion Z′ = {Z′.sub.1,Z′.sub.2,..,Z.sub.i′..,Z’.sub.t} of the test set, in which Z.sub.i′ which (1≤i≤t) is the cell proportion of a certain cell type in the tissue predicted in the test set data. Taking a simulated tissue named V241 in the test set as an example, the prediction result of the cell proportion of the tissue of V241 is as follows: the cell proportion of Monocytes type is 0.171; the cell proportion of Unknown type is 0.027; the cell proportion of CD4Tcells type is 0.428; the cell proportion of Bcells type is 0.102; the cell proportion of NK type is 0.086; and the cell proportion of CD8Tcells type is 0.185. The partial prediction results of the cell type proportion of 500 simulated tissues are shown in
6. Model Evaluation
[0045] The evaluation indexes are constructed by the models obtained in step 4 and step 5, and the performance of the model is evaluated. The performance of a Cbccon model is evaluated by the formula
the formula
the formula
and the formula
respectively, and the performance is compared with CPM, Cibersort(Ci), Cibersortx(Cix), and MuSic methods. Z′ is the predicted cell proportion, Z is the actual cell proportion, ∂.sub.z, ∂.sub.z′ represent the standard deviation of the predicted cell proportion and the actual cell proportion, respectively, and γ.sub.2, γ.sub.2, represent the average of the predicted cell proportion and the actual cell proportion, respectively. By comparing the evaluation indexes of the model, it can be concluded that compared with other algorithms, Cbccon model has a lower RMSE value, a smaller variation range and a higher relate value. This shows that Cbccon method has better deconvolution performance than other algorithms. The improvement of Cbccon on prediction accuracy of cell deconvolution is mainly due to the fact that the convolution layer used in the model can fully mine the internal relations among genes from single-cell RNA sequencing data, thus extracting the hidden features of the data. Moreover, the network nodes of Cbccon have high robustness to the noise and deviation of the data, so that the prediction accuracy of the cell proportion is higher. Moreover, Cbccon solves the problem that the traditional algorithm needs gene expression matrix of a specific cell type to deconvolution the cells, and needs to add various constraints to standardize the model. The model structure is intuitive and understandable, and has high expansibility. The comparison results are shown in
[0046] After fitting the model with the training data in step 4, the data coverage rate achieved by Cbccon is counted as follows: [0047] (1) data with the error between the predicted value and the true value of the cell proportion within 10%; coverage rate: 99.8%; [0048] (2) data with the error between the predicted value and the true value of the cell proportion within 5%; coverage rate: 85%; [0049] (3) data with the error between the predicted value and the true value of the cell proportion within 1%; coverage: 30%.
[0050] Through the comparative result in
[0051] Finally, it should be explained that the above is only a preferred embodiment of the present disclosure, and it is not intended to limit the present disclosure. Although the present disclosure has been described in detail with reference to the aforementioned embodiments, it is still possible for those skilled in the art to modify the technical solutions described in the aforementioned embodiments or equivalently replace some of the technical features. Any modification, equivalent substitution, improvement, etc. made within the spirit and principle of the present disclosure shall be included in the scope of protection of the present disclosure.