SINGLE-STEP RETROSYNTHESIS METHOD AND SYSTEM BASED ON MULTI-SEMANTIC NETWORK

Abstract

A single-step retrosynthesis method and system based on a multi-semantic network are provided. The method includes the following steps: inputting an ECFP4 feature and a SMILES word one-hot feature of a target product molecule during the single-step retrosynthesis prediction, and outputting the first k reactions which may occur on the target product molecule in a reaction template through the multi-semantic network. The SMILES string of a reactant corresponding to the target product molecule is calculated by applying the output reaction template into SMILES string of the target product molecule. The present disclosure is the first method for performing single-step retrosynthesis prediction by using a multi-semantic fusion network in the field of single-step retrosynthesis. It is a template-based single-step retrosynthesis method, and the prediction result has relatively high interpretability. The network learns fused semantic features, ECFP4 semantic features and SMILES word one-hot semantic features.

Claims

1. A single-step retrosynthesis method based on a multi-semantic network, comprising: S1: acquiring a public data set, and preprocessing the public data set to obtain a preprocessed data set D, wherein each piece of data in the data set D corresponds to one specific reaction, and each piece of data comprises a reaction, a reactant molecule and a product molecule; S2: extracting reaction templates from all data in the data set D by using an RDChiral tool, and removing repeated reaction templates, to obtain a final reaction template set T, wherein each reaction template contains one or more reactions; S3: obtaining an ECFP4 feature set E of product molecules represented by ECFP4 vectors and a SMILES word one-hot feature set S of the product molecules represented by a SMILES word one-hot matrix, respectively, according to the product molecules in the data set D; S4: constructing sample setsG={(e.sub.i, s.sub.i), t.sub.i}.sub.i=1.sup.N, where e.sub.i∈E and s.sub.i∈S represent the ECFP4 feature and the SMILES word one-hot feature of a product molecule in a i-th data of the data set D, respectively, t.sub.i∈T represents a reaction template in the i-th data of the data set D, and N represents a number of the sample set; S5: constructing the multi-semantic network, wherein the multi-semantic network comprises an input layer, a convolution layer, a normalization layer, an activation layer, a pooling layer, a dropout layer, a fully connected layer and an output layer, the convolution layer is configured to convolve input data, the normalization layer is configured to normalize a convolved feature, the activation layer is configured to activate a normalized feature, and the pooling layer performs pooling operation on the ECFP4 feature and the SMILES word one-hot feature, respectively, so as to obtain an ECFP4 semantic feature and a SMILES word one-hot semantic feature; fusing the ECFP4 semantic feature with the SMILES word one-hot semantic feature to obtain a fused semantic feature; and passing the fused semantic feature through the dropout layer, the fully connected layer and Softmax, to obtain a final output result by the output layer; S6: training the multi-semantic network in S5 by using the sample sets in S4 to obtain a trained single-step retrosynthesis prediction model; S7: for a target product molecule to be predicted, predicting a reaction template capable of generating the target product molecule by using the trained single-step retrosynthesis prediction model in S6, and calculating a SMILES string of the reactant molecule corresponding to the target product molecule by using the RDChiral tool in combination with the SMILES string of the target product molecule, thereby realizing single-step retrosynthesis prediction.

2. The method according to claim 1, wherein the S3 comprises: according to the product molecules in the data set D, generating the ECFP4 vectors of the product molecules in all data in the data set D by using the RDKit tool to obtain the ECFP4 feature set E of product molecules represented by ECFP4 vectors; generating the SMILES word one-hot matrix of the product molecules in all data in the data set D by using a Sklearn tool to obtain the SMILES word one-hot feature set S of the product molecules represented by the SMILES word one-hot matrix.

3. The method according to claim 2, wherein the generating the SMILES word one-hot matrix of the product molecules in all data in the data set D by using a Sklearn tool to obtain the SMILES word one-hot feature set S of the product molecules represented by the SMILES word one-hot matrix comprises: S3.1: performing one-hot-encoding on each character of an alphabet constructing the SMILES string to generate a word vector with dimension w.sub.2; using word vectors of a first l.sub.2 characters in each product molecule SMILES string to form a SMILES word one-hot matrix s.sub.2∈{0,1}.sup.l.sup.2.sup.×w.sup.2, wherein if the product molecule SMILES string has less than l.sub.2 characters, the product molecule SMILES string is padded with 0 vector; S3.2: deeming every successive n rows in the matrix s.sub.2∈{0,1}.sup.l.sup.2.sup.×w.sup.2 as a group, the n rows corresponding to word vectors of n characters; concatenating the word vectors in a same group in sequence to obtain a composition of word vector with a length of w.sub.1, w.sub.1=n*w.sub.2, a total of l.sub.1 composition of word vectors being obtained, $l_{1} = \frac{l_{2}}{n};$ and constituting the SMILES word one-hot feature of the product molecules∈{0,1}.sup.l.sup.1.sup.×w.sup.1, where w.sub.2, l.sub.2 and n are positive integers, and n<l.sub.2.

4. The method according to claim 1, wherein the multi-semantic network in S5 has one input layer, k.sub.1+k.sub.2 convolution layers, k.sub.1+k.sub.2 normalization layers, k.sub.1+k.sub.2 activation layers, k.sub.1+k.sub.2 pooling layers, two dropout layers, three fully connected layers and three output layers, where k.sub.1 and k.sub.2 are positive integers, processing step in S5 comprises: S5.1: inputting the ECFP4 feature represented by the ECFP4 vector at input node 1, and inputting the SMILES word one-hot feature represented by the SMILES word one-hot matrix at input node 2, wherein the input layer comprising the input node 1 and the input node 2; S5.2: convolving the ECFP4 feature input at the input node 1 by using k.sub.1 convolution kernels with a same size to obtain a convolved ECFP4 feature, wherein a number of output channels of the k.sub.1 convolution kernels is c.sub.1; S5.3: convolving the SMILES word one-hot feature input at the input node 2 by using k.sub.2 convolution kernels with different sizes to obtain a convolved SMILES word one-hot feature, wherein a number of output channels of the k.sub.2 convolution kernels is c.sub.2; S5.4: normalizing the convolved feature in S5.2 and the convolved features in S5.3 by the normalization layer, respectively, to obtain a normalized ECFP4 feature and a normalized SMILES word one-hot feature; S5.5: performing ReLU activation operation on the normalized ECFP4 feature and the normalized SMILES word one-hot feature by the activation layer, respectively, to obtain an activated ECFP4 feature and an activated SMILES word one-hot feature; S5.6: performing max-pooling operation on the activated ECFP4 feature and the activated SMILES word one-hot feature by the pooling layer, respectively, to obtain a pooled ECFP4 feature and a pooled SMILES word one-hot feature; S5.7: concatenating the pooled ECFP4 feature to obtain a concatenated ECFP4 semantic feature, concatenating the pooled SMILES word one-hot feature to obtain a concatenated SMILES word one-hot semantic feature, and concatenating the ECFP4 semantic feature and the SMILES word one-hot semantic feature to obtain fused semantic features; S5.8: feeding the fused semantic features to one fully connected layer, and passing the fused semantic features through Softmax, to output probability of each node which falls between [0,1] and is denoted as p.sub.1∈R.sup.d, p.sub.1 indicating a predicted occurrence probability of each type of reaction templates; passing the ECFP4 semantic feature and the SMILES word one-hot semantic feature through the dropout layer, the fully connected layer and Softmax, respectively, to output probabilities of each node which fall between [0, 1] are denoted as p.sub.2∈R.sup.d and p.sub.3∈R.sup.d, respectively, p.sub.2 and p.sub.3 indicating occurrence probabilities of each type of reaction templates predicted according to the ECFP4 semantic feature and SMILES word one-hot semantic feature, d being a number of the reaction templates in a reaction template set T; S5.9: obtaining a final predicted result by the output layer according to a result in S5.8.

5. The method according to claim 1, wherein in the training process of step S6, according to three classification results of the model, three cross entropy losses of the model are denoted as loss.sub.1, loss.sub.2 and loss.sub.3, respectively, and a final loss of the single-step retrosynthesis prediction model is:
loss=α.sub.1loss.sub.1+α.sub.2loss.sub.2+α.sub.3loss.sub.3 where loss.sub.1, loss.sub.2 and loss.sub.3 represent a predicted loss of the fused semantic feature, a predicted loss of the ECFP4 semantic feature and a predicted loss of the SMILES word one-hot semantic feature, respectively, and α.sub.j (j=1,2,3) represents weights of the three losses loss.sub.1, loss.sub.2 and loss.sub.3 in a global loss of the network, respectively, where Σα.sub.j=1 and α.sub.j∈(0,1).

6. A single-step retrosynthesis system based on a multi-semantic network, comprising: a data set preprocessing module configured to acquire a public data set and preprocess the public data set to obtain a preprocessed data set D, wherein each piece of data in the data set D corresponds to one specific reaction, and each piece of data comprises a reaction, a reactant molecule and a product molecule; a reaction template set constructing module configured to construct a reaction template set T; a feature constructing module is configured to obtain an ECFP4 feature set E of product molecules represented by ECFP4 vectors and a SMILES word one-hot feature set S of product molecules represented by a SMILES word one-hot matrix, respectively, according to the product molecules in the data set D; a sample set constructing module configured to construct sample setsG={(e.sub.i, s.sub.i), t.sub.i}.sub.i=1.sup.N, where e.sub.i∈E and s.sub.i∈S represent the ECFP4 feature and the SMILES word one-hot feature of a product molecule in a i-th data of the data set D, respectively, t.sub.i∈T represents a reaction template in the i-th data of the data set D, and N represents a number of sample set; a multi-semantic network constructing module configured to construct the multi-semantic network, wherein the multi-semantic network comprises an input layer, a convolution layer, a normalization layer, an activation layer, a pooling layer, a dropout layer, a fully connected layer and an output layer, the convolution layer is configured to convolve input data, the normalization layer is configured to normalize a convolved feature, the activation layer is configured to activate a normalized feature, and the pooling layer performs pooling operation on the ECFP4 feature and the SMILES word one-hot feature, respectively, so as to obtain the ECFP4 semantic feature and the SMILES word one-hot semantic feature; fuse the ECFP4 semantic feature with the SMILES word one-hot semantic feature to obtain a fused semantic feature; and pass the fused semantic feature through the dropout layer, the fully connected layer and Softmax, to obtain a final output result by the output layer; a multi-semantic network training module configured to train the multi-semantic network in the multi-semantic network constructing module by using the sample set of the sample set constructing module to obtain a trained single-step retrosynthesis prediction model; a single-step retrosynthesis predicting module configured to, for a target product molecule to be predicted, predict a reaction template capable of generating the target product molecule by using the trained single-step retrosynthesis prediction model in the multi-semantic network training module, and calculate a SMILES string of the reactant corresponding to the target product molecule in combination with the SMILES string of the target product molecule, thereby realizing single-step retrosynthesis prediction.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0046] In order to explain the embodiments of the present disclosure or the technical solutions in the prior art more clearly, drawings used in the description of the embodiments or the prior art will be briefly described below. Apparently, the drawings in the following description are some embodiments of the present disclosure. For those skilled in the art, other drawings can be obtained according to these drawings without any creative labor.

[0047] FIG. 1 is a flow chart of single-step retrosynthesis prediction of a multi-semantic network according to an embodiment of the present disclosure.

[0048] FIG. 2 is a diagram of a SMILES word one-hot feature according to an embodiment of the present disclosure.

[0049] FIG. 3 is a schematic diagram of a multi-semantic network according to an embodiment of the present disclosure.

[0050] FIG. 4 is a schematic diagram of a single-step retrosynthesis system module of a multi-semantic network according to an embodiment of the present disclosure.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0051] The present disclosure discloses a single-step retrosynthesis method and system based on a multi-semantic network, the method includes the following steps: inputting an ECFP4 feature and a SMILES word one-hot feature of a target product molecule during the single-step retrosynthesis prediction, and outputting a first k reactions which may occur on the target product molecule in a reaction template form after passing through the multi-semantic network. The SMILES string of the reactant corresponding to the target product molecule is finally calculated according to the output reaction template and in combination with the SMILES string of the target product molecule, thereby realizing single-step retrosynthesis prediction. The present disclosure further provides a single-step retrosynthesis system based on a multi-semantic network, which performs single-step retrosynthesis prediction by preprocessing a data set, constructing a reaction template set, constructing a feature, constructing a sample set, constructing a multi-semantic network, and training a multi-semantic network.

[0052] The method and system of embodiments of the present disclosure have the following advantages or beneficial technical effects.

[0053] The single-step retrosynthesis prediction method is a first method for performing single-step retrosynthesis prediction by using a multi-semantic fusion network in the field of single-step retrosynthesis, and is a template-based single-step retrosynthesis method. The prediction result has relatively high interpretability. The present disclosure designs a new loss function, which can improve the training precision of the model. The present disclosure designs a semantic extraction method, which can extract a deep semantic feature of the ECFP4 feature and the SMILES word one-hot feature of the target product molecule. The present disclosure designs a construction method of a SMILES word one-hot feature, which can contain more potential information. The single-step retrosynthesis prediction model of the present disclosure can be used as both single-step chemical retrosynthesis prediction and single-step bioretrosynthesis prediction, and matching operation of enzyme information is not required when the single-step bioretrosynthesis prediction is carried out.

[0054] In order to make the purposes, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiment of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. Apparently, described embodiments are a part of the embodiments of the present disclosure, rather than all of the embodiments. Based on the embodiment of the present disclosure, all other embodiments obtained by those skilled in the art without any creative labor fall within protection scope of the present disclosure.

Embodiment 1

[0055] The embodiment of the present disclosure provides a single-step retrosynthesis method based on a multi-semantic network, which includes the following steps S1-S7.

[0056] In step S1, a public data set is acquired and preprocessed to obtain a preprocessed data set D. Each piece of data in the data set D corresponds to one specific reaction, and each piece of data includes a reaction, a reactant and a target product molecule.

[0057] In step S2, reaction templates are extracted from all the data in the data set D, and repeated reaction templates are removed by using an RDChiral tool to obtain a final reaction template set T. Each reaction template contains one or more reactions.

[0058] In step S3, according to the product molecule in the data set D, an ECFP4 feature set E of a product molecule represented by an ECFP4 vector and a SMILES word one-hot feature set S of a product molecule represented by a SMILES word one-hot matrix are obtained, respectively.

[0059] In step S4, a sample set G={(e.sub.i, s.sub.i), t.sub.i}.sub.i=1.sup.N is constructed, where e.sub.i∈E and s.sub.i∈S represent the ECFP4 feature and the SMILES word one-hot feature of a product molecule in a i-th data of the data set D, respectively, t.sub.i∈T represents a reaction template in the i-th data of the data set D, and N represents a number of sample sets.

[0060] In step S5, a multi-semantic network is constructed, the multi-semantic network includes an input layer, a convolution layer, a normalization layer, an activation layer, a pooling layer, a dropout layer, a fully connected layer and an output layer. The convolution layer is configured to perform convolution on the input data, the normalization layer is configured to perform normalization on the convolved feature, the activation layer is configured to activate the normalized feature, and the pooling layer performs pooling operation on the ECFP4 feature and the SMILES word one-hot feature, respectively, so as to obtain an ECFP4 semantic feature and a SMILES word one-hot semantic feature. The ECFP4 semantic feature is fused with the SMILES word one-hot semantic feature to obtain a fused semantic feature; and then, the fused semantic feature is passed through the dropout layer, the fully connected layer and Softmax, to obtain a final output result outputted by the output layer.

[0061] In step S6, the multi-semantic network in step S5 is trained by using the sample set in step S4 to obtain a trained single-step retrosynthesis prediction model.

[0062] In step S7, for a target product molecule to be predicted, the reaction template capable of generating the target product molecule is predicted using the single-step retrosynthesis prediction model trained in step S6, and then the SMILES string of the reactant molecule corresponding to the target product molecule is calculated by using the RDChiral tool in combination with the SMILES string of the target product molecule, thereby realizing single-step retrosynthesis prediction.

[0063] In the specific implementation process, each piece of data in the data set D specifically includes: (1) reaction represented by the SMILES string; (2) all reactant molecules participating in the reaction (represented by the SMILES string, in which if there is more than one reactant molecule, the SMILES strings of a plurality of reactants are separated by separators); (3) one product molecule generated by the reaction (represented by the SMILES string); (4) number of catalytic enzyme (only for metabolic reaction, which is not necessary). If there are a plurality of product molecules for a certain reaction in the original public data set, there are a plurality of related data in D, and each data corresponds to one product of the reaction.

[0064] Referring to FIG. 1, a flow chart of single-step retrosynthesis prediction of a multi-semantic network according to an embodiment of the present disclosure is shown.

[0065] In step S7, during the synthesis prediction of the target product molecule to be predicted, an ECFP4 feature and a SMILES word one-hot feature of a target product molecule to be predicted are input, and the first k reactions which may occur on the target product molecule are output in a reaction template form through a single-step retrosynthesis prediction model. The SMILES string of the reactant corresponding to the target product molecule can be finally calculated according to the reaction template in combination with the SMILES string of the target product molecule, thereby realizing single-step retrosynthesis prediction.

[0066] In an embodiment, step S2 includes:

[0067] extracting reaction templates from all the data in the data set D and removing repeated reaction templates by using an RDChiral tool to obtain a final reaction template set T.

[0068] In the specific implementation process, the template_extractor function (template extraction function) in RDChiral is used to extract a reaction template in a SMARTS format.

[0069] In an embodiment, step S3 includes:

[0070] according to the product molecule in the data set D, generating an ECFP4 vector of the product molecule in all the data in the data set D by using the RDKit tool to obtain an ECFP4 feature set E of the product molecule represented by the ECFP4 vector; generating a SMILES word one-hot matrix of the product molecule in all the data in the data set D using a Sklearn tool to obtain the SMILES word one-hot feature set S of the product molecule represented by the SMILES word one-hot matrix.

[0071] In an embodiment, generating a SMILES word one-hot matrix of the product molecule in all the data in the data set D using a Sklearn tool to obtain the SMILES word one-hot feature set S of the product molecule represented by the SMILES word one-hot matrix includes the following steps S3.1-S3.2.

[0072] In step S3.1, one-hot-encoding is performed on each character of alphabet for constructing the SMILES string, to generate a word vector with a dimension of w.sub.2; the word vector of the first l.sub.2 characters in each product molecule SMILES string is taken to form a SMILES word one-hot matrix s.sub.2∈{0,1}.sup.l.sup.2.sup.×w.sup.2, and if the product molecule SMILES string has less than l.sub.2 characters, it is padded with 0 vector.

[0073] In step S3.2, every successive n rows in the matrix s.sub.2∈{0,1}.sup.l.sup.2.sup.×w.sup.2 is taken as a group, in which the n rows correspond to the word vectors of n characters, the word vectors in the same group are concatenated in sequence to obtain a composition of word vector with a length of w.sub.1, w.sub.1=n*w.sub.2, a total of l.sub.1 composition of word vectors are obtained,

[00002] $l_{1} = \frac{l_{2}}{n},$

thereby constituting the SMILES word one-hot feature of the product molecule s∈{0,1}.sup.l.sup.1.sup.×w.sup.1, where w.sub.2, l.sub.2 and n are positive integers, and n<l.sub.2.

[0074] Specifically, the SMILES word one-hot feature of the product molecule is a 0-1 matrix, and each row of the matrix represents the composition of word vector representation of the consecutive n characters in the SMILES string of the product molecule. The SMILES word one-hot feature of the product molecule is generated by the following method. First, it is assumed that the alphabet of all molecule SMILES strings in the data set D contains w.sub.2 letters in total, and one-hot-encoding is performed on each character in all SMILES strings, so as to generate a word vector with length w.sub.2. Thereafter, all the first l.sub.2 characters in the product molecule SMILES string are represented by word vectors, and the product molecule SMILES string with less than l.sub.2 characters are padded with 0 vector to obtain the matrix s.sub.2∈{0,1}.sup.l.sup.2.sup.×w.sup.2. Finally, starting from the first row, every consecutive n rows (word vectors of n characters) in the matrix s.sub.2∈{0,1}.sup.l.sup.2.sup.×w.sup.2 are concatenated into a composition of word vector with length w.sub.1 (w.sub.1=n*w.sub.2) in sequence to obtain a total of

[00003] $l_{1} (l_{1} = \frac{l_{2}}{n})$

composition of word vectors, and the matrix formed by the composition of word vectors is the SMILES one-hot matrix feature of the product molecule s∈{0,1}.sup.l.sup.1.sup.×w.sup.1, where w.sub.2, l.sub.2 and n are positive integers, and n<l.sub.2.

[0075] In the specific implementation process, the length of the ECFP4 feature of the product molecule is 4096. The dimension of the SMILES word one-hot feature of the product molecule is s∈{0,1}.sup.75×120, which is a feature of vectorization representation the product molecule SMILES string. The generating step includes the following steps. First, one-hot-encoding is performed on the alphabet consisted of characters in all the molecule SMILES strings, and the 40-dimension word vector is generated. Then, the first 225 characters in the product molecule SMILES string are all represented by word vectors, and the product molecule SMILES string with less than 225 characters are padded with 0 vector to obtain s.sub.2∈{0,1}.sup.225×40. Finally, the word vectors of every three consecutive characters in s.sub.2∈{0,1}.sup.225×40 are concatenated into a composition of word vector to finally obtain the product word one-hot matrix s∈{0,1}.sup.75×120.

[0076] In an embodiment, the multi-semantic network in step S5 has one input layer, k.sub.1+k.sub.2 convolution layers, k.sub.1+k.sub.2 normalization layers, k.sub.1+k.sub.2 activation layers, k.sub.1+k.sub.2 pooling layers, two dropout layers, three fully connected layers and three output layers, where k.sub.1 and k.sub.2 are positive integers,

[0077] The processing step includes the following steps S5.1-S5.9.

[0078] In step S5.1, the input layer includes two nodes, the ECFP4 feature represented by the ECFP4 vector is input at the input node 1, and the SMILES word one-hot feature represented by the SMILES word one-hot matrix is input at the input node 2.

[0079] In step S5.2, the ECFP4 feature input at the node 1 is convolved by using k.sub.1 convolution kernels with the same size, in which a number of output channels of k.sub.1 convolution kernels is c.sub.1, so as to obtain the convolved ECFP4 feature.

[0080] In step S5.3, the SMILES word one-hot feature input at the node 2 is convolved by using k.sub.2 convolution kernels with different sizes, in which a number of output channels of k.sub.2 convolution kernels is c.sub.2, so as to obtain the convolved SMILES word one-hot feature.

[0081] In step S5.4, the feature after convolution in S5.2 and the feature after convolution in S5.3 are normalized by the normalization layer, respectively, to obtain the normalized ECFP4 feature and the SMILES word one-hot feature.

[0082] In step S5.5, ReLU activation operation is performed on the normalized ECFP4 feature and the SMILES word one-hot feature by the activation layer, respectively, to obtain the activated ECFP4 feature and the activated SMILES word one-hot feature.

[0083] In step S5.6, max-pooling operation is performed on the activated ECFP4 feature and the activated SMILES word one-hot feature by the pooling layer, respectively, to obtain the ECFP4 feature and the SMILES word one-hot feature after pooling operation.

[0084] In step S5.7, the ECFP4 feature after max-pooling is concatenated to obtain the concatenated ECFP4 semantic feature, the SMILES word one-hot feature after max-pooling is concatenated to obtain the concatenated SMILES word one-hot semantic feature, and the ECFP4 semantic feature and the SMILES word one-hot semantic feature are concatenated to obtain the fused semantic features.

[0085] In step S5.8, the fused semantic features are fed to a fully connected layer, and then passed through Softmax to output the probability of each node which falls between [0,1] and is denoted as p.sub.1∈R.sup.d, such probability indicates the predicted occurrence probability of each type of reaction templates. Furthermore, the ECFP4 semantic feature and the SMILES word one-hot semantic feature are passed through the dropout layer, the fully connected layer and Softmax, respectively, so as to output the probabilities of each node which fall between [0, 1] and are denoted as p.sub.2∈R.sup.d and p.sub.3∈R.sup.d, respectively. p.sub.2 and p.sub.3 indicate the occurrence probabilities of each type of reaction templates predicted according to the ECFP4 semantic feature and SMILES word one-hot semantic feature respectively. d is a number of the reaction templates in a reaction template set T.

[0086] In step S5.9, the final predicted result is obtained by the output layer according to the result of step S5.8.

[0087] Specifically, semantics refers to more abstract features after operations of the convolution layer, the normalization layer, the activation layer and the pooling layer. After passing through the dropout layer and the fully connected layer, the fused semantic feature passes through Softmax to output the probability of each node which falls between [0,1] and is denoted as p.sub.1∈R.sup.d. p.sub.1 indicates the occurrence probability of each type of reaction templates predicted according to the fused semantic feature. Furthermore, the ECFP4 semantic feature and the SMILES word one-hot semantic feature pass through the dropout layer, the fully connected layer and Softmax, respectively, so as to output the probabilities of each node which fall between [0, 1] and are denoted as p.sub.2∈R.sup.d and p.sub.3∈R.sup.d, respectively. p.sub.2 and p.sub.3 indicate the occurrence probabilities of each type of reaction templates predicted according to the ECFP4 semantic feature and SMILES word one-hot semantic feature respectively.

[0088] p.sub.1 is a final output result of the network, which refers to the occurrence probability of each type of templates predicted according to the fused semantic features. The fused semantic feature is obtained by concatenating the ECFP4 semantic feature and the SMILES word one-hot semantic feature. The occurrence probabilities of each type of templates obtained by learning the ECFP4 semantic feature and the word one-hot semantic feature are p.sub.2,p.sub.3. In this way, in the training of the model, the ability of the network to learn the ECFP4 semantic feature and the SMILES word one-hot semantic feature is enhanced, so as to obtain the ECFP4 semantic feature and the SMILES word one-hot semantic feature which are more abstract. The fused semantic feature which is more abstract is obtained by concatenating the two features. That is to say, the network learns the fused semantic features and also learns the ECFP4 semantic feature and the SMILES word one-hot semantic feature. In this way, the ability of the ECFP4 feature and the SMILES word one-hot semantic feature expressing molecules can be enhanced, and the ability of the fused semantic features expressing molecules can also be enhanced, thus improving precision of the prediction result of the network.

[0089] Referring to FIG. 2, a diagram of a SMILES word one-hot feature according to the present disclosure is shown. Product Molecule denotes a product molecule, One-Hot-Encoding denotes one-hot-encoding, SMILES String denotes a SMILES string of a target product molecule, Word Vector denotes a word vector, and Composition of Word Vector denotes a composition of word vector, One-hot encoding of the SMILES string denotes a feature after one-hot encoding of the SMILES string, and SMILES Word One-Hot Feature denotes a word one-hot feature.

[0090] In the specific embodiment, in step S5, the multi-semantic network includes one input layer, six convolution layers, six normalization layers, six activation layers, six pooling layers, two dropout layers, three fully connected layers and three output layers.

[0091] Referring to FIG. 3, a schematic diagram of a multi-semantic network according to the present disclosure is shown. Target ProductMolecule represents the target product molecule, and ECFP4 Feature and SMILES Word One Hot Feature represent the ECFP4 feature and the SMILES word one-hot feature of the target product molecule, respectively. Convolution+BN+ReLU represents convolution, normalization and activation operations, Subsampling represents subsampling, Concatenation represents concatenation, and Fully connected represents full connected.

[0092] In step S5.2, the ECFP4 feature input at the node 1 is convolved by using 3 convolution kernels with a size of 1×4096, in which a number of output channels of 3 convolution kernels is 100, so as to obtain the convolved ECFP4 feature. In step S5.3, the SMILES word one-hot matrix input at the node 2 is convolved by using 3 convolution kernels with a size of 3×120, 4×120 and 5×120, in which a number of output channels of 3 convolution kernels is 100, so as to obtain the convolved SMILES word one-hot matrix feature.

[0093] The ReLU activation function in step S5.5 is:

f(x)=max(0,x)

where x represents input of neurons, which can change all negative values to 0, while keeping the positive values unchanged. The unilateral inhibition function enables neurons in the neural network to have sparse activation.

[0094] In step S5.8, Softmax function is specifically defined as:

[00004] $S_{i} = \frac{e^{i}}{Σ_{j} e^{j}}$

where e is a natural constant, Σ.sub.je.sup.j represents a sum of powers of all neurons with e as a base and with the neuron as the index, and S.sub.i represents a result of a i-th neuron passing through Softmax.

[0095] In an embodiment, in the training process of step S6, according to three classification results of the model, three cross entropy losses of the obtained model are denoted as loss.sub.1, loss.sub.2 and loss.sub.3, respectively, and a final loss of the single-step retrosynthesis prediction model is:

loss=α.sub.1loss.sub.1+α.sub.2loss.sub.2+α.sub.3loss.sub.3

where loss.sub.1, loss.sub.2 and loss.sub.3 represent a predicted loss of the fused semantic feature, a predicted loss of the ECFP4 semantic feature and a predicted loss of the SMILES word one-hot semantic feature, respectively, and α.sub.j (j=1,2,3) represents a weight of the three losses loss.sub.1, loss.sub.2 and loss.sub.3 in the global loss of the network, respectively, where Σα.sub.j=1 and α.sub.j∈(0,1).

[0096] Specifically, the loss function represents the difference between a prediction result and a real value. The embodiment proposes a new loss function, which evaluates the classification result (classification probability P.sub.1) obtained by fusion semantic learning originally, and additionally evaluates the classification results (classification probabilities P.sub.2 and P.sub.3) of the ECFP4 semantic feature and the SMILES word one-hot semantic feature. By comprehensively evaluating the whole model with weights α.sub.1, α.sub.2 and α.sub.3, the training precision of the model is improved. The value of α.sub.j is a decimal between (0,1), and the endpoints 0 and 1 are not taken.

[0097] In the specific implementation process, an Adam optimizer is used in the model. When the model is trained, three cross entropy losses are calculated according to the three output results of step S5.

[0098] The specific form of the cross entropy loss function loss.sub.j (j=1,2,3) is as follows:

[00005] ${Loss}_{j} = - \frac{1}{d} \underset{i}{.Math.} {.Math.}_{c = 1}^{d} y_{i, c} \log (p_{j, i, c})$

where d is a total number of labels, that is, a size of the reaction template set T; y.sub.i,c is a binary identifier, which indicates whether a real label of sample i is c, that is, whether a predicted rule of the sample i is the same as a real rule c, and 1 is taken when the real label of the sample i is the same as c, otherwise 0 is taken; p.sub.j,i,c represents a j-th output probability of the network of the sample i with the label as c, that is, the j-th output probability of the sample i predicted by the network with the rule of c.

[0099] In the specific embodiment, in step S6, a number of model training epochs is set as 100, and multiple iterations are performed in each epoch until all training samples participate in training for one time, and a number of training samples batch size participating in one iteration is set as 128. The initial learning rate is set as 0.001.

[0100] The following specific examples illustrate and verify the method of the present disclosure.

[0101] Example 1: a publicly available chemical reaction data set USPTO-50k is preprocessed according to step S1; a reaction template set is constructed according to step S2; a ECFP4 feature and a SMILES word one-hot feature are constructed according to step S3; a set G is constructed according to step S4, and the set G is randomly divided into a training set, a verification set and a test set according to a ratio of 8:1:1. The training set and the verification set are used to train and select models, and the test set is used to perform prediction by the single-step retrosynthesis prediction model after training. The training set and the verification set are used to train the model, and the test set tests the prediction precision of the trained single-step chemical retrosynthesis prediction model. Table 1 shows prediction performance of the single-step retrosynthesis prediction method based on the multi-semantic network of the present disclosure in single-step chemical retrosynthesis. At present, the prediction precision of top-1, top-3, top-5 and top-10 of the best experimental results in the field is 52.5%, 69.0%, 75.6% and 83.7%. Apparently, the prediction precision based on the model of the present disclosure is significantly higher than the best results in the field at present.

TABLE-US-00001 TABLE 1 Prediction performance of single-step chemical retrosynthesis of the multi-semantic network Top-1 Top-3 Top-5 Top-10 61.8% 80.6% 85.1% 89.5%

[0102] Example 2: a publicly available metabolic reaction data set MetaNetX is preprocessed according to step S1; a reaction template set is constructed according to step S2; a ECFP4 feature and a SMILES word one-hot feature are constructed according to step S3; a set G is constructed according to step S4, and the set G is randomly divided into a training set, a verification set and a test set according to a ratio of 8:1:1. The training set and the verification set are used to train and select models, and the test set is used to perform prediction by the single-step retrosynthesis prediction model after training. The training set and the verification set train the model, and the test set tests the prediction precision of the trained single-step chemical retrosynthesis prediction model. Table 2 shows the prediction performance of the single-step retrosynthesis prediction method based on the multi-semantic network according to the present disclosure in single-step bioretrosynthesis. There are few existing researches on single-step bioretrosynthesis, and the model of the present disclosure can predict single-step bioretrosynthesis without matching enzyme information.

TABLE-US-00002 TABLE 2 Prediction performance of single-step bioretrosynthesis of the multi-semantic network Top-1 Top-3 Top-5 Top-10 47.0% 66.4% 73.4% 79.8%

[0103] Compared with the prior art, the embodiments of the present disclosure has the following beneficial effects.

[0104] 1. The single-step retrosynthesis prediction method is a first method for performing single-step retrosynthesis prediction by using a multi-semantic fusion network in the field of single-step retrosynthesis, and is a template-based single-step retrosynthesis method. The prediction result has relatively high interpretability.

[0105] 2. The present disclosure designs a new loss function, which can improve the training precision of the model by comprehensively evaluating three prediction results of learning of the multi-semantic network.

[0106] 3. The present disclosure designs a semantic extraction method, which can extract deep semantic information of the ECFP4 feature and the SMILES word one-hot feature of target product molecules.

[0107] 4. The present disclosure designs a construction method of a SMILES word one-hot feature, which can contain more potential information.

[0108] 5. The single-step retrosynthesis prediction model of the present disclosure can be used as both single-step chemical retrosynthesis prediction and single-step bioretrosynthesis prediction, and matching operation of enzyme information is not required when the single-step bioretrosynthesis prediction is carried out.

Embodiment 2

[0109] Based on the same inventive idea, the embodiment provides a single-step retrosynthesis system based on a multi-semantic network, which includes a data set preprocessing module, a reaction template set constructing module, a feature constructing module, a sample set constructing module, a multi-semantic network constructing module, a multi-semantic network training module and a single-step retrosynthesis predicting module.

[0110] The data set preprocessing module is configured to acquire a public data set and preprocess the public data set to obtain a preprocessed data set D, each piece of data in the data set D corresponds to one specific reaction, and each piece of data includes a reaction, a reactant molecule and a product molecule.

[0111] The reaction template set constructing module is configured to construct a reaction template set T.

[0112] The feature constructing module is configured to obtain an ECFP4 feature set E of a product molecule represented by an ECFP4 vector and a SMILES word one-hot feature set S of a product molecule represented by a SMILES word one-hot matrix, respectively, according to the target product molecule in the data set D.

[0113] The sample set constructing module is configured to construct a sample set G={(e.sub.i, s.sub.i), t.sub.i}.sub.i=1.sup.N, where e.sub.i∈E and s.sub.i∈S represent the ECFP4 feature and the SMILES word one-hot feature of the product molecules in a i-th data of the data set D, respectively, t.sub.i∈T represents a reaction template in the i-th data of the data set D, and N represents a number of sample sets.

[0114] The multi-semantic network constructing module is configured to construct a multi-semantic network, the multi-semantic network includes an input layer, a convolution layer, a normalization layer, an activation layer, a pooling layer, a dropout layer, a fully connected layer and an output layer. The convolution layer is configured to convolve the input data, the normalization layer is configured to normalize the convolved feature, the activation layer is configured to activate the normalized feature, and the pooling layer performs pooling operation on the ECFP4 feature and the SMILES word one-hot feature, respectively, so as to obtain the ECFP4 semantic feature and the SMILES word one-hot semantic feature; fuse the ECFP4 semantic feature with the SMILES word one-hot semantic feature to obtain the fused semantic features; and then, pass the fused semantic features through the dropout layer, the fully connected layer and Softmax, to obtain the final output result by the output layer.

[0115] The multi-semantic network training module is configured to train the multi-semantic network in the multi-semantic network constructing module by using the sample set of the sample set constructing module to obtain a trained single-step retrosynthesis prediction model.

[0116] The single-step retrosynthesis predicting module is configured to, for a target product molecule to be predicted, use the single-step retrosynthesis prediction model trained in the multi-semantic network training module to predict a reaction template capable of generating the target product molecule, and finally calculate the SMILES string of the reactant molecule corresponding to the target product molecule in combination with the SMILES string of the target product molecule, thereby realizing single-step retrosynthesis prediction.

[0117] Referring to FIG. 4, a schematic diagram of a module of a single-step retrosynthesis system of a multi-semantic network according to an embodiment of the present disclosure is shown.

[0118] Generally speaking, the data set preprocessing module is configured to preprocess the data set to obtain the processed data set. The reaction template set constructing module is configured to generate a reaction template set based on the preprocessed data set. The feature constructing module is configured to generate an ECFP4 feature and a SMILES word one-hot feature of a product molecule according to the product molecule. The sample set constructing module is configured to generate a sample set consisted of the ECFP4 feature, the SMILES word one-hot feature and the reaction template. The multi-semantic network constructing module is configured to construct a multi-semantic network for single-step retrosynthesis prediction. The multi-semantic network training module is configured to train the multi-semantic network by using data in the sample set to obtain a single-step retrosynthesis prediction model of the trained multi-semantic network. The single-step retrosynthesis predicting module is configured to perform single-step retrosynthesis prediction on new target product molecules by using the multi-semantic network model.

[0119] Because the system introduced in Embodiment 2 of the present disclosure is the system used to implement the single-step retrosynthesis method based on a multi-semantic network in Embodiment 1 of the present disclosure, those skilled in the art can understand the specific structure of the system based on the method introduced in Embodiment 1 of the present disclosure, which will not be described in detail herein. All systems used in the method of Embodiment 1 of the present disclosure belong to scope to be protected by the present disclosure.

[0120] It should be understood that the above description of the preferred embodiments is more detailed, which should not be considered as a limitation on protection scope of the present disclosure. Under inspiration of the present disclosure, those skilled in the art can also make substitutions or modifications without departing from the protection scope claimed by claims of the present disclosure, all of which fall within the protection scope of the present disclosure. The claimed protection scope of the present disclosure shall be subject to the appended claims.

SINGLE-STEP RETROSYNTHESIS METHOD AND SYSTEM BASED ON MULTI-SEMANTIC NETWORK

Inventors

Cpc classification

Classification Explorer

G16C20/10

PHYSICS

Classification Explorer

G16C20/70

PHYSICS

International classification

Classification Explorer

G16C20/10

PHYSICS

Classification Explorer

G16C20/70

PHYSICS

Abstract

Claims

Description