MOLECULAR GRAPH REPRESENTATION LEARNING METHOD BASED ON CONTRASTIVE LEARNING

Abstract

The present invention is a molecular graph representation learning method based on contrastive learning, the method comprising: obtaining a molecular fingerprint representation of each molecule, and calculating a similarity between each two molecular fingerprints; collecting a full amount of chemical functional group information, and matching a corresponding functional group for each atom in the molecule; using a heterogeneous graph to model a molecular graph; using a RGCN in the structure-aware molecular encoder to encode the representation of each atom in the molecule and the representation of the functional group to which the atom belongs, and mapping the molecule to a feature space through an aggregation function to obtain a structure-aware feature representation; according to the fingerprint similarity between molecules, selecting positive and negative samples, and carrying out a comparative learning in the feature space; obtaining the structure-aware molecular encoder by using the contrastive learning method for training on a large-sample molecular dataset, and applying the structure-aware molecular encoder to a prediction task of downstream molecular attributes. The present invention helps to capture more abundant molecular structure information and solve the problem on molecular property prediction.

Claims

1. A molecular graph representation learning method based on contrastive learning, wherein, the method comprises the following steps: (1) obtaining a molecular fingerprint representation of each molecule, and calculating a similarity between each two molecular fingerprints; (2) collecting a full amount of chemical functional group information, and matching a corresponding functional group for each atom in the molecule; wherein, when an atom belongs to a plurality of functional groups, a functional group containing a larger number of atoms is preferentially matched as the functional group corresponding to the atom; (3) using a heterogeneous graph to model a molecular graph, wherein the heterogeneous graph is a graph containing different types of nodes and edges, different atoms correspond to different node types, and different bonds correspond to different edge types; (4) constructing a structure-aware molecular encoder, using a relational graph convolutional network (RGCN) in the structure-aware molecular encoder to encode the representation of each atom in the molecule and the representation of the functional group to which the atom belongs, and mapping the molecule to a feature space through an aggregation function to obtain a structure-aware feature representation, wherein the specific process is as follows: taking the heterogeneous graph with initialized node features and functional group features as an input of the structure-aware molecular encoder, transferring information by the relational graph convolutional network (RGCN) in the structure-aware molecular encoder through calculating and aggregating information for different types of edges, and integrating the information aggregated by different edges for different types of nodes; after obtaining the feature representation of each atom and the functional group that the atom belongs to, then aggregating the features of the nodes and the functional groups to obtain the structure-aware feature representation of the molecule; wherein, a formula for the information transfer of the relational graph convolutional network (RGCN) is as follows: $h_{i}^{l + 1} = σ (\underset{r \in R}{.Math.} \underset{j \in N_{i}^{r}}{.Math.} \frac{1}{c_{i, r}} W_{r}^{l} h_{j}^{l} + W_{0}^{l} h_{i}^{l})$ wherein, R is a set of all edges, N.sub.i.sup.r is all neighbor nodes which are adjacent to the node i and are of edge type r, c.sub.i,r is a parameter that can be learned, W.sub.r.sup.l is a weight matrix of the current layer l, h.sub.i.sup.l is a feature vector of the current layer l to the current node i; the feature of each neighbor node is multiplied by a weight corresponding to the edge type, and then is multiplied by a learnable parameter, and then summed, and finally, the information transferred by a self-loop edge is added and the activation function σ is passed, which is used as an output of the layer and an input of a next layer; (5) according to the fingerprint similarity between molecules, selecting positive and negative samples, and carrying out a comparative learning in the feature space; (6) obtaining the structure-aware molecular encoder by using the contrastive learning method for training on a large-sample molecular dataset, and applying the structure-aware molecular encoder to a prediction task of downstream molecular attributes.

2. The molecular graph representation learning method based on contrastive learning according to claim 1, wherein in step (1), a Simplified Molecular Input Line Entry System (SMILES) representation of each molecule is transformed into the molecular fingerprint through Rdkit; the molecular fingerprint is selected from one of Morgan fingerprints, Molecular ACCess System (MACCs) fingerprint and topology fingerprint.

3. The molecular graph representation learning method based on contrastive learning according to claim 2, wherein in step (1), the similarity between two molecular fingerprints is calculated using a Tanimoto coefficient, and the formula is as follows: $S_{AB} = \frac{c}{a + b - c}$ wherein, the partial molecular structures of 166 molecules are pre-specified by MACCs fingerprints; when any molecular structure is contained, the corresponding position is recorded as 1, otherwise, it is recorded as 0; a and b respectively represent the number of 1 displayed in the A and B molecules, and c represents the number of 1 displayed in both the A and B molecules.

4. The molecular graph representation learning method based on contrastive learning according to claim 1, wherein, in step (5), when selecting the positive and negative samples, one molecule of which similarity with a target molecule is greater than a certain threshold is selected as the positive sample, K molecules of which each similarity is less than a certain threshold are selected as the negative samples; a feature representation corresponding to the target molecule is denoted as q, a feature representation of the positive sample is denoted as k.sub.0, and the feature representations of K negative samples are denoted as k.sub.1, . . . , k.sub.K.

5. The molecular graph representation learning method based on contrastive learning according to claim 4, wherein, after obtaining the feature representations of each target molecule and the positive and negative samples thereof, a loss is calculated by using a loss function, and the parameters of the structure-aware molecular encoder are updated through a back-propagation algorithm, which causes the structure-aware molecular encoder to recognize the target molecule and the positive samples as similar instances and distinguish the target molecule and the positive samples from dissimilar samples.

6. The molecular graph representation learning method based on contrast learning according to claim 5, wherein the loss function is InfoNCE, and the formula is as follows: $L = - \log \frac{\exp (q^{T} k_{0} / τ)}{{.Math.}_{i = 0}^{K} \exp (q^{T} k_{i} / τ)}$ wherein, τ is a hyperparameter, the loss function causes the model to identify the target molecule q and positive samples k.sub.0 as similar instances, and to distinguish q from dissimilar instances k.sub.1, . . . , k.sub.K.

7. The molecular graph representation learning method based on contrast learning according to claim 1, wherein the specific process of step (6) is as follows: training the structure-aware molecular encoder on the large-sample molecular data set through the contrastive learning method described in step (5); then inputting molecular data in a small-sample data set into the structure-aware molecular encoder, and then using a linear classifier to classify the molecular representations output by the encoder, and predicting the molecular attributes.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0039] In order to illustrate the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative efforts.

[0040] FIG. 1 is a schematic flowchart of a molecular graph representation learning method based on contrastive learning provided by an embodiment of the present invention;

[0041] FIG. 2 is a schematic structural diagram of a structure-aware molecular encoder provided by an embodiment of the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0042] The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be pointed out that the following embodiments are intended to facilitate the understanding of the present invention, but do not have any limiting effect on it.

[0043] The molecular graph representation learning method based on contrastive learning provided by the present invention can be used in application scenarios such as a chemical molecule attribute prediction, virtual screening, etc., and selects positive and negative samples based on similarities of molecular fingerprints, and compares them with molecular data in a feature space, and directly encodes the knowledge of functional groups in the chemical field into the representations of the molecules to obtain a molecular graph representation with chemical field knowledge and distinctiveness. The present invention solves the problem of insufficient labeling data in supervised learning, and makes full use of the structure or characteristics of the molecular map data itself to construct labels.

[0044] As shown in FIG. 1, a molecular graph representation learning method based on contrastive learning, wherein, the method comprises the following steps:

[0045] firstly, transforming the SMILES representation of the molecules into the molecular fingerprint by Rdkit which is a powerful tool for cheminformatics. For each molecule, after calculating the fingerprint similarities between it and all other molecules using a Tanimoto coefficient, selecting one molecule of which the similarity with the molecule is greater than a certain threshold as a positive sample, and selecting K molecules of which the similarities are less than a certain threshold as negative samples.

[0046] Modeling the target molecule and its corresponding positive and negative samples by using a heterogeneous graph, which aims to characterize the different attributes of each node and edge. Inputting the sample data of the molecules into a structure-aware molecular encoder shown in FIG. 2, and obtaining the feature representations corresponding to the target sample and the positive and negative samples. Denoting a feature representation corresponding to the target molecule as q, denoting a feature representation of the positive sample as k.sub.0, and denoting the feature representations of K negative samples as k.sub.1, . . . , k.sub.K.

[0047] Taking InfoNCE as a loss function, the parameters of the model are updated through a back-propagation algorithm, which encourages the model to identify the target molecule and positive samples as similar instances, and at the same time distinguish them from dissimilar instances to learn discriminative structure-aware molecular feature representation.

[0048] The loss function described is InfoNCE, and the formula is as follows:

[00004] $L = - \log \frac{\exp (q^{T} k_{0} / τ)}{{.Math.}_{i = 0}^{K} \exp (q^{T} k_{i} / τ)}$

[0049] wherein, τ is a hyperparameter, the loss function causes the model to identify the target molecule q and positive samples k.sub.0 as similar instances, and to distinguish q from dissimilar instances k.sub.1, . . . , k.sub.K.

[0050] As shown in FIG. 2, it is a schematic diagram of a structure-aware graph neural network provided by an embodiment of the present invention. Modeling the molecules by using the heterogeneous graph with initialized node features and functional group features, and characterizing the different attributes of each node and edge. Taking the heterogeneous graph as an input of the structure-aware molecular encoder, and then calculating and aggregating information for different types of edges by utilizing RGCN, and integrating the information aggregated by different edges for different types of nodes to transfer information. The RGCN takes into account the type of edge, and in order to transfer the features of the nodes in a previous layer to a next layer, the RGCN adds a special self-loop edge for each node. The specific information transfer process is as follows:

[00005] $h_{i}^{l + 1} = σ (\underset{r \in R}{.Math.} \underset{j \in N_{i}^{r}}{.Math.} \frac{1}{c_{i, r}} W_{r}^{l} h_{j}^{l} + W_{0}^{l} h_{i}^{l})$

[0051] wherein, R is a set of all edges, N.sub.i.sup.r is all neighbor nodes which are adjacent to the node i and are of edge type r, c.sub.i,r is a parameter that can be learned, W.sub.r.sup.l is a weight matrix of the current layer l, h.sub.i.sup.l is a feature vector of the current layer l to the current node i. The feature of each neighbor node is multiplied by a weight corresponding to the edge type, and then is multiplied by a learnable parameter, and then summed, and finally, the information transferred by a self-loop edge is added and the activation function σ is passed, which is used as an output of the layer and an input of a next layer.

[0052] After obtaining the feature representation of each atom and the functional group that the atom belongs to by the RGCN, then aggregating the features of the nodes and the functional groups by an aggregation function to obtain the structure-aware feature representation of the molecule.

[0053] The above-mentioned embodiments describe the technical solutions and beneficial effects of the present invention in detail. It should be understood that the above-mentioned embodiments are only specific embodiments of the present invention and are not intended to limit the present invention. Any modifications, additions and equivalent substitutions made within the scope of the principles of the present invention shall be included within the protection scope of the present invention.

MOLECULAR GRAPH REPRESENTATION LEARNING METHOD BASED ON CONTRASTIVE LEARNING

Inventors

Cpc classification

Classification Explorer

G06N3/0464

PHYSICS

Classification Explorer

G06N3/084

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06N3/0985

PHYSICS

Classification Explorer

G06N3/0895

PHYSICS

Classification Explorer

G16C20/70

PHYSICS

Classification Explorer

G06N3/042

PHYSICS

International classification

Classification Explorer

G06N3/08

PHYSICS

Abstract

Claims

Description