Graph neural network systems for generating structured representations of objects

Abstract

There is described a neural network system for generating a graph, the graph comprising a set of nodes and edges. The system comprises one or more neural networks configured to represent a probability distribution over sequences of node generating decisions and/or edge generating decisions, and one or more computers configured to sample the probability distribution represented by the one or more neural networks to generate a graph.

Claims

1. A system comprising one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to perform operations for generating a graph, the graph comprising a set of nodes and edges, the operations comprising: generating, by the one or more computers and using a trained neural network system comprising one or more neural networks, an output representing a probability distribution over sequences that each include one or more node generating decisions, one or more edge generating decisions, or both, wherein the trained neural network system has been trained on a training dataset comprising a plurality of graphs that each comprise a respective plurality of nodes and edges; and sampling the probability distribution represented by the output of the trained neural network system comprising the one or more neural networks to generate the graph.

2. The system of claim 1, wherein the one or more neural networks comprises: a node creation neural network configured to receive as input a graph and output one or more probabilities of adding a new node to the graph; an edge addition neural network configured to receive as input a graph and an indication of a candidate node and output one or more probabilities of adding an edge connected to the candidate node to the graph; a node selection neural network configured to receive as input a graph and an indication of a candidate node and output one or more probabilities of adding an edge to the graph between the candidate node and each of the nodes of the graph; and wherein sampling the probability distribution further comprises: generating a node of the graph based upon the output of node creation neural network; and generating an edge of the graph based upon the output of the edge addition neural network and the node selection neural network.

3. The system of claim 2, wherein sampling the probability distribution comprises performing an iterative process of generating a node and generating an edge and wherein, at each iteration of the iterative process, generating an edge occurs subsequent to generating a node.

4. The system of claim 1, wherein each node and each edge is sequentially generated one at a time.

5. The system of claim 2, wherein generating a node of the graph further comprises: receiving an initial graph; providing the initial graph to the node creation neural network; receiving an output from the node creation neural network; determining whether a new node of the graph is to be generated based upon the output of the node creation neural network; responsive to determining that a new node of the graph is to be generated: generating the new node; and generating an updated graph by updating the initial graph to include the new node.

6. The system of claim 5, wherein generating an edge of the graph further comprises: providing the updated graph and an indication of the new node to the edge addition neural network; receiving an output from the edge addition neural network; determining whether an edge connected to the new node is to be generated based upon the output of the edge addition neural network; responsive to determining that an edge is to be generated: providing the updated graph and an indication of the new node to the node selection neural network; receiving an output from the node selection neural network; selecting a node of the graph based upon the output of the node selection neural network; and updating the graph to include an edge between the new node and the selected node.

7. The system of claim 6, wherein generating an edge of the graph further comprises: determining whether to generate a further edge connected to the new node.

8. The system of claim 1, wherein each node of the graph is associated with information and wherein generating the output representing the probability distribution comprises propagating information between neighbouring nodes to provide a node with information from the local neighbourhood of the node.

9. The system of claim 8, wherein propagating information comprises performing a plurality of rounds of information propagation.

10. The system of claim 8, wherein the information associated with a node is encoded as a state vector.

11. The system of claim 10, wherein propagating information between neighbouring nodes comprises generating a message vector associated with an edge connecting the neighbouring nodes based upon the state vectors associated with the neighbouring nodes.

12. The system of claim 11, wherein the message vector is further based upon a feature vector associated with the edge.

13. The system of claim 12, wherein the feature vector is based upon an edge type.

14. The system of claim 11, wherein the message vector is generated using a respective one of the one or more neural networks in the neural network system.

15. The system of claim 11, wherein propagating information to a node further comprises updating a state vector of the node based upon an aggregation of one or more message vectors associated with one or more edges connected to the node.

16. The system of claim 15, wherein the aggregation of one or more message vectors is an aggregation of the message vectors associated with the incoming edges connected to the node.

17. The system of claim 15, wherein the aggregation of one or more message vectors is an aggregation of the message vectors associated with each of the edges connected to the node.

18. The system of claim 15, wherein updating a state vector of a node is based upon the output of a respective one of the one or more neural networks that is configured to receive the aggregation of message vectors and the current state vector of the node as input.

19. The system of claim 10, wherein the graph is associated with a graph state vector, wherein the graph state vector is based upon an aggregation of a set of node state vectors.

20. The system of claim 19, wherein aggregating a set of node state vectors comprises: for each node state vector of the set, generating a modified node state vector by a respective one of the one or more neural networks that is configured to receive a node state vector as input; and aggregating the set of modified node state vectors.

21. The system of claim 20, wherein the modified node state vector is of a higher dimensionality than the unmodified node state vector.

22. The system of claim 11, wherein aggregating the set of modified node state vectors is based upon a gated sum.

23. The system of claim 9, wherein propagating information comprises: performing at least one round of propagating information between neighbouring nodes; determining a graph state vector associated with the graph; wherein the one or more neural networks comprise a node creation neural network and wherein the node creation neural network is configured to: process, the graph state vector to determine one or more probabilities of adding a new node to the graph.

24. The system of claim 23, wherein the one or more probabilities comprises a respective probability for each of a plurality of node types.

25. The system of claim 10, wherein propagating information comprises: perform at least one round of propagating information between neighbouring nodes; determine a graph state vector associated with the graph; wherein the one or more neural networks comprise an edge addition neural network and wherein the edge addition neural network is configured to: process, the graph state vector and the state vector associated with a candidate node to determine one or more probabilities of adding an edge connected to the candidate node to the graph.

26. The system of claim 10, wherein propagating information comprises: performing at least one round of propagating information between neighbouring nodes; wherein the one or more neural networks comprise a node selection neural network and wherein the node selection neural network is configured to: determine, a plurality of probabilistic scores for adding an edge between a candidate pair of nodes, the candidate pair of nodes comprising a candidate node and a particular node of the graph, based upon processing, by a neural network, the state vectors associated with the candidate node and the particular node.

27. The system of claim 26, wherein the plurality of probabilistic scores comprises a respective score for each of a plurality of edge types for connecting the candidate node and the particular node.

28. The system of claim 10, wherein the one or more neural networks comprises a node initialization neural network configured to initialize the state vector of a newly generated node, wherein initializing the state vector of a newly generated node comprises: processing, by the node initialization neural network, a graph state vector and a feature vector associated with the newly generated node to determine the initial state vector.

29. The system of claim 28, wherein the feature vector associated with the node is based upon a node type associated with the node.

30. The system of claim 1, wherein the probability distribution represented by the output of the neural network system comprising the one or more neural networks is determined based upon a conditional probability distribution.

31. The system of claim 30, wherein the conditional probability distribution is conditioned on data retrieved from a memory, the data retrieval based upon a memory look-up using a representation of at least a portion of the graph.

32. The system of claim 1, wherein the one or more neural networks is trained based upon maximizing an expected log-likelihood of the training dataset of graphs.

33. The system of claim 1, wherein each graph in the training dataset is associated with an ordering of nodes and edges of the graph.

34. The system of claim 1, to the operations further comprising: receiving data specifying a second graph; and evaluating the second graph based upon a probability distribution represented by an output generated for the second graph by the neural network system that comprises the one or more neural networks.

35. The system of claim 1, wherein the graph comprises a cycle.

36. The system of claim 1, wherein the graph represents the structure of a molecule, each respective node of the graph representing an atom of the molecule and each respective edge of the graph representing a chemical bond between atoms of the molecule.

37. The system of claim 36, wherein the one or more neural networks is trained based upon a metric comprising a chemical property.

38. The system of claim 1, wherein generating the output representing the probability distribution comprises, subsequent to completion of training of the one or more neural networks, adjusting an adjustable bias to alter a property associated with the graph.

39. A method performed by one or more computers and for generating a graph, the graph comprising a set of nodes and edges, the method comprising: generating, by the one or more computers and using a trained neural network system comprising one or more neural networks, an output representing a probability distribution over sequences that each include one or more node generating decisions, one or more edge generating decisions, or both, wherein the trained neural network system has been trained on a training dataset comprising a plurality of graphs that each comprise a respective plurality of nodes and edges; and sampling the probability distribution represented by the output of the trained neural network system comprising the one or more neural networks to generate the graph.

40. The method of claim 39, wherein the one or more neural networks comprises: a node creation neural network configured to receive as input a graph and output one or more probabilities of adding a new node to the graph; an edge addition neural network configured to receive as input a graph and an indication of a candidate node and output one or more probabilities of adding an edge connected to the candidate node to the graph; a node selection neural network configured to receive as input a graph and an indication of a candidate node and output one or more probabilities of adding an edge to the graph between the candidate node and each of the nodes of the graph; and wherein sampling the probability distribution further comprises: generating a node of the graph based upon the output of node creation neural network; and generating an edge of the graph based upon the output of the edge addition neural network and the node selection neural network.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 shows an example neural network system for generating a graph.

(2) FIG. 2 shows exemplary processing for generating a graph.

(3) FIG. 3 shows exemplary processing for generating a node of the graph.

(4) FIG. 4 shows exemplary processing for generating an edge of the graph.

(5) FIG. 5 shows exemplary processing for updating state vectors associated with nodes of the graph.

(6) FIG. 6 is a schematic illustration showing operation of the system of FIG. 1.

DETAILED DESCRIPTION

(7) FIG. 1 shows an example neural network system 100 for generating a graph. The neural network system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented.

(8) The system 100 comprises one or more neural networks configured to represent a probability distribution over sequences of node generating and edge generating decisions. In the example system 100 of FIG. 1, the one or more neural networks comprises a node creation neural network 101, an edge addition neural network 102 and a node selection neural network 103. The system 100 also comprises a graph engine 104 configured to sample the probability distribution represented by the one or more neural networks to generate a graph.

(9) The node creation neural network 101 is configured to receive as input a graph 105 and to process the input graph 105 to output one or more probabilities 106 of adding a new node to the input graph 105. That is, the node creation neural network 101 is configured to represent a probability distribution for determining whether to add a new node to a particular graph. The input graph 105 is received from the graph engine 104 and may be a graph that is under construction by the graph engine 104. The one or more probabilities 106 of adding a new node to the input graph 105 is received by the graph engine 104 and the graph engine 104 is configured to determine whether to add a new node to the graph 105 based upon the one or more probabilities 106 of adding a new node. Where nodes are typed, the one or more probabilities 106 may comprise a probability of adding a new node of each particular type of node. The node creation neural network 101 may comprise a final softmax layer in order to provide outputs as probabilities.

(10) The input graph 105 may have an associated graph state vector and the node creation neural network 101 may be configured to process the graph state vector to determine the one or more probabilities 106 of adding a new node. Such processing is described in more detail below.

(11) The edge addition neural network 102 is configured to receive the input graph 105 and an indication of a candidate node 107 of the input graph 105. The edge addition neural network is further configured to process the input graph 105 and the indication of the candidate node 107 to output one or more probabilities of adding an edge 108 connected to the candidate node 107. That is, the edge additional neural network 102 is configured to represent a probability distribution for determining whether to add an edge to a node of a particular graph. The one or more probabilities 108 of adding an edge to the input graph 105 connected to the candidate node 107 are received by the graph engine 104. The graph engine 104 is configured to determine whether to add an edge connected to the candidate node 107 to the graph 105 based upon the one or more probabilities 108 of adding an edge connected to the candidate node 107. Where edges are typed, the one or more probabilities 108 may comprise a probability of adding a new edge of each particular type of edge. The edge addition neural network 102 may comprise a final softmax layer in order to provide outputs as probabilities or may comprise a final layer having a sigmoid activation function if only a single probability is required.

(12) The edge addition neural network 102 may be configured to process a graph state representation associated with the input graph 105 and a node state vector associated with the candidate node 107 to output the one or more probabilities 108 of adding an edge connected to the candidate node 107. Such processing is described in more detail below.

(13) The node selection neural network 103 is configured to receive the input graph 105 and an indication of the candidate node 107. The node selection neural network 103 is further configured to process the input graph 105 and the indication of the candidate node 107 to output one or more probabilities 109 of adding an edge to the graph between the candidate node 107 and each of the nodes of the graph 105. That is, the node selection neural network 103 is configured to represent a probability distribution for determining which node of a particular graph a new edge from a candidate node is to be connected to. The one or more probabilities 109 of adding an edge to the graph 105 between the candidate node 107 and each of the nodes of the graph 105 are received by the graph engine 104. The graph engine 104 is configured to determine which node of the graph 105 to add a new edge connected to the candidate node 107 in the graph 105 based upon the one or more probabilities 109 of adding an edge between the candidate node 107 and each node of the graph 105.

(14) The node selection neural network 103 may be configured to process a node state vector associated with the candidate node 107 and a node state vector associated with each node of the input graph 105 to generate a score associated with each pairing of the candidate node 107 and a respective node of the input graph 105. The node selection neural network 103 may comprise a final softmax layer to output the one or more probabilities 109 of adding an edge between the candidate node 107 and each node of the graph 105 based upon the generated scores. The generation of scores is described in further detail below.

(15) The node creation neural network 101, edge addition neural network 102 and node selection neural network 103 may each be a multi-layer perceptron (MLP). Alternatively, the neural networks 101, 102, 103 may be any other type of neural network having an appropriate type of activation function as deemed suitable by a person skilled in the art.

(16) The graph engine 104 is configured to generate and output a graph 110 by sampling the probability distributions represented by the node creation neural network 101, the edge addition neural network 102 and the node selection neural network 103. Processing to generate an output graph 110 may be an iterative process. The iterative process may alternate between the steps of node generation and edge generation. An example high-level iterative process is shown in FIG. 2.

(17) At step S201 of FIG. 2, a graph comprising no nodes or edges is initialized. At step S202, it is determined whether or not to add a new node to the graph. If a node is not to be added, then the initial graph having no nodes or edges is output and the process completes. Otherwise, processing continues to step 203 where a new node is added to the graph.

(18) At step S204, it is determined whether or not to add an edge connected to the new node. If an edge is not required, then processing returns to step S202 to determine whether further nodes are to be added to the graph. Otherwise, processing continues at step S205 to select a node of the graph to connect the new edge to. It will be appreciated that this may include the new node itself such that the graph may contain self-loops. At step S206, an edge connecting the new node and the selected node is added to the graph.

(19) Processing then continues at step S207 where it is determined whether any further edges connected to the new node are to be generated. If further edges are required, then processing returns to step S205 to select a node to connect a further edge to. If no further edges are required, then processing returns to step S202 to determine whether any further nodes are to be generated. Thus, processing to generate nodes and edges repeat until it is determined that no further nodes are to be generated and the graph is output at step S208.

(20) Referring to FIG. 3, exemplary processing for generating a node of the graph will now be described in more detail. Such processing may correspond to steps S202, S203 and S208 of FIG. 2.

(21) At step S301, an initial or input graph 105 is received. The input graph 105 is provided to the node creation neural network 101 at step S302. The node creation neural network 101 processes the input graph 105 to output one or more probabilities 106 of adding a new node to the input graph 105 at step S303. Processing of the input graph 105 may include processing of state vectors associated with each node of the input graph 105. The processing of node state vectors may be based upon an information propagation method to update the node state vectors. An exemplary information propagation method is described in more detail below with reference to FIG. 5.

(22) A graph state vector may be determined based upon the updated node state vectors. The one or more probabilities 106 of adding a new node to the input graph 105 may be determined based upon the graph state vector. For example, the graph state vector may be processed by the node creation neural network 101 to obtain the one or probabilities of adding a new node to the input graph 105. An exemplary method of determining a graph state vector is described in more detail below.

(23) At step S304, it is determined whether or not a new node is to be generated based upon the one or more probabilities 106 obtained from step S303. That is, the graph engine 104 samples the probability distribution represented by the one or more probabilities 106 to determine whether or not to add a new node to input graph 105. If the nodes are typed, the sampled outcome may be a decision to add a new node of a particular type.

(24) If it is determined that a new node is not to be added, then the input graph 105 is output unchanged at step S308. Otherwise, processing continues at step S305 to generate a new node. Generating a new node may comprise initializing a node state vector associated with the new node. The node state vector may be initialized by processing a graph state vector and a feature vector associated with the new node using a neural network. This node initialization neural network may be an MLP or any other type of suitable neural network. The feature vector associated with the new node may be based upon the type of the new node. By using both the node feature vector and the graph state vector in the initialization process, it can be ensured that different nodes having the same feature vector/type will have a different initialization dependent on the current state of the graph.

(25) At step S306, the new node is added to the graph 105. The updated graph 105 is then output at step S307. Processing may then continue in order to add edges connecting the new node to other nodes in the graph as shown in steps S204 to S207 and described in more detail below with reference to the exemplary processing shown in FIG. 4.

(26) At step S401 of FIG. 4, a graph such as the input graph 105 or the updated graph of step S307 is provided to the edge addition neural network 102. In addition, an indication of a candidate node 107 such as an indication of the newly generated node at step S305 is provided to the edge addition neural network 102. The edge addition neural network 102 processes the graph and the indication of a candidate node 107 to output one or more probabilities 108 of adding an edge connected to the candidate node 107 to the graph 105 at step S402.

(27) The indication of the candidate node 107 may be a state vector associated with the candidate node 107. The node state vectors of each node in the graph 105 may be updated based upon the information propagation method mentioned above and described in more detail below with reference to FIG. 5. This ensures that information may be propagated through the graph along newly added edges from previous edge addition iterations. In particular, where a candidate node is a newly generated node, the new node is able to obtain information from its local neighborhood.

(28) A graph state vector may be determined based upon the updated node state vectors. The one or more probabilities 108 of adding an edge connected to the candidate node 107 to the graph may be determined based upon processing of the updated node state vector associated with the candidate node 107 and the graph state vector by the edge addition neural network 102.

(29) At step S403, it is determined whether or not an edge connected to the candidate node 107 is to be added based upon the one or more probabilities 108 obtained at step S402. That is, the graph engine 104 samples the probability distribution represented by the one or more probabilities 108 to determine whether or not to add an edge connected to the candidate node 107.

(30) If no edges are to be added, then the graph is provided as output at step S408. Processing may then return to step S202 to determine whether further nodes are to be added to the graph 105.

(31) If an edge is to be added, processing continues at step S404 in which the graph 105 and the indication of the candidate node 107 is provided to the node selection neural network 103. The node selection neural network 103 processes the graph 105 and the indication of the candidate node 107 to output one or more probabilities 109 of adding an edge between the candidate node 107 and each node of the graph 105. Similar to the above, the indication of the candidate node 107 may be a state vector associated with the candidate node 107 and the node state vectors of each node in the graph may be updated based upon the information propagation method. The node selection neural network 103 may process each pairing of the candidate node 107 and a respective node of the graph 105 to generate a score for adding an edge connecting the candidate node 107 and the respective node of the graph 105. Where edges have an associated type, a score may be generated for each type of edge. The node selection neural network 103 may then process the generated scores through a softmax layer to generate the one or more probabilities 109 of adding an edge between the candidate node 107 and each node of the graph.

(32) At step S405, a node is selected based upon the one or more probabilities obtained at step S404. That is, the graph engine 104 samples the probability distribution represented by the one or more probabilities 109 to determine which node of the graph 105 to connect the candidate node 107 to. If edges are typed, the selection may also include the type of edge to be used to connect the candidate node 107 and the selected node.

(33) The graph is then updated to include an edge between the candidate node 107 and the node selected at step S406. Processing may then return to step S403 to determine whether any further edges connecting the candidate node 107 are required.

(34) Referring now to FIG. 5, exemplary processing for updating node state vectors based upon information propagation will now be described. A schematic illustration of the information propagation process is shown in FIG. 6. Information propagation provides nodes with information regarding its local neighborhood. Changes in the structure of the graph or properties of the graph may be communicated to nodes through information propagation.

(35) At step S501, a message vector is generated for each edge of the graph. The message vector may be generated based upon the node state vectors associated with the two nodes connected by the respective edge. Additionally, the message vector may further be generated based upon a feature vector associated with the respective edge if for example, edges are typed. The node state vectors and the edge feature vector may be processed by an edge neural network to generate the message vector.

(36) At step S502, for each node, the message vectors associated with incoming edges are aggregated. It will be appreciated that aggregation of message vectors associated with outgoing edges is also possible and that the aggregated message vector may be an aggregation of the message vectors associated with either incoming edges, outgoing edges or every edge connected to a node. The aggregation may be a summation of the message vectors or any other method of aggregation as deemed suitable a person skilled in the art.

(37) At step S503, the node state vector associated with each node is updated based upon the respective aggregated message vector determined at step S502. An updated node state vector for a respective node may be generated based upon processing, by a node neural network, of the respective aggregated message vector and the current node state vector to generate an updated node state vector.

(38) The message generating neural network of step S502 and the neural network for updating node state vectors may be forward feedforward neural networks such as an MLP or recurrent neural networks such as an LSTM or GRU.

(39) At step S504, it is determined whether or not further rounds of information propagation are required. A single round of information propagation will enable a node to obtain information regarding its local neighborhood whilst further rounds of information propagation will enable a node to obtain information from further afield in the graph.

(40) If a further round of information propagation is required, processing returns to step S501. Alternatively, if no further rounds of information propagation are required, the updated node state vectors may be output at step S505.

(41) By using information propagation, node state vectors may comprise information relating to the structure of the graph. Use of node state vectors in the sequential graph generating process enables decisions to be made based upon the structure of the graph and the sequence of decisions. By contrast, a conventional recurrent neural network modelling the graph as sequence of node/edge generating decision relies only upon the sequence of decisions and lacks the ability to incorporate structural information of the graph or to refer to specific nodes in its decision making process.

(42) An exemplary implementation of the above information propagation process will now be described in more detail. The message vector of step S501 may be generated according to the following:
m.sub.u.fwdarw.v=f.sub.e(h.sub.u,h.sub.v,x.sub.u,v)=MLP(concat([h.sub.u,h.sub.v,x.sub.u,v]))

(43) where m.sub.u.fwdarw.v is the message vector associated with an edge connecting a node u and a node v, f.sub.e is a message generating function which is implemented using a fully connected neural network, MLP, h.sub.u is a node state vector associated with node u, h.sub.v is a node state vector associated with node v, x.sub.u,v is a feature vector associated with the edge and concat is an operation concatenating its operands into a single vector.

(44) If message vectors are to be generated in the reverse (outgoing) direction, then an additional MLP having the same form as above may be used for generating message vectors in the reverse direction:
m.sub.u.fwdarw.v′=f.sub.e′(h.sub.u,h.sub.v,x.sub.u,v)=MLP′(concat([h.sub.u,h.sub.v,x.sub.u,v])).

(45) The additional neural network, MLP′, for the reverse direction may also share some or all of the parameters of the MLP for the incoming direction. Both MLPs may have linear activation functions.

(46) For a particular node v, aggregating the message vectors, corresponding to step S502, may be performed as follows:

(47) $a_{v} = \underset{u : (u, v) \in E}{.Math.} m_{u .fwdarw. v}$

(48) or as follows if both incoming and outgoing message vectors are to be used:

(49) $a_{v} = \underset{u : (u, v) \in E}{.Math.} m_{u .fwdarw. v} + \underset{u : (u, v) \in E}{.Math.} m_{u .fwdarw. v}^{'}$

(50) where a.sub.v is the aggregated message vector for node v and E is the set of edges of the graph. As can be seen, the above aggregation is a summation of the message vectors associated with the incoming and/or outgoing edges connected to node v.

(51) As described above, at step S503, the node state vector for a respective node may be updated based upon processing the aggregated message vectors using a node neural network. For example, the node neural network may be implemented as a recurrent neural network, as follows:
h.sub.v′=RNNCell(h.sub.v,a.sub.v).

(52) The RNNCell may be a standard recurrent neural network type, in which case the update to the node state vector may be performed as:
h.sub.v′=σ(Wh.sub.v+Ua.sub.v)

(53) where σ is a sigmoid activation function and W and U are respective weight matrices.

(54) Alternatively, the recurrent neural network may be a GRU type, in which case the update to the node state vector may be performed as:
z.sub.v=σ(W.sub.zh.sub.v+U.sub.za.sub.v),
r.sub.v=σ(W.sub.rh.sub.v+U.sub.za.sub.v),
{tilde over (h)}.sub.v=tanh(W(r.sub.v⊙h.sub.v)+Ua.sub.v),
h.sub.v′=(1−z.sub.v)⊙h.sub.v+z.sub.v⊙{tilde over (h)}.sub.v,

(55) where ⊙ is a gated sum.

(56) Alternatively, the recurrent neural network may an LSTM type, in which case the update to the node state vector may be performed as:
i.sub.v=σ(W.sub.ih.sub.v+U.sub.ia.sub.v+V.sub.ic.sub.v),
f.sub.v=σ(W.sub.fh.sub.v+U.sub.fa.sub.v+V.sub.vc.sub.v),
{tilde over (c)}.sub.v=tanh(W.sub.ch.sub.v+U.sub.ca.sub.v),
c.sub.v′=f.sub.v⊙c.sub.v+i.sub.v⊙{tilde over (c)}.sub.v,
o.sub.v′=σ(W.sub.oh.sub.v+U.sub.oa.sub.v+V.sub.oc.sub.v′),
h.sub.v′=o.sub.v′⊙ tanh(c.sub.v′).

(57) where each V is a respective weight matrix.

(58) As discussed above, a graph state vector may be determined based upon the node state vectors (whether updated or not). Given that a graph will generally contain more information than individual nodes, the graph state vector may be of higher dimensionality than a node state vector. A node state vector may be mapped to a vector of higher dimensionality based upon processing the node state vector using a neural network. The graph state vector may then be obtained by aggregating the higher dimensional node state vectors. Alternatively, if mapping to a higher dimensional vector is not performed, the graph state vector may be an aggregation of the node state vectors.

(59) The aggregation to generate a graph state vector may be based upon a gated sum. For example, a gating neural network comprising a final sigmoid activation function layer and one or more lower layers may be used to process a node state vector (irrespective of dimensionality) to obtain a set of gating weights associated with of the node state vector. The set of gating weights may comprise a weight for each element of the node state vector. The aggregation may be based upon a sum of the node state vectors having their corresponding gating weights applied.

(60) For example, mapping a node state vector to a higher dimensional node state vector may be implemented as follows:
h.sub.v.sup.G=f.sub.m(h.sub.v)

(61) where f.sub.m is a function mapping to a higher dimensional space implemented as an MLP. The MLP may have a linear activation function. The dimensionality of h.sub.v.sup.G may, for example, be twice the dimensionality of h.sub.v.

(62) The graph state vector may be computed as a gated sum of the higher dimensional node state vectors as follows:

(63) $h_{G} = \underset{v \in V}{.Math.} g_{v}^{G} ⊙ h_{v}^{G}$ $g_{v}^{G} = σ (g_{m} (h_{v}))$

(64) where V is the set of nodes of the graph, g.sub.m is an MLP having a linear activation function which feeds into an output layer having a sigmoid activation function to produce a set of weights g.sub.v.sup.G for node v.

(65) An exemplary implementation of the node creation neural network 101 will now be described in more detail and is schematically illustrated in FIG. 6. As discussed above, the node creation neural network 101 outputs one or more probabilities of adding a new node to a graph and may be implemented as follows:
p(add one more node|G)=σ(f.sub.an(h.sub.G))

(66) where f.sub.an is an MLP having a linear activation function the output of which is fed into a layer having a sigmoid activation function denoted by σ, and h.sub.G is a graph state vector. Where nodes may be one of K types, the output of f.sub.an may be a K+1 dimensional vector representing a score for adding a node of each type and also a score for not adding a node. The sigmoid layer above may be replaced with a softmax layer to convert the scores to probabilities as shown below:

(67) $\hat{p} = {[{\hat{p}}_{1}, .Math., {\hat{p}}_{K + 1}]}^{⊤} = f_{a n} (h_{G}) p_{k} = \frac{\exp ({\hat{p}}_{k})}{{.Math.}_{k^{'}} \exp ({\hat{p}}_{k}^{'})}, \forall k$

(68) where {circumflex over (p)}.sub.k is a score for node type k and p.sub.k is the probability of adding a node of type k.

(69) An exemplary implementation of the edge addition neural network 102 will now be described in more detail. The schematic illustration of FIG. 6 also applies. One possible implementation has a similar form to the exemplary implementation of the node creation neural network 101 and is implemented as follows:
p(add edge|G,v)=σ(f.sub.ae(h.sub.G,h.sub.v.sup.(T)))

(70) where f.sub.ae is an MLP having a linear activation function the output of which is fed into a layer having a sigmoid activation function denoted by σ, h.sub.G is a graph state vector and h.sub.v.sup.(T) is an updated node state vector of a candidate node v 107 after T rounds of information propagation. In this implementation, the output of the edge addition neural network 102 is a probability of adding an edge connected to the candidate node 107 with edge typing handled by the node selection neural network 103. However, it will be appreciated that edge typing may be handled by the edge addition neural network 102 in similar manner described above with respect to node types.

(71) An exemplary implementation of the node selection neural network 103 will now be described in more detail and is schematically illustrated in FIG. 6. The node selection neural network 103 may firstly generate a score for each node of the graph to connect a candidate node 107 to as follows:
s.sub.u=f.sub.s(h.sub.u.sup.(T),h.sub.v.sup.(T))=MLP(concat([h.sub.u.sup.(T),h.sub.v.sup.(T)]))

(72) where s.sub.u is a score for selecting node u, f.sub.s is scoring function implemented by an MLP, h.sub.u.sup.(T) is an updated node state vector for each node of the graph after performing T rounds of information propagation and h.sub.v.sup.(T) is the updated node state vector for the candidate node v 107 after performing T rounds of information propagation. In this implementation, T=1 or 2 rounds of information propagation. However, it will be appreciated that further rounds of information may be performed if deemed appropriate by a person skilled in the art.

(73) The scores may then be converted to a probability using a softmax layer:

(74) $p_{u} = \frac{\exp (s_{u})}{{.Math.}_{u^{'}} \exp (s_{u^{'}})}$

(75) where p.sub.u is the probability of selecting node u of the graph to connect to candidate node v 107.

(76) Where edges are typed, for each node, a vector of scores comprising a score for each of J edge types may be computed. The softmax layer may then be computed over all scores across node and edge types:

(77) $p_{u, j} = \frac{\exp (s_{u, j})}{{.Math.}_{u^{'}, j^{'}} \exp (s_{u^{'}, j^{'}})}$

(78) where s.sub.u,j is a score for connecting the candidate node v 107 to node u of the graph using edge type j and p.sub.u,j is the corresponding probability.

(79) Any of the above probability distributions may be based upon a conditional probability distribution. That is, the graph generation process may be conditioned on some additional input. For example, the conditional probability distribution may be conditioned on data retrieved from a memory. The graph, for example, the graph state vector, may be used to query a memory to retrieve data from the memory. In addition, the conditioning may be based upon an attention mechanism.

(80) The generative graph model and in particular the probability distributions represented by the one or more neural networks defines a joint distribution p(G,π) over graphs G and node and edge orderings π. As will be appreciated from the above, generating a graph using the system 100 generates both a graph and a particular ordering of nodes and edges based upon the graph construction sequence. For training of the one or more neural networks, optimizing the logarithm of marginal likelihood p(G)=Σ.sub.π∈ custom character .sub.(G)p(G,π) may be used. However, optimization of log p(G) may be intractable for large graphs. Therefore, the one or more neural networks may be trained based upon optimizing an expected joint log-likelihood sampled from a training dataset of graphs as shown below:
.sub.p.sub.data.sub.(G,π)[log p.sub.data(G,π)]= custom character .sub.p.sub.data.sub.(G).sub.p.sub.data.sub.(π|G)[log p(G,π)]

(81) where p.sub.data(G) may be computed by sampling from the training dataset whilst p.sub.data(π|G) may be sampled from the training dataset or may be suitably chosen beforehand. For example, where there exists a canonical ordering associated with a graph of the training dataset, the distribution p.sub.data(π|G) may be based upon a delta function that places all of the probability on the canonical ordering.

(82) The above training process enables the system to be train faster than a conventional recurrent neural network system and provides better optimization of the system.

(83) The training dataset may comprise graphs that are structurally identical but having different permutations of their node and edge orderings. The permutations may be a random permutation of a canonical ordering or a random ordering generated from a uniform distribution or the ordering may be a learned ordering.

(84) In an exemplary application of drug discovery, the “ChEMBL” database provided by the European Bioinformatics Institute of the European Molecular Biology Laboratory (EMBL-EBI) may be used to obtain exemplary drug molecules for the generation of a training dataset. A canonical ordering may be based upon the simplified molecular-input line-entry system (SMILES) representation associated with the molecule and a graph generated from the SMILES representation using a tool such as “RDKit: Open-source cheminformatics software” available at http://www.rdkit.org.

(85) Evaluation of the learned graph generative model may be performed by computing the marginal likelihood. This may be performed using a sampling method such as importance sampling. For example, one Monte-Carlo estimate based on importance sampling is shown below:

(86) $p (G) = \underset{π}{.Math.} p (G, π) = \underset{π}{.Math.} q (π | G) \frac{p (G, π)}{q (π | G)} = 𝔼_{q (π | G)} [\frac{p (G, π)}{q (π | G)}]$

(87) where q(π|G) is any proposal distribution over ordering permutations, and the estimate may be obtained by generating a few samples from q(π|G) and then taking the average

(88) $\frac{p (G, π)}{q (π | G)}$
for the samples. Where there exists a canonical ordering associated with a graph, q(π|G) may be based upon a delta function that places all of the probability on the canonical ordering.

(89) FIG. 6 schematically illustrates operation of an example system of the type shown in FIG. 1. Thus on the left FIG. 6 shows a schematic illustration of a process for updating state vectors associated with nodes of the graph. In the center FIG. 6 shows a schematic illustration of a process for generating one or more probabilities based upon the graph, in particular graph level predictions for add node and add edge functions. On the right FIG. 6 shows a schematic illustration of a process for selecting a node.

(90) This specification uses the term “configured” in connection with systems and computer program components. For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions.

(91) Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.

(92) The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

(93) A computer program, which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.

(94) In this specification, the term “database” is used broadly to refer to any collection of data: the data does not need to be structured in any particular way, or structured at all, and it can be stored on storage devices in one or more locations. Thus, for example, the index database can include multiple collections of data, each of which may be organized and accessed differently.

(95) Similarly, in this specification the term “engine” is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more specific functions. Generally, an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers.

(96) The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers.

(97) Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.

(98) Computer readable media suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.

(99) To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's device in response to requests received from the web browser. Also, a computer can interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return.

(100) Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads.

(101) Machine learning models can be implemented and deployed using a machine learning framework, e.g., a TensorFlow framework, a Microsoft Cognitive Toolkit framework, an Apache Singa framework, or an Apache MXNet framework.

(102) Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.

(103) The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client. Data generated at the user device, e.g., a result of the user interaction, can be received at the server from the device.

(104) While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

(105) Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

(106) Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Graph neural network systems for generating structured representations of objects

Assignee

Inventors

Cpc classification

Classification Explorer

G06N3/044

PHYSICS

Classification Explorer

G06N3/047

PHYSICS

Classification Explorer

G06F16/9024

PHYSICS

Classification Explorer

G16C20/80

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06F17/18

PHYSICS

Classification Explorer

G16C20/50

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

G16C20/70

PHYSICS

Classification Explorer

G06N5/022

PHYSICS

International classification

Classification Explorer

G06N3/04

PHYSICS

Classification Explorer

G06F16/901

PHYSICS

Classification Explorer

G06F17/18

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

G06N3/047

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Abstract

Claims

Description