System and method of connection information regularization, graph feature extraction and graph classification based on adjacency matrix

Abstract

Disclosed is system and method of connection information regularization, graph feature extraction and graph classification based on adjacency matrix. By concentrating the connection information elements in the adjacency matrix into a specific diagonal region of the adjacency matrix in order to reduce the non-connection information elements in advance. The subgraph structure of the graph is further extracted along the diagonal direction using the filter matrix. Then a stacked convolutional neural network is used to extract a larger subgraph structure. On the one hand, it greatly reduces the amount of computation and complexity, solving the limitations of the computational complexity and the limitations of window size. And on the other hand, it can capture large subgraph structure through a small window, as well as deep features from the implicit correlation structures at both vertex and edge level, which improves the accuracy and speed of the graph classification.

Claims

1. A connection information regularization system based on adjacency matrix in a computer environment, wherein the connection information regularization system is configured to reorder all vertices in a first adjacency matrix of a graph to obtain a second adjacency matrix; wherein connection information elements in the second adjacency matrix are mainly distributed in a diagonal region of the second adjacency matrix, the diagonal region having a size of n; where n is a positive integer, n≥2 and n is smaller than |V|; |V| is the number of rows or columns of the second adjacency matrix; wherein the diagonal region of the second adjacency matrix is composed of the following connection information elements: a positive integer i traverses from 1 to |V|; for i >max(n, |V|−n), elements from (i−n+1)-th to |V|-th columns in i-th row are selected; for i≤n, elements from 0-th to (i+n−1)-th columns in the i-th row are selected; for max(n, |V|−n)≥i≥min(|V|−n, n), elements from (i−n+1)-th to (i+n−1)-th columns in the i-th row are selected; wherein an element of the connection information elements is the corresponding element of an edge of the graph in the second adjacency matrix; the graph is a structure of objects in graph theory.

2. The system of claim 1, wherein when there is no weight on the edge of the graph, the value of the connection information element is 1 and the value of the non-connection information element is 0.

3. The system of claim 1, wherein when the edge of the graph has weight, the value of the connection information element is the weight of the edge, and the value of the non-connection information element is 0.

4. The system of claim 1, wherein the diagonal region refers to the diagonal region from the upper left corner to the lower right corner of a matrix.

5. The system of claim 1, wherein the diagonal region of the second adjacency matrix refers to a scanned area that is scanned diagonally by using a scanning rectangle with a size of n×n.

6. The system of claim 5, wherein the scanning process is described as follows: the upper left corner of the scanning rectangle is coincident with the upper left corner of the second adjacency matrix; then the scanning rectangle is moved by one grid down and to the right, until the lower right corner of the scanning rectangle coincides with the lower right corner of the second adjacency matrix.

7. The system of claim 1, wherein the connection information regularization system is configured to reorder all the vertices of the first adjacency matrix so that concentration of connection information elements in the diagonal region of the second adjacency matrix is maximized.

8. The system of claim 7, wherein the vertices of the first adjacency matrix are reordered by a greedy algorithm, which includes the following steps: (1) initial input: inputting the first adjacency matrix of the input graph as pending adjacency matrix; (2) counting swap pairs: calculating all possible vertex exchange pairs in the pending adjacency matrix; (3) row and column exchange: judging whether all possible vertex exchange pairs are in a processed state; when yes, outputting the pending adjacency matrix to obtain the second adjacency matrix, and the greedy algorithm ends; otherwise, selecting one vertex exchange pair as the current vertex exchange pair, and switching the corresponding two rows and two columns in the pending adjacent matrix to generate a new adjacency matrix and jump to Step (4); (4) exchange evaluation: calculating the concentration of connection information elements in new adjacency matrix; when the concentration of connection information elements in the new adjacency matrix is higher than before, accepting the exchange and replacing the pending adjacency matrix with the adjacency matrix and jumping to step (2); when the concentration of connection information elements in the new adjacency matrix is lower than or equal to before, abandoning the exchange, and marking the current vertex exchange pair as a processed state, and jumping the process to step (3).

9. The system of claim 7, wherein the vertices of the first adjacency matrix are reordered by a branch and bound algorithm, which includes the following steps: (1) initial input: inputting the first adjacency matrix of the input graph as pending adjacency matrix; (2) counting swap pairs: calculating all possible vertex exchange pairs in the pending adjacency matrix; (3) row and column exchange: judging whether all possible vertex exchange pairs are in a processed state; when yes, then outputting the pending adjacency matrix to obtain the second adjacency matrix, and ending the branch and bound algorithm; otherwise, performing an exchange operation for each of the unprocessed vertex exchange pairs and jumping to step (4); the exchange operation refers to simultaneous exchange of the two corresponding rows and columns in the pending adjacency matrix, and a new adjacency matrix is generated for each of said vertex exchange pairs performing the exchange operation; (4) exchange evaluation: calculating the concentration of connection information elements in each of the new adjacency matrixes, and when there is a new adjacency matrix in which the concentration of connection information elements is higher than before, selecting the newest adjacency matrix with the highest concentration and marking the vertex exchange pair as the processed state, and then going to step (3); when there is not a matrix whose concentration of elements is higher than before, outputting the current adjacency matrix to be processed to obtain the second adjacency matrix, and ending the branch and bound algorithm.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is schematic diagram of diagonal area with width 3 in 6×6 adjacency matrix.

(2) FIG. 2 is a linear weighted calculation process.

(3) FIG. 3 is schematic diagram of converting a first adjacency matrix (left) to second adjacency matrix (right).

(4) FIG. 4 is flow diagram of greedy algorithm.

(5) FIG. 5 is flow diagram of branch and bound algorithm.

(6) FIG. 6 is data flow diagram of stacked CNN module.

(7) FIG. 7 is data flow diagram of stacked CNN module with independent pooling and convolution pooling module.

(8) FIG. 8 is data flow diagram of stacked CNN module with independent pooling and multiple convolution pooling module.

(9) FIG. 9 is a graph and corresponding first adjacency matrix.

(10) FIG. 10 is the flow diagram of greedy algorithm.

(11) FIG. 11 is schematic diagram of exchanging column and row of an adjacency matrix.

(12) FIG. 12 is a first adjacency matrix and a reordered second adjacency matrix.

(13) FIG. 13 is a graph and corresponding second adjacency matrix.

(14) FIG. 14 is diagram of filter matrix movement of feature generation module.

(15) FIG. 15 is a schematic diagram of calculation of a filter matrix of feature generation module.

(16) FIG. 16 is a schematic diagram of zero-padding operation for an adjacency matrix.

(17) FIG. 17 is a schematic diagram of a graph classification system based on stacked CNNs.

(18) FIG. 18 is the accuracy and running time on MUTAG.

(19) FIG. 19 is the accuracy and running time on PTC.

(20) FIG. 20 is the accuracy and running time on PROTEINS.

(21) FIG. 21 is the accuracy and running time on different dropout rate.

(22) FIG. 22 is the accuracy and running time with/without connection information regularization system on different datasets.

(23) FIG. 23 is the convergence curve on MUTAG.

(24) FIG. 24 is the convergence curve on PTC.

(25) FIG. 25 is the convergence curve on PROTEINS.

(26) FIG. 26 is a filter matrix and its corresponding subgraph structure, where (a) is the positive subgraph structure, (b) is the negative subgraph structure, and (c) is the filter matrix.

(27) FIG. 27 a schematic diagram of the subgraph structure corresponding to the features captured by each convolutional layer, where (a) is a 12-vertices graph, (b) is an extracted 4-vertices feature, (c) is an extracted 6-vertices feature, and (d) It is an extracted 8-vertices feature, (e) is an extracted 10-vertices feature, and (f) is an extracted 12-vertices feature.

(28) FIG. 28 is a schematic diagram of the physical meaning of the feature generation module.

(29) FIG. 29 a schematic diagram of a subgraph structure captured by feature generation module and stacked CNN module.

(30) FIG. 30 is structure of a graph classification system based on stacked CNN.

DETAILED DESCRIPTION OF THE DISCLOSURE

(31) In order to make the objectives, technical solutions and advantages of the present disclosure clearer, we take the system and method of graph feature extraction and graph classification based on adjacency matrix in the computer environment described in the present disclosure as an example to further describe the technical scheme. The following examples are only for illustrating the present disclosure and are not intended to limit the scope of the present disclosure. In addition, it should be understood that after reading the teachings of the present disclosure, those skilled in the art can make various changes or modifications to the present disclosure, and these equivalent forms also fall within the scope defined by the appended claims of the present disclosure.

(32) One embodiment specifically implements a connection information regularization system in a computer environment provided by the present disclosure. The connection information regularization system reorders all the vertices in the first adjacency matrix of the graph to obtain a second adjacency matrix, and the connection information elements in the second adjacency matrix are mainly distributed in a diagonal area of n of second adjacency, where n is a positive integer, n≥2 and n<|V|, |V| is the number of rows or columns of the second adjacency matrix; preferably, said diagonal region refers to the diagonal region from the upper left corner to the lower right corner of the matrix. For example, the shaded region in FIG. 1 is a diagonal region with a width of 3 in a 6×6 adjacency matrix.

(33) The graphs and subgraphs mentioned are graphs in graph theory.

(34) The connection information element is the corresponding element of the edge of the graph in the adjacency matrix.

(35) The connection information regularization system concentrates the connection information elements in the adjacency matrix into a specific diagonal region with a width of n in the second adjacency matrix (n is the size of the subgraph represented by the extracted features, i.e. the window size. And n is a positive integer, and n≤|V|, the |V| is the number of rows or columns of the second adjacency matrix). And then a matrix of size n×n (that is, the window size is n) is used to traverse along the diagonal region to complete the extraction of the subgraph structure with the number of vertices n in the graph, and the required computational complexity and calculation amount are greatly reduced, solving the computational complexity limit.

(36) In the present disclosure, the vector refers to a quantity having a magnitude and a direction, and in mathematics, a 1×m matrix, where m is a positive integer greater than 1. The features described in the present disclosure all represent features of a graph.

(37) The adjacency matrix in the present disclosure refers to a matrix representing the adjacent relationship between the vertices of a graph. The basic properties of the adjacency matrix are that by switching the two columns of the adjacency matrix and the corresponding rows, another adjacency matrix representing the same graph can be got. Let G=(V, E) be a graph, V is the vertex set (vertex set), v.sub.i is the i-th vertex in V, |V| represents the number of vertices in V, i is positive integers and i≤|V|, E is an edge set. G's adjacency matrix is an n-order square matrix with the following properties:

(38) The adjacency matrix in the present disclosure refers to a matrix representing the adjacent relationship between the vertices of a graph. The basic properties of the adjacency matrix are that by switching the two columns of the adjacency matrix and the corresponding rows, another adjacency matrix representing the same graph can be got. Let G=(V, E) be a graph, V is the vertex set (vertex set), v.sub.i is the i-th vertex in V, |V| represents the number of vertices in V, i is positive integers and i≤|V|, E is an edge set. G's adjacency matrix is an n-order square matrix with the following properties: 1) For undirected graphs, the adjacency matrix must be symmetric, and the main diagonal must be zero (only undirected simple graphs are discussed here), the sub-diagonal is not necessarily zero, and directed graphs are not necessarily so; the main diagonal is the diagonal of the upper left corner to the lower right corner of the matrix; the sub-diagonal is the diagonal of the upper right corner of the matrix to the lower left corner of the matrix. 2) In a directed graph, the degree of any vertex v.sub.i is the number of all non-zero elements in the i-th column (or i-th row); the vertex i is represented as the i-th column (or i-th row) in the matrix. In a directed graph, the in-degree of vertex i is the number of all non-zero elements in the i-th row; the out-degree of the vertex is the the number of all non-zero elements in the i-th row; the degree of the vertex is the number of edges associated with the vertex; the out-degree of the vertex is the number of edges start from the vertex and point to other vertices; the in-degree of the vertex is the number of edges start from other vertices and point to the vertex. 3) The adjacency matrix method need |V|.sup.2 elements to represent a graph. Since the adjacency matrix of an undirected graph must be symmetric, only data in upper right or lower left triangle need to be stored except the zeros in diagonals. Therefore, only |V|×(|V|−1)/2 elements are needed; when the edges of the undirected graph are edges with weights, the values of the connected elements in the adjacency matrix are replaced by weights, and when there are no connected elements, use 0 instead.

(39) The connection information element of the present disclosure is the corresponding element of the edge of the graph in the adjacency matrix; in the undirected graph, the element value of the i-th row and the j-th column represents whether the connection of the vertex vi and the vertex v.sub.j exists and whether there are connection weights; the value of the element in the i-th row and the j-th column in the directed graph represents whether the connection of the vertex v.sub.i to the vertex v.sub.j exists and whether there is a connection weight. For example, if there is an edge between the vertex v.sub.i and the vertex v.sub.j in the undirected graph, then the element values of the corresponding i-th row, j-th column and j-th row i-th column in the adjacency matrix are all 1; if there are no edges, the corresponding element values of the i-th row, j-th column and the j-th row, the i-th column are all 0. If there are edges and the weight exists on the edge, they are all w; for another example, if there is an edge between vertex v.sub.i and vertex v.sub.j in a directed graph and there is an edge starting from vertex v.sub.i to vertex v.sub.j, then the element in i-th row and the j-th column of adjacency matrix is 1. If there is no edge pointing to the vertex v.sub.j from the vertex v.sub.i, the element value of the corresponding i-th row and j-th column is 0. If there is an edge from the vertex v.sub.i to the vertex v.sub.j, and there is a weight w on the edge, then the element value of the corresponding i-th row and j-th column is w; where i, j is a positive integer less than or equal to |V|, |V| is the number of vertices in the graph, w is any real number.

(40) Preferably, if there is no weight on the edge of the graph, the value of the connection information element is 1 and the value of the non-connection information element is 0. More preferably, if the edge of the graph is weighted. Then, the value of the connection information element is the edge weight value, and the value of the non-connection information element is 0.

(41) The first adjacency matrix of the present disclosure refers to the first adjacency matrix obtained by converting the graph into an adjacency matrix at the beginning, that is, the initial adjacency matrix before the exchange of the corresponding rows and columns. And the second adjacency matrix refers to the matrix obtained by performing the exchange corresponding rows and columns on first adjacency matrix to concentrate the connection information. The connection information elements in the second adjacency matrix are centrally distributed in a diagonal area of width n of the second adjacency matrix, where n is a positive integer, and n≤|V|, said |V| is the number of rows or columns of the second adjacency matrix. A schematic diagram of converting the first adjacency matrix to the second adjacency matrix is shown in FIG. 3. The left is the first adjacency matrix, and the right is the second adjacency matrix.

(42) Further, the diagonal region of the second adjacency matrix is composed of the following elements: a positive integer i traverses from 1 to |V|, and when i>max(n, |V|−n), the i-th row is selected. Element of (i−n+1) to |V| column; when i≤n, select elements from 0-th to i+n−1th columns in the i-th row; when max(n, |V|−n)≥I≥min(|V|−n,n), then in the i-th column, select elements from (i−n+1)-th column to (i+n−1)-th column;

(43) Preferably, the diagonal region of the second adjacency matrix refers to a scanned area that is scanned diagonally by using a scanning rectangle with a size n×n; more preferably, the scanning process is described as follows. First, the upper left corner of the scanning rectangle is coincident with the upper left corner of the second adjacency matrix; then each time the scanning rectangle is moved to the right and the down by one grid, until the lower right corner of the scanning rectangle coincides with the lower right corner of the second adjacency matrix.

(44) Further, the connection information regularization system is configured to reorder all the vertices of the first adjacency matrix so that concentration of connection information elements in the diagonal region of the second adjacency matrix is maximized; concentration of connection information elements refers to the ratio of non-zero elements in the diagonal area.

(45) Preferably, the reordering method is an integer optimization algorithm, which functions to concentrate the connection information elements in the matrix into the diagonal region and make the concentration of the connection information elements as high as possible; the integer optimization algorithm refers to an algorithm that makes the information elements of the matrix more concentrated by exchanging the corresponding two rows and columns in the matrix at the same time.

(46) Further, the reordering method is a greedy algorithm and includes the following steps:

(47) (1) Initial Input: Input the first adjacency matrix of the input graph as pending adjacency matrix.

(48) (2) Counting Swap Pairs: Calculate all possible vertex exchange pairs in the pending adjacency matrix.

(49) (3) Row and Column Exchange: It is judged whether all possible vertex exchange pairs are in a processed state. If yes, the pending adjacency matrix is output to obtain the second adjacency matrix, and the greedy algorithm ends; otherwise, one vertex exchange pair is selected as the current vertex exchange pair, and switch the corresponding two rows and two columns in the pending adjacent matrix to generate a new adjacency matrix and jump to Step (4);

(50) (4) Exchange Evaluation: Calculating the concentration of connection information elements in new adjacency matrix. If the concentration of connection information elements in the new adjacency matrix is higher than before, the exchange is accepted. The adjacency matrix replaces the pending adjacency matrix and jumps to step (2); if the concentration of connection information elements in the new adjacency matrix is lower than or equal to before. Then, the exchange is abandoned, and the current vertex exchange pair is marked as a processed state, and the process jumps to step (3).

(51) The flow diagram of the greedy algorithm refers to FIG. 4.

(52) Further, the reordering method is a branch and bound algorithm and includes the following steps:

(53) (1) Initial Input: Input the first adjacency matrix of the input graph as pending adjacency matrix.

(54) (2) Counting Swap Pairs: Calculate all possible vertex exchange pairs in the pending adjacency matrix.

(55) (3) Row and Column Exchange: It is judged whether all possible vertex exchange pairs are in a processed state. If yes, then the pending adjacency matrix is output to obtain the second adjacency matrix, and the branch and bound algorithm ends; otherwise, perform an exchange operation for each of the unprocessed vertex exchange pairs and jump to step (4). The exchange operation refers to simultaneous exchange of the two corresponding rows and columns in the pending adjacency matrix, and a new adjacency matrix is generated for each of said vertex exchange pairs performing the exchange operation;

(56) (4) Exchange Evaluation: Calculating the concentration of connection information elements in each of the new adjacency matrixes, and if there is a new adjacency matrix in which the concentration of connection information elements is higher than before, select the newest adjacency matrix with the highest concentration and mark the vertex exchange pair as the processed state, and then go to step (3); If there is not a matrix whose concentration of elements is higher than before, the current adjacency matrix to be processed is output to obtain the second adjacency matrix, and the branch and bound algorithm ends.

(57) The flow diagram branch and bound algorithm refers to FIG. 5.

(58) Further, the concentration of connection information elements in the diagonal region of the second adjacency matrix depends on the number of connection information elements and/or the number of non-connection information elements in the diagonal region.

(59) Further, the concentration of connection information elements in the diagonal region of the second adjacency matrix depends on the number of connection information elements outside the diagonal region and/or the number of non-connection information elements.

(60) Further, the concentration can be measured by the Loss value. The smaller the Loss value is, the higher the concentration is, and the method for calculating the Loss value is as follows:

(61) $LS (A, n) = {.Math.}_{i = 1}^{n} {.Math.}_{j = i + n}^{.Math. V .Math.} A_{i, j} + {.Math.}_{i = n + 1}^{.Math. V .Math.} {.Math.}_{j = 1}^{i - n} A_{i, j}$

(62) In the formula, LS(A, n) represents the Loss value, A denotes the second adjacency matrix, n denotes the width of diagonal region of the second adjacency matrix and A.sub.i, j denotes the i-th row and j-th column elements in the second adjacency matrix. Preferably, the LS(A, n) denotes the Loss value of the second adjacency matrix A when the filter matrix size is n×n. The smaller the Loss value is, the higher the concentration is.

(63) Further, the concentration can also be measured using the ZR value. The smaller the ZR value is, the higher the concentration is, and the method for calculating the ZR value is as follows:

(64) $TC (A, n) = {.Math.}_{i = 1}^{n} {.Math.}_{j = 1}^{.Math. V .Math. - n + i - 1} C_{i, j} + {.Math.}_{i = n + 1}^{.Math. V .Math.} {.Math.}_{j = i - n + 1}^{.Math. V .Math.} C_{i, j}$ $T 1 (A, n) = {.Math.}_{i = 1}^{n} {.Math.}_{j = 1}^{.Math. V .Math. - n + i - 1} A_{i, j} + {.Math.}_{i = n + 1}^{.Math. V .Math.} {.Math.}_{j = i - n + 1}^{.Math. V .Math.} A_{i, j}$ $ZR (A, n) = \frac{TC \times T 1}{TC}$

(65) In the formula, A denotes the second adjacency matrix, C denotes the matrix with the same size of the A and all elements are connections information elements, A.sub.i, j denotes the elements of the i-th row and j-th column in A. Ci,j denotes the element of row i and column j in C. TC(A, n) and TC denotes the total number of elements in the diagonal region with width n in A. T1(A, n) and T1 denotes the number of connected information elements in the diagonal region with width n in A. ZR(A, n) denotes the ZR value, which means the proportion of non-connected information elements in the diagonal region with width n, and n denotes the number of rows or columns of the filter matrix. Preferably, the ZR(A, n) denotes the ZR value of the second adjacency matrix A when the filter matrix size is n×n. The smaller the ZR value is, the higher the concentration of the second adjacency matrix is.

(66) An embodiment implements a graph feature extraction system based on adjacency matrix in a computer environment provided by the present disclosure. The graph feature extraction system extracts features of a graph based on an adjacency matrix of the graph, and the features which correspond to the subgraph directly support the classification. The features are presented in the form of at least one vector, each vector corresponding to the distribution of a mixed state in the graph; the graph feature extraction system includes feature generation module and any form of connection information regularization system in a computer environment described above. The graph feature extraction system includes connection information regularization system and feature generation module, and they work together as a whole to effectively extract local patterns and connection features in a specific diagonal region with window size of n for datasets with different sizes and different structural complexity. The connection information system makes the computational complexity and calculation amount required by feature generation module reduce greatly, solving the limitation of computational complexity

(67) Preferably, the feature generation module generates a feature of the graph by using a filter matrix, and the filter matrix is a square matrix; more preferably, the feature generation module uses at least one filter matrix along the diagonal region of second adjacency matrix to obtain at least one vector corresponding to the features of the graph. The features which correspond to the subgraph directly support the classification and are presented in the form of at least one vector, and each vector corresponds to the distribution of a mixed state in the graph.

(68) Preferably, the distribution condition refers to the possibility that the subgraph structure in the mixed state appears in the graph; preferably, each of the mixed states represents a linear weight of an adjacency matrix corresponding to any of a plurality of subgraph structures. More preferably, the linear weighting refers to multiplying the adjacency matrix of each subgraph by the weight corresponding to the adjacency matrix, and then adding the bits together to obtain a matrix of the same size as the adjacency matrix of the subgraph. The sum of the weights corresponding to the adjacency matrix is 1; the calculation process is shown in FIG. 2.

(69) Preferably, the filtering operation is to add the inner product of filter matrix and second adjacency matrix and get the value through an activation function. Filter matrix moves diagonally to obtain a set of values to form a vector corresponding to the distribution of a subgraph structure in the graph; more preferably, the activation function is a sigmoid function, a ReLU activation function, and a pReLU function.

(70) Preferably, the feature generation module uses the different filter matrix to perform the filtering operation.

(71) Preferably, the initial value of each element in the filter matrix is a value of a random variable taken from the Gaussian distribution, respectively. The Gaussian distribution is a probability distribution. The Gaussian distribution is the distribution of continuous random variables with two parameters μ and σ. The first parameter μ is the mean value of the random variable that obeys the normal distribution, and the second parameter σ is the variance of the random variable; when the value of a random variable is taken from a Gaussian distribution, the closer the value of the random variable taken to μ, the greater the probability, while the greater the distance from μ, the smaller the probability.

(72) Preferably, the elements in the filter matrix are real number greater than or equal to −1 and less than or equal to 1. More preferably, the elements in the filter matrix are real numbers greater than or equal to 0 and less than or equal to 1.

(73) Preferably, the feature generation module participates in a machine learning process for adjusting the values of the elements of the filter matrix.

(74) Preferably, the machine learning process utilizes back propagation to calculate the gradient value by using the loss value and further adjust the values of each element in the filter matrix.

(75) The loss value refers to the error between the output of the machine learning process and the actual output that should be obtained; the gradient can be seen as the slope of a curved surface along a given direction, and the gradient of the scalar field is a vector field. The gradient at one point in the scalar field points to the fastest growing direction of the scalar field, and the gradient value is the largest rate of change in this direction.

(76) The machine learning process described consists of a forward propagation process and a backward propagation process. In the forward propagation process, input information is processed layer by layer from the input layer to the hidden layer and finally passed to the output layer. If the desired output value is not obtained in the output layer, the sum of the square error between output and the expected value is used as the objective function, and the back propagation is performed. The partial derivative of the target function for each neuron weight is calculated layer by layer adjust the values. The gradient of the weight vector of the function is used as a basis for modifying the weight value, and the machine learning process is completed during the weight value modification process. When the error converges to the desired value or reaches the maximum epochs of learnings, the machine learning process ends. The initial values of the elements in the filter matrix are the values of the random variables taken from the Gaussian distribution, which are then updated by back propagation in the machine learning process and are optimized at the end of the machine learning process.

(77) Preferably, the hidden layer refers to each layer other than the input layer and the output layer, and the hidden layer does not directly receive signals from the outside world and does not directly send signals to the outside world.

(78) Further, the size of the filter matrix is n×n, that is, the size of the filter matrix is the same as the width of the diagonal region in the second adjacency matrix. After the connection information elements concentrated into the diagonal region by the connection information regularization system, a filter matrix is used to perform diagonal convolution and it can extract the distribution of the subgraph structure of size n in the graph as much as possible under the premise of O(n) time complexity.

(79) An embodiment implements the graph classification system based on adjacency matrix in a computer environment provided by the present disclosure includes a class labeling module and any form of feature extraction based on adjacency matrix in a computer environment as described above. In the system, the class labeling module labels the graph based on the features extracted by the graph feature extraction system, and outputs the class of the graph; the graph is graph in graph theory.

(80) Preferably, the class labeling module calculates the possibility that the graph belongs to each class, and labels graph as the class with the highest possibility and completes the classification of the graph.

(81) Preferably, the class labeling module uses the classification algorithm to calculate the possibility that the graph belongs to each class, and labels the graph as the class with the highest possibility to complete the classification of the graph; more preferably, the classification algorithm is selected from any one of kNN, a linear classification algorithm, or any of a plurality of types.

(82) The kNN algorithm means that if most of the k nearest samples in a feature space belong to a certain class, the sample also belongs to this class and has the same characteristics of the samples in this class. This method determines the class based on the nearest one or several samples. The linear classification algorithm means that the data is classified using a straight line (or plane, hyperplane) in the feature space.

(83) Further, the graph classification system includes a stacked CNN module, and the stacked CNN module processes features generated by the graph feature extraction system and merges the subgraph structures features supporting the classification and generates the feature which represents larger subgraph structure in the graph. The larger subgraph structure refers to a subgraph structure with more than n vertices.

(84) Preferably, the stacked CNN module includes convolution submodule and pooling submodule.

(85) The convolution submodule uses at least one convolution layer to perform a convolution operation on features generated by the graph feature extraction system and merges the subgraph structures features supporting the classification to obtain at least one vector as a convolution result. The input of the first convolutional layer is the feature generated by any of the forms of the graph feature extraction system as described above. If there are multiple convolutional layers, the input of each convolutional layer is the result of the previous convolutional layer. The output of each convolutional layer is at least one vector. Each convolutional layer uses at least one filter matrix for the convolution operation, and result of the last convolutional layer is outputted to the pooling submodule.

(86) Further, the convolution operation refers to using a filter matrix to move on an adjacency matrix with some regularity, multiply bitwisely and sum up to get a value and the values obtained constitute a vector or a matrix.

(87) The filter matrix is a square matrix; the number of rows of the filter matrix in each of the convolution layers is the same as the number of vectors input to the convolution layer; preferably, the elements in the filter matrix are real numbers greater than or equal to −1 and less than or equal to 1; more preferably, the elements in the filter matrix are real numbers greater than or equal to 0 and less than or equal to 1.

(88) The pooling submodule is configured to perform a pooling operation on the matrix obtained by the convolution submodule, obtain at least one vector as a pooling result and output to the class labeling module to label the graph. The pooling result includes features of a larger subgraph structure in the graph; the larger subgraph structure refers to a subgraph structure having more than n vertices; preferably, the pooling operation is selected from the group consisting of: max-pooling, average-pooling. The max-pooling refers to taking the maximum value among the neighborhood; the average-pooling refers to averaging the values among the neighborhood.

(89) Further, the pooling operation is based on the convolution operation and performs mathematical operations on each convolution result, thereby reducing the dimension of the convolution result. The mathematical operations include but are not limited to averaging and taking the maximum value.

(90) Preferably, a data flow diagram of the stacked CNN module is shown in FIG. 6.

(91) The stacked CNN module extracts larger, deeper and more complex feature, which corresponds to larger, deeper and more complex subgraph in the graph, from the feature generated by feature generation module through a series of convolutional layers. The connection information regularization system, the feature generation module and the stacked CNN module in the graph classification system provided by the present disclosure work together to extract larger (the number of vertices is greater than n), deeper and complex features with a small window size n. First, it captures small subgraphs with small window of size n, and then larger, deeper and complex subgraphs with a number of vertices greater than n is extracted by the combination of small subgraphs. That is, it can capture large subgraph structure through a small window, as well as deep features from the implicit correlation structures at both vertex and edge level, which improves the accuracy and speed of the graph classification.

(92) Further, the graph classification system includes an independent pooling module and a convolution pooling module; the independent pooling module performs pooling operation on the feature extracted by graph feature extraction system to obtain at least one vector as the first pooling result and output to class labeling module. The convolution pooling module performs convolution and pooling operation on the input features extracted by any form of the graph feature extraction system as described above. It merges the subgraph structures features supporting the classification, generates a second pooling result representing a larger subgraph structure feature and output it to the class labeling module. The class labeling module classifies the graph and output the class label of graph according to the first pooling result and the second pooling result; the larger subgraph structure refers to a subgraph structure with more than n vertices.

(93) Preferably, the convolution pooling module includes a convolution submodule and a pooling submodule. The convolution submodule uses at least one filter matrix to perform convolution operation on the input merge the features which can support classification to obtain at least one vector as convolution result and output it to the pooling submodule. The pooling submodule performs the pooling operation on the convolution result to obtain at least one vector as the second pooling result and output it to class labeling module. The second pooling result contains features of a larger subgraph structure in the graph.

(94) The filter matrixes are square matrixes; the number of rows of the filter matrix in each of the convolution layers is the same as the number of vectors input to the convolution layer; preferably, the elements in the filter matrix are real numbers and greater than or equal to −1 and less than or equal to 1; more preferably, the elements in the filter matrix are real numbers greater than or equal to 0 and less than or equal to 1. Preferably, the pooling operation is selected from the largest pooling operation, the average pooling operation.

(95) Preferably, the data flow diagram of the stacked CNN module including the independent pooling module and the convolutional pooling module is shown in FIG. 7.

(96) Further, the graph classification system further includes an independent pooling module and multiple convolution pooling modules; the independent pooling module performs pooling operation on the feature extracted by graph feature extraction system to obtain at least one vector as the first pooling result and output to class labeling module. The convolution pooling module performs convolution and pooling operation on the input features in turn. Convolution operation is performed to merge the subgraph structures features supporting the classification and generate a convolution result. The pooling operation is performed on the convolution result to obtain at least a vector as pooling result which contains larger subgraph structure feature. The convolution result of previous convolution pooling module is output to the next convolution pooling module and the pooling result of each convolution pooling module is output to the class labeling module. The class labeling module classifies the graph and output the class label of graph according to the first pooling result and all the pooling result of convolution pooling module.

(97) Wherein, the input of the first convolution pooling module is the feature generated by any form of the graph feature extraction system as described above and the input of other convolution pooling module is the convolution result of the previous convolution pooling module. The last convolution pooling module only outputs the pooling result to the class labeling module; the larger subgraph structure refers to the subgraph structure with more than n vertices.

(98) Preferably, the convolution pooling module includes a convolution submodule and a pooling submodule. The convolution submodule uses at least one filter matrix to perform convolution operation on the input merge the features which can support classification to obtain at least one vector as convolution result and output it to the next convolution pooling module. The pooling submodule performs the pooling operation on the convolution result to obtain at least one vector as pooling result and output it to class labeling module. The pooling result contains features of a larger subgraph structure in the graph. Preferably, the number of convolution submodule and pooling submodule may be the same or different. Preferably, the number of convolution submodule and pooling submodule is one or more.

(99) The filter matrixes are square matrixes; the number of rows of the filter matrix in each of the convolution layers is the same as the number of vectors input to the convolution layer; preferably, the elements in the filter matrix are real numbers and greater than or equal to −1 and less than or equal to 1; more preferably, the elements in the filter matrix are real numbers greater than or equal to 0 and less than or equal to 1.

(100) Preferably, the number of the convolution pooling modules is less than or equal to 10, and more preferably, the number of convolution pooling modules in the graph classification system is less than or equal to 5; more preferably, the number of the convolution pooling modules is less than or equal to 5. The number of convolution pooling modules in the graph classification system is less than or equal to 3;

(101) Preferably, the pooling operation is selected from the max pooling operation, the average pooling operation.

(102) Preferably, the data flow diagram of the stacked CNN module including the independent pooling module and the multiple convolution pooling modules is shown in FIG. 8.

(103) Further, the element values of the vector of convolution result represent the possibility that the sub-graph structure appears at various positions on the graph. And the element values of the pooling result, the first pooling result, and the second pooling result represent the maximum or average probability that the subgraph structure appears in the graph.

(104) Further, the class labeling module includes a hidden layer unit, an activation unit, and a labeling unit.

(105) The hidden layer unit processes the received vector to obtain at least one mixed vector and output it to the activation unit, and the mixed vector contains information of all vectors received by the hidden layer unit. The hidden layer unit combines the input vectors as a combined vector and performs a linear weighted operation on the combined vector using at least one weighted vector to obtain at least one mixed vector. Preferably, the hidden layer refers to each layer other than the input layer and the output layer, and the hidden layer does not directly receive signals from the outside world and does not directly send signals to the outside world.

(106) The activation unit calculates a value for each mixed vector output by the hidden layer unit using an activation function, and outputs all the resulting values as a vector to the labeling unit; preferably, the activation functions performed are sigmoid function, ReLU activation function, pReLU function.

(107) The labeling unit is configured to calculate the possibility that the graph belongs to each class according to the result of the activation unit and labels the class with the highest possibility as the classification result of the graph to complete the classification. Preferably, the labeling unit calculates the probability that the graph belongs to each classification label based on the classification algorithm and labels the class with the highest possibility as the classification result of the graph to complete the classification. More preferably, the classification algorithm is any one or more than one of the kNN and the linear classification algorithm.

(108) The fourth object of the present disclosure is to provide a connection information regularization method in a computer environment, which includes the following steps:

(109) (1) Initial Input: convert the graph to the first adjacency matrix.

(110) (2) Connection Information Regularization: reorder all the vertices in the first adjacency matrix of the graph to obtain a second adjacency matrix, and the connection information elements in the second adjacency matrix are mainly distributed in a diagonal area of n of second adjacency, where n is a positive integer, n≥2 and n is much smaller than |V|, |V| is the number of rows or columns of the second adjacency matrix.

(111) The diagonal region of the second adjacency matrix is composed of the following elements: a positive integer i traverses from 1 to |V|, and when i>max(n, |V|−n), the i-th row is selected. Element of (i−n+1) to |V| column; when i≤n, select elements from 0-th to i+n−1th columns in the i-th row; when max(n, |V|−n)≥i≥min(|V|−n, n), then in the i-th column, select elements from (i−n+1)-th column to (i+n−1)-th column;

(112) The connection information element is the corresponding element of the edge of the graph in the adjacency matrix.

(113) the graph is graph in graph theory.

(114) Preferably, if there is no weight on the edge of the graph, the value of the connection information element is 1 and the value of the non-connection information element is 0; more preferably, if the edge of the graph is weighted Then, the value of the connection information element is the edge weight value, and the value of the non-connection information element is 0.

(115) Preferably, the diagonal region refers to the diagonal region from the upper left corner to the lower right corner of the matrix.

(116) Preferably, the diagonal region of the second adjacency matrix refers to a scanned area that is scanned diagonally by using a scanning rectangle with a size n×n.

(117) More preferably, the scanning process is described as follows. First, the upper left corner of the scanning rectangle is coincident with the upper left corner of the second adjacency matrix; then each time the scanning rectangle is moved to the right and the down by one grid, until the lower right corner of the scanning rectangle coincides with the lower right corner of the second adjacency matrix.

(118) Preferably, the reordering method is an integer optimization algorithm.

(119) Further, the reordering method is a greedy algorithm and includes the following steps:

(120) (1) Initial Input: Input the first adjacency matrix of the input graph as pending adjacency matrix.

(121) (2) Counting Swap Pairs: Calculate all possible vertex exchange pairs in the pending adjacency matrix.

(122) (3) Row and Column Exchange: It is judged whether all possible vertex exchange pairs are in a processed state. If yes, the pending adjacency matrix is output to obtain the second adjacency matrix, and the greedy algorithm ends; otherwise, one vertex exchange pair is selected as the current vertex exchange pair, and switch the corresponding two rows and two columns in the pending adjacent matrix to generate a new adjacency matrix and jump to Step (4);

(123) (4) Exchange Evaluation: Calculating the concentration of connection information elements in new adjacency matrix. If the concentration of connection information elements in the new adjacency matrix is higher than before, the exchange is accepted. The adjacency matrix replaces the pending adjacency matrix, and jumps to step (2); if the concentration of connection information elements in the new adjacency matrix is lower than or equal to before. Then, the exchange is abandoned, and the current vertex exchange pair is marked as a processed state, and the process jumps to step (3).

(124) Further, the reordering method is a branch and bound algorithm and includes the following steps:

(125) (1) Initial Input: Input the first adjacency matrix of the input graph as pending adjacency matrix.

(126) (2) Counting Swap Pairs: Calculate all possible vertex exchange pairs in the pending adjacency matrix.

(127) (3) Row and Column Exchange: It is judged whether all possible vertex exchange pairs are in a processed state. If yes, then the pending adjacency matrix is output to obtain the second adjacency matrix, and the branch and bound algorithm ends; otherwise, perform an exchange operation for each of the unprocessed vertex exchange pairs and jump to step (4). The exchange operation refers to simultaneous exchange of the two corresponding rows and columns in the pending adjacency matrix, and a new adjacency matrix is generated for each of said vertex exchange pairs performing the exchange operation;

(128) (4) Exchange Evaluation: Calculating the concentration of connection information elements in each of the new adjacency matrixes, and if there is a new adjacency matrix in which the concentration of connection information elements is higher than before, select the newest adjacency matrix with the highest concentration and mark the vertex exchange pair as the processed state, and then go to step (3); If there is not a matrix whose concentration of elements is higher than before, the current adjacency matrix to be processed is output to obtain the second adjacency matrix, and the branch and bound algorithm ends.

(129) Further, the concentration of connection information elements in the diagonal region of the second adjacency matrix depends on the number of connection information elements and/or the number of non-connection information elements in the diagonal region.

(130) Further, the concentration of connection information elements in the diagonal region of the second adjacency matrix depends on the number of connection information elements outside the diagonal region and/or the number of non-connection information elements.

(131) Further, the concentration can be measured by the Loss value. The smaller the Loss value is, the higher the concentration is, and the method for calculating the Loss value is as follows:

(132) $LS (A, n) = {.Math.}_{i = 1}^{n} {.Math.}_{j = i + n}^{.Math. V .Math.} A_{i, j} + {.Math.}_{i = n + 1}^{.Math. V .Math.} {.Math.}_{j = 1}^{i - n} A_{i, j}$

(133) In the formula, LS(A, n) represents the Loss value, A denotes the second adjacency matrix, n denotes the width of diagonal region of the second adjacency matrix, and Ai,j denotes the i-th row and j column elements in the second adjacency matrix.

(134) Further, the concentration can also be measured using the ZR value. The smaller the ZR value is, the higher the concentration is, and the method for calculating the ZR value is as follows:

(135) $TC (A, n) = {.Math.}_{i = 1}^{n} {.Math.}_{j = 1}^{.Math. V .Math. - n + i - 1} C_{i, j} + {.Math.}_{i = n + 1}^{.Math. V .Math.} {.Math.}_{j = i - n + 1}^{.Math. V .Math.} C_{i, j}$ $T 1 (A, n) = {.Math.}_{i = 1}^{n} {.Math.}_{j = 1}^{.Math. V .Math. - n + i - 1} A_{i, j} + {.Math.}_{i = n + 1}^{.Math. V .Math.} {.Math.}_{j = i - n + 1}^{.Math. V .Math.} A_{i, j}$ $ZR (A, n) = \frac{TC \times T 1}{TC}$

(136) In the formula, A denotes the second adjacency matrix, C denotes the matrix with the same size of the A and all elements are connections information elements, A.sub.i, j denotes the elements of the i-th row and j-th column in A. C.sub.i,j denotes the element of row i and column j in C. TC(A, n) and TC denotes the total number of elements in the diagonal region with width n in A. T1(A, n) and T1 denotes the number of connected information elements in the diagonal region with width n in A. ZR(A, n) denotes the ZR value, which means the proportion of non-connected information elements in the diagonal region with width n.

(137) An embodiment implements the graph feature extraction method based on adjacency matrix in a computer environment, and the method extracts features of a graph based on adjacency matrix of the graph, the features which correspond to the subgraph directly support the classification. The features are presented in the form of at least one vector, and each vector corresponds to the distribution of a mixed state in the graph. The method includes the following steps: (1) Connection Information regularization: based on the first adjacency matrix of the graph, the second adjacency matrix is obtained using any connection information regularization method described above. (2) Diagonal filtering: Based on the second adjacency matrix obtained in step (1), the features of the graph are generated. the features which correspond to the subgraph directly support the classification, and each vector corresponds to the distribution of a mixed state in the graph.

(138) The graphs and subgraphs are graphs in graph theory.

(139) Preferably, the step (2) utilizes a filtering matrix to generate features of the graph and the filtering matrix is a square matrix. More preferably, the step (2) utilizes at least one filter matrix along the diagonal region of second adjacency matrix to obtain at least one vector corresponding to the features of the graph. The features which correspond to the subgraph directly support the classification and are presented in the form of at least one vector, and each vector corresponds to the distribution of a mixed state in the graph.

(140) Preferably, the step (2) uses different filter matrixes to perform the filtering operation.

(141) Preferably, the distribution condition refers to the possibility that the subgraph structure in the mixed state appears in the graph; preferably, each of the mixed states represents a linear weight of an adjacency matrix corresponding to any of a plurality of subgraph structures. More preferably, the linear weighting refers to multiply the adjacency matrix of each subgraph by the weight corresponding to the adjacency matrix, and then add bitwise together to obtain a matrix of the same size as the adjacency matrix of the subgraph.

(142) Preferably, the filtering operation is to add the inner product of filter matrix and second adjacency matrix and get the value through an activation function. Filter matrix moves diagonally to obtain a set of values to form a vector corresponding to the distribution of a subgraph structure in the graph; more preferably, the activation function is a sigmoid function, a ReLU activation function, and a pReLU function.

(143) Preferably, the initial values of each element in the filter matrix are the values of random variables taken from the Gaussian distribution respectively;

(144) Preferably, the elements in the filter matrix are real numbers greater than or equal to −1 and less than or equal to 1, more preferably, the elements in the filter matrix are real numbers greater than or equal to 0 and less than or equal to 1.

(145) Preferably, the step (2) participates in a machine learning process for adjusting the values of the elements of the filter matrix.

(146) Preferably, the machine learning process utilizes back propagation to calculate the gradient value by using the loss value and further adjust the values of each element in the filter matrix. More preferably, the feature generation module can use different filter matrix to perform the filter operation.

(147) Preferably, the value of the connection information element is 1 and the value of the non-connection information element is 0; more preferably, if the edge of the graph is weighted Then, the value of the connection information element is the edge weight value, and the value of the non-connection information element is 0.

(148) Preferably, the diagonal region of the second adjacency matrix refers to a scanned area that is scanned diagonally by using a scanning rectangle with a size n×n.

(149) Further, the size of the filter matrix is n×n.

(150) An embodiment implements a method for classifying a graph based on adjacency matrix in a computer environment provided by the present disclosure. The method for classifying a graph includes the following steps:

(151) (1) Feature Extraction: Using the graph feature extraction method based on adjacency matrix of any form as described previously to extract the features of the graph.

(152) (2) Class Labeling: Based on the features extracted in step (1), classify the graph and output the class of the graph. The graph is the graph in graph theory. Preferably, the step (2) calculates the possibility that the graph belongs to each class, and labels graph as the class with the highest possibility and completes the classification of the graph. Preferably, the step (2) uses the classification algorithm to calculate the possibility that the graph belongs to each class, and labels the graph as the class with the highest possibility to complete the classification of the graph; more preferably, the classification algorithm is selected from any one of kNN, a linear classification algorithm, or any of a plurality of types.

(153) An embodiment implements a method for classifying a graph based on stacked CNN in a computer environment provided by the present disclosure. The method for classifying a graph includes the following steps:

(154) (1) Feature extraction: Using the graph feature extraction method based on adjacency matrix of any form as described previously to extract the features of the graph.

(155) (2) Convolution Operation: Using at least one convolutional layer to perform convolution operation on the features extracted in step (1) and merging the subgraph structures features which support the classification to obtain at least one vector as convolution result. The input of the convolutional layers is the feature extracted in step (1). If there are multiple convolution layers, the input of each convolutional layer is the result of the previous convolutional layer and the result of each convolutional layer is at least one vector, each convolution layer uses at least one filter matrix for convolution operation and the convolution result of the last convolution layer is output to step (3). The filter matrix is a square matrix. The number of rows of the filtering matrix in each convolution layer is the same as the number of vectors input to the convolution layer. Preferably, the elements in the filtering matrix are real numbers greater than or equal to −1 and less than or equal to 1. More preferably, the elements in the filter matrix are real numbers greater than or equal to 0 and less than or equal to 1.

(156) (3) Pooling Operation: Pooling the result of the convolution operation in step (2) and obtaining at least one vector as a pooling result and outputting it to step (4). The pooling result contains larger subgraph structure of the graph with more than n vertices. Preferably, the pooling operation is selected from maximum pooling and average pooling.

(157) (4) Class Labeling: Labeling the graph and outputting the class of graph according to the pooling result obtained by step (3).

(158) An embodiment implements another method for classifying graph based on stacked CNN in computer environment provided by the present disclosure. The method for classifying a graph includes the following steps:

(159) (1) Feature Extraction: Using the graph feature extraction method based on adjacency matrix of any form as described previously to extract the features of the graph and output to the step (2) and (3).

(160) (2) Independent Pooling Operation: Pooling the features extracted in step (1) to obtain at least one vector as the first pooling result and outputting to step (4).

(161) (3) Convolution Pooling Operation: Using at least one convolutional layer to perform convolution operation on the features extracted in step (1) and merging the subgraph structures features which support the classification to obtain at least one vector as convolution result. Then the pooling operation is performed on it to obtain at least on vector as the second pooling result and output to step (4). The second pooling result contains the feature of larger subgraph structure with more than n vertices. The filter matrix is square matrix. The number of rows of the filtering matrix in each convolution layer is the same as the number of vectors input to the convolution layer. Preferably, the elements in the filtering matrix are real numbers greater than or equal to −1 and less than or equal to 1. More preferably, the elements in the filter matrix are real numbers greater than or equal to 0 and less than or equal to 1. Preferably, the pooling operation is selected from maximum pooling and average pooling.

(162) (4) Class Labeling: Labeling the graph and outputting the class of graph according to the first pooling result and the second pooling result.

(163) An embodiment implements another method for classifying graph based on stacked CNN in computer environment provided by the present disclosure. The method for classifying a graph includes the following steps:

(164) (1) Feature Extraction: Using the graph feature extraction method based on adjacency matrix of any form as described previously to extract the features of the graph and output to the step (2).

(165) (2) Independent Pooling Operation: Pooling the features extracted in step (1) to obtain at least one vector as the first pooling result and outputting to step (3).

(166) (3) Convolution and Pooling Operation: Using at least one convolutional layer to perform convolution operation on the features extracted in step (1) and merging the subgraph structures features which support the classification to obtain at least one vector as convolution result. Then the pooling operation is performed on it to obtain at least on vector as pooling result which contains the feature of larger subgraph structure with more than n vertices. The convolution result of previous level is output to the next convolution and pooling operation and the pooling result of each level is output to the step (4). Wherein, the input of the first level convolution and pooling operation is the feature extracted in step (1). If there are multi-level convolution and pooling operation, the input of each level is the result of previous one, and only pooling result is output to the step (4) in the last level. The filter matrix is square matrix. The number of rows of the filtering matrix in each convolution layer is the same as the number of vectors input to the convolution layer. Preferably, the elements in the filtering matrix are real numbers greater than or equal to −1 and less than or equal to 1. More preferably, the elements in the filter matrix are real numbers greater than or equal to 0 and less than or equal to 1. Preferably, the pooling operation is selected from maximum pooling and average pooling.

(167) (4) Class Labeling: Labeling the graph and outputting the class of graph according to the first pooling result and all the pooling result in the step (3).

(168) Further, the element values of the vector of convolution result represent the possibility that the sub-graph structure appears at various positions on the graph. And the element values of the pooling result, the first pooling result, and the second pooling result represent the maximum or average probability that the subgraph structure appears in the graph.

(169) Further, the class labeling includes the following steps: (1) Feature Merging: The received vector is processed by the hidden layer and at least one mixed vector is obtained and output to step (2). The mixed vector contains information of all vectors received by the hidden layer. Preferably, the process described combines the input vectors into a combined vector and uses at least one weight vector to linearly weight the combined vector to obtain at least one mixed vector. (2) Feature Activation: Calculating a value for each mixed vector output by the hidden layer using an activation function, and outputting all the resulting values as a vector step (3); preferably, the activation functions performed are sigmoid function, ReLU activation function, pReLU function. (3) Class Labeling: The class labeling is configured to calculate the possibility that the graph belongs to each class according to the result of the activation and labels the class with the highest possibility as the classification result of the graph to complete the classification. Preferably, the class labeling calculates the probability that the graph belongs to each classification label based on the classification algorithm and labels the class with the highest possibility as the classification result of the graph to complete the classification. More preferably, the classification algorithm is any one or more than one of the kNN and the linear classification algorithm.

(170) One embodiment implements a graph classification system provided by the present disclosure. The vertex of the graph is an arbitrary entity, and an edge of the graph is a relationship between entities.

(171) Preferably, entity is any independent individual or set of individuals actual or virtual. Preferably, the entity may be one or combinations of person, thing, event, thing, concept. More preferably, any of said entities is selected from the group atoms in a compound or a single substance, any one or more of humans, commodities, events in a network.

(172) Preferably, the relationship is any relationship between entities. More preferably, the relationship is a chemical bond connecting atoms, a link between commodities, and a person-to-person relationship. More preferably, the relationship is the link between the commodities includes a causal relationship and an associated relationship of the purchased merchandise. More preferably, the person-to-person relationship includes an actual blood relationship, a friend relationship, a concern, transaction or message relationship in a virtual social network.

(173) One embodiment implements a network structure classification system provided by the present disclosure. The classification system implements a network structure classification based on any form of graph classification system as described above. The vertex of the graph is a node in the network. The edge of the graph is the relationship between nodes in the network. Preferably, the network is selected from the group consisting of electronic network, social network and logistics network. More preferably, the electronic network is selected from the group consisting of a local area network, a metropolitan area network, a wide area network, the Internet, 4G, 5G, CDMA, Wi-Fi, GSM, WiMax, 802.11, infrared, EV-DO, Bluetooth, GPS satellites, and/or any other communication scheme for wirelessly transmitting at least some of the information in at least a portion of a network of suitable wired/wireless technologies or protocols. Preferably, the node is selected from geographical position, mobile station, mobile device, user equipment, mobile user and network user. More preferably, the relationship between the nodes is selected from the information transmission relationship between the electronic network nodes, the transport relationship between geographical locations, the actual kinship between people, the friendship, attention, transaction or sending message relationship in the virtual social network. Preferably, the classification is selected from the network structure type. Structure type selected from the star, tree, fully connected and ring.

(174) One embodiment implements a compound classification system provided by the present disclosure. The classification system implements compound classification based on any form of a graph classification system as described before. The vertex of the graph is the atom of the compound. The edge is a chemical bond between the atoms. Preferably, the class is selected from the group consisting of activity, mutagenicity, carcinogenicity, catalytic activity etc. of the compound.

(175) One embodiment implements a social network classification system provided by the present disclosure. The classification system implements social network classification based on any form of a graph classification system as described above. The vertices of which are entities of social networks, including, but not limited to, people, institutions, events, geographic locations in social networks. The edges of the graph are relationships between entities, including, but not limited to, friends, concerns, Private letters, names, associations. The named name refers to a person who can use @.

(176) One embodiment implements a computer system provided by the present disclosure. The computer system includes any of graph feature extraction systems, graph classification system, the network structure classification system, the compound classification system, the social network classification system, or any of a plurality of types mentioned above.

(177) In addition, one embodiment takes a 6-vertex graph as an example to describe in detail the connection information regularization system and graph feature extraction system based on adjacency matrix in the computer environment of the present disclosure. For this 6-vertex graph, its vertices are denoted by a, b, c, d, e, f in alphabetical order, the six edges are (a, b), (a, c), (b, e), (b, f), (e, f) and (e, d) respectively. The graph structures and the its first adjacency matrix based on the order are shown in FIG. 9.

(178) The connection information regularization system is configured to reorder all the vertices in the first adjacency matrix of the graph to obtain a second adjacency matrix, and the connection information elements in the second adjacency matrix are mainly distributed in a diagonal area of n of second adjacency, where n is a positive integer, n≥2 and n is much smaller than |V|, |V| is the number of rows or columns of the second adjacency matrix. The diagonal region of the second adjacency matrix is composed of the following elements: a positive integer i traverses from 1 to |V|, when n<i<|V|−n, select the elements from (i−n+1)-th to (i+n−1)-th columns in i-th row; when i≤n, select elements from 0-th to i+n−1th columns in the i-th row; when i≥|V|−n, select the elements from (i−n+1)-th to |V|-th columns in i-th row.

(179) The vertex reordering method may be a greedy algorithm including the following steps:

(180) (1) Initial Input: input the first adjacency matrix A of the input graph as pending adjacency matrix.

(181) (2) Counting Swap Pairs: calculate all possible vertex swap pairs in A. Label columns in A as 1 to 6, then all possible vertex swap pairs are pairs={(m, h)|1<=m<=5, m+1<=h<=6}. So

(182) Specially, the pending matrix will be relabeled each time it is updated, then all possible pairs are reinitialized to 15 pairs. Init i=1, j=2.

(183) (3) Row and Column Exchange: judge whether i is equal to 5, if yes, then output A to obtain the second adjacency matrix, the greedy algorithm ends; otherwise, select pairs (i, j) as the current vertex exchange pair, execute swap (i, j), generate a new adjacency matrix and skip to step (4).

(184) (4) Exchange Evaluation: calculate the concentration of connection information elements in new adjacency matrix. If the concentration of connection information elements in the new adjacency matrix is higher than before, the refresh(A) is performed to replace A with the new matrix and jumps to step (2); if the concentration of connection information elements in the new adjacency matrix is lower than or equal to before. Then, the exchange is abandoned and execute j=j+1. If j>5, then execute i=i+1, j=i+1 and jump to step (3). If j≤5, then jump to step (3).

(185) The specific flow chart is shown in FIG. 10, where swap(A, i, j) indicates that the rows and columns corresponding to i, j in the adjacency matrix A are exchanged at the same time to obtain a new adjacency matrix and refresh(A) indicates that the adjacency matrix accept the exchange.

(186) The concentration of the connection information is measured by the Loss and ZR. The calculation method is shown in the following formula. For example, in FIG. 13(a), Loss(A, 3)=0, ZR(A, 3)=12/24=0.5. In FIG. 13(b), Loss(A, 3)=2, ZR(A, 3)=10/24=5/12. The lower Loss or ZR is, the higher the concentration.

(187) $LS (A, n) = {.Math.}_{i = 1}^{n} {.Math.}_{j = i + n}^{.Math. V .Math.} A_{i, j} + {.Math.}_{i = n + 1}^{.Math. V .Math.} {.Math.}_{j = 1}^{i - n} A_{i, j}$ $TC (A, n) = {.Math.}_{i = 1}^{n} {.Math.}_{j = 1}^{.Math. V .Math. - n + i - 1} C_{i, j} + {.Math.}_{i = n + 1}^{.Math. V .Math.} {.Math.}_{j = i - n + 1}^{.Math. V .Math.} C_{i, j}$ $T 1 (A, n) = {.Math.}_{i = 1}^{n} {.Math.}_{j = 1}^{.Math. V .Math. - n + i - 1} A_{i, j} + {.Math.}_{i = n + 1}^{.Math. V .Math.} {.Math.}_{j = i - n + 1}^{.Math. V .Math.} A_{i, j}$ $ZR (A, n) = \frac{TC \times T 1}{TC}$

(188) Taking the graph mentioned in FIG. 9 as an example, select n=3, and the corresponding two rows and two columns in the first adjacency matrix are exchanged as shown in FIG. 11. FIG. 11 (a) is the input first adjacency matrix with Loss(A, 3)=4, ZR(A, 3)=16/24=2/3. FIG. 11(b) is a new adjacency matrix A′ obtained by exchanging the rows and columns labeled as a and d. Its Loss(A′, 3)=6, ZR(A′, 3)=18/24=3/4, The Loss(A′, 3)>Loss(A, 3), ZR(A′, 3)>ZR(A, 3). That is, the concentration of connection information elements is reduced, so the exchange is abandoned. In FIG. 11(c) the new adjacency matrix A″ is obtained by exchanging the rows and columns labeled as b and c, the Loss(A″, 3)=2, ZR(A″, 3)=14/24=7/12, Loss (A″, 3)<Loss(A, 3),ZR(A″, 3)<ZR(A, 3). The concentration becomes higher, so replace A with A″. After constant attempts, the best results can be obtained, as shown in the right adjacency matrix in FIG. 12 and the optimal result is the second adjacency matrix. At this point, the vertex order of the second adjacency matrix becomes c, a, b, f, e, d. All the connection information elements (elements with value “1”) all fall into the second adjacency matrix with a width of n (n=3) in the diagonal area.

(189) An important role of the connection information regularization system is that given a first adjacency matrix, there may be more than one way to reorder the vertices of the graph, and the concentrations are the lowest. Therefore, there is more than one second adjacency matrix, these second adjacency matrices are isomorphic. As shown in FIG. 13(a), both adjacency matrices are the second adjacency matrix obtained by connection information regularization system. All the connection information are in the diagonal area of width n (n=3) in the adjacency matrix. However, the order of the vertices of the two matrices is not the same, so there may be multiple second adjacency matrices. In the present disclosure, different isomorphic representations of the graph are generated using this isomorphic property. These isomorphic second adjacency matrices are used to increase the training set at the preprocessing stages in the deep learning process of the graph classification system.

(190) The second adjacency matrix is input into the feature generation module to calculate and obtain at least one vector that directly corresponds to the subgraph structure supporting the classification. The feature generation module uses filter matrixes with size n×n, and moves along the diagonal of the second adjacency matrix to perform a convolution operation as shown in FIG. 14. These filter matrices is denoted as F.sup.0,i, i∈{1, . . . , n.sub.0}. Then the diagonal features extracted by the filter matrix F.sup.0,i in step j can be shown as:

(191) $P_{i, j}^{0} = α (.Math. F^{0, i}, A_{[j : j + n, j : j + n]}^{N} .Math.) = α ({.Math.}_{p = 1}^{n} {.Math.}_{q = 1}^{n} F_{p, q}^{0, i} A_{j + p, j + q}^{n})$

(192) Where α(.Math.) is the activation function, such as sigmoid. Therefore, the feature size obtained from diagonal convolution is n.sub.0×(|V|−n+1). In the following description, P.sup.0 is used to denote the feature {p.sub.i,j.sup.0] obtained by the feature generation module, and F.sub.0 is used to denote the filter parameter {F.sup.0,i].

(193) Also taking the graph shown in FIG. 9 as an example, n.sub.0=2 filter matrices with size 3×3 are used to calculate along the diagonal of second adjacency matrix, as shown in FIG. 15. FIG. 15(a) shows the graph and its second adjacency matrix. FIG. 15(b) shows the two filter matrices used. For convenience, the values in the filter matrix are 0 or 1, and the corresponding structures of two filter matrices used here are shown in FIG. 15(c). Using the above one filter matrix in (b) to move along the diagonal direction of the second adjacency matrix, the calculation is to multiply bitwise and sum up, so a vector (4, 4, 6, 4) can be obtained. Similarly using the below one filter matrix in (b) to move diagonally along the second adjacency matrix, another vector (4, 4, 4, 4) can be obtained. That is, after filtering operations, two vectors can be obtained, as shown in FIG. 15(d), and another vector can be obtained through the activation function (Sigmoid) as shown in FIG. 15(e). The higher the values in the vectors of FIG. 15(d) and FIG. 15(e), the higher the probability that the structure of filter matrix appears in corresponding position in the graph. For example, the region corresponding to 0.99 in FIG. 15(e) is the area enclosed by the dotted line in FIG. 15(a), that is, the subgraph structure represented by the three vertices b, e, f and it is exactly the same with the structure represented by the filter matrix (the structure above the FIG. 15 (c)).

(194) The main advantage of the connection information regularization system is that the connection information is concentrated in the diagonal area of the second adjacency matrix. The elements that do not contain the connection information do not contribute significantly to the classification of the graph, which results in a significant reduction in the amount of computation of the system. Specifically, without a connection information regularization system, when the feature generation module uses a filter matrix of size n×n to extract features, each filter matrix needs to perform calculations. After connection information regularization system, when using a filter matrix of size n×n to extract features, each filter matrix requires only calculations. Take FIG. 14 as an example, set n=3 and after the connection information regularization system, the number of calculations to be performed by each filter matrix is reduced from (6−3+1)2=16 times to 6−3+1=4 times. The amount is only 25% of the original. It can be seen that the graph feature extraction system with a connection information regularization system has a much smaller computational cost than the graph feature extraction system without one. The former calculation amount is only 25% of the latter.

(195) In addition, an embodiment is provided to describe in detail a specific implementation of the graph classification system based on adjacency matrix in a computer environment according to the present disclosure, and the effect of such an implementation is verified by public datasets.

(196) For datasets with irregularly sized graphs, we need to find a suitable window size n for it. When n is too small, it may lead to the loss of the most connection information element after passing through the connection information regularization system. In addition, small n may cause overfit of the feature generation module, because less likely subgraph structure features are captured. First, we unified the sizes of the adjacency matrices of all graphs, and choose the largest number of vertices in the dataset |V|max as the size of the uniform adjacency matrix (number of rows or columns). For graphs with vertices less than |V|max, such as the graph of 3 vertices, we use the zero-padding operation (addition of 0) to make the number of rows and columns of the adjacency matrix equal to |V|max. At the same time, it also ensures that the existing connection information in the original graph is maintained, that is, the additional 0 does not destroy or change the original vertices and edges in the graph. The zero-padding operation is shown in FIG. 16. FIG. 16(a) shows the graph structure of the three-vertex graph and its adjacency matrix, we perform zero-padding on it to make the size of the adjacency matrix become 5, as shown in FIG. 16. (b).

(197) When selecting n, a small number of graphs are sampled randomly from a given dataset. Then the connected information regularization system with different window sizes n is used to process the selected graphs and the Loss of the second adjacency matrices are compared. For the randomly selected graphs, the window size n that minimizes the average Loss of the second adjacency matrices is selected as the window size of the dataset.

(198) For each graph, after zero-padding is performed to get the first adjacency matrix, the first adjacency matrix is processed using the processing flow shown in FIG. 30. First the greedy algorithm in Embodiment 1 is used to regularize the connection information and generate features of graph. In the process of feature generation, nf0 filter matrices are used to filter according to the way described in Embodiment land output to the stacked CNN module. The first convolution result P1 is obtained in the stacked CNN module through the first convolution submodule and the value of the vector represents the possibility that the subgraph structure appears at various positions in the graph. And then repeatedly adding more convolution submodules, we can get more convolution results P.sup.2, P.sup.3, . . . , P.sup.m. The deeper convolution submodule is, the larger and more complex the subgraph represented by the convolution result. Table 1 describes the size, number of filter matrices and the size of the generated features in each convolution submodule. The diagonal convolution represents the feature generation module, and the convolution layer m is the m-th convolution submodule. It should be noted that stacking each convolution submodule in CNN requires setting the height of the filter matrix (i.e. the number of rows in the filter matrix) the same as the number of filter matrix in the previous convolution submodule (i.e. the number of vectors output by previous convolution submodule). For example, for the convolution submodule 2, the filter matrix size is n1×s2, which means that the filter matrix height is the same as the number of filter matrixes (n1) in the convolution submodule 1.

(199) Formally, for i-th convolution layer, we take feature in size of P.sup.i−1 in size of n.sub.i−1×(|V|−n+1) as input, extend it with zero-padding (s.sub.i−1)/2 on the left and zero-padding (s.sub.i−1)/2 on the right and get the {circumflex over (P)}.sup.i−1 in size of n.sub.i−1×(|V|−n+s.sub.i). Then we apply n.sub.i filters F.sup.i in size of (n.sub.i−1×s.sub.i), and get the feature. We define the elements of as follows:
P.sub.j,k.sup.i=α( custom character F.sup.i,j,{circumflex over (P)}.sub.[1:n.sub.i−1.sub.,n:n+s.sub.i.sub.−1])

(200) In the formula, α(.Math.) denotes an activation function, such as sigmoid. And j, k denotes the position of the element in P.sup.j, j-th row and the k-th column. S.sub.i denotes the width of the filter matrix in the i-th convolution layer, and n.sub.i denotes the number of filter matrixes in the i-th convolution layer.

(201) TABLE-US-00001 TABLE 1 Configuration and Feature Size in Each Layer of Graph Classification System Number of Filter Zero- Schema Filters size padding Feature size Input |V| × |V Diagonal n.sub.f.sub.0 n × n 0 n.sub.f.sub.0 × Convolution (|V| − n + 1) Convolution n.sub.f.sub.1 n.sub.f.sub.0 × s.sub.1 − 1 n.sub.f.sub.1 × Layer 1 s.sub.1 (|V| − n + 1) Convolution n.sub.f.sub.2 n.sub.f.sub.1 × s.sub.2 − 1 n.sub.f.sub.2 × Layer 2 s.sub.2 (|V| − n + 1) Convolution n.sub.f.sub.3 n.sub.f.sub.2 × s.sub.3 − 1 n.sub.f.sub.3 × Layer 3 s.sub.3 (|V| − n + 1) . . . . . . . . . . . . . . . Convolution n.sub.f.sub.m n.sub.f.sub.(m−1) × s.sub.m − 1 n.sub.f.sub.m × Layer m s.sub.m (|V| − n + 1) Pooling Layer n.sub.f.sub.m Hidden Layer n.sub.f.sub.m Output K

(202) After going deeper through the m convolution layers with system supplied parameter m, we obtain the deep feature set P.sup.0, . . . , P.sup.m. Pooling submodule is applied to perform pooling operation on each convolution result and max-pooling is taken here. We add the pooling layer for each deep feature set P.sup.j where i from 0 to m. For P.sup.j whose size is n.sub.i−1×(|V|−n+1), we take max-pooling on each row. Therefore, we get a vector of size n.sub.i−1×1.

(203) FIG. 17 shows the relationship between convolution submodules and pooling submodules in a stacked CNN, where the arrows indicate the direction of data flow between the modules. The hidden layer unit is a fully connected layer. The neurons in the fully connected layer have a complete connection with all the activation values of the previous layer. The weight parameter W.sub.h and the bias parameter b.sub.h are set in this layer to calculate for the input and get the activation value. And the dropout is set to prevent the neural network from overfitting. The dropout refers to that in a deep learning network training process, the neurons are temporarily discarded from the network with a certain probability, and the dropout can effectively prevent overfitting.

(204) In the classification unit, we perform multinomial logistic regression through another full connection on weight parameter W.sub.s, bias parameter b.sub.s and softmax function. The softmax function computes the probability distribution over the vector x of class labels and labels the graph with the label corresponding to highest probability in the result.

(205) The neural network training in the system is achieved by minimizing the cross-entropy loss. Its formula is:

(206) $C = - \log {.Math.}_{i = 1}^{.Math. ℛ .Math.} \Pr (y_{i} .Math. A_{i})$

(207) Where |R| is the total number of graphs in the training set R, A.sub.i denotes the adjacency matrix of the i-th graph in R, y.sub.i denotes the i-th class label in x. The parameters are optimized with stochastic gradient descent (SGD). The backpropagation algorithm is employed to compute the gradients.

(208) In order to evaluate the effect of the present disclosure, five open graph datasets were used for testing. Three bioinformatics datasets: MUTAG, PTC and PROTEINS are used in our experimental evaluation. MUTAG is a dataset with 188 nitro compounds where classes indicate whether the compound has a mutagenic effect on a bacterium. PTC is a dataset of 344 chemical compounds that reports the carcinogenicity for male and female rat. PROTEINS is a collection of graphs, in which nodes are secondary structure elements and edges indicate neighborhood in the amino-acid sequence or in 3D space. In addition, two social network datasets, IMDB-BINARY and IMDB-MULTI, are also used in our experimental comparison. IMDB-BINARY is a movie collaboration dataset where actor/actress and genre information of different movies are collected on IMDB. For each graph, nodes represent actors/actress and the edge connected between them if they appear in the same movie. The collaboration network and ego-network for each actor/actress are generated. The ego-network is labeled with that the genre it belongs to. IMDB-MULTI is the multi-class version since a movie can belong to several genres at the same time. IMDB-BINARY is the binary class version which has the set of ego-networks derived from Comedy, Romance and Sci-Fi genres.

(209) Based on the above data sets, two different implementations of the stacked CNN-based graph classification system of the present disclosure are used for verification. The first implementation uses one independent pooling module and one convolution pooling module; The second graph classification system uses an independent pooling module and 4 convoluted submodules. We set a parameter n from 3 to 17. Also the filter size s.sub.i used at each convolution layer is tuned from {3, 5, 7, 9, 11, 13, 15, 17, 19}. The number of convolution filters is tuned from {20, 30, 40, 50, 60, 70, 80} at each layer. The convergence condition is set to the accuracy difference of less than 0.3% from the previous iteration at the training phase or the number of iterations exceeding 30. The test set and training set are randomly sampled based on the ratio of 3:7 in each experiment.

(210) Given the test collection of graphs in size of N, each graph G.sub.i with class label y, and predicted class ŷ.sub.i by classifier, the accuracy measure is formalized as follows:

(211) $Accuracy = \frac{{.Math.}_{i = 1}^{N} δ (y_{i} = {\hat{y}}_{i})}{N}$

(212) where the indicator function δ(.Math.) gets value “1” if the condition is true, and gets value “0” otherwise.

(213) Comparing the present disclosure with three representative methods: DGK(Deep graph kernels, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015: 1365-1374), PSCN(Learning convolutional neural networks for graphs, Proceedings of the 33 rd International Conference on Machine Learning, New York, N.Y., USA, 2016, 2014-2023) and MTL(Joint structure feature exploration and regularization for multi-task graph classification, IEEE Transactions on Knowledge and Data Engineering, 2016, 28(3): 715-728). Table 2 shows the characteristics of the five datasets used and summarizes the average accuracy and standard deviation of the comparison results. All the examples were run ten times in the same setup.

(214) TABLE-US-00002 TABLE 1 Properties of the datasets and accuracy for disclosure and 3 state-of-the-art approaches IMDB- IMDB- Datasets MUTAG PTC PROTEINS BINARY MULTI Number of 188 344 1113 1000 1500 Graphs Number of 2 2 2 2 3 classes Max Vertices 28 109 620 136 89 Number Avg Vertices 17.9 25.5 39.1 19.77 13 Number DGK 82.94 ± 2.68 59.17 ± 1.56 73.30 ± 0.82 66.96 ± 0.56 44.55 ± 0.52 (5 s) (30 s) (143 s) PSCN 92.63 ± 4.21 60.00 ± 4.82 75.89 ± 2.76 71.00 ± 2.29 45.23 ± 2.84 (3 s) (6 s) (30 s) MTL 82.81 ± 1.22 54.46 ± 1.61 59.74 ± 2.11 59.50 ± 3.23 36.53 ± 3.23 (0.006 s) (0.045 s) (0.014 s) The First 92.32 ± 4.10 62.50 ± 4.51 74.99 ± 2.13 63.43 ± 2.50 46.22 ± 1.15 Graph (0.01 s) (0.10 s) (0.39 s) Classification System The Second 94.99 ± 5.63 68.57 ± 1.72 75.96 ± 2.98 71.66 ± 2.71 50.66 ± 4.10 Graph (0.01 s) (0.08 s) (0.60 s) Classification System
For dataset MUTAG, compared to the best result of PSCN at 92.63%, the second graph classification system (5 convolution layers) obtained the accuracy of 94.99%, higher than PSCN. the first graph classification system achieved the accuracy of 92.32%, very similar to PSCN. For PTC dataset, DGK and PSCN obtained similar accuracy measure of around 60%. The first graph classification system achieved 62.50% and the second graph classification system achieved 64.99%, which is the best accuracy to date on this dataset, with the best of our knowledge. For dataset PROTEINS, the second graph classification system achieved the highest accuracy of 75.96%, which is slightly higher than the best result of 75.89% by PSCN. For the two social network datasets, the present disclosure has a competitive accuracy result of 71.66% for IMDB-BINARY, higher than the best of PSCN at 71.00% and has achieved the highest accuracy of 50.66% for IMDB-MULTI, compared to the best of PSCN at 45% and the best of DGK at 44%.

(215) Study the impact of parameter configuration on the accuracy of the classification result and the time complexity performance of the present disclosure.

(216) Window Size n:

(217) This is the key parameter for determining how good the system of the present disclosure can cover the most significant subgraph patterns in the given graph dataset. Because a small n may result in the fact that most graphs would fail to concentrate all connection information into the diagonal area with width n. Consequently, we may loss more structural connectivity information, which can be critical for classification of graph dataset. On the other hand, a big n will lead to high computation cost and time complexity. FIG. 18(a) shows the accuracy and executing time of the present disclosure varying n on dataset MUTAG. In this experiment, the number of convolution filters is set to 50 for all experiments and the stacked convolution filter width is set to 7. The accuracy and execution time are both the average value in 10 runs with the same experimental setting. From both FIG. 18(a), FIG. 19(a) and FIG. 20(a), we observe that the accuracy is insensitive to the increase of n while the execution time is more sensitive and grows significantly as the parameter n increases from 3 to 11 for both MUTAG dataset and PTC dataset. Thus, setting smaller n is more desirable. From Table 2, we can see that the maximum number of vertices in PTC is 109, the average number of vertices is 25.5, the maximum number of vertices in PROTEINS is 620, the average number of vertices is 39.1, and the window size n is 3 to 11, so the choice of n will be far less than the number of vertices of the graph |V|.

(218) Stacked Convolution Filter Width s.sub.i:

(219) For convenience, we set the same width for all layers to simply the discussion. Setting a larger width s.sub.i means that each filter can capture more complex subgraph structure features. Also, the complex subgraph structure features have higher possibility in combination. However, it is also hard to determine the filter width to cover all the possible combinations. In this embodiment, we set n=7, filter number by 50 and vary filter width from 3 to 15. Note that due to zero-padding, we can only use the filter with odd value, namely 3, 5, 7, 9, 11, 13, 15. We also performed 10 runs for each measurement collected under the same setting and take the average value in accuracy and executing time. FIG. 18(b), FIG. 19(b) and FIG. 20(b) illustrate the results on MUTAG, PTC and Proteins respectively. It shows that on MUTAG, the accuracy grows as we increase filter width from 3 to 9 and become more stable as we increase the filter width from 9 to 15. This indicates that 9 is an approximately optimal setting of filter width because the running time on 9 is smaller than that on the filter width of 9 and 15. Similar to MUTAG, PTC dataset shows that the best setting of the filter width is 7, because setting filter width as 9,11 and 13 respectively gets similar accuracy but takes longer running time compared to small filter width of 7. While in Proteins dataset, namely FIG. 20(b), we can see that optimal filter width is 11.

(220) Filter Number n.sub.f.sub.i

(221) Similar to filter width, we set the same filter number for all convolution layers, including diagonal convolution layer and stacked convolution layers. In this experiment, we set n by 7, filter width by 7 and vary filter number from 20 to 80. Each measurement is collected by 10 runs and the average value of accuracy and running time are reported. FIG. 18(c) shows the result on MUTAG and FIG. 19(c) shows the result of PTC. And FIG. 20(c) shows the result in PROTEINS. We make an interesting observation: a larger filter number, for example, 60 in FIG. 19(c), may result in much worse classification accuracy for both datasets. This is because the more filters are used, the more weights need to be trained. Thus, it is easier to get overfitted in training with larger filter number.

(222) Convolution Layer Number

(223) For better observing the efficiency and effectively of the present disclosure on different convolution layer number, the number of convolution layers on the MUTAG, PTC, and PROTEINS is set to 1 to 5 in this embodiment. FIG. 18(d), FIG. 19(d) and FIG. 20(d) illustrate the accuracy and executing time of our approach on MUTAG, PTC and Proteins, respectively. Note that in this embodiment, all other parameters are fixed as the default value. n and filter width are set as 7, filter number is set as 50. An interesting fact is that, without tuning other parameters, increasing convolution layer number will not increase the accuracy explicitly. In FIG. 18(d), the accuracy on 5-convolution layer is similar to 2-convolution layer version. It is because without increasing the filter number and filter width, the deeper convolution network cannot take advantage of its capacity in representing more complex features. In FIG. 20(d), the accuracy on 5-convolution layer is even worse than 2-convolution layer version. It means that the current parameter setting in n, filter width and number works well on 2-convolution layer and limits the performance on 5-convolution. Therefore, in this situation, we need enlarge the other parameters for 5-convolution layer version on PROTEINS dataset.

(224) Dropout Ratio

(225) The previous embodiments have shown that increasing the filter matrix width, filter matrix size and number and number of convolution layers may not improve performance. The next set of embodiments investigates the effect of overfitting by using the dropout ratio in batch normalization. The batch normalization is a technique for maintaining the same distribution of input of each layer of the neural network during the deep neural network training process, which can help the neural network to converge. FIG. 21 shows the results on MUTAG and PTC. The x-axis varies the dropout ratio, the left y-axis measures the accuracy and the right y-axis measures the running time. FIG. 21(a) shows that the accuracy increases when the dropout ratio is from 0 to 0.2 and the accuracy reduces when the dropout ratio is from 0.2 to 0.9 for MUTAG. FIG. 21(b) shows the measurements for PTC: the accuracy is stable when dropout ratio is from 0 to 0.4, increases when the dropout ratio is from 0.4 to 0.5, and decreases slightly when the dropout ratio is from 0.5 to 0.9. This set of experiments indicates that when the dropout ratio is set to 0.2, the present disclosure get the best fit on MUTAG and the optimal dropout ratio for PTC is 0.5.

(226) The present disclosure proposes a graph feature extraction system based on adjacency matrix, concentrating the connection information elements in an adjacency matrix and extracting features. The system is compared here with common CNN without connection information regularization system. For naïve CNN, a 2-dimension convolution layer is applied on adjacency matrix and the pooling layers are 2-dimension pooling. The configuration of the embodiment is n=7, filter width as 7 and filter number as 50, for both the present disclosure and common version. The results are reported in FIG. 22. FIG. 22(a) is the accuracy on these two approaches. We can see that the accuracy of the present disclosure gets higher. In FIG. 22(b), the computing time of common without regularization system version is larger than the present disclosure. It means that the present disclosure gets a higher accuracy and lower running time.

(227) Convergence

(228) FIGS. 23, 24, and 25 are the loss convergence process on training and validation set for MUTAG, PTC and PROTEINS. The grey line is the loss on training set and blue line is loss on validation set. It can be seen that in both three datasets, the loss reduces at first and get stable from 30 epochs. And just like most machine learning approaches, especially neural network, the loss on training set can get a lower value than validation set. It is because the training procedure apply Stochastic Gradient Descent on loss of training set not the validation set.

(229) Feature Training

(230) This embodiment is performed on the MUTAG dataset, with n set to 7, filter width set to 7 and filter number set to 20. FIG. 26 reports the results, in which the x-axis is the epoch number from 0 to 30. Epoch=0 means the initial value, which is sampled from a normal distribution. FIG. 26(c) shows the raw filter value, which is a 7×7 matrix. Each cell representing the corresponding position in the filter matrix. The darker the cell is, the bigger the value is. In other words, the darkest cell has the value closer to 1 while the white cell has the value more close to −1, the grey cell has the value around 0. In initial stage, more cells are grey, with values around 0. As moving forward with the training procedure, some dark cells become lighter and some light cells become darker, especially in the left top part. While the darkest cells, on the right bottom part, keep dark during the training. It means that these cells play important roles in classification of the given dataset of graphs. This is because the back propagation only modifies the cells that are non-contributing o the classification of the input graph. For better understanding the subgraph structure, FIG. 16 draws the positive subgraph and negative subgraph in FIG. 16(a) and FIG. 16(b) respectively. The positive subgraph is drawn by setting the cell as 1 if its value is bigger than 0 and as 0 if its value is smaller or equal to 0. This subgraph is called a positive subgraph because it represents the edges that should appear. In the contrast, the negative subgraph is drawn by setting the cell as 1 if its value is smaller or equal to 0 and as 1 if its value is bigger than 0. The negative subgraph denotes the edges that should not appear. It can be seen that, both positive graph and negative graph do change gradually from the initial state in the training procedure and arrive at the stable structures at the end of the training. It means that the training procedure eventually reaches the convergence.

(231) Feature Visualization

(232) FIG. 27 illustrates the subgraph features captured in different convolution layers. FIG. 27(a) presents the input graph of 12 vertices. In this embodiment, the second classification system (5 convolution layer) is used, with the window-size of n=4, the diagonal convolution filer of size 4×4 and the rest 4 convolution layer filters in size of 3. Thus, the feature size in each layer is 4, 6, 8, 10, 12. FIG. 27(b), (c), (d), (e), (f) show the patterns learned at each of the five convolution layers respectively. The adjacency matrix shows the existing probability of each edge, the darker the cell is, the higher probability that the corresponding edge is captured by this filter. In the first layer shown in FIG. 27(b), only the basic four vertex patterns can be handled. Moving forward to the second layer shown in FIG. 27(c), the filters can capture and represent the six-vertex patterns, which are composed by the first layer features. By further adding more convolution layers, the more complicated subgraph patterns can be captured and represented. Finally, in FIG. 27(f), the 12-vertex feature is captured, which is quite similar to the original input graph in FIG. 27(a).

(233) Finally, an embodiment is provided to mainly explain the important feature of the graph classification system based on adjacency matrix proposed by the present disclosure: Capturing a large multi-vertex subgraph structure using a smaller window.

(234) Taking a graph consisting of ten vertices (|V|=10) as an example, FIG. 28 shows the physical meaning of using the feature generation module on this graph. It is observed that the graph has two rings of size six vertices, and two vertices are shared by these two ring structures. To capture such a ring-based graph pattern, existing approaches usually require having the window size larger than 10. However, the method of the present disclosure can be effective even when the window size n is as small as six. Consider the original graph on the top left in FIG. 28, we sort the vertices by the connection information regularization system with n=6 and get the order labeled graph on the top right. We use a, b, c, d, e, f, g, h, i to denote the sequence of sorted vertices. Then diagonal convolution with filters in size of 6×6 is performed, namely n=6. The filter can move by |V|−n+1=10−6+1=5 steps. The five figures in the center of FIG. 28 shows how each of the five filters covers (captures) the different patterns of the graph in each step. For example, in the first step, the filter stops at A n and it covers all the connections [1:6,1:6] between any pair of vertices marked by a, b, c, d, e, f. As shown in step 1 of FIG. 28, the filter highlighted by the dash line, covers the ring consisting of vertices a, b, c, d, e, f. More interestingly, using the diagonal convolution operation, different subgraph structures (features) can be captured by the same filter. For instance, steps 1 and 5 capture the same graph structure: the six-vertex ring. At the same time, steps 2,3 and 4 capture another same type of graph structure: the six-vertex line.

(235) More specifically, FIG. 29 gives a numerical example to describe the features captured by the feature generation module and the features captured in the stacked CNN. FIG. 29(a) shows a 12-vertex graph and the second adjacency matrix of the graph. The graph contains two rings of six vertexes, and the two vertexes are shared by the two ring structures. The elements of the adjacency matrix and the filter matrix blank in FIG. 29 indicate that the value is 0. In order to simplify the calculation of the values of the elements in the filter matrix, 0 or 1 is selected. FIG. 29(b) shows two filter matrices in the feature generation module and the corresponding subgraph structure is shown in FIG. 29(c). Using the two filter matrices of FIG. 29(b) to perform filtering operations along the diagonal of the second adjacency matrix of the graph, the vector can be calculated as shown in FIG. 29(d). The elements enclosed by the dashed lines are zero-padding. The filter matrix in the stacked CNN is shown in FIG. 29(e). To simplify the calculation, the elements are also 0 or 1. Using the filter matrix in the stacked CNN to filter the captured features (FIG. 29(d)), the resulting vector is shown in FIG. 29(h). Considering the physical meaning represented by the stacked filter matrix in CNN, it represents the combination of the subgraph structures captured by the feature generation module. Therefore, the filter matrix of the feature generation module can be stacked according to the value of the filter matrix in the stack CNN. FIG. 29(i) shows. An adjacency matrix represented by a filter matrix in a stacked CNN is obtained as shown in FIG. 29(f). FIG. 29(g) is a subgraph structure represented by a filter matrix in a stacked CNN. It can be seen that in FIG. 29(g), a double ring with ten vertices and one with six vertex rings and 4 vertices.

(236) The graph classification system based on adjacency matrix proposed by the present disclosure can capture the large multi-vertex subgraph structure and the deep features of the implicit correlation structure from the vertices and edges through a smaller window, thereby improving the classification accuracy.

System and method of connection information regularization, graph feature extraction and graph classification based on adjacency matrix

Assignee

Inventors

Cpc classification

Classification Explorer

G06V20/30

PHYSICS

Classification Explorer

G06N3/0418

PHYSICS

Classification Explorer

G06V10/454

PHYSICS

Classification Explorer

G06N3/084

PHYSICS

Classification Explorer

G06V10/84

PHYSICS

Classification Explorer

G06F16/9024

PHYSICS

Classification Explorer

G06F16/906

PHYSICS

Classification Explorer

G06F18/29

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

G06F18/2415

PHYSICS

Classification Explorer

G06V10/764

PHYSICS

Classification Explorer

G06F18/2323

PHYSICS

Classification Explorer

G06N7/01

PHYSICS

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06V10/774

PHYSICS

Classification Explorer

G06F17/10

PHYSICS

Classification Explorer

G06F18/24143

PHYSICS

Classification Explorer

G06V10/426

PHYSICS

International classification

Classification Explorer

G06K9/62

PHYSICS

Classification Explorer

G06F16/906

PHYSICS

Classification Explorer

G06N3/04

PHYSICS

Classification Explorer

G06F16/901

PHYSICS

Abstract

Claims

Description