Method for reconstructing a 3D object based on dynamic graph network

11715258 · 2023-08-01

Assignee

Inventors

Cpc classification

International classification

Abstract

The present invention provides a method for reconstructing a 3D object based on dynamic graph network, first, obtaining a plurality of feature vectors from 2D image I of an object; then, preparing input data: predefining an initial ellipsoid mesh, obtaining a feature input X by filling initial features and creating a relationship matrix A corresponding to the feature input X; then, inputting the feature input X and corresponding relationship matrix A to a dynamic graph network for integrating and deducing of each vertex's feature, thus new relationship matrix is obtained and used for the later graph convoluting, which improves the initial graph information and makes the initial graph information adapted to the mesh relation of the corresponding object, therefore the accuracy and the effect of 3D object reconstruction have been improved; last, regressing the position, thus the 3D structure of the object is deduced, and the 3D object reconstruction is completed.

Claims

1. A method for reconstructing a 3D object based on a dynamic graph network, comprising: (1): image feature extraction, comprising: extracting image features from a 2D (2 dimensional) image I of an object, wherein the extracted image features constitute a feature map, and the feature map comprises N feature vectors of D dimensions and N center coordinates of the image areas respectively corresponding to the N feature vectors, where N and D are natural numbers, N feature vectors are denoted by F.sub.n, n=1, 2, . . . , N, feature vector F.sub.n is a column vector, and the center coordinate of the image area corresponding to feature vector F.sub.n is denoted by (x.sub.n, y.sub.n), the center coordinate (x.sub.n, y.sub.n) is taken as the feature vector coordinate of feature vector F.sub.n, where x.sub.n is the horizontal coordinate of the feature vector coordinate, y.sub.n is the vertical coordinate of the feature vector coordinate; (2): input data preparation for a dynamic graph network, comprising:— predefining an initial ellipsoid mesh, which comprises N vertices and a plurality of edges, where the coordinate of the k.sup.th vertex is (x′.sub.k,y′.sub.k,z′.sub.k), x′.sub.k,y′.sub.k,z′.sub.k respectively are the x coordinate, y coordinate and z coordinate of the k.sup.th vertex, where k=1, 2, . . . , N; filling initial features: in the feature map, finding feature vector coordinate (x.sub.k′,y.sub.k′), k′∈{1, 2, . . . N}, which has the nearest distance from the coordinate (x′.sub.k,y′.sub.k) of the k.sup.th vertex, and combining feature vector F.sub.k′ and coordinate (x′.sub.k,y′.sub.k,z′.sub.k) of the k.sup.th vertex into a feature vector, which is denoted by X.sub.k, where the number of dimensions of feature vector X.sub.k is c.sub.1, c.sub.1=D+3, and a feature input X is obtained, X={X.sub.1, X.sub.2, . . . , X.sub.N}; creating a relationship matrix corresponding to the feature input X, which is denoted by A and A=(A.sub.ij).sub.N×N, i=1, 2, . . . , N, j=1, 2, . . . , N; wherein if there is an edge connecting the i.sup.th vertex and the j.sup.th vertex, or a neighborhood relationship exists between the i.sup.th vertex and the j.sup.th vertex, element A.sub.ij=1, otherwise, element A.sub.ij=0; (3): feature mapping and convolution in a dynamic graph network, wherein the dynamic graph network (dynamic graph convolutional neural network) comprises a dynamic graph learning layer and two graph convolution layers, comprising: 3.1): in the dynamic graph learning layer, first performing a feature mapping for feature input X by a learnable parameter θ, for the i.sup.th feature vector X.sub.i of feature input X, the feature vector h.sub.i is obtained: h.sub.i=θ.sup.TX.sub.i, where the learnable parameter θ is a matrix with size of c.sub.1×c.sub.2, c.sub.2 is a natural number; then measuring the distances between vertex θ.sup.TX.sub.i and vertex θ.sup.TX.sub.j, and obtaining a relationship matrix, which is denoted by S, S=(S.sub.ij).sub.N×N, element S.sub.ij of relationship matrix S is: S ij = exp { - d 2 ( θ T X i , θ T X j ) } .Math. n = 1 N exp { - d 2 ( θ T X i , θ T X n ) } ; where d.sup.2( ) is a distance measure function used to calculate the distance between two feature vectors, and exp{ } is an exponential function; then normalizing relationship matrix S, where the normalized element S.sub.ij is: S _ ij = S ij .Math. n = 1 N S in ; retaining the K largest elements of the N elements S.sub.i1, S.sub.i2, . . . , S.sub.iN of the i.sup.th row, and setting the remaining elements of the i.sup.th row to 0, thus a relationship matrix denoted by S is obtained by the retaining and setting of the N rows, where S=(S.sub.ij).sub.N×N, where K is a natural number less than N; last, integrating relationship matrix S with relationship matrix A into new relationship matrix Â:
Â=(1−η)A+ηS; where η is a hyper-parameter used to balance relationship matrix A and relationship matrix S, η is a decimal less than 1; 3.2): in the two graph convolution layers, performing two graph convolutions for feature input X, thus a feature output denoted by Z is obtained according to the following equation:
Z=σ(Âσ(ÂXW.sup.(1))W.sup.(2)); where feature output Z is a matrix consisted of N columns of vectors, σ(ÂXW.sup.(1) is the output of the first graph convolution layer, and taken as the input of the second graph convolution layer, W.sup.(1) is a learnable linear parameter of the first graph convolution layer, W.sup.(2) is a learnable linear parameter of the second graph convolution layer, σ( ) is an activation function; (4): linear regression mapping in a 3D coordinate regression layer, comprising: sending the N column vectors Z.sub.i, Z.sub.i=1, 2, . . . , N, of feature output Z as input to a 3D coordinate regression layer to perform linear regression mapping, wherein the 3D coordinate regression layer outputs 3 dimensional coordinate vectors P.sub.i, i=1, 2, . . . , N, which respectively correspond to the predicted 3D coordinates of the N vertices; (5): dynamic graph network training, comprising: 5.1): creating a graph learning loss function L.sub.graph: L graph = ( .Math. i = 1 N .Math. j = 1 N .Math. "\[LeftBracketingBar]" Z i - Z j .Math. "\[RightBracketingBar]" 2 S ij + .Math. S - A .Math. F 2 ) ; where Z.sub.i, Z.sub.j are respectively the i.sup.th, j.sup.th column vectors of feature output Z, |Z.sub.i−Z.sub.j| means calculate the Euclidean distance between the i.sup.th, j.sup.th column vectors, ∥S−A∥.sub.F means calculate the norm of S−A; 5.2): continuously inputting a 2D image I of a different object, and processing the 2D image I according to step (1), (2) and (3), then updating learnable parameter θ of the dynamic graph learning layer and learnable linear parameters W.sup.(1), W.sup.(2) of the two graph convolution layers by adopting a gradient decent algorithm to perform a back propagation according to graph learning loss function L.sub.graph, until the decrease of the value of graph learning loss function L.sub.graph stops, completing the dynamic graph network training; (6): graph convolution layer training, comprising: after the dynamic graph network training is complete, continuously inputting a 2D image I of a different object, and processing the 2D image I according to step (1), (2), (3) and (4), then updating the network parameters of the 3D coordinate regression layer by measuring the distance between the predicted 3D coordinates and the true 3D coordinates of the N vertices and adopting a gradient decent algorithm to perform a back propagation according to a Chamfer distance loss function, until the decrease of the value of the Chamfer distance loss function stops, completing the 3D coordinate regression layer training; (7): 3D object reconstruction, comprising: after the dynamic graph network training and the graph convolution layer training are completed, inputting a 2D image I of an object, obtaining the predicted 3D coordinates of the N vertices by processing the 2D image I according to (1), (2), (3) and (4), thus the 3D structure of the object is deduced, and the 3D object reconstruction is completed.

2. The method for reconstructing a 3D object based on the dynamic graph network of claim 1, further comprising pre-processing 2D image I before the image feature extraction, including enhancing, cropping and making uniform 2D image I, and a detailed cropping and unifying are as follows: custom character cropping 2D image I into an image with fixed size 256×256, removing the color of a background area by using an edge detection algorithm, and filling the background area with green color; custom character normalizing 2D image I to make the pixel values uniform, wherein the normalizing makes the pixel values conform to Gaussian distribution.

Description

BRIEF DESCRIPTION OF THE DRAWING

(1) The above and other objectives, features and advantages of the present invention will be more apparent from the following detailed description taken in conjunction with the accompanying drawings, in which:

(2) FIG. 1 is a diagram of an end-to-end deep learning architecture using for 3D object reconstruction in prior art;

(3) FIG. 2 is a flow diagram of a method for reconstructing a 3D object based on dynamic graph network in accordance with the present invention;

(4) FIG. 3 is a diagram of the input data preparation for dynamic graph network shown in FIG. 2;

(5) FIG. 4 is a flow diagram of the 3D object reconstruction shown in FIG. 2;

(6) FIG. 5 is a diagram of the 3D object reconstruction shown in FIG. 2;

(7) FIG. 6 shows six 3D renderings reconstructed in accordance with the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

(8) Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings. It should be noted that the similar modules are designated by similar reference numerals although they are illustrated in different drawings. Also, in the following description, a detailed description of known functions and configurations incorporated herein will be omitted when it may obscure the subject matter of the present invention.

Embodiment

(9) FIG. 2 is a flow diagram of a method for reconstructing a 3D object based on dynamic graph network in accordance with the present invention.

(10) In one embodiment, As shown in FIG. 2, a method for reconstructing a 3D object based on dynamic graph network is provided, comprising comprises the following steps:

(11) Step S1: image feature extraction

(12) Extracting image features from 2D (2 dimensional) image I of an object, wherein the extracted image features constitute a feature map, the feature map comprises N feature vectors of D dimensions and N center coordinates of the image areas respectively corresponding to the N feature vectors, N feature vectors are denoted by F.sub.n, n=1, 2, . . . , N, feature vector F.sub.n is a column vector, and the center coordinate of the image area corresponding to feature vector F.sub.n is denoted by (x.sub.n, y.sub.n), the center coordinate (x.sub.n, y.sub.n) is taken as the feature vector coordinate of feature vector F.sub.n, where x.sub.n is the horizontal coordinate of the feature vector coordinate, x.sub.n is the vertical coordinate of the feature vector coordinate.

(13) In the embodiment, 2D (2 dimensional) image I of an object is needed to be processed before performing the image feature extraction: enhance, crop and uniform 2D image I. The detailed cropping and unifying are as follows: custom character cropping 2D image I into a image with fixed size 256×256, and removing the color of background area by edge detection algorithm, filling the background area with green color; custom character normalizing 2D image I to uniform the pixel values, which make the pixel values conform to Gaussian distribution.

(14) In the embodiment, pre-processed 2D (2 dimensional) image I is sent to a ResNet50 model of residual network that has been trained using ImageNet image database for image feature extraction. The ResNet50 model comprises multiple 3×3 convolutional layers, activation function layers and batch normalization layers, and also contains short-circuit connections to connect the upper layer features to the lower layer features. The ResNet50 model usually consists of multiple blocks. However, in the embodiment, the last block in order has been removed to obtain feature vectors corresponding to image areas of 2D image I. then a feature map constituted by N=16×16 feature vectors with the size of D=512 dimensions is obtained. Namely, the feature map comprises 256 feature vectors F.sub.1, F.sub.2, . . . , F.sub.256 with the size of 512 dimensions and 512 center coordinates of the image areas respectively corresponding to the 512 feature vectors.

(15) Step S2: input data preparation for dynamic graph network

(16) As shown in FIG. 3, predefining an initial ellipsoid mesh, which comprises N vertices and a plurality of edges, where the coordinate of the K.sup.th vertex is (x′.sub.k,y′.sub.k,z′.sub.k), x′.sub.k,y′.sub.k,z′.sub.k respectively are the x coordinate, y coordinate and z coordinate of the k.sup.th vertex.

(17) Filling initial features: in the feature map, finding feature vector coordinate (x.sub.k′,y.sub.k′), k′∈{1, 2, . . . N}, which has the nearest distance from its coordinate (x′.sub.k,y′.sub.k), and combining feature vector F.sub.k′ and coordinate (x′.sub.k,y′.sub.k,z′.sub.k) of the k.sup.th vertex into a feature vector, which is denoted by X.sub.k, where the number of dimensions of feature vector X.sub.k is c.sub.1, c.sub.1=D+3, and a feature input X is obtained, X={X.sub.1, X.sub.2, . . . , X.sub.N}. In the embodiment, c.sub.1=512+3=515.

(18) Creating a relationship matrix corresponding to the feature input X, which is denoted by A and A=(A.sub.ij).sub.N×N; wherein if there is a edge connecting the i.sup.th vertex and the j.sup.th vertex, or neighborhood relationship exists between the i.sup.th vertex and the j.sup.th vertex (the distance of the i.sup.th vertex and the j.sup.th vertex is less than set threshold ε), element A.sub.ij=1, otherwise, element A.sub.ij=0.

(19) In the embodiment, the initial ellipsoid mesh comprises 256 vertices and 6×16 edges.

(20) Step S3: feature mapping and convolution in a dynamic graph network

(21) The dynamic graph network (dynamic graph convolutional neural network) comprises a dynamic graph learning layer and two graph convolution layers.

(22) Step S3.1: in the dynamic graph learning layer, first performing a feature mapping for feature input X by a learnable parameter θ, for the i.sup.th feature vector X.sub.i of feature input X, the feature vector h.sub.i is obtained: h.sub.i=θ.sup.TX.sub.i, where the learnable parameter θ is a matrix with size of c.sub.1×c.sub.2, c.sub.2 is the number of dimensions in distance space; then measuring the distances between vertices, and a relationship matrix is obtained, which is denoted by S, S=(S.sub.ij).sub.N×N, element S.sub.ij of relationship matrix S is:

(23) S ij = exp { - d 2 ( θ T X i , θ T X j ) } .Math. n = 1 N exp { - d 2 ( θ T X i , θ T X n ) } ;

(24) where d.sup.2( ) is a distance measure function used to calculate the distance between two feature vectors, exp{ } is an exponential function.

(25) Then normalizing relationship matrix S, where the normalized element S.sub.ij is:

(26) S _ ij = S ij .Math. n = 1 N S in .

(27) To N elements S.sub.i1, S.sub.i2, . . . , S.sub.iN of the i.sup.th row, retaining the K largest elements, and setting the remaining elements to 0, thus a relationship matrix denoted by S is obtained by the retaining and setting of the N rows, where S=(S.sub.ij).sub.N×N.

(28) To each vertex, too many neighbor nodes will cause graph over-smoothing problem during feature aggregation, making all nodes' features tend to be the same, which can lead to network training failure. So the present invention makes relationship matrix S sparse: normalize the weight of each neighbor vertex, retain the K nearest vertices, and discard the remaining links.

(29) Last, integrating relationship matrix S with relationship matrix A into new relationship matrix Â:
Â=(1−η)A+ηS;

(30) where η is a hyper-parameter used to balance relationship matrix A and relationship matrix S, its value is determined according to specific implementation.

(31) Step S3.2: in the two graph convolution layers, performing two graph convolutions for feature input X, thus a feature output denoted by Z is obtained according to the following equation:
Z=σ(Āσ(ĀXW.sup.(1))W.sup.(2));

(32) where feature output Z is a matrix consisted of N columns of vectors, σ(ÂXW.sup.(1)) is the output of the first graph convolution layer, and taken as the input of the second graph convolution layer, W.sup.(1) is the learnable linear parameter of the first graph convolution layer, W.sup.(2) is the learnable linear parameter of the second graph convolution layer, σ( ) is a activation function.

(33) where multiplying feature input X with learnable linear parameter W.sup.(1) is a linear mapping, and multiplying feature input X with relationship matrix  is a feature aggregation of neighbor vertices.

(34) Step S4: linear regression mapping in 3D coordinate regression layer

(35) Sending the N column vectors Z.sub.i, i=1, 2, . . . , N, of feature output Z as input to a 3D coordinate regression layer to perform linear regression mapping, wherein the 3D coordinate regression layer outputs 3 dimensional coordinate vectors P.sub.i, i=1, 2, . . . , N, which respectively correspond to the predicted 3D coordinates of the N vertices.

(36) Step S5: dynamic graph network training

(37) Step S5.1: creating a graph learning loss function L.sub.graph:

(38) L graph = ( .Math. i = 1 N .Math. j = 1 N .Math. "\[LeftBracketingBar]" Z i - Z j .Math. "\[RightBracketingBar]" 2 S ij + .Math. S - A .Math. F 2 ) ;

(39) where Z.sub.i, Z.sub.j are respectively the i.sup.th, j.sup.th column vectors of feature output Z, |Z.sub.i−Z.sub.j| means calculate the Euclidean distance between the i.sup.th, j.sup.th column vectors, ∥S−A∥.sub.F means calculate the norm of S−A.

(40) Step S5.2: continuously inputting 2D image I of different object, and processing it according to step S1, S2 and S3, then updating learnable parameter θ of the dynamic graph learning layer and learnable linear parameter W.sup.(1), W.sup.(2) of the two graph convolution layers by adopting gradient decent algorithm to perform a back propagation according to graph learning loss function L.sub.graph, until the decrease of the value of graph learning loss function L.sub.graph stops (the value of graph learning loss function L.sub.graph converges to a stable value), at this time, the dynamic graph network training is completed.

(41) The traditional graph network (graph convolutional neural network) requires the feature input X and the corresponding initial relationship matrix A to be inputted, and initial relationship matrix A is used as the operational input for feature aggregation. But relationship matrix A is constant and fixed throughout the process, which means that initial relationship matrix A has a great influence on the feature transfer and aggregation process. However, initial relationship matrix A has the disadvantage of incomplete information (e.g., missing edges) as well as not corresponding well to the grid relationship with the mesh, so the present invention provides a method for reconstructing a 3D object based on dynamic graph network. Unlike the traditional 3D reconstruction based on graph convolutional neural network, the dynamic graph network will map the feature of each vertex and update the edges of the graph based on graph Laplacian regularization to discover the potential edge relationships, and after distance calculation based on Gaussian kernel function and sparsification of the generated potential graph, finally integrate relationship matrix S with initial relationship matrix A into new relationship matrix  for later the graph convolution operation. Compared with initial relationship matrix A, new relationship matrix  improves the incomplete initial graph information of initial relationship matrix A, and can be better adapted to the mesh relationship with the corresponding object, thus the accuracy and effectiveness of the reconstruction are improved.

(42) Step S6: graph convolution layer training

(43) After the dynamic graph network training is complete, continuously inputting 2D image I of different object, and processing it according to step S1, S2, S3 and S4, then updating the network parameters of the 3D coordinate regression layer by measuring the distance between the predicted 3D coordinates and the true 3D coordinates of the N vertices and adopting gradient decent algorithm to perform a back propagation according to Chamfer distance loss function, until the decrease of the value of Chamfer distance loss function stops (the value of Chamfer distance loss function converges to a stable value), at this time, the 3D coordinate regression layer training is completed.

(44) Chamfer distance loss function is used to measure the distance between the predicted value and true value, its expression is:

(45) L = .Math. p M min q .Math. "\[LeftBracketingBar]" p - q .Math. "\[RightBracketingBar]" 2 + .Math. q G min p .Math. "\[LeftBracketingBar]" p - q .Math. "\[RightBracketingBar]" 2 ;

(46) where M is the set of predicted vertices, G is the set of true vertices. To each vertex in a vertex set, Chamfer distance loss function means finding a nearest vertex in another vertex set, then adding all squares of the distances between vertices and their respective nearest vertices. The first item means the sum of the distances between each vertex p in the predicted vertex set and its nearest vertex in the true vertex set; the first item means the sum of the distances between each vertex q in the true vertex set and its nearest vertex in tjen predicted vertex set. If distance L is bigger, it means that the difference between the predicted vertex set and the true vertex set is bigger; If distance L is smaller, it means that the difference between the predicted vertex set and the true vertex set is smaller, the effect of 3D object reconstruction is better. Chamfer distance is mainly used for point cloud reconstruction or 3D reconstruction work.

(47) In the embodiment, Chamfer distance loss function is denoted by L.sub.regress:

(48) L regress = .Math. i = 1 N .Math. "\[LeftBracketingBar]" P i - Q i * .Math. "\[RightBracketingBar]" 2 + .Math. i = 1 N .Math. "\[LeftBracketingBar]" P i - Q i .Math. "\[RightBracketingBar]" 2 ;

(49) where Q.sub.i* is the nearest coordinate vector constituted by the true 3D coordinate to the i.sup.th coordinate vector P.sub.i. P.sub.i′ is the nearest coordinate vector to coordinate vector Q.sub.i corresponding to the true 3D coordinate of the i.sup.th vertex, coordinate vector Q.sub.i is constituted by the true 3D coordinate of the i.sub.th vertex.

(50) Step S7: 3D object reconstruction

(51) As shown in FIG. 4, FIG. 5, after the dynamic graph network training and the graph convolution layer training are completed, to 2D image I of an object, obtaining the predicted 3D coordinates of the N vertices by processing it according to S0, S1, S2, S3 and S4: image pre-processing, image feature extraction, image feature extraction, input data preparation for dynamic graph network, feature mapping and convolution in a dynamic graph network and linear regression mapping in 3D coordinate regression layer, thus the 3D structure of the object is deduced, the 3D object reconstruction is completed.

(52) As shown in FIG. 4, the dynamic graph network comprises a dynamic graph learning layer and two graph convolution layers. Feature mapping, distance measurement, sparsification and integration of relationship matrices are performed in the dynamic graph learning layer, and the new relationship matrix  is obtained, namely, the potential edges are found by learning; Feature mapping and feature integrating of neighbor vertices are performed in the two graph convolution layers, and feature output Z is obtained. Detailed diagram of the 3D object reconstruction is shown in FIG. 5.

(53) In the embodiment, 2D images of six objects are respectively taken as 2D image I to be processed in accordance with the present invention, then the 3D structures of the six objects are deduced. The 3D structures of the six objects are converted into six 3D renderings, which are shown in FIG. 6. As can be seen from FIG. 6, the 3D structure reconstructed has the characteristics of completeness and smoothness, and can be applied to various practical applications such as game, drone and animations.

(54) While illustrative embodiments of the invention have been described above, it is, of course, understand that various modifications will be apparent to those of ordinary skill in the art. Such modifications are within the spirit and scope of the invention, which is limited and defined only by the appended claims.