Method for reconstructing a 3D object based on dynamic graph network
11715258 · 2023-08-01
Assignee
Inventors
Cpc classification
G06V10/84
PHYSICS
G06F18/217
PHYSICS
G06F18/213
PHYSICS
International classification
G06F18/21
PHYSICS
G06F18/213
PHYSICS
G06V10/84
PHYSICS
Abstract
The present invention provides a method for reconstructing a 3D object based on dynamic graph network, first, obtaining a plurality of feature vectors from 2D image I of an object; then, preparing input data: predefining an initial ellipsoid mesh, obtaining a feature input X by filling initial features and creating a relationship matrix A corresponding to the feature input X; then, inputting the feature input X and corresponding relationship matrix A to a dynamic graph network for integrating and deducing of each vertex's feature, thus new relationship matrix is obtained and used for the later graph convoluting, which improves the initial graph information and makes the initial graph information adapted to the mesh relation of the corresponding object, therefore the accuracy and the effect of 3D object reconstruction have been improved; last, regressing the position, thus the 3D structure of the object is deduced, and the 3D object reconstruction is completed.
Claims
1. A method for reconstructing a 3D object based on a dynamic graph network, comprising: (1): image feature extraction, comprising: extracting image features from a 2D (2 dimensional) image I of an object, wherein the extracted image features constitute a feature map, and the feature map comprises N feature vectors of D dimensions and N center coordinates of the image areas respectively corresponding to the N feature vectors, where N and D are natural numbers, N feature vectors are denoted by F.sub.n, n=1, 2, . . . , N, feature vector F.sub.n is a column vector, and the center coordinate of the image area corresponding to feature vector F.sub.n is denoted by (x.sub.n, y.sub.n), the center coordinate (x.sub.n, y.sub.n) is taken as the feature vector coordinate of feature vector F.sub.n, where x.sub.n is the horizontal coordinate of the feature vector coordinate, y.sub.n is the vertical coordinate of the feature vector coordinate; (2): input data preparation for a dynamic graph network, comprising:— predefining an initial ellipsoid mesh, which comprises N vertices and a plurality of edges, where the coordinate of the k.sup.th vertex is (x′.sub.k,y′.sub.k,z′.sub.k), x′.sub.k,y′.sub.k,z′.sub.k respectively are the x coordinate, y coordinate and z coordinate of the k.sup.th vertex, where k=1, 2, . . . , N; filling initial features: in the feature map, finding feature vector coordinate (x.sub.k′,y.sub.k′), k′∈{1, 2, . . . N}, which has the nearest distance from the coordinate (x′.sub.k,y′.sub.k) of the k.sup.th vertex, and combining feature vector F.sub.k′ and coordinate (x′.sub.k,y′.sub.k,z′.sub.k) of the k.sup.th vertex into a feature vector, which is denoted by X.sub.k, where the number of dimensions of feature vector X.sub.k is c.sub.1, c.sub.1=D+3, and a feature input X is obtained, X={X.sub.1, X.sub.2, . . . , X.sub.N}; creating a relationship matrix corresponding to the feature input X, which is denoted by A and A=(A.sub.ij).sub.N×N, i=1, 2, . . . , N, j=1, 2, . . . , N; wherein if there is an edge connecting the i.sup.th vertex and the j.sup.th vertex, or a neighborhood relationship exists between the i.sup.th vertex and the j.sup.th vertex, element A.sub.ij=1, otherwise, element A.sub.ij=0; (3): feature mapping and convolution in a dynamic graph network, wherein the dynamic graph network (dynamic graph convolutional neural network) comprises a dynamic graph learning layer and two graph convolution layers, comprising: 3.1): in the dynamic graph learning layer, first performing a feature mapping for feature input X by a learnable parameter θ, for the i.sup.th feature vector X.sub.i of feature input X, the feature vector h.sub.i is obtained: h.sub.i=θ.sup.TX.sub.i, where the learnable parameter θ is a matrix with size of c.sub.1×c.sub.2, c.sub.2 is a natural number; then measuring the distances between vertex θ.sup.TX.sub.i and vertex θ.sup.TX.sub.j, and obtaining a relationship matrix, which is denoted by S, S=(S.sub.ij).sub.N×N, element S.sub.ij of relationship matrix S is:
Â=(1−η)A+η
Z=σ(Âσ(ÂXW.sup.(1))W.sup.(2)); where feature output Z is a matrix consisted of N columns of vectors, σ(ÂXW.sup.(1) is the output of the first graph convolution layer, and taken as the input of the second graph convolution layer, W.sup.(1) is a learnable linear parameter of the first graph convolution layer, W.sup.(2) is a learnable linear parameter of the second graph convolution layer, σ( ) is an activation function; (4): linear regression mapping in a 3D coordinate regression layer, comprising: sending the N column vectors Z.sub.i, Z.sub.i=1, 2, . . . , N, of feature output Z as input to a 3D coordinate regression layer to perform linear regression mapping, wherein the 3D coordinate regression layer outputs 3 dimensional coordinate vectors P.sub.i, i=1, 2, . . . , N, which respectively correspond to the predicted 3D coordinates of the N vertices; (5): dynamic graph network training, comprising: 5.1): creating a graph learning loss function L.sub.graph:
2. The method for reconstructing a 3D object based on the dynamic graph network of claim 1, further comprising pre-processing 2D image I before the image feature extraction, including enhancing, cropping and making uniform 2D image I, and a detailed cropping and unifying are as follows: cropping 2D image I into an image with fixed size 256×256, removing the color of a background area by using an edge detection algorithm, and filling the background area with green color;
normalizing 2D image I to make the pixel values uniform, wherein the normalizing makes the pixel values conform to Gaussian distribution.
Description
BRIEF DESCRIPTION OF THE DRAWING
(1) The above and other objectives, features and advantages of the present invention will be more apparent from the following detailed description taken in conjunction with the accompanying drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
(8) Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings. It should be noted that the similar modules are designated by similar reference numerals although they are illustrated in different drawings. Also, in the following description, a detailed description of known functions and configurations incorporated herein will be omitted when it may obscure the subject matter of the present invention.
Embodiment
(9)
(10) In one embodiment, As shown in
(11) Step S1: image feature extraction
(12) Extracting image features from 2D (2 dimensional) image I of an object, wherein the extracted image features constitute a feature map, the feature map comprises N feature vectors of D dimensions and N center coordinates of the image areas respectively corresponding to the N feature vectors, N feature vectors are denoted by F.sub.n, n=1, 2, . . . , N, feature vector F.sub.n is a column vector, and the center coordinate of the image area corresponding to feature vector F.sub.n is denoted by (x.sub.n, y.sub.n), the center coordinate (x.sub.n, y.sub.n) is taken as the feature vector coordinate of feature vector F.sub.n, where x.sub.n is the horizontal coordinate of the feature vector coordinate, x.sub.n is the vertical coordinate of the feature vector coordinate.
(13) In the embodiment, 2D (2 dimensional) image I of an object is needed to be processed before performing the image feature extraction: enhance, crop and uniform 2D image I. The detailed cropping and unifying are as follows: cropping 2D image I into a image with fixed size 256×256, and removing the color of background area by edge detection algorithm, filling the background area with green color;
normalizing 2D image I to uniform the pixel values, which make the pixel values conform to Gaussian distribution.
(14) In the embodiment, pre-processed 2D (2 dimensional) image I is sent to a ResNet50 model of residual network that has been trained using ImageNet image database for image feature extraction. The ResNet50 model comprises multiple 3×3 convolutional layers, activation function layers and batch normalization layers, and also contains short-circuit connections to connect the upper layer features to the lower layer features. The ResNet50 model usually consists of multiple blocks. However, in the embodiment, the last block in order has been removed to obtain feature vectors corresponding to image areas of 2D image I. then a feature map constituted by N=16×16 feature vectors with the size of D=512 dimensions is obtained. Namely, the feature map comprises 256 feature vectors F.sub.1, F.sub.2, . . . , F.sub.256 with the size of 512 dimensions and 512 center coordinates of the image areas respectively corresponding to the 512 feature vectors.
(15) Step S2: input data preparation for dynamic graph network
(16) As shown in
(17) Filling initial features: in the feature map, finding feature vector coordinate (x.sub.k′,y.sub.k′), k′∈{1, 2, . . . N}, which has the nearest distance from its coordinate (x′.sub.k,y′.sub.k), and combining feature vector F.sub.k′ and coordinate (x′.sub.k,y′.sub.k,z′.sub.k) of the k.sup.th vertex into a feature vector, which is denoted by X.sub.k, where the number of dimensions of feature vector X.sub.k is c.sub.1, c.sub.1=D+3, and a feature input X is obtained, X={X.sub.1, X.sub.2, . . . , X.sub.N}. In the embodiment, c.sub.1=512+3=515.
(18) Creating a relationship matrix corresponding to the feature input X, which is denoted by A and A=(A.sub.ij).sub.N×N; wherein if there is a edge connecting the i.sup.th vertex and the j.sup.th vertex, or neighborhood relationship exists between the i.sup.th vertex and the j.sup.th vertex (the distance of the i.sup.th vertex and the j.sup.th vertex is less than set threshold ε), element A.sub.ij=1, otherwise, element A.sub.ij=0.
(19) In the embodiment, the initial ellipsoid mesh comprises 256 vertices and 6×16 edges.
(20) Step S3: feature mapping and convolution in a dynamic graph network
(21) The dynamic graph network (dynamic graph convolutional neural network) comprises a dynamic graph learning layer and two graph convolution layers.
(22) Step S3.1: in the dynamic graph learning layer, first performing a feature mapping for feature input X by a learnable parameter θ, for the i.sup.th feature vector X.sub.i of feature input X, the feature vector h.sub.i is obtained: h.sub.i=θ.sup.TX.sub.i, where the learnable parameter θ is a matrix with size of c.sub.1×c.sub.2, c.sub.2 is the number of dimensions in distance space; then measuring the distances between vertices, and a relationship matrix is obtained, which is denoted by S, S=(S.sub.ij).sub.N×N, element S.sub.ij of relationship matrix S is:
(23)
(24) where d.sup.2( ) is a distance measure function used to calculate the distance between two feature vectors, exp{ } is an exponential function.
(25) Then normalizing relationship matrix S, where the normalized element
(26)
(27) To N elements
(28) To each vertex, too many neighbor nodes will cause graph over-smoothing problem during feature aggregation, making all nodes' features tend to be the same, which can lead to network training failure. So the present invention makes relationship matrix
(29) Last, integrating relationship matrix
Â=(1−η)A+η
(30) where η is a hyper-parameter used to balance relationship matrix A and relationship matrix
(31) Step S3.2: in the two graph convolution layers, performing two graph convolutions for feature input X, thus a feature output denoted by Z is obtained according to the following equation:
Z=σ(Āσ(ĀXW.sup.(1))W.sup.(2));
(32) where feature output Z is a matrix consisted of N columns of vectors, σ(ÂXW.sup.(1)) is the output of the first graph convolution layer, and taken as the input of the second graph convolution layer, W.sup.(1) is the learnable linear parameter of the first graph convolution layer, W.sup.(2) is the learnable linear parameter of the second graph convolution layer, σ( ) is a activation function.
(33) where multiplying feature input X with learnable linear parameter W.sup.(1) is a linear mapping, and multiplying feature input X with relationship matrix  is a feature aggregation of neighbor vertices.
(34) Step S4: linear regression mapping in 3D coordinate regression layer
(35) Sending the N column vectors Z.sub.i, i=1, 2, . . . , N, of feature output Z as input to a 3D coordinate regression layer to perform linear regression mapping, wherein the 3D coordinate regression layer outputs 3 dimensional coordinate vectors P.sub.i, i=1, 2, . . . , N, which respectively correspond to the predicted 3D coordinates of the N vertices.
(36) Step S5: dynamic graph network training
(37) Step S5.1: creating a graph learning loss function L.sub.graph:
(38)
(39) where Z.sub.i, Z.sub.j are respectively the i.sup.th, j.sup.th column vectors of feature output Z, |Z.sub.i−Z.sub.j| means calculate the Euclidean distance between the i.sup.th, j.sup.th column vectors, ∥S−A∥.sub.F means calculate the norm of S−A.
(40) Step S5.2: continuously inputting 2D image I of different object, and processing it according to step S1, S2 and S3, then updating learnable parameter θ of the dynamic graph learning layer and learnable linear parameter W.sup.(1), W.sup.(2) of the two graph convolution layers by adopting gradient decent algorithm to perform a back propagation according to graph learning loss function L.sub.graph, until the decrease of the value of graph learning loss function L.sub.graph stops (the value of graph learning loss function L.sub.graph converges to a stable value), at this time, the dynamic graph network training is completed.
(41) The traditional graph network (graph convolutional neural network) requires the feature input X and the corresponding initial relationship matrix A to be inputted, and initial relationship matrix A is used as the operational input for feature aggregation. But relationship matrix A is constant and fixed throughout the process, which means that initial relationship matrix A has a great influence on the feature transfer and aggregation process. However, initial relationship matrix A has the disadvantage of incomplete information (e.g., missing edges) as well as not corresponding well to the grid relationship with the mesh, so the present invention provides a method for reconstructing a 3D object based on dynamic graph network. Unlike the traditional 3D reconstruction based on graph convolutional neural network, the dynamic graph network will map the feature of each vertex and update the edges of the graph based on graph Laplacian regularization to discover the potential edge relationships, and after distance calculation based on Gaussian kernel function and sparsification of the generated potential graph, finally integrate relationship matrix
(42) Step S6: graph convolution layer training
(43) After the dynamic graph network training is complete, continuously inputting 2D image I of different object, and processing it according to step S1, S2, S3 and S4, then updating the network parameters of the 3D coordinate regression layer by measuring the distance between the predicted 3D coordinates and the true 3D coordinates of the N vertices and adopting gradient decent algorithm to perform a back propagation according to Chamfer distance loss function, until the decrease of the value of Chamfer distance loss function stops (the value of Chamfer distance loss function converges to a stable value), at this time, the 3D coordinate regression layer training is completed.
(44) Chamfer distance loss function is used to measure the distance between the predicted value and true value, its expression is:
(45)
(46) where M is the set of predicted vertices, G is the set of true vertices. To each vertex in a vertex set, Chamfer distance loss function means finding a nearest vertex in another vertex set, then adding all squares of the distances between vertices and their respective nearest vertices. The first item means the sum of the distances between each vertex p in the predicted vertex set and its nearest vertex in the true vertex set; the first item means the sum of the distances between each vertex q in the true vertex set and its nearest vertex in tjen predicted vertex set. If distance L is bigger, it means that the difference between the predicted vertex set and the true vertex set is bigger; If distance L is smaller, it means that the difference between the predicted vertex set and the true vertex set is smaller, the effect of 3D object reconstruction is better. Chamfer distance is mainly used for point cloud reconstruction or 3D reconstruction work.
(47) In the embodiment, Chamfer distance loss function is denoted by L.sub.regress:
(48)
(49) where Q.sub.i* is the nearest coordinate vector constituted by the true 3D coordinate to the i.sup.th coordinate vector P.sub.i. P.sub.i′ is the nearest coordinate vector to coordinate vector Q.sub.i corresponding to the true 3D coordinate of the i.sup.th vertex, coordinate vector Q.sub.i is constituted by the true 3D coordinate of the i.sub.th vertex.
(50) Step S7: 3D object reconstruction
(51) As shown in
(52) As shown in
(53) In the embodiment, 2D images of six objects are respectively taken as 2D image I to be processed in accordance with the present invention, then the 3D structures of the six objects are deduced. The 3D structures of the six objects are converted into six 3D renderings, which are shown in
(54) While illustrative embodiments of the invention have been described above, it is, of course, understand that various modifications will be apparent to those of ordinary skill in the art. Such modifications are within the spirit and scope of the invention, which is limited and defined only by the appended claims.