LIGHT FIELD RECONSTRUCTION METHOD AND APPARATUS OF A DYNAMIC SCENE

Abstract

The light field reconstruction method includes: obtaining a human segmentation result via a pre-trained semantic segmentation network, and obtaining an object segmentation result according to a pre-obtained scene background; fusing multiple frames of depth maps to obtain a geometric model, obtaining a complete human model according to a pre-trained human model completion network, and registering the models by point cloud registration and fusing the registered models to obtain an object model, so as to obtain a complete human model with geometric details and the object model; tracking motion of a rigid object through point cloud registration; reconstructing the complete human model with geometric details through a human skeleton tracking and a non rigid tracking of human surface nodes; and performing a fusion operation in time sequence and obtaining a reconstructed human model and a reconstructed rigid object model through the fusion operation.

Claims

1. A light field reconstruction method of a dynamic scene, comprising: obtaining a human segmentation result via a pre-trained semantic segmentation network, and obtaining an object segmentation result according to a pre-obtained scene background; fusing multiple frames of depth maps of the human segmentation result and the object segmentation result to obtain a geometric model, obtaining a complete human model according to a pre-trained human model completion network, and registering the models by point cloud registration and fusing the registered models to obtain a fused object model, so as to obtain a complete human model with geometric details and the object model; tracking motion of a rigid object through point cloud registration; reconstructing the complete human model with geometric details through a human skeleton prior and non rigid point cloud tracking; and performing, after obtaining a motion field of the complete human model with geometric details and motion of the rigid object, a fusion operation in time sequence and obtaining a reconstructed human model and a reconstructed rigid object model through the fusion operation.

2. The light field reconstruction method of a dynamic scene according to claim 1, wherein the method further comprises: removing a wrong part in the human segmentation result according to a projection result of the reconstructed rigid object under a current visual angle.

3. The light field reconstruction method of a dynamic scene according to claim 1, wherein obtaining the complete human model according to the pre-trained human model completion network comprises: triggering to process the human model of a current frame with a trained depth learning network to obtain a complete human model, when an average weight of the geometric model reaches a specific preset threshold.

4. The light field reconstruction method of a dynamic scene according to claim 1, wherein the motion of the rigid object is obtained by solving an optimization function, the optimization function for point cloud registration comprises two optimization terms of color and geometry, expressions of the optimization function are as follows: $E_{object} (T) = λ_{color} E_{color} + λ_{geo} E_{geo}$ $E_{color} = \underset{i \in N}{.Math.} \underset{(p, q) \in ℛ}{.Math.} {(C_{p} (f (T_{i} q)) - C (q))}^{2}$ $E_{geo} = \underset{i \in N}{.Math.} \underset{(p, q) \in ℛ}{.Math.} {(n_{p}^{T} (p - T_{i} q))}^{2}$ where, N is the number of objects, R is a set of corresponding points found via closest searching, p and q are the corresponding points of frame t and frame t−1, function C returns the color q of the point, C.sub.p is a pre-computed function continuous on a tangent plane of the point p, function f projects a three-dimensional point to the tangent plane, T.sub.i is the motion of the rigid object, λ.sub.color is a coefficient of the color optimization item and set to 0.1, λ.sub.geo is a coefficient of the geometric optimization term and set to 0.9, E.sub.color is a calculation expression of the color optimization term, which is calculated from color differences of adjacent points, and E.sub.geo is a calculation expression of the geometric optimization term, which is calculated from spatial position differences of adjacent points.

5. The light field reconstruction method of a dynamic scene according to claim 1, wherein the tracking through the human skeleton comprises: tracking through the human skeleton prior and non rigid point cloud, and adding new constraints when solving a position of a human skeleton node: adding new constraints when solving the position of the human skeleton node:
E.sub.inter=λ.sub.gmmE.sub.gmm+λ.sub.1stmE.sub.1stm+λ.sub.sp_h1E.sub.sp_h where, E.sub.gmm is human pose data collected under an interaction between human and object, E.sub.1stm is a constraint item in time sequence, E.sub.sp_h is a geometric intersection term, λ.sub.gmm, λ.sub.1stm, λ.sub.sp_h1 are coefficients of optimization terms respectively.

6. The light field reconstruction method of a dynamic scene according to claim 1, wherein the non rigid tracking of human surface nodes comprises: jointly solving an optimization equation, wherein optimization variables are human shape β.sub.0 and pose θ.sub.0, and ED non rigid motion field G.sub.0, and the optimization equation is:
E.sub.comp(G.sub.0,β.sub.0,θ.sub.0)=λ.sub.vdE.sub.vdata+λ.sub.mdE.sub.mdata where, the first item is a voxel data item, wherein λ.sub.vd is an optimization coefficient, E.sub.vdata defines an error between a SMPL model and a reconstructed geometric model: $E_{vdata} (β_{0}, θ_{0}) = \underset{\overline{v} \in \overline{T}}{.Math.} ψ (D (W (T (\overline{v}; β_{0}, θ_{0}); β_{0}, θ_{0})),$ where, input of D(⋅) is coordinates of a point, and output is a SDF value of a bilinear) interpolation of the coordinates of the point in the TSDF volume, ψ(⋅) represents a robust Geman-McClure penalty function.

7. The light field reconstruction method of a dynamic scene according to claim 6, wherein the mutual action term E.sub.mdata is represented by the following point to plane distance: $E_{mdata} = \underset{(\overline{v}, u) \in C}{.Math.} ψ (n_{u}^{T} (W (T (\overline{v}; β_{0}, θ_{0})) - u)) + \underset{({\overline{v}}_{c}, u) \in 𝒫}{.Math.} ψ (n_{u}^{T} ({\tilde{v}}_{c} - u)),$ where, C is a closest point pair set of points v on the SMPL and points u on a complete model, P is a closest point pair set of vertexes {tilde over (v)}.sub.c of a partial model and points u on the complete model, and n.sub.u is a normal vector of a point.

8. The light field reconstruction method of a dynamic scene according to claim 6, wherein for each 3D voxel v text missing or illegible when filed , {tilde over (v)} represents a position warped through ED non rigid motion, N(v) represents the number of non empty voxels neighboring the voxel, D(v) represents a TSDF value of v; calculating a current SDF value d(v) and updating a weight by the following formulas:
d(v)=(u−{tilde over (v)})sgn(n.sub.u.sup.T(u−{tilde over (v)})),ω(v)=1/(1+N(v)). where, u is a three-dimensional point on the complete model corresponding to the {tilde over (v)}, n.sub.u is its normal vector, sgn(⋅) is a sign function and determined by positive and negative SDF values.

9. The light field reconstruction method of a dynamic scene according to claim 6, wherein through the SDF value and updating the weight, fusing according to a fusion strategy, and using a marching cubes algorithm, a complete mesh model with geometric details is obtained, the fusion strategy is: $D (v) \leftarrow \frac{D (v) W (v) + d (v) w (v)}{W (v) + w (v)}, W (v) \leftarrow W (v) + w (v) .$ where, D(v) represents a TSDF value of v, and W(v) represents a current accumulative weight.

10. A light field reconstruction apparatus of a dynamic scene, comprising: a segmenting module, configured to obtain a human segmentation result via a pre-trained semantic segmentation network, and obtain an object segmentation result according to a pre-obtained scene background; a registering module, configured to fuse multiple frames of depth maps of the human segmentation result and the object segmentation result to obtain a geometric model, obtain a complete human model according to a pre-trained human model completion network, and register the models by point cloud registration and fuse the registered models to obtain a fused object model, so as to obtain a complete human model with geometric details and the object model; a tracking module, configured to track motion of a rigid object through point cloud registration; a reconstructing module, configured to reconstruct the complete human model with geometric details through a human skeleton prior and non rigid point cloud tracking; and a fusing module, configured to perform, after obtaining a motion field of the complete human model with geometric details and motion of the rigid object, a fusion operation in time sequence and obtaining a reconstructed human model and a reconstructed rigid object model through the fusion operation.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0037] The above-mentioned and/or additional aspects and advantages of the present disclosure will become apparent and easy to understand from the following description of embodiments in combination with the accompanying drawings, in which:

[0038] FIG. 1 is a flowchart of a light field reconstruction method of a dynamic scene according to an embodiment of the present disclosure.

[0039] FIG. 2 is a schematic diagram of a principal design of light field reconstruction of a dynamic scene according to an embodiment of the present disclosure.

[0040] FIG. 3 is a schematic diagram of a three-dimensional model reconstruction result according to an embodiment of the present disclosure.

[0041] FIG. 4 is a block diagram of a light field reconstruction apparatus of a dynamic scene according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

[0042] Embodiments of the present disclosure are described in detail below. The examples of the embodiments are shown in the drawings, in which the same or similar labels throughout represent the same or similar elements or elements with the same or similar functions. The embodiments described below with reference to the drawings are exemplary and are intended to explain the present disclosure, but cannot be understood as a limitation of the present disclosure.

[0043] A light field reconstruction method of a dynamic scene and a light field reconstruction apparatus of a dynamic scene of embodiment of the present present disclosure are described below with reference to the drawings.

[0044] The light field reconstruction method of a dynamic scene of embodiment of the present disclosure reconstructs a dynamic human body, through a non rigid tracking and prior information of a human skeleton under input of the monocular RGBD camera, tracking and reconstructing a rigid object through point cloud registration, and after obtaining the reconstructed rigid object, restricting a position of the human model through optimization items in space to prevent the human model from being inserted into the object model. At the same time, the result of human mask extraction in the original data is adjusted through the reconstructed rigid object model, as shown in FIG. 2.

[0045] FIG. 1 is a flowchart of a light field reconstruction method of a dynamic scene according to an embodiment of the present disclosure.

[0046] As illustrated in FIG. 1, the method includes the following steps.

[0047] At block S1, a human segmentation result is obtained via a pre-trained semantic segmentation network, and an object segmentation result is obtained according to a pre-obtained scene background.

[0048] Specifically, in the present disclosure, the semantic segmentation network is first used to obtain human segmented part, and the object in the scene is obtained according to the known scene background and the human segmentation result.

[0049] Further, the human segmentation result of the semantic segmentation network sometimes incorrectly contain the object, the wrong part in the human segmentation result is removed according to a projection result of a reconstructed rigid object under a current visual angle.

[0050] At block S2, multiple frames of depth maps of the human segmentation result and the object segmentation result are fused to obtain a geometric model, a complete human model is obtained according to a pre-trained human model completion network, and the models are registered by point cloud registration and the registered models are fused to obtain a fused object model, so as to obtain a complete human model with geometric details and the object model.

[0051] It can be understood that, firstly, the human body model with high-quality geometric details is obtained: with single visual angle non rigid dynamic reconstruction technology (ED node-graph based), the geometric model (TSDF form) of a current frame is obtained by fusing the multiple frames of depth maps, for example a partial model of the front of the human body obtained from 1-3 s of front video of the human body.

[0052] Secondly, the complete human model is obtained: once an average weight of TSDF volume reaches a certain threshold (32, adjustable), triggering to process the current frame with a trained depth learning network to obtain a complete human model. The specific implementation is as follows.

[0053] The network model refers to PIFu, which is composed of an image encoder and a MLP. The difference is that the input is not only RGB, but also D (depth map) and human parsing (segmentation map of human body parts), this aims to obtain a model that is closer a real situation in scale and human pose (that is, the above fused geometric model). The training data set can be obtained by rendering a large number of three-dimensional human models to obtain the depth and RGB and segmenting human body parts to obtain the human parsing, and then the depth learning model can be obtained through the changed PIFu network training.

[0054] Finally, the fusion of the human model is obtained: after obtaining the above two models, the two models are registered by point cloud registration, and then the registered models are fused, that is, the reconstructed incomplete model is supplemented by the learned complete model to form a complete human model with geometric details.

[0055] At block S3, motion of a rigid object is tracked through point cloud registration.

[0056] It can be understood that, for the rigid object, its motion is tracked through point cloud registration, the optimization function for point cloud registration includes two optimization terms of color and geometry, expressions of the optimization function are as follows:

[00005] $E_{object} (T) = λ_{color} E_{color} + λ_{geo} E_{geo}$ $E_{color} = \underset{i \in N}{.Math.} \underset{(p, q) \in ℛ}{.Math.} {(C_{p} (f (T_{i} q)) - C (q))}^{2}$ $E_{geo} = \underset{i \in N}{.Math.} \underset{(p, q) \in ℛ}{.Math.} {(n_{p}^{T} (p - T_{i} q))}^{2}$

[0057] where, N is the number of objects, R is a set of corresponding points found via closest searching, p and q are the corresponding points of frame t and frame t−1, function C returns the color q of the point, C.sub.p is a pre-computed function continuous on a tangent plane of the point p, function f projects a three-dimensional point to the tangent plane, T.sub.i is the motion of the rigid object, λ.sub.color is a coefficient of the color optimization item and set to 0.1, λ.sub.geo is a coefficient of the geometric optimization term and set to 0.9, E.sub.color is a calculation expression of the color optimization term, which is calculated from color differences of adjacent points, and E.sub.geo is a calculation expression of the geometric optimization term, which is calculated from spatial position differences of adjacent points.

[0058] At block S4, the complete human model with geometric details is reconstructed through a human skeleton tracking and a non rigid tracking of human surface nodes.

[0059] It can be understood that, the human model can be reconstructed through a human skeleton prior and non rigid point cloud tracking, and when solving a position of a human skeleton node, new constraints are added:

E.sub.inter=λ.sub.gmmE.sub.gmm+λ.sub.1stmE.sub.1stm+λ.sub.sp_h1E.sub.sp_h

[0060] where, λ.sub.gmm, λ.sub.1stm, λ.sub.sp_h1 are coefficients of optimization terms respectively, E.sub.gmm is human pose data collected under an interaction between human and object, a pose distribution under the interaction scene is obtained through Gaussian mixture model, and the current pose estimation is constrained to keep consistent with prior information of the Gaussian mixture model as much as possible; E.sub.1stm is a constraint item in time sequence, a LSTM network is trained to predict the current pose estimation based on a historical pose estimation, and the solution of the current pose estimation is constrained by a predictive value of the current pose estimation, when the human is occluded by an object, a better skeleton motion estimation can be realized based on the continuity in time sequence; E.sub.sp_h is a geometric intersection term, after obtaining the rigid object model, the human and the object models are constrained to not intersect in space, so as to avoid the incorrect insertion of the human model into the object under occlusion.

[0061] Further, the non rigid tracking of the human surface node is based on non rigid motion estimation: in order to obtain more realistic non rigid motion (clothing folds, etc.), an optimization problem is solved to estimate the non rigid motion G based on the pose estimation (skeleton motion estimation). ED node graph and SMPL model are used to represent the complete human model. For any 3D vertex v.sub.c, {tilde over (v)}.sub.c=ED(v.sub.c; G) represents a position warped through the ED node graph, G is a non rigid motion field. For the SMPL model, T is a unified template, T(β, θ) is a template deformed by shape and pose parameters, in which β represents the shape parameter (human shape), θ represents the pose parameter. For any vertex v∈T, W(T(v; β, θ); β, θ) is the 3D coordinate position after warp.

[0062] A specific operation of aligning a partial model in the form of TSDF and a complete mesh (mesh model) consists in jointly solving an optimization equation, in which optimization variables are human shape β.sub.0 and pose θ.sub.0, and ED non rigid motion field G.sub.0 (from the TSDF partial model to the complete mesh). The optimization equation is:

E.sub.comp(G.sub.0,β.sub.0,θ.sub.0)=λ.sub.vdE.sub.vdata+λ.sub.mdE.sub.mdata

[0063] where, the first item is a voxel data item, in which λ.sub.vd is an optimization coefficient, E.sub.vdata defines an error between a SMPL model and a reconstructed geometric model (a partial model of TSDF volume):

[00006] $E_{vdata} (β_{0}, θ_{0}) = \underset{\overline{v} \in \overline{T}}{.Math.} ψ (D (W (T (\overline{v}; β_{0}, θ_{0}); β_{0}, θ_{0})),$

[0064] where, input of D(⋅) is coordinates of a point, and output is a SDF value of a bilinear interpolation of the coordinates of the point in the TSDF volume (smaller means closer to the plane), ψ(⋅) represents a robust Geman-McClure penalty function.

[0065] The mutual action term E.sub.mdata measures a fitting from both the partial TSDF model and the SMPL model to the complete mesh, and is represented by the following point to plane distance:

[00007] $E_{mdata} = \underset{(\overline{v}, u) \in C}{.Math.} ψ (n_{u}^{T} (W (T (\overline{v}; β_{0}, θ_{0})) - u)) + \underset{({\overline{v}}_{c}, u) \in 𝒫}{.Math.} ψ (n_{u}^{T} ({\tilde{v}}_{c} - u)),$

[0066] where, C is a closest point pair set of points v on the SMPL and points u on a complete model, P is a closest point pair set of vertexes {tilde over (v)}.sub.c of a partial model and points u on the complete model, and n.sub.u is a normal vector of a point, λ.sub.mdata is a coefficient of the mutual action term.

[0067] At block S5, after obtaining a motion field of the complete human model with geometric details and motion of the rigid object, a fusion operation is performed in time sequence, and a reconstructed human model and a reconstructed rigid object model are obtained through the fusion operation.

[0068] Specifically, the fusion operation needs to be performed in time sequence after obtaining the motion field of the human model and the motion of the rigid object, the fusion is performed in the TSDF domain, the complete human model and rigid object model are obtained through fusing.

[0069] It can be understood that, for each 3D voxel v text missing or illegible when filed , {tilde over (v)} represents a position warped through ED non rigid motion, N(v) represents the number of non empty voxels neighboring the voxel, the larger the number is, the more observations are made in this part, and the more reliable it is, with the fusion of the partial models, the number from the middle to the edge becomes smaller and smaller, therefore using its inverse ratio to represent a fusion weight to achieve a seamless fusion effect. D(v) represents a TSDF value of v, text missing or illegible when filed W(v) represents a current accumulative weight of v.

[0070] a current SDF value d(v) is calculated and a weight is updated according to the following formulas:

d(v)=(u−{tilde over (v)})sgn(n.sub.u.sup.T(u−{tilde over (v)})),ω(v)=1/(1+N(v)).

[0071] where, u is a three-dimensional point on the complete model corresponding to the {tilde over (v)}, n.sub.u is its normal vector, sgn(⋅) is a sign function and determined by positive and negative SDF values. Through the SDF value and updating the weight, a fusion is performed according to a fusion strategy:

[00008] $D (v) \leftarrow \frac{D (v) W (v) + d (v) w (v)}{W (v) + w (v)}, W (v) \leftarrow W (v) + w (v) .$

[0072] then through a marching cubes algorithm, a complete mesh model with geometric details can be obtained from TSDF volume.

[0073] According to the light field reconstruction method of a dynamic scene of embodiments of the present disclosure, a human segmentation result is obtained via a pre-trained semantic segmentation network, and an object segmentation result is obtained according to a pre-obtained scene background; multiple frames of depth maps of the human segmentation result and the object segmentation result are fused to obtain a geometric model, a complete human model is obtained according to a pre-trained human model completion network, and the models are registered by point cloud registration and the registered models are fused to obtain a fused object model, so as to obtain a complete human model with geometric details and the object model; motion of a rigid object is tracked through point cloud registration; the complete human model with geometric details is reconstructed through a human skeleton prior and non rigid point cloud tracking; and after obtaining a motion field of the complete human model with geometric details and motion of the rigid object, a fusion operation is performed in time sequence and a reconstructed human model and a reconstructed rigid object model are obtained through the fusion operation. The present disclosure improves the robustness of dynamic light field reconstruction in the human-object interaction scene, through point cloud tracking of the rigid object and reconstructing a three-dimensional model, then constraining the human skeleton tracking based on prior information of the model after obtaining the three-dimensional model. The prior of time sequence information and pose prior of human-object interaction collected are used to enhance the human skeleton tracking effect in the case of object occlusion, so as to obtain the robust human skeleton tracking in the case of occlusion, which realizes a light field reconstruction of a dynamic scene in the case of occlusion.

[0074] FIG. 4 is a block diagram of a light field reconstruction apparatus of a dynamic scene according to an embodiment of the present disclosure.

[0075] As illustrated in FIG. 4, the apparatus 10 includes a segmenting module 100, a registering module 200, a tracking module 300, a reconstructing module 400, and a fusing module 500.

[0076] The segmenting module 100 is configured to obtain a human segmentation result via a pre-trained semantic segmentation network, and obtain an object segmentation result according to a pre-obtained scene background.

[0077] The registering module 200 is configured to fuse multiple frames of depth maps of the human segmentation result and the object segmentation result to obtain a geometric model, obtain a complete human model according to a pre-trained human model completion network, and register the models by point cloud registration and fuse the registered models to obtain a fused object model, so as to obtain a complete human model with geometric details and the object model.

[0078] The tracking module 300 is configured to track motion of a rigid object through point cloud registration.

[0079] The reconstructing module 400 is configured to reconstruct the complete human model with geometric details through a human skeleton prior and non rigid point cloud tracking.

[0080] The fusing module 500 is configured to perform, after obtaining a motion field of the complete human model with geometric details and motion of the rigid object, a fusion operation in time sequence and obtaining a reconstructed human model and a reconstructed rigid object model through the fusion operation.

[0081] According to the light field reconstruction apparatus of a dynamic scene of embodiments of the present disclosure, with the segmenting module, a human segmentation result is obtained via a pre-trained semantic segmentation network, and an object segmentation result is obtained according to a pre-obtained scene background; with the registering module, multiple frames of depth maps of the human segmentation result and the object segmentation result are fused to obtain a geometric model, a complete human model is obtained according to a pre-trained human model completion network, and the models are registered by point cloud registration and the registered models are fused to obtain a fused object model, so as to obtain a complete human model with geometric details and the object model; with the tracking module, motion of a rigid object is tracked through point cloud registration; with the reconstructing module, the the complete human model with geometric details is reconstructed through a human skeleton prior and non rigid point cloud tracking; and with the fusing module, after obtaining a motion field of the complete human model with geometric details and motion of the rigid object, a fusion operation is performed in time sequence and a reconstructed human model and a reconstructed rigid object model are obtained through the fusion operation. The present disclosure improves the robustness of dynamic light field reconstruction in the human-object interaction scene, through point cloud tracking of the rigid object and reconstructing a three-dimensional model, then constraining the human skeleton tracking based on prior information of the model after obtaining the three-dimensional model. The prior of time sequence information and pose prior of human-object interaction collected are used to enhance the human skeleton tracking effect in the case of object occlusion, so as to obtain the robust human skeleton tracking in the case of occlusion, which realizes a light field reconstruction of a dynamic scene in the case of occlusion.

[0082] In addition, the terms “first” and “second” are only used for description purposes and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined as “first” and “second” can explicitly or implicitly include at least one such feature. In the description of the present disclosure, “multiple” means at least two, such as two, three, etc., unless otherwise specifically defined.

[0083] In the description of this specification, reference to the description of the terms “an embodiment”, “some embodiments”, “example”, “specific example”, or “some examples” means that the specific features, structures, materials, or features described in combination with this embodiment or example are included in at least one embodiment or example of the present disclosure. In this specification, the illustrative expression of the above terms need not refer to the same embodiments or examples. Furthermore, the specific features, structures, materials, or features described may be combined in an appropriate manner in any one or more embodiments or examples. In addition, those skilled in the art can combine and combine different embodiments or examples described in this specification and the features of different embodiments or examples without contradiction.

[0084] Although the embodiments of the present disclosure have been shown and described above, it can be understood that the above embodiments are exemplary and cannot be understood as a limitation of the present disclosure. Those skilled in the art can change, modify, replace and transform the above embodiments within the scope of the present disclosure.

LIGHT FIELD RECONSTRUCTION METHOD AND APPARATUS OF A DYNAMIC SCENE

Inventors

Cpc classification

Classification Explorer

G06T7/20

PHYSICS

Classification Explorer

G06T7/251

PHYSICS

Classification Explorer

G06T7/557

PHYSICS

Classification Explorer

G06T2200/08

PHYSICS

Classification Explorer

G06T2207/10024

PHYSICS

Classification Explorer

G06T2207/20084

PHYSICS

Classification Explorer

G06T2207/10052

PHYSICS

Classification Explorer

G06T2207/20081

PHYSICS

Classification Explorer

G06T15/506

PHYSICS

Classification Explorer

G06T17/00

PHYSICS

Classification Explorer

G06T2207/10028

PHYSICS

Classification Explorer

G06T2207/30196

PHYSICS

International classification

Classification Explorer

G06T7/20

PHYSICS

Classification Explorer

G06T15/50

PHYSICS

Abstract

Claims

Description