6D POSE AND SHAPE ESTIMATION METHOD

20230080133 · 2023-03-16

    Inventors

    Cpc classification

    International classification

    Abstract

    A computer-implemented method of estimating a 6D pose and shape of one or more objects from a 2D image, comprises the steps of: detecting, within the 2D image, one or more 2D regions of interest, each 2D region of interest containing a corresponding object among the one of more objects; cropping out a corresponding pixel value array, coordinate tensor , and feature map for each 2D region of interest; concatenating the corresponding pixel value array, coordinate tensor, and feature map for each 2D region of interest; and inferring, for each 2D region of interest, a 4D quaternion describing a rotation of the corresponding object in the 3D rotation group, a 2D centroid, which is a projection of a 3D translation of the corresponding object onto a plane of the 2D image given a camera matrix associated to the 2D, image, a distance from a viewpoint of the 2D image to the corresponding object a size and a class-specific latent shape vector of the corresponding object.

    Claims

    1. A computer-implemented method of estimating 3D position, orientation and shape of one or more objects from a 2D image, the method comprising the steps of: detecting, within the 2D image, one or more 2D regions of interest, each 2D region of interest containing a corresponding object among the one of more objects; cropping out a corresponding pixel value array, coordinate tensor , and feature map for each 2D region of interest; concatenating the corresponding pixel value array, coordinate tensor , and feature map for each 2D region of interest; inferring, for each 2D region of interest, a 4D quaternion describing a rotation of the corresponding object in the 3D rotation group, a 2D centroid, which is a projection of a 3D translation of the corresponding object onto a plane of the 2D image given a camera matrix associated to the 2D image, a distance from a viewpoint of the 2D image to the corresponding object, a size, and a class-specific latent shape vector of the corresponding object.

    2. The computer-implemented method according to claim 1, wherein the step of cropping out a corresponding pixel value array, coordinate tensor, and feature map for each 2D region of interest also comprises resizing them into a uniform array size.

    3. The computer-implemented method according to claim 2, further comprising a step of back projecting the 2D centroid using the distance from the viewpoint and the camera matrix to compute the 3D translation .

    4. The computer-implemented method according to claim 3, wherein the 4D quaternion describes the rotation in an allocentric projection space and the method further comprises a step of computing an egocentric projection using the 4D quaternion and the 3D translation.

    5. The computer-implemented method according to claim 1, wherein the class-specific latent shape vector represents an offset from a mean latent shape representation of a corresponding object class, and the method further comprises the step of: adding the class-specific latent shape vector to the mean latent shape representation of the corresponding object class to obtain an absolute shape vector of the corresponding object.

    6. The computer-implemented method according to claim 5, comprising a further step of reconstructing an unscaled 3D point cloud, from the absolute shape vector, using a separately trained decoder neural network.

    7. The computer-implemented method according to claim 5, further comprising a step of scaling the unscaled 3D point cloud, using the inferred size to obtain a scaled 3D point cloud of the corresponding object.

    8. The computer-implemented method according to claim 7, wherein method further comprises a step of meshing the scaled 3D point cloud to generate a triangle mesh of the scaled 3D shape.

    9. The computer-implemented method according to claim 8, wherein the method further comprises a step of merging mesh triangles of the triangle mesh, using a ball pivoting algorithm, to fill any remaining hole in the triangle mesh.

    10. The computer-implemented method according to claim 9, wherein the method further comprises a step of applying a Laplacian filter to the triangle mesh to generate a smoothed scaled 3D shape (M) of the corresponding object.

    11. The computer-implemented method according to claim 1, wherein the one or more 2D regions of interest are detected within the 2D image using a feature pyramid network.

    12. The computer-implemented method according to claim 11, further comprising a step of classifying each 2D region of interest using a fully convolutional neural network attached to each level of the feature pyramid network.

    13. The computer-implemented method according to claim 12, further comprising a step of regressing a boundary of each 2D region of interest towards the corresponding object using another fully convolutional neural network attached to each level of the feature pyramid network.

    14. The computer-implemented method according to claim 1, wherein the step of inferring, for each 2D region of interest, the 4D quaternion, 2D centroid, distance, size, and class-specific latent shape vector of the corresponding object is carried out using a separate neural network for each one of the 4D quaternion, 2D centroid, distance, size, and class-specific latent shape vector.

    15. The computer-implemented method according to claim 14, wherein each separate neural network for inferring the 4D quaternion, 2D centroid, distance, size, and class-specific latent shape vector comprises multiple 2D convolution layers, each followed by a batch normalization layer and a rectified linear unit activation layer, and a fully-connected layer at the end of the separate neural network.

    16. The computer-implemented method according to claim 15, wherein each one of the separate neural networks for inferring the 4D quaternion and distance comprises four 2D convolution layers followed each by a batch normalization layer and a rectified linear unit activation layer, whereas each one of the separate neural networks for inferring the 2D centroid, size and class-specific latent shape vector comprises only two 2D convolution layers followed each by a batch normalization layer and a rectified linear unit activation layer.

    17. The computer-implemented method according to claim 1, wherein the 2D image is in the form of a pixel array with at least one value for each pixel.

    18. The computer-implemented method according to claim 17, wherein the pixel array has an intensity value for each of three colors for each pixel.

    19-20. (canceled)

    21. A data processing device programmed to carry out a computer-implemented method of estimating 3D position, orientation and shape of one or more objects from a 2D image, the method comprising the steps of: detecting, within the 2D image, one or more 2D regions of interest, each 2D region of interest containing a corresponding object among the one of more objects; cropping out a corresponding pixel value array, coordinate tensor, and feature map for each 2D region of interest; concatenating the corresponding pixel value array, coordinate tensor, and feature map for each 2D region of interest; inferring, for each 2D region of interest, a 4D quaternion describing a rotation of the corresponding object in the 3D rotation group, a 2D centroid, which is a projection of a 3D translation of the corresponding object onto a plane of the 2D image given a camera matrix associated to the 2D image, a distance from a viewpoint of the 2D image to the corresponding object, a size, and a class-specific latent shape vector of the corresponding object.

    22. A system comprising a data processing device programmed to carry out a computer-implemented method of estimating 3D position, orientation and shape of one or more objects from a 2D image, and an imaging device connected to input the 2D image to the data processing device, wherein the computer-implemented method comprises the steps of: detecting, within the 2D image, one or more 2D regions of interest, each 2D region of interest containing a corresponding object among the one of more objects; cropping out a corresponding pixel value array, coordinate tensor, and feature map for each 2D region of interest; concatenating the corresponding pixel value array, coordinate tensor, and feature map for each 2D region of interest; inferring, for each 2D region of interest, a 4D quaternion describing a rotation of the corresponding object in the 3D rotation group, a 2D centroid, which is a projection of a 3D translation of the corresponding object onto a plane of the 2D image given a camera matrix associated to the 2D image, a distance from a viewpoint of the 2D image to the corresponding object, a size, and a class-specific latent shape vector of the corresponding object.

    23. The system of claim 22, further comprising a robotic manipulator connected to the data processing device, wherein the data processing device is also programmed to control the manipulator based on the estimated 3D position, orientation and shape of each object in the 2D image.

    24. The system of claim 22, further comprising propulsion, steering and/or braking devices wherein the data processing device is also programmed to control and/or assist control of the propulsion, steering and/or braking devices based on the estimated 3D position, orientation and shape of each object in the 2D image.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0027] The present disclosure may be more completely understood in consideration of the following detailed description of various embodiments in connection with the accompanying drawings, in which :

    [0028] FIG. 1 illustrates schematically a robotic application ;

    [0029] FIG. 2 illustrates schematically an autonomous driving application ;

    [0030] FIG. 3 illustrates schematically a 6D pose and shape estimating method ;

    [0031] FIG. 4 illustrates schematically a step of detecting regions of interest in an image ;

    [0032] FIG. 5 illustrates egocentric and allocentric projections of an object translated across an image plane ;

    [0033] FIG. 6 illustrates a meshing and smoothing step performed on a point cloud representing a 3D shape; and

    [0034] FIG. 7 illustrates a learning step in which an estimated 6D pose and shape is compared to a ground truth point cloud.

    [0035] While the present disclosure is amenable to various modifications and alternative forms, specifics thereof have been shown by way of example in the drawings and will be described in detail. It should be understood, however, that the intention is not to limit aspects of the present disclosure to the particular embodiments described. On the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the scope of the present disclosure .

    DETAILED DESCRIPTION

    [0036] For the following defined terms, these definitions shall be applied, unless a different definition is given in the claims or elsewhere in this specification.

    [0037] As used in this specification and the appended claims, the singular forms “a”, “an”, and “the” include plural referents unless the content clearly dictates otherwise. As used in this specification and the appended claims, the term “or” is generally employed in its sense including “and/or” unless the content clearly dictates otherwise.

    [0038] The following detailed description should be read with reference to the drawings in which similar elements in different drawings are numbered the same. The detailed description and the drawings, which are not necessarily to scale, depict illustrative embodiments and are not intended to limit the scope of the present disclosure. The illustrative embodiments depicted are intended only as exemplary. Selected features of any illustrative embodiment may be incorporated into an additional embodiment unless clearly stated to the contrary.

    [0039] As illustrated in FIGS. 1 and 2, a system 100 may comprise a data processing device 101 and an imaging device 102. The imaging device 102 may be a camera configured to capture 2D images, and more particularly a color camera, which may be configure to capture each 2D image as a RGB pixel array with an intensity value for each one of three colors (red, green and blue) for each pixel. The imaging device 102 may be further connected to the data processor device 101 for inputting these 2D images to the data processing device 101. The data processing device 101 may be a computer programmed to estimate 3D position, orientation and shape of one or more objects 103 from a 2D image captured by the imaging device 102.

    [0040] As in the example shown in FIG. 1, the system 100 may further comprise a robotic manipulator 104 connected to the data processing device 101, which may further be programmed to control the robotic manipulator 104 according to the estimated 3D position, orientation and shape of one or more objects 103, for example for handling those objects 103 and/or working on them.

    [0041] Alternatively or additionally, however, the system 100 may be a vehicle, as illustrated in FIG. 2, further comprising propulsion, steering and/or braking devices 105, 106 and 107, and the data processing device 101 may be programmed to control and/or assist control of the propulsion, steering and/or braking devices 105, 106 and 107 according to the estimated 3D position, orientation and shape of one or more objects 103, for example for collision prevention in autonomous driving or driving assistance.

    [0042] In either case, the data processing device 101 may be programmed to estimate the 3D position, orientation and shape of each object 103 according to the method illustrated in FIG. 3. In a detection step 1 of a first stage S1 of this method, the data processing device 101 may compute a feature map FM of the 2D image received from the imaging device 102, as well as, for each object 103 within the 2D image I, a 2D region of interest ROI containing the corresponding object 103. This may be carried out, as illustrated in FIG. 4, using a feature pyramid network FPN with first and second fully-convolutional networks FCN1, FCN2 attached to each level of the feature pyramid network FPN, as for example in the RetinaNet detector described by Lin et al in “Focal Loss for Dense Object Detection”, arXiv:1708.02002v2. The first fully-convolutional networks FCN 1 may be used for classifying each 2D region of interest, and the second fully-convolutional networks FCN2 for regressing a boundary of each 2D region of interest towards the corresponding object 103.

    [0043] For each 2D region of interest ROI, a corresponding cropped pixel value array I′, cropped coordinate tensor C′, and cropped feature map FM' may then be cropped out from, respectively, the 2D image I, a coordinate tensor C for the whole 2D image I, and the feature map FM, using for example an ROI align operator. The coordinate tensor C may have two channels where each pixel is filled with corresponding coordinates, as described by Liu et al in “An intriguing failing of convolutional neural networks and the coordconv solution”, NeurIPS, 2018. The cropped coordinate tensor Cʹ thus contains the location of the crop so that the global context is not lost. The cropped pixel value array Iʹ, cropped coordinate tensor C′, and cropped feature map FMʹ for each region of interest may then concatenated in a concatenation step 3, before being inputted to a predictor in a second stage S2.

    [0044] The predictor of this second stage S2 may infer a 4D quaternion q describing a rotation of the corresponding object in the 3D rotation group SO(3) projected in the allocentric space, a 2D centroid (x,y), which is a projection of a 3D translation of the corresponding object onto a plane of the 2D image I given a camera matrix K associated to the imaging device 102, a distance z from a viewpoint of the 2D image to the corresponding object, a metric size (w, h,l), and a class-specific latent shape vector F.sub.Shape(FMʹ) of the corresponding object 103. The class-specific latent shape vector F.sub.Shape(FMʹ) may represent an offset of the shape of the corresponding object 103 from a mean latent shape representation m.sub.c of a corresponding object class c. Since this information is inferred from a cropped region of interest ROI, the allocentric representation is favored for the 4D quaternion q as it is viewpoint invariant under 3D translation of the corresponding object 103, as shown in FIG. 5, comparing the changing appearance of such an object 103 under egocentric representation when subject to a 3D translation lateral to an image plane to the unchanging appearance of the same object 103 under allocentric representation when subject to the same 3D translation.

    [0045] This predictor may have separate branches for inferring each of the 4D quaternion q, 2D centroid (x,y), distance z, size (w,h,l), and class-specific latent shape vector F.sub.Shape(FMʹ), each branch representing an artificial neural network containing successive 2D convolutions 4 followed each by a batch norm layer 5 and a Rectified Linear Unit activation layer 6, and ending with a fully connected layer 7. Since features from the feature pyramid network FPN in the first stage S1 are forwarded, within the cropped feature maps FM', to the second, predictor stage S2, a relatively small number of layers may be used for these inferences. Using separate lifting modules for each object class may thus not require excessive processing. This further improves performance since 6D poses and scaled shapes from different classes do not interfere during optimization. As shown in FIG. 3, the branches for inferring the 4D quaternion q and the distance z may comprise more layers than the rest, for example four successive 2D convolutions 4 followed each by a batch norm layer 5 and a Rectified Linear Unit activation layer 6, as opposed to just two for the rest. However, smaller numbers of layers may be envisaged for each branch.

    [0046] The 2D centroid (x,y) may be back projected into 3D space using, in a back projecting step 8, the distance z and the camera matrix K, so as to obtain a 3D translation t of the corresponding object 103 with respect to the viewpoint of the imaging device 102. In an egocentric projection computing step 9, the egocentric projection P of the corresponding object 103 with respect to the viewpoint and orientation of the imaging device 102 may be computed from the 3D translation t and the 4D quaternion q describing the rotation in the allocentric space, thus finally obtaining the 6D pose of the corresponding object 103 with respect to the imaging device 102.

    [0047] The class-specific latent shape vector F.sub.Shape(FM')inferred for the corresponding object may be applied in an adding step 10 to the mean latent shape representation m.sub.c associated to the corresponding object class, to obtain a shape encoding e of the corresponding object. In a decoding step 11, a decoder neural network D.sub.c for that object class c may be used to reconstruct an unscaled 3D point cloud p, according to the equation:

    [00001]p=Dce

    [0048] This unscaled 3D point cloud p may then brought to scale in scaling step 12, using the inferred size (w, h, l) of the corresponding object 103 to obtain a scaled 3D point cloud (w,h,l).Math. p.

    [0049] In a meshing step 13, a metric triangle mesh TM may then be generated based on the scaled 3D point cloud (w,h,l).Math. p, as shown in FIG. 6. A ball pivoting algorithm may then be used in a next step to merge mesh triangles of the triangle mesh so as to fill any remaining holes in the triangle mesh TM. Finally, in another step, a Laplacian filter may be applied to the triangle mesh TM to generate a smoothed metric 3D shape F of the corresponding object 103, providing a realistic representation for example for augmented reality applications.

    [0050] The artificial neural networks may be jointly trained to infer shape and pose. For this, a low-dimensional latent space representation S.sub.c of each object class c may be learned, which is a set of all latent space representations s.sub.c of training shapes associated to that object class c. During inference, this allows for reconstruction of a 3D model by predicting just a few shape parameters, thus improving generalization. A method may be used, such as e.g. the AtlasNet method disclosed by T. Groueix et al in “AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation”, CVPR 2018, may be used to learn this latent space representation S.sub.c of an object class c by taking as input a complete point cloud and encoding it into a global shape descriptor. A 3D shape can then be reconstructed by concatenating that descriptor with points sampled from a 2D uv-map and feeding the result to a decoder network. This approach decouples the number of predicted points from that in the original training shapes, thus enabling the reconstruction of shapes with arbitrary resolution. One such auto-encoder and decoder network may be trained separately for each object class on a dataset of 3D shapes, such as for example a subset of the ShapeNet repository described by A. X. Chang et al in “An information-rich 3D model repository”, arXiv: 1512.03012, 2015. Each auto-encoder and decoder network may thus learn a class-specific distribution of valid shapes in latent space.

    [0051] The latent space representation S.sub.c of each object class c can thus be computed using the encoder part E.sub.c of the auto-encoder and decoder network. The mean latent shape representation m.sub.c for each object class may then be computed as follows:

    [00002]mc=1|Sc|.Math.ScScsc

    [0052] The decoder part D.sub.c of the trained auto-encoder and decoder network may then be frozen to be used as the decoder neural network for reconstructing the unscaled 3D point cloud p in the pose and shape estimation method described above.

    [0053] To encourage the predictions to stay inside of the learnt shape distribution, a shape regularization loss may be employed. That is, assuming the training’s per-class shape encodings s.sub.c span a convex shape space S.sub.c, the decoder neural network may be punished for any predicted shape encoding e outside of the shape space S.sub.c, and such outlying predictions be projected onto the boundary ∂S.sub.c of the shape space S.sub.c. In practice, all predicted shape encodings e may be detected by considering that e ∈S.sub.c if:

    [00003]minsc,lsc,jscijesc,iTesc,i0

    [0054] It may thus be checked whether there are two vectors in the embedding which enclose the predicted shape encoding e. These may then be projected on the line connecting the two closest training shape encodings S.sub.c,.sub.1, S.sub.c,.sub.2 within the shape space S.sub.c. These two training shape encodings s.sub.c,1, S.sub.c,.sub.2 may be retrieved by computing the Euclidean distance for the regressed encodings with all elements of S.sub.c and taking the two elements with the smallest distance. The error then amounts to the length of the projection

    [00004]π(e|sc,1,sc,2)

    according to the equation:

    [00005]πe|sc,1,sc,2=sc,1esc,1eTsc,2esc,2eTsc,2esc,2e

    [0055] The final loss L.sub.reg for one sample shape encoding e then amounts to:

    [00006]LregeSc,sc,1,sc,2=0,ifeScπesc,1,sc,22,otherwise

    [0056] The regressed 3D shape may be directly aligned with the scene in fully differential manner. Provided the 6D pose as 4D quaternion q and 3D translation t, together with the unscaled 3D point cloud p and the size (w, h, l), the scaled shape .sup.p.sub.3D may be computed and transformed to 3D at the corresponding pose, using the camera matrix K, according to the formula:

    [00007]p3D=qwhlpq1+K1xzyzz

    [0057] The alignment may then be simply measured, using a chamfer distance as shown in FIG. 7, against a point cloud p̂.sub.3d representing the ground truth, which may for instance be computed by uniformly sampling a plurality of points, for example 2048 points, from a CAD model on whose basis the learning is being carried out, to form a point cloud p, and then applying to it the rotation Ȓ and translation t̂ corresponding to its 6D pose in the learning scene, according to the formula:

    [00008]p^3D=R^p^+t^

    [0058] Eventually, the final loss L can be calculated as:

    [00009]Lp3D,p^3D:=1p3D.Math.vp3Dminv^p^3Dvv^2+1p^3D.Math.v^p^3Dminvp3Dvv^2

    wherein ν and ν̂ denote, respectively, vertices of the point clouds .sup.p.sub.3D and

    [0059] Since applying the chamfer distance directly may turn out to be unstable due to local minima, a warm-up training may first be carried out in which an L.sub.1-norm between each component and the ground truth is computed, using for example the method disclosed by A. Kendall et al in “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics”, CVPR, 2018, to weight the different terms. Additionally, the predictions may be disentangled and the 3D Point Cloud Loss computed separately for each parameter.

    [0060] Those skilled in the art will recognize that the present disclosure may be manifested in a variety of forms other than the specific embodiments described and contemplated herein. For example, the described 6D pose and shape estimating method may be applied over a sequence of successive 2D images I, estimating 3D position, orientation and shape of the one or more objects 103 from each 2D image I of the sequence of successive 2D images, in order to track the one or more objects over the sequence of successive 2D images. Accordingly, departure in form and detail may be made without departing from the scope of the present disclosure as described in the appended claims.