METHOD AND SYSTEM FOR HEAD POSE ESTIMATION

20210165999 · 2021-06-03

    Inventors

    Cpc classification

    International classification

    Abstract

    A method for head pose estimation using a monocular camera. The method includes: providing an initial image frame recorded by the camera showing a head; and performing at least one pose updating loop with the following steps: identifying and selecting of a plurality of salient points of the head having 2D coordinates in the initial image frame within a region of interest; determining 3D coordinates for the selected salient points using a geometric head model of the head, corresponding to a head pose; providing an updated image frame recorded by the camera showing the head; identifying within the updated image frame at least some previously selected salient points having updated 2D coordinates; updating the head pose by determining updated 3D coordinates corresponding to the updated 2D coordinates using a perspective-n-point method; and using the updated image frame as the initial image frame for the next pose updating loop.

    Claims

    1. A method for head pose estimation using a monocular camera, the method comprising: providing an initial image frame recorded by the camera showing a head; and performing at least one pose estimation loop with the following steps: identifying and selecting of a plurality of salient points of the head having 2D coordinates in the initial image frame within a region of interest; using a geometric head model of the head, determining 3D coordinates for the selected salient points corresponding to a head pose of the geometric head model; providing an updated image frame recorded by the camera showing the head; identifying within the updated image frame at least some previously selected salient points having updated 2D coordinates; updating the head pose by determining updated 3D coordinates corresponding to the updated 2D coordinates using a perspective-n-point method; and using the updated image frame as the initial image frame for the next pose updating loop.

    2. The method of claim 1, wherein before performing the at least one pose updating loop, a distance between the camera and the head is determined.

    3. The method of claim 1, wherein before performing the at least one pose updating loop, dimensions of the head model are determined.

    4. The method of claim 1, wherein the head model is a cylindrical head model.

    5. The method of claim 1, wherein a plurality of consecutive pose updating loops are performed.

    6. The method of claim 1, wherein previously selected salient points are identified using optical flow.

    7. The method of claim 1, wherein the 3D coordinates are determined by projecting 2D coordinates from an image plane of the camera onto a visible head surface.

    8. The method of claim 1, wherein the visible head surface is determined by determining the intersection of a boundary plane with a model head surface.

    9. The method of claim 1, wherein the boundary plane is parallel to an X-axis of the camera and a center axis of the cylindrical head model.

    10. The method of claim 1, wherein the region of interest is defined by projecting the visible head surface onto the image plane.

    11. The method of claim 1, wherein the salient points are selected based on an associated weight which depends on the distance to a border of the region of interest.

    12. The method of claim 1, wherein the perspective-n-point method is performed based on the weight of the salient points.

    13. The method of claim 1, wherein in each pose updating loop, the region of interest is updated.

    14. A system for head pose estimation, comprising a monocular camera and a processing device, which is configured to: receive an initial image frame recorded by the camera showing a head; and perform at least one pose updating loop with the following steps: identifying and selecting of a plurality of salient points of the head having 2D coordinates in the initial image frame within a region of interest; determining 3D coordinates for the selected salient points using a geometric head model of the head, corresponding to a head pose; receiving an updated image frame recorded by the camera showing the head; identifying within the updated image frame at least some previously selected salient points having updated 2D coordinates; updating the head pose by determining updated 3D coordinates corresponding to the updated 2D coordinates using a perspective-n-point method; and using the updated image frame as the initial image frame for the next pose updating loop.

    15. The system of claim 14, wherein the system is adapted to determine a distance between the camera and the head before performing the at least one pose updating loop.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0038] Further details and advantages of the present invention will be apparent from the following detailed description of not limiting embodiments with reference to the attached drawing, wherein:

    [0039] FIG. 1 is a schematic representation of an inventive system and a head;

    [0040] FIG. 2 is a flowchart illustrating an embodiment of the inventive method;

    [0041] FIG. 3 illustrates a first initialization step of the method of FIG. 2;

    [0042] FIG. 4 illustrates a second initialization step of the method of FIG. 2; and

    [0043] FIG. 5 illustrates a sequence of steps of the method of FIG. 2.

    DESCRIPTION OF THE ILLUSTRATED EMBODIMENTS

    [0044] FIG. 1 schematically shows a system 1 for head pose estimation according to an embodiment of the invention and a head 10 of a person. The system 1 comprises a monocular camera 2 which may be characterized by a vertical Y-axis, a horizontal Z-axis, which corresponds to the optical axis, and a X-axis which is perpendicular to the drawing plane of FIG. 1. The camera 2 is connected (by wire or wirelessly) to a processing device 3, which may receive image frames I.sub.0, I.sub.n, I.sub.n+1 recorded by the camera 2. The camera 2 is directed towards the head 10. The system 1 is configured to perform a method for head pose estimation, which will now be explained with reference to FIGS. 2 to 5.

    [0045] FIG. 2 is a flowchart illustrating one embodiment of the inventive method. After the start, an initial image frame I.sub.0 is recorded by the camera as shown in FIGS. 3 and 4. The “physical location” of any image frame corresponds to an image plane 2.1 of the camera 2. The initial image frame I.sub.0 is provided to the processing device 3. In a following step, the processing device 3 determines a distance Z.sub.eyes between the camera and the head 10, or rather between the camera and the baseline of the eyes, which (as illustrated by FIG. 3) is given by

    [00001] Z eyes = f .Math. δ mm δ px ,

    with f being the focal length of the camera in pixels, δ.sub.px the estimated distance between the eye's centers on the image frame I.sub.0, and δ.sub.mm the mean interpupillary distance, which corresponds to 64.7 mm for male and 62.3 mm for female according to anthropometric databases. As shown in FIGS. 3 to 5, the real head 10 is approximated by a cylindrical head model (CHM) 20. During initialization, the head 10 is supposed to be in a vertical position and facing the camera 2, wherefore the CHM 20 is also upright with its center axis 23 parallel to the Y-axis of the camera 2. The center axis 23 runs through the centers C.sub.T, C.sub.B of the top and bottom bases of the CHM 20.

    [0046] Z.sub.cam denotes the distance between the center of the CHM 20 and the camera 2 and is equal to the sum of Z.sub.eyes and the distance Z.sub.head from the centre of the head 10 to the midpoint between the eyes' baseline. Z.sub.cam is related to a radius r of the CHM by Z.sub.head=√{square root over (r.sup.2−(δ.sub.mm/2).sup.2)}. As shown in FIG. 4, the dimensions of the CHM 20 may be determined by a bounding box in the image frame, which defines a region of interest 30. The height of the bounding box corresponds to the height of the CHM 20, while the width of the bounding box corresponds to the diameter of the CHM 20. Of course, the respective quantities in the image frame I.sub.0 need to be scaled by a factor of

    [00002] δ mm δ px

    in order to obtain the actual quantities in the 3D space. Given the 2D coordinates {p.sub.TL, p.sub.TR, p.sub.BL, p.sub.BR} of the top left, top right, bottom left and bottom right corners of the bounding box, the processing device 3 calculates

    [00003] r = 1 2 | p TR - p TL | δ mm δ px .

    Similarly, the height h of the CHM 20 is calculated by

    [00004] h = | p TR - p BR | δ mm δ px .

    [0047] With Z.sub.cam determined (or estimated), the corners of the face bounding box in 3D space, i.e., {P.sub.TL, P.sub.TR, P.sub.BL, P.sub.BR} and the centers C.sub.T, C.sub.B of the top and bottom bases of the CHM 20 can be determined by projecting the corresponding 2D coordinates into 3D space and combining this with the information about Z.sub.cam.

    [0048] The steps described so far can be regarded as part of an initialization process. Once this is done, the method continues with the steps referring to the actual head pose estimation, which will now be described with reference to FIG. 5. The steps are part of a pose updating loop which is shown in the right half of FIG. 2.

    [0049] While FIG. 5 shows an initial image frame I.sub.n recorded by the camera 2 and provided to the processing device 3, this may be identical to the image frame I.sub.0 in FIGS. 3 and 4. According to one step of the method performed by the processing device 3, a plurality of salient points S are identified within the region of interest 30 and selected (indicated by the white-on-black numeral 1 in FIG. 5). Such salient points S are located in textured regions of the initial image frame I.sub.n and may be corners of an eye, of a mouth, of a nose or the like. In order to identify the salient points S, a suitable algorithm like FAST may be used. The salient points S are represented by 2D coordinates pi in the image frame I.sub.0. A weight is assigned to each salient point S which depends on a distance of the salient point S from a border 31 of the region of interest 30. The closer the respective salient point S is to the border 31, the lower is its weight. It is possible that salient points S with the lowest weight are not selected, but discarded as being (rather) unreliable. This may serve to enhance the total performance of the method. It should be noted that the region of interest 30 comprises, apart from a facial region 32, several non-facial regions, e.g. a neck region 33, a head top region 34, a head side region 35 etc.

    [0050] With the 2D coordinates pi of the selected salient points S known, corresponding 3D coordinates P.sub.i are determined (indicated by the white-on-black numeral 3 in FIG. 5). This is achieved by projecting the 2D coordinates onto a visible head surface 22 of the CHM 20. The visible head surface 22 is that part of a surface 21 of the CHM 20 that is considered to be visible for the camera 2. With the initial head pose of the CHM 20, the visible head surface 22 is one half of its side surface. The 3D coordinates P.sub.i may also be seen as the result of an intersection between a ray 40 starting at an optical center of the camera 2 and passing through the respective salient point S at the image plane 2.1, and the visible head surface 22 of the CHM 20. The equation of the ray 40 is defined as P=C+kV, with V being a vector parallel to the line that goes from the camera's optical center C through P. The scalar parameter k is computed by solving the quadratic equation of the geometric model.

    [0051] In another step, and updated image frame I.sub.n+1, which has been recorded by the camera 2, is provided to the processing device 3 and at least some of the previously selected salient points S are identified within this updated image frame I.sub.n+1 (indicated by the white-on-black numeral 2 in FIG. 5) along with updated 2D coordinates qi. This identification may be performed using optical flow. While the labels in FIG. 5 indicate that identification within the updated image frame I.sub.n+1 is performed before determining the 3D coordinates P.sub.i corresponding to the initial image frame I.sub.n, the sequence of these steps may be inverted as indicated in the flowchart of FIG. 2 or they may be performed in parallel.

    [0052] In another step (indicated by the white-on-black numeral 4 in FIG. 5), the processing device 3 uses the updated 2D coordinates qi and the 3D coordinates Pi to solve a perspective-n-point problem and thus, to update the head pose. The head pose is computed by calculating updated 3D coordinates P′.sub.i resulting from a translation t and rotation R, so that P′.sub.i=R.Math.P.sub.i+t, and by minimizing the error between the reprojection of the 3D features onto the image plane and their respective detected 2D features by means of an iterative approach. In the definition of the error, it is also possible to take into account the weight associated with the respective salient point S, so that an error resulting from a salient point S with low weight contributes less to the total error. Applying the translation t and rotation R to the old head pose yields the updated head pose (indicated by the white-on-black numeral 5 in FIG. 5).

    [0053] In another step, the region of interest 30 is updated. In this embodiment, the region of interest 30 is defined by the projection of the visible head surface 22 of the CHM 20 onto the image. The visible head surface 22 in turn is defined by the intersection of the head surface 21 with a boundary plane 24. The boundary plane 24 has a normal vector resulting from the cross product between a parallel vector to the X-axis of the camera 2 and a vector parallel to the centre axis 23 of the CHM 20. In other words, the boundary plane 24 is parallel to the X-axis and to the centre axis 24 (see the white-on-black numeral 6 in FIG. 5). The corners {P′T.sub.L, P′.sub.TR, P′.sub.BL, P′.sub.BR} of the visible head surface 22 of the CHM 20 are given by the furthermost intersected points between the model head surface 21 and the boundary plane 24, whereas the new region of interest 30 results from projecting the visible head surface 22 onto the image plane 2.1 (indicated by the white-on-black numeral 7 in FIG. 5).

    [0054] The updated region of interest 30 again comprises non-facial regions like the neck region 33, the head top region 34, the head side region 35 etc. In the next loop, salient points from at least one of these non-facial regions 33-35 may be selected. For example, the head side region 35 now is closer to the center of the region of interest 30, making it likely that a salient point from this region will be selected, e.g. a feature of an ear.