Triggering a head-pose dependent action
20230102851 · 2023-03-30
Assignee
Inventors
Cpc classification
G06T7/246
PHYSICS
G06F18/214
PHYSICS
G06T7/277
PHYSICS
H04N23/611
ELECTRICITY
H04R25/70
ELECTRICITY
G06V40/171
PHYSICS
G06V10/62
PHYSICS
International classification
G06F18/214
PHYSICS
G06T7/246
PHYSICS
G06T7/277
PHYSICS
H04N23/611
ELECTRICITY
Abstract
Disclosed herein is an apparatus comprising a camera and a processing unit operatively coupled to the camera, wherein the processing unit is configured to: receive a sequence of images captured by the camera; process a first image of the received sequence of images to compute respective likelihoods of each of a plurality of predetermined facial features being visible in the first image; compute, from the computed likelihoods, a probability that the first image depicts a predetermined first side of a human head; responsive to at least the computed probability exceeding the predetermined detection probability, trigger performance of a predetermined action.
Claims
1. An apparatus comprising a camera and a processing unit operatively coupled to the camera, wherein the processing unit is configured to: receive a sequence of images provided by the camera; process a first image of the sequence of images to compute likelihood(s) of facial feature(s) being visible in the first image; compute a probability that the first image depicts a first side of a human head based on the computed likelihood(s); and trigger a performance of an action if the computed probability exceeds a predetermined detection probability.
2. The apparatus according to claim 1, wherein the processing unit is further configured to: compute a stability parameter indicative of a stability of at least some of the images over time; and determine whether the computed stability parameter fulfills a predetermined stability condition; wherein the processing unit is configured to trigger the performance of the action if the computed probability exceeds the predetermined detection probability, and if the computed stability parameter fulfilling the predetermined stability condition.
3. The apparatus according to claim 2, wherein the processing unit is configured to compute the stability parameter by: tracking at least one of the facial feature(s) across two or more of the images, and computing a metric associated with a movement of the at least one of the facial feature(s) within a field of view of the camera.
4. The apparatus according to claim 3, wherein the predetermined stability condition is fulfilled if the metric is smaller than a threshold.
5. The apparatus according to claim 3, wherein the processing unit is configured to compute the stability parameter by determining a weighted sum of image positions of the at least one of the facial feature(s) in the two or more of the images to obtain a combined image position.
6. The apparatus according to claim 5, wherein the processing unit is configured to compute the stability parameter by applying a position-velocity Kalman Filter to track the combined image position.
7. The apparatus according to claim 3, wherein the metric is associated with a degree of movement.
8. the apparatus according to claim 3, wherein the metric is associated with a speed of movement.
9. The apparatus according to claim 1, wherein the camera is a monocular camera.
10. The apparatus according to claim 1, wherein the first side of the human head is a first lateral side of the human head.
11. The apparatus according to claim 1, wherein the facial feature(s) comprise multiple facial features.
12. The apparatus according to claim 11, wherein the multiple facial features comprise: a first facial feature of the first side of the human head, a second facial feature of a second side of the human head, the second side being opposite the first side, and a third facial feature of a third side of the human head, the third side being different from the first and second sides.
13. The apparatus according to claim 12, wherein the first facial feature is indicative of a first ear.
14. The apparatus according to claim 12, wherein the third side is a front side of the human head.
15. The apparatus according to claim 12, wherein the third facial feature is indicative of a nose.
16. The apparatus according to claim 1, wherein the action comprises recording at least one of the images in the sequence as an image of the first side of the human head.
17. The apparatus according to claim 1, wherein the action comprises recording an image of the first side of the human head.
18. A computer-implemented method for triggering an action, the method comprising: receiving from a camera, a sequence of images; processing a first image of the sequence of images to determine likelihood(s) of facial feature(s) being visible in the first image; computing a probability that the first image depicts a first side of a human head based on the determined likelihood(s); and triggering a performance of the action if the computed probability exceeds the predetermined detection probability.
19. The computer-implemented method according to claim 18, further comprising: computing a stability parameter indicative of a stability of at least some of the images over time; and determining whether the computed stability parameter fulfills a predetermined stability condition; wherein the act of triggering the performance of the action is performed if the computed probability exceeds the predetermined detection probability, and if the computed stability parameter fulfills the predetermined stability condition.
20. The computer-implemented method according to claim 18, wherein the action comprises recording an image of the first side of the human head as a training image for a machine-learning process.
21. The computer-implemented method according to claim 18, wherein the action comprises using an image of the first side of the human head in a hearing-aid fitting process.
22. The computer-implemented method according to claim 18, wherein the action comprises performing image processing of an image of the first side of the human head to detect an ear in the image.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0062]
[0063]
[0064]
[0065]
[0066]
[0067]
DETAILED DESCRIPTION
[0068]
[0069] receive a sequence of images captured by the digital camera 12;
[0070] process a first image of the received sequence of images to compute respective likelihoods of each of at least three predetermined facial features being visible in the first image;
[0071] compute, from the detected likelihoods, a probability that the first image depicts a predetermined first side of a human head 30;
[0072] process one or more images of the sequence of images to compute a stability parameter indicative of a stability of the captured images over time;
[0073] determine whether the computed probability exceeds a predetermined detection probability and whether the computed stability parameter fulfills a predetermined stability condition;
[0074] responsive to the computed probability exceeding the predetermined detection probability and the computed stability parameter fulfilling the predetermined stability condition, triggering performance of a predetermined head-pose dependent action, e.g., the capturing of an image or the recording of one of the already captured images as an image of the first side of the human head 30.
[0075] For example, the processing unit may be configured to perform the acts of the method described in detail below in connection with
[0076]
[0077] The digital camera 12 may be a webcam or another type of digital camera communicatively coupled to the data processing device 110 and operable to capture images of a human head 30. In the example of
[0078] The data processing device 110 comprises the processing unit 112, a memory 111 and a communications interface 113. The processing unit 112 may be a suitable programmed central processing unit. The processing unit 112 is operationally coupled to the memory 111 and the communications interface 113. The memory 111 may be configured to store a computer program to be executed by the processing unit 112. Alternatively or additionally, the memory 111 may be configured to store recorded images captured by, and received from, the digital camera 12. The communications interface 113 comprises circuitry for communicating with the digital camera 12. In the example of
[0079] While the data processing device 110 of the example of
[0080]
[0081] At initial step S1, the process receives one or more captured images depicting a human head. The subsequent steps S2-S4 may be performed in real-time or quasi-real time, i.e., the individual images, e.g., individual frames of video, may be processed as they are received rather than waiting for an entire plurality of images having been received. In other examples, the process may receive a certain number of images, e.g., a certain number of frames, and then process the number of images by performing steps S2-S4.
[0082] In subsequent step S2, the process processes the one or more captured images to detect an orientation of the depicted head relative to the camera having captured the image(s). In particular, this step may detect whether a predetermined lateral side view of a human head is depicted in the image. Accordingly, this step may output a corresponding lateral side view detection condition C.sub.side={true, false}, e.g., a left side detection condition C.sub.right={true, false} and/or a right side detection condition C.sub.left={true, false}. At least some embodiments of this step may be based on detected facial landmarks. An example of a process for detecting the orientation of the head will be described below in more detail with reference to
[0083] In subsequent step S3, the process processes one or a sequence of the captured images depicting the human head to determine whether the captured image or the sequence of the captured images depict the human head in a sufficiently stable manner, i.e., whether a predetermined stability condition is fulfilled. The stability detection ensures that an action is only triggered if the images are sufficiently stable. For example, when the action to be triggered involves utilizing the captured images or capturing new images, the stability detection reduces the risk of out-of-focus pictures being taken. The stability detection may e.g., detect relative movement of the depicted head relative to the camera. The stability detection may output a stability condition parameter C.sub.stable={true, false} indicative of whether a stability condition is fulfilled or not, e.g., whether detected relative movement of the depicted head is sufficiently small, e.g., smaller than a predetermined threshold. Various methods for stability detection may be used:
[0084] For example, the stability detection may be based on the processing of raw video frames. Examples of such processing include an analysis of edges and gradients in the received images, as in-focus images tend to have visual features with sharp edges and stronger color gradients. Another example of a stability detection based on the processing of raw video frames may involve an image histogram analysis, as the form of histograms tend to differ for out-of-focus and in-focus images.
[0085] Another example of stability detection utilizes detected landmarks, thus allowing reuse of results from the orientation detection. In particular, the stability detection may be based on a tracking of individual landmarks and/or tracking of a combination, e.g., a linear combination, of multiple landmarks across a sequence of images. The tracking of landmarks allows the computation of a statistical position deviation, based on which a stability criterion may be determined. An example of such a process will be described in more detail below with reference to
[0086] If the process detects that, with sufficiently high probability, the images depict the predetermined lateral side view of the human head and that the stability condition is fulfilled, the process proceeds at step S4; otherwise, the process terminates or returns to step S1.
[0087] Accordingly, in step S4, the process triggers a predetermined action, responsive to the side view detection condition and the stability condition being fulfilled. If the event is to be triggered by a detected left side view, the triggering condition is C.sub.left ∧C.sub.stable Similarly, if the event is to be triggered by a detected right side view, the triggering condition is C.sub.right ∧C.sub.stable. If the event is to be triggered by any detected lateral side view, e.g., by both left and right side views, the triggering condition is (C.sub.right ∨C.sub.left) ∧C.sub.stable.
[0088] The predetermined action may include capturing one or more further images or recording one or more of the captured images as one or more images depicting a lateral side view of the human head.
[0089]
[0090] In initial step S21, the process detects one or more facial landmarks (also referred to as landmarks, landmark points or keypoints) in the image. The landmark detection may be performed using any suitable detection method known as such in the art, e.g., by a commercially available pose estimation library, such as PoseNet or OpenPose.
[0091] The detection process may be configured to detect a plurality of predetermined facial landmarks. An example of such landmarks are illustrated in
[0092] In particular,
[0093]
[0094]
[0095]
[0096] The process for detecting the facial landmarks may result in a list, array or other suitable data structure of detected facial landmarks. The list may include, for each detected landmark, a landmark identifier, 2D image coordinates indicating the position in the image where the landmark has been detected, and a confidence value. The landmark identifier may be indicative of which of the predetermined facial landmarks has been detected, e.g., the left corner of the right eye, etc. It will be appreciated that the landmark identifiers may be chosen in any suitable manner, e.g., a landmark name or other descriptor, a landmark serial number, etc. as long as the identifiers allow the different landmarks to be distinguished from another. The 2D image coordinates indicate the 2D image position of the detected landmark in the image, e.g., as expressed in (x, y) pixel coordinates. The confidence value may be a value between 0 and 1 indicative of the confidence with which the landmark has been detected, where a confidence level of 1 may correspond to absolute certainty. The confidence level may also be referred to as a confidence score or simply score. It can be interpreted as a probability that the landmark is visible.
[0097] Most landmark detection algorithms produce numerous facial landmarks related to different features of a human face. At least some embodiments of the process disclosed herein only use information about selected, predetermined facial landmarks as an input for the detection of the orientation of the head depicted in the image. In particular, in order to detect a lateral side view of the human head, the process may utilize three groups of landmark features as schematically illustrated in
[0098] It will be appreciated that each of the groups of landmarks may include a single landmark or multiple landmarks. The groups may include equal numbers of landmarks or different numbers of landmarks. For each group of landmarks, the process may determine a representative image position, e.g., as a geometric center of the detected individual landmarks of the group or another aggregate position. Similarly, the process may determine an aggregate confidence level of the group of landmarks having been detected, e.g., as a product, average or other combination of the individual landmark confidence levels of the respective landmarks of the group.
[0099] An example of an input to the detection process is illustrated in the table below:
TABLE-US-00001 Landmark Group Name Position Confidence Group 1: Left ear-related x.sub.1, y.sub.1 0 ≤ c.sub.1 ≤ 1 landmarks Group 2: Right ear-related x.sub.2, y.sub.2 0 ≤ c.sub.2 < 1 landmarks Group 3: Nose-related x.sub.3, y.sub.3 0 ≤ c.sub.3 < 1 landmarks
[0100] This format is supported by OpenPose (see https://github.com/CMU-Perceptual-Computing-Lab/openpose) and PoseNet (see https://github.com/tensorflow/tfjs-models/tree/master/posenet) libraries.
[0101] Again referring to
[0102] A robust and efficient measure of the probability P.sub.left that an image depicts a left lateral side view of a human head may be computed from the above three groups of landmarks associated with the ears and the nose, e.g., as follows:
P.sub.left=c.sub.1(1−C.sub.2)C.sub.3
[0103] It will be appreciated that, if the side view detection is based on other landmarks, the probability of the image depicting a certain side view may be computed in a similar manner, depending on whether the corresponding landmarks are visible from the respective side or not.
[0104] In subsequent step S23, the process determines whether the predetermined lateral side view has been detected with sufficiently high probability, e.g., by comparing the computed probability with a predetermined threshold:
[0105] Where 0≤P.sub.side≤1 is the predetermined threshold and C.sub.left represents a left side view detection condition. The left side view detection condition has the logical value “true”, if the left side view has been detected with sufficiently high probability, and the logical value “false” otherwise.
[0106] It will be appreciated that other embodiments of the process may detect a different lateral side view or multiple side views. For example, a probability P.sub.right that an image depicts a right lateral side view of a human head may be computed as:
P.sub.right=C.sub.2(1−C.sub.1)C.sub.3.
[0107] A corresponding right side detection condition C.sub.right may be computed as:
[0108] Accordingly, embodiments of the process may compute P.sub.left and/or P.sub.right, depending on whether only one of the lateral side views is intended to trigger and action or whether both lateral side views are intended to trigger and action.
[0109] The process then returns the computed lateral side view detection condition or conditions.
[0110]
[0111] In initial step S31, the process receives a sequence of input data sets associated with a corresponding sequence of video frames. The sequence of video frames having been captured at a certain frame rate, measured as frames per second (FSP). The input data may include the actual video frames. In that case, for each video frame, the process performs landmark detection, e.g., as described in connection with step S21 of the side view detection process of
[0112] In subsequent step S32, the process computes a weighted sum of the image coordinates of the detected landmarks:
[0113] where z.sub.i is a coordinate vector associated with i-th landmark and z is the weighted sum of landmark positions, each landmark coordinate vector being weighted by its detection confidence level c.sub.i. The weighted sum z may be considered as a generalized head center. N is the number of landmarks. It will be appreciated that the stability detection may be performed based on all detected landmarks or only based on a subset of the detected landmarks, e.g., the landmarks selected for the side view detection, as described in connection with step S21 of the side view detection process of
[0114] In subsequent step S33 the process computes model input parameters for a position-velocity Kalman Filter configured to track the generalized head center. In particular, the process defines a state vector, an associated evolution equation and a measurement equation.
[0115] To this end, the process defines a state vector x=[x y x′ y′].sup.T where [x,y].sup.T=z represent the current position of the generalized head center and [x′, y′].sup.T=z′ represents a 2D velocity of the generalized head center.
[0116] The process further defines an evolution equation for use by the Kalman filter:
x=Fx.sub.prev+w,
[0117] Where F denotes a transition matrix:
[0118] In the evolution equation, x is related to the current video frame while x.sub.prev is related to the preceding video frame, i.e., the video frame processed during the preceding iteration of the process. ΔT=1/FPS is the reciprocal of the frame rate FPS (frames per second). The evolution equation further includes a process noise term w, which may be drawn from a zero mean multivariate normal distribution with covariance Q, i.e. w˜N(0,Q). The covariance Q may be a diagonal matrix with suitable predetermined values.
[0119] The process may define the measurement equation as:
z=Hx+v,
[0120] Where H denotes the measurement matrix
[0121] and v denotes observation noise, which may be drawn from a zero mean multivariate normal distribution with suitably selected diagonal covariance R, i.e. v˜N(0,R).
[0122] In step S34, the process performs an iteration of a Kalman-Filter, known as such in the art (see e.g., https://en.wikipedia.org/wiki/Kalman_filter), using the above evolution and measurement equations.
[0123] In step S35, the process computes a position deviation measure from the Kalman-filtered state vector x. In particular, the filtered state vector x includes the 2D velocity z′=[x′, y′].sup.T and the process computes a norm, e.g. a Euclidean norm, ∥z′∥ of the 2D velocity and uses it as a stability parameter.
[0124] In step S36, the process determines whether the computed stability parameter fulfills a predetermined stability criterion, e.g., by comparing the computed stability parameter with a predetermined threshold:
[0125] Where z′.sub.stable>0 is the predetermined threshold and C.sub.stable represents the stability condition. The stability condition has the logical value “true”, if the stability parameter, i.e., the position deviation, is smaller than the threshold; otherwise, the stability condition has the logical value “false.”
[0126] The process then returns the computed stability condition C.sub.stable.
[0127] The process described above thus provides a robust and efficient triggering of a head-pose dependent action in real time that can be implemented on desktop, mobile and web computing environments. It has relatively low computational cost due to simplicity of the operations performed on top of landmarks detection for each frame. It is also adaptive to different camera frame rates and can easily be adapted/tuned by means of few tuning parameters, namely the thresholds P.sub.side and z′.sub.stable.
[0128] Although the above embodiments have mainly been described with reference to certain specific examples, various modifications thereof will be apparent to those skilled in art without departing from the spirit and scope of the invention as outlined in claims appended hereto.