BEHAVIOR RECOGNITION APPARATUS, LEARNING APPARATUS, AND METHOD
20170351928 · 2017-12-07
Assignee
Inventors
Cpc classification
G06V40/103
PHYSICS
G06V20/597
PHYSICS
G06T7/143
PHYSICS
International classification
Abstract
Provided is a behavior recognition apparatus, including a detection unit configured to detect, based on a vehicle interior image obtained by photographing a vehicle interior, positions of a plurality of body parts of a person inside a vehicle in the vehicle interior image; a feature extraction unit configured to extract a rank-order feature which is a feature based on a rank-order of a magnitude of a distance between parts obtained by the detection unit; and a discrimination unit configured to discriminate a behavior of an occupant in the vehicle using a discriminator learned in advance and the rank-order feature extracted by the feature extraction unit. Also provided is a learning apparatus to learn the discrimination unit.
Claims
1. A behavior recognition apparatus, comprising: a detection unit configured to detect, based on a vehicle interior image obtained by photographing a vehicle interior, positions of a plurality of body parts of a person inside a vehicle in the vehicle interior image; a feature extraction unit configured to extract a rank-order feature which is a feature based on a rank-order of a magnitude of a distance between parts obtained by the detection unit; and a discrimination unit configured to discriminate a behavior of an occupant in the vehicle using a discriminator learned in advance and the rank-order feature extracted by the feature extraction unit.
2. The behavior recognition apparatus according to claim 1, wherein the discriminator is learned by decision tree learning and is configured based on a magnitude relationship between a rank-order of a magnitude of a distance between a first pair of body parts and a rank-order of a magnitude of a distance between a second pair of body parts.
3. The behavior recognition apparatus according to claim 1, wherein the discriminator is configured based on statistical machine learning.
4. The behavior recognition apparatus according to claim 1, wherein the discrimination unit is configured to calculate a likelihood for each of a plurality of behaviors determined in advance, and wherein, with respect to images of a plurality of frames constituting a moving image, the behavior recognition apparatus detects a position of a body part, extracts a rank-order feature, calculates a likelihood for each of the plurality of behaviors, and determines a behavior, for which a sum of squares of the likelihood is maximum, as the behavior of the occupant in the vehicle.
5. A learning apparatus, comprising: an input unit configured to acquire positions of a plurality of body parts of a person inside a vehicle in a vehicle interior image obtained by photographing a vehicle interior and a correct behavior taken by the person; a feature extraction unit configured to extract a rank-order feature which is a feature based on a rank-order of a magnitude of a distance between body parts; and a learning unit configured to learn a discriminator for discriminating a behavior of an occupant in the vehicle based on the rank-order feature extracted by the feature extraction unit and the correct behavior.
6. The learning apparatus according to claim 5, wherein the learning unit is configured to learn the discriminator by decision tree learning based on a magnitude relationship between a rank-order of a magnitude of a distance between a first pair of body parts and a rank-order of a magnitude of a distance between a second pair of body parts.
7. The learning apparatus according to claim 5, wherein the learning unit is configured to learn the discriminator based on statistical machine learning.
8. The learning apparatus according to claim 5, wherein the discriminator is learned by using also input data obtained by adding a minute fluctuation to positions of the plurality of body parts in the vehicle interior image as learning data representing a same correct behavior.
9. A behavior recognition method, comprising steps of: detecting, based on a vehicle interior image obtained by photographing a vehicle interior, positions of a plurality of body parts of a person inside a vehicle in the vehicle interior image; extracting a rank-order feature which is a feature based on a rank-order of a magnitude of a distance between parts obtained in the detecting step; and discriminating a behavior of an occupant in the vehicle using a discriminator learned in advance and the rank-order feature extracted in the feature extracting step.
10. The behavior recognition method according to claim 9, wherein the discriminator is learned by decision tree learning and is configured based on a magnitude relationship between a rank-order of a magnitude of a distance between a first pair of body parts and a rank-order of a magnitude of a distance between a second pair of body parts.
11. The behavior recognition method according to claim 9, wherein the discriminator is configured based on statistical machine learning.
12. The behavior recognition method according to claim 9, wherein the step of detecting the positions of the plurality of body parts and the step of extracting the rank-order feature is performed for each frame in a moving image, and wherein discriminating the behavior of the occupant comprises calculating a likelihood for each of a plurality of behaviors determined in advance, and determining a behavior, for which a sum of squares of the likelihood is maximum, as the behavior of the occupant.
13. A learning method, comprising steps of: acquiring positions of a plurality of body parts of a person inside a vehicle in a vehicle interior image obtained by photographing a vehicle interior and a correct behavior taken by the person inside the vehicle; extracting a rank-order feature which is a feature based on a rank-order of a magnitude of a distance between body parts; and learning a discriminator for discriminating a behavior of an occupant in the vehicle based on the rank-order feature extracted in the feature extracting step and the correct behavior.
14. The learning method according to claim 13, wherein the discriminator is learnt by decision tree learning based on a magnitude relationship between a rank-order of a magnitude of a distance between a first pair of body parts and a rank-order of a magnitude of a distance between a second pair of body parts.
15. The learning method according to claim 13, wherein the discriminator is learnt based on statistical machine learning.
16. The learning method according to claim 13, wherein the discriminator is learned by using also input data obtained by adding a minute fluctuation to positions of the plurality of body parts in the vehicle interior image as learning data representing a same correct behavior.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
DESCRIPTION OF THE EMBODIMENTS
[Outline of Configuration]
[0040] Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
[0041]
[0042] Each of these functional units will be described together with descriptions of a learning process and a behavior recognition process presented below.
[Learning Process]
[0043] First, a learning process performed by the learning apparatus 2 will be described.
[0044] In step S10, the learning apparatus 2 acquires a moving image of infrared images and depth information (range images) containing a behavior, the correct recognition result (correct behavior) of which is known. The infrared images are input from the infrared image input unit 11, the depth information is input from the depth information input unit 12, and the correct behavior is input from the correct behavior input unit 17.
[0045] As shown in
[0046] The depth information input unit 12 acquires depth information of the inside of the vehicle (hereinafter, depth information) input from outside of the behavior recognition apparatus 1 and outputs depth information D(t) at an obtained time point t (t=1, 2, . . . , T) to the detection unit 13. In this case, the depth information D(t) may be acquired by installing a commercially-available stereoscopic camera, a time-of-flight (TOF) sensor, or the like inside the vehicle.
[0047] A correct behavior (correct category) of a presently-input infrared image and depth information is input to the correct behavior input unit 17. Examples of a correct behavior include an operation of a steering wheel, an adjustment of a rearview mirror, an adjustment of a control panel, wearing and removing a seat belt, an operation of a smartphone, and eating and drinking.
[0048] Processes of a loop L1 constituted by steps S11 to S13 are performed on each frame of an input moving image as a target.
[0049] In step S11, the detection unit 13 detects a body part from the infrared image I(t) and the depth information D(t).
[0050] As shown in
[0051] In this case, x.sub.m(t) represents a horizontal coordinate in the infrared image I(t) of an m-th part at a time point t. In addition, y.sub.m(t) represents a vertical coordinate in the infrared image I(t) of the m-th part at the time point t. Meanwhile, z.sub.m(t) represents a depth-direction coordinate of the m-th part at the time point t and is given as a value on the two-dimensional coordinates (x.sub.m(t), y.sub.m(t)) in the depth information D(t).
[0052] Specifically, for example, as described in Scwarz et al. [3], the two-dimensional coordinates (x.sub.m(t), y.sub.m(t)) (m=1, 2, . . . , M) of the M-number of parts of an occupant in a vehicle may be detected using a discriminator C.sub.1 generated in advance for detecting the two-dimensional coordinates (x.sub.m(t), y.sub.m(t)) (m=1, 2, . . . , M) of the M-number of parts of an occupant in a vehicle. The discriminator C.sub.1 can be generated using a large amount of learning data to which the two-dimensional coordinates (x.sub.m(t), y.sub.m(t)) (m=1, 2, . . . , M) and the depth-direction coordinates z.sub.m(t) (m=1, 2, . . . , M) of the M-number of parts of an occupant in a vehicle are assigned.
[0053] Alternatively, as described in Toshev et al. [4], the two-dimensional coordinates (x.sub.m(t), y.sub.m(t)) (m=1, 2, . . . , M) of the M-number of parts of an occupant in a vehicle may be detected using a discriminator C.sub.2 generated in advance for detecting the two-dimensional coordinates (x.sub.m(t), y.sub.m(t)) (m=1, 2, . . . , M) of the M-number of parts of an occupant in a vehicle. The discriminator C.sub.2 can be generated using a large amount of learning data to which the two-dimensional coordinates (x.sub.m(t), y.sub.m(t)) (m=1, 2, . . . , M) of the M-number of parts of an occupant in a vehicle are assigned.
[0054] In step S12, the minute fluctuation application unit 151 of the learning unit 15 adds a minute fluctuation to the two-dimensional coordinates (x.sub.m(t), y.sub.m(t)) (m=1, 2, . . . , M) of the M-number of parts of an occupant in a vehicle obtained by the detection unit 13 to create K-number of pieces of learning data D.sub.k(t) (k=1, 2, . . . , K) which are similar to, but differ from, each other. The correct behavior remains the same as that input to the correct behavior input unit 17 even after the minute fluctuation is added.
[0055] As shown in
[0056] In this case, Δx.sub.m,k(t) represents a minute fluctuation with respect to the horizontal direction of the m-th part and a magnitude thereof is given by a random value equal to or smaller than a maximum value Δx.sub.max determined in advance and differs in value for each k (=1, 2, . . . , K). In addition, Δy.sub.m,k(t) represents a minute fluctuation with respect to the vertical direction of the m-th part and a magnitude thereof is given by a random value equal to or smaller than a maximum value Δy.sub.max determined in advance and differs in value for each k (=1, 2, . . . , K). Furthermore, the maximum values Δx.sub.max and Δy.sub.max are respectively determined heuristically.
[0057] In step S13, the feature extraction unit 152 extracts K-number of rank-order features F.sub.k(t) (k=1, 2, . . . , K) based on the K-number of pieces of learning data D.sub.k(t) (k=1, 2, . . . , K). Specifically, the rank-order feature F(t) is extracted using Expression (1) below.
F(t)=(R(D(1,2)),R(D(1,3)), . . . ,R(D(8,9)),R(D(9,10))) (1)
[0058] In Expression (1), D(m, n) represents a Euclidean distance on an infrared image space between the m-th part and an n-th part, and R(D(m, n)) represents a rank-order of D(m, n) when D(1, 2), D(1, 3), . . . , D(8, 9), D(9, 10) are sorted in a descending order. For example, for the sake of convenience, let us consider four parts as shown in
In this case, the rank-order feature F(t) at the time point t can be extracted as
F(t)=(1, 5, 6, 4, 3, 2).
[0059] The rank-order feature F(t) is characteristically invariable with respect to a scale fluctuation of a position of a body part as shown in
[0060] Due to the processes of steps S11 to S13 described above, a plurality of pieces of learning data D.sub.k(t) are created for an image corresponding to a single frame and the rank-order feature F(t) is determined for each piece of learning data D.sub.k(t). In addition, the processes are executed for each frame of the input moving image by repetitively performing the loop L1.
[0061] In step S14, the determination condition setting unit 153 of the learning unit 15 generates a discriminator C.sub.3 with respect to a discrimination category c (=1, . . . , C) using K×T-number of rank-order features F.sub.k(t) (k=1, 2, . . . , K, t=1, 2, . . . , T) obtained by the feature extraction unit 152 and a correct category corresponding to each rank-order feature F.sub.k(t). In the present embodiment, the discriminator C.sub.3 is generated using decision tree learning and, particularly, using Random. Forests such as described in Breiman [5].
[0062] Random Forests refer to a type of group learning algorithms which use a decision tree as a weak discriminator and are constituted by a plurality of nodes r (=1, . . . , R) and links connecting the nodes. A node on a topmost layer is referred to as a root node, a node on a bottommost layer is referred to as a leaf node, and others are simply referred to as nodes. Each node stores, by learning, a determination condition Φ.sub.r (r=1, . . . , R) for sorting a rank-order feature in the node to a left-side node or a right-side node and a probability P.sub.r(c) (r=1, . . . , R) with respect to a discrimination category c (=1, . . . , C).
[0063] In this case, the discrimination category c (=1, . . . , C) refers to a correct behavior that is input to the correct behavior input unit 17. The discrimination category may be appropriately set based on a context of behavior recognition of an occupant in a vehicle. Examples of settings may include “c=1: operation of a steering wheel”, “c=2: adjustment of a rearview mirror”, “c=3: adjustment of a control panel”, “c=4: wearing and removing a seat belt”, “c=5: operation of a smartphone”, and “c=6: eating and drinking”.
[0064] In addition, candidates φ.sub.m (m=1, . . . , M) of a determination condition necessary for learning by Random Forests are set using an i-th element I and a j-th element J of the rank-order feature F(t) and a threshold τ for comparing magnitudes of the element I and the element J. A specific determination condition or, in other words, values of i, k, and τ in φ.sub.m are randomly determined.
[0065] The determination condition is used to determine to which child node a transition is to be made from each node constituting a decision tree. Specifically, magnitudes of the i-th element I and the j-th element J of the rank-order feature F(t) are compared with each other and: when I−J>τ is satisfied, a transition is made to a right-side node; but when I−J>τ is not satisfied, a transition is made to a left-side node. For example, when a determination condition (i, j, τ)=(1, 5, 1) is applied to the rank-order feature F(t)=(1, 5, 6, 4, 3, 2), since the i-th (=1st) element I=5 and the j-th (=5th) element J=2 satisfy I−J=5−2=3>1=τ, a transition is made to the right-side node. In a similar manner, when a determination condition (i, j, τ)=(1, 0, 7) is applied, since the i-th (=1st) element I=5 and the j-th (=0th) element J=1 satisfy I−J=5−1=4<7=τ, a transition is made to the left-side node.
[0066] Once candidates φ.sub.m (m=1, . . . , M) of a determination condition are obtained as described above, learning may be subsequently performed according to procedures determined in Breiman [6]. In this case, learning refers to setting an appropriate determination condition Φ.sub.r (r=1, . . . , R) for each node r (=1, . . . , R) and setting a probability P.sub.r(c) (r=1, . . . , R) with respect to a discrimination category c (=1, . . . , C). Specifically, as the determination condition Φ.sub.r (r=1, . . . , R) of an r-th node, a candidate φ.sub.m for which reliability G(φ) defined by Expression (2) below is maximum among the candidates φ.sub.m (m=1, . . . , M) of a determination condition may be set.
[0067] In this case, Q.sub.1(φ) represents the number of samples which make a transition to a left-side node under a determination condition φ, Q.sub.r(φ) represents the number of samples which make a transition to a right-side node under the determination condition φ, H(Q(φ)) represents information entropy with respect to a discrimination category at a prescribed node, H(Q.sub.1(φ)) represents information entropy with respect to a discrimination category of a sample having made a transition to a left-side node under the determination condition φ, and H(Q.sub.r(φ)) represents information entropy with respect to a discrimination category of a sample having made a transition to a right-side node under the determination condition φ.
[0068] Finally, the determination condition Φ.sub.r (r=1, . . . , R) for each node r (=1, . . . , R) in Random Forests and the probability P.sub.r(c) (r=1, . . . , R) with respect to the discrimination category c (=1, . . . , C) are determined. The process described above is performed a plurality of times by varying a subset of learning data used in learning and the candidate φ.sub.m of the determination condition to create a plurality of decision trees. A discrimination result of a discriminator (corresponding to the probability calculation unit 161 of the discrimination unit 16) is an integration of the probability P.sub.r(c) with respect to the discrimination category c obtained by each decision tree.
[Behavior Recognition Process]
[0069] Next, a behavior recognition process performed by the behavior recognition apparatus 1 will be described.
[0070] In step S20, the behavior recognition apparatus 1 acquires a moving image of infrared images and depth information (range images) with respect to a behavior which is to be recognized. Acquisition of infrared images and depth information is basically similar to the acquisition during the learning process.
[0071] Processes of a loop L2 constituted by steps S21 to S23 are performed on each frame of an input moving image as an object.
[0072] In step S21, the detection unit 13 detects two-dimensional positions of body parts. In step S22, the feature extraction unit 14 extracts a rank-order feature based on a rank-order of a distance between body parts. The processes of steps S21 and S22 are similar to the processes of steps S11 and S12 in the learning process.
[0073] In step S23, the probability calculation unit 161 learned by the learning apparatus 2 obtains a probability corresponding to each recognition category c (=1, . . . , C) of the rank-order feature extracted by the feature extraction unit 14. A rank-order feature newly input by the feature extraction unit 14 will be denoted by F(t′). The correct recognition category of the rank-order feature F(t′) is unknown. The probability calculation unit 161 calculates a probability P(t′, c) with respect to the recognition category c (=1, . . . , C) of the rank-order feature F(t′) (t′=1, . . . , T′) based on the determination condition Φ.sub.r (r=1, . . . , R) for each node r (=1, . . . , R) in Random Forests obtained by the learning unit 15 and the probability P.sub.r(c) (r=1, . . . , R) with respect to the discrimination category c (=1, . . . , C). The calculated probability P(t′, c) is output to the probability integration unit 162.
[0074] Specifically, the probability P(t′, c) is given as a probability P.sub.r′(c) of a leaf node r′ (where r′ is any one of 1 to R) which is eventually reached when sequentially tracing nodes from a root node in accordance with the determination condition Φ.sub.r (r=1, . . . , R) for each node r (=1, . . . , R) in Random Forests obtained by the learning unit 15.
[0075] In step S24, the probability integration unit 162 determines a behavior of an occupant in a vehicle in the input moving image based on a discrimination result (a probability for each category) of L-number of most recent frames. Specifically, by integrating the probability P(t′, c) with respect to the recognition category c (=1, . . . , C) at the time point t′ obtained by the probability calculation unit 161 for L-number of frames in the time direction, the probability integration unit 162 determines which recognition category c (=1, . . . , C) the rank-order feature F(t′) belongs to. Specifically, a recognition category c(F(t′)) (any one of 1 to C) to which the rank-order feature F(t′) belongs may be determined using Expression (3).
[0076] In this case, a sum of squares is obtained instead of a simple sum with respect to the probability P(t′, c) in Expression (3) in order to highlight a difference between two recognition categories when the recognition categories are similar to, but differ from, each other. In addition, the value of L may be heuristically determined.
[0077] By utilizing the behavior recognition apparatus 1, the behavior recognition result c(F(t′)) of an occupant in a vehicle obtained in this manner is transmitted to a higher level apparatus and applied to various applications which use a behavior of an occupant in a vehicle as an input. For example, the behavior recognition result c(F(t′)) is applied to recognize dangerous behavior such as the occupant in a vehicle operating a smartphone or drinking and eating and to adaptively alert the occupant in a vehicle by collating the dangerous behavior with a traveling state of the vehicle. Moreover, the unit described above corresponds to an example of the behavior recognition apparatus 1.
[0078] In the present embodiment, since a rank-order of a magnitude of a distance between parts is used as a feature, accurate behavior recognition can be performed. This is because the rank-order of a magnitude of a distance is invariable even when a scale fluctuation such as enlargement or reduction, a rotation, or a translation occurs and is robust with respect to a minute fluctuation of parts. Due to such characteristics, an effect of various fluctuations which occur when estimating a behavior of an occupant in a vehicle such as a horizontal movement of a seat position, a difference in physiques among occupants, and a position or an orientation of a camera, an effect of an estimation error of a position of a body part by deep learning, and other effects can be suppressed.
[Modification]
[0079] In the description provided above, a two-dimensional position (x.sub.m(t), y.sub.m(t)) is obtained as a position of a body part and, therefore, a distance on an xy plane is also used as a distance between body parts. However, it is also preferable to obtain a position of a body part three-dimensionally and to use a distance in a three-dimensional space as a distance between parts. In this case, when adding a minute fluctuation to a position of a part in a learning process, a random value may be added to each of x, y, and z components or random values may be added to the x and y components while a value of (x.sub.m(t)+Δ.sub.m,k(t), y.sub.m(t)+Δy.sub.m,k(t)) in depth information D(t) may be adopted as the z component.
[0080] In addition, a position of a body part used in a learning process or a behavior recognition process may be obtained in any way. This means that, in addition to algorithms for part detection not being limited to a specific algorithm, part detection may also be performed manually. Nevertheless, in a behavior recognition process, desirably, the detection of a body part is performed by a machine to enable real-time processing.
[0081] Furthermore, while the probability integration unit 162 determines a recognition result of a final behavior category based on a sum of squares of a probability P(t′, c) in each frame, the recognition result of a final behavior category may instead be determined based on a simple sum or a product (or an arithmetic mean or a geometric mean).
[0082] In addition, while a case of adopting Random Forests as an example of decision tree learning has been described above, other decision tree learning algorithms such as ID3 and CART may be used instead.
[0083] Furthermore, adoptable learning processes are not limited to decision tree learning and other arbitrary statistical machine learning processes may be used. Statistical machine learning refers to a learning process of generating a model for discriminating classes of input data based on a statistical method from learning data. For example, a multi-class Support Vector Machine such as that described in Weston et al., [6] can be used. Alternatively, a least squares probabilistic classification method such as that described in Sugiyama [7] can be used. Alternatively, Bayesian estimation, neural networking, and the like can also be used.
[0084] The behavior recognition apparatus 1 and the learning apparatus 2 according to the present invention are not limited to implementations using a semiconductor integrated circuit (LSI) and may be realized when a program is executed by a computer having a general-purpose microprocessor and a general-purpose memory. In addition, while the behavior recognition apparatus 1 and the learning apparatus 2 are described as separate apparatuses in the description given above, a single apparatus may be configured so as to be switchable between a learning mode and a recognition mode.