METHOD AND SYSTEM FOR POSE ESTIMATION
20190122035 ยท 2019-04-25
Assignee
Inventors
- Xiaogang Wang (Shatin N.T., CN)
- Xiao Chu (Shatin N.T., CN)
- Wanli Ouyang (Shatin N.T., CN)
- Hongsheng Li (Shatin N.T., CN)
Cpc classification
G06F18/217
PHYSICS
International classification
Abstract
The disclosures relate to a method and a system for pose estimation. The method comprises: extracting a plurality of sets of part-feature maps from an image, each set of the extracted part-feature maps encoding the messages for a particular body part and forming a node of a part-feature network; passing a message of each set of the extracted part-feature maps through the part-feature network to update the extracted part-feature maps, resulting in each set of the extracted part-feature maps incorporating the message of upstream nodes; estimating, based on the updated part-feature maps, the body part within the image.
Claims
1. A method for pose estimation, comprising: extracting a plurality of sets of part-feature maps from an image, each set of the extracted part-feature maps representing a body part and forming a node of a part-feature network; passing a message of each set of the extracted part-feature maps through the part-feature network to update the extracted part-feature maps, resulting in each set of the extracted part-feature maps incorporating the message of upstream nodes; and estimating, based on the updated part-feature maps, the body part within the image.
2. The method of claim 1, wherein the passing of the message is performed twice in opposite directions and each pairs of the updated part-feature maps performed in different directions are combined into a score map; and the estimating of the body part is performed based on the combined score maps.
3. The method of claim 1, wherein the extracting of the part-feature maps is performed via a CNN.
4. The method of claim 3, wherein the CNN is a VGG net.
5. The method of claim 4, wherein three pooling layers are adopted in the VGG net.
6. The method of claim 1, wherein the passing of the message is performed by a convolution operation with a geometrical transformation kernel.
7. A system for pose estimation, comprising: a memory that stores executable components; and a processor electrically coupled to the memory to execute the executable components for: extracting a plurality of sets of part-feature maps from an image, each set of the extracted part-feature maps representing a body part and forming a node of a part-feature network; passing a message of each set of the extracted part-feature maps through the part-feature network to update the extracted part-feature maps, resulting in each set of the extracted part-feature maps incorporating the message of previously passed nodes; and estimating, based on the refined part-feature maps, the body part within the image.
8. The system of claim 7, wherein the passing of the message is performed twice in opposite directions and each pairs of the updated part-feature maps performed in different directions are combined into a score map; and the estimating of the body part is performed based on the combined score maps.
9. The system of claim 7, wherein the extracting of the part-feature maps is performed via a CNN.
10. The system of claim 9, wherein the CNN is a VGG net.
11. The system of claim 10, wherein three pooling layers are adopted in the VGG net.
12. The system of claim 7, wherein the passing of the message is performed by a convolution operation with a geometrical transformation kernel.
Description
BRIEF DESCRIPTION OF THE DRAWING
[0011] Exemplary non-limiting embodiments of the present application are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
DETAILED DESCRIPTION
[0018] Reference will now be made in detail to some specific embodiments of the present application contemplated by the inventors for carrying out the present application. Examples of these specific embodiments are illustrated in the accompanying drawings. While the present application is described in conjunction with these specific embodiments, it will be appreciated by one skilled in the art that it is not intended to limit the present application to the described embodiments. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. The present application may be practiced without some or all of these specific details. In other instances, well-known process operations have not been described in detail in order not to unnecessarily obscure the present application.
[0019] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used herein, the singular forms a, an and the are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms comprises and/or comprising when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
[0020] An exemplary system 1000 for estimating poses from an input image will now be described with reference to
h.sub.fcn7.sup.k(x,y)=f(h.sub.fcn6(x,y).Math.w.sub.fcn7.sup.k)(1)
wherein .Math.denotes a convolution operation, f denotes a nonlinear function, and w.sub.fcn7.sup.k denotes a filter bank for part k. It should be noted that, h.sub.fcn7.sup.k contains a set of part-feature maps extracted from different channels. The part-feature maps of body parts contain rich information and detailed descriptions of human poses and appearance.
[0021] Since spatial distributions and co-occurrence of part-feature maps obtained at different parts are highly correlated, passing the rich information contained in part-feature maps between parts can effectively improve features learned at each part. In the prior art, the passing process is implemented in the score map level, which results in the loss of important inter-part information. Surprisingly, when a message passes through at the feature level, the rich information contained in part-feature maps between parts is largely preserved.
[0022] In the present application, the geometric constraints among body parts could be consolidated by shifting part-feature maps of one body part towards neighboring parts. The geometrical transformation kernels model the relationship between every pair of part-feature maps from neighboring parts. To optimize features obtained at a part, it is expected to receive information from all other parts with a fully connected graph. However, in order to directly model the relationship between part-feature maps of parts in distance, large transformation kernels, which are difficult to be trained, have to be introduced. Second, the relationships between some parts (such as head and foot) are unstable. It is advantageous to pass message between them through intermediate parts on a designed graph, since the relative spatial distribution between the two adjacent parts is stable and the corresponding kernel is easy to be trained. The adjacent parts on the graph are close in distance and have relatively stable relationship in the graph. The extracted sets of part-feature maps constitute a part-feature network processed by a structured feature learning layer 1220, wherein each set of part-feature maps occupies a node 1221 in the part-feature network. In an exemplary implement, a message of each set of the extracted part-feature maps is passed through the part-feature network along a unitary direction. The passing operation will be illustrated in detail with reference to
[0023] The flow chart illustrating a process for estimating poses from an input image is schematically shown in
[0024] Referring to
[0025]
[0026] An exemplary bi-directional message passing process is illustrated in
A.sub.4=f(A.sub.4+A.sub.5.Math.w.sup.a.sup.
wherein A.sub.4 represents the updated part-feature maps after message passing, A.sub.4 represents the part-feature maps before message passing, and w.sup.a.sup.
A.sub.3=f(A.sub.3+A.sub.4.Math.w.sup.a.sup.
[0027] The part-feature maps in the network 6200 may be updated in a similar way but an opposite direction, and are therefore not discussed in detail here. Finally, two sets of updated part-feature maps (A.sub.k, B.sub.k) may be linearly combined into a set of score maps indicating the likelihood of the existence of the corresponding body parts.
[0028] As will be appreciated by one skilled in the art, the present application may be embodied as a system, a method or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a unit, circuit, module, or system. Much of the functionality and many of the principles when implemented, are best supported with or integrated circuits (ICs), such as a digital signal processor and software therefore or application specific ICs. It is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating ICs with minimal experimentation. Therefore, in the interest of brevity and minimization of any risk of obscuring the principles and concepts according to the present application, further discussion of such software and ICs, if any, will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments. In addition, the present application may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software. For example, the system may comprise a memory that stores executable components and a processor, electrically coupled to the memory to execute the executable components to perform operations of the system, as discussed in reference to