System and method for human pose estimation in unconstrained video
10509957 ยท 2019-12-17
Assignee
Inventors
Cpc classification
G06V40/103
PHYSICS
G06V10/84
PHYSICS
G06V20/46
PHYSICS
G06V20/52
PHYSICS
International classification
Abstract
A system and method for estimating a sequence of human poses in an unconstrained video. In the present invention, a unified two stage, tree-based, optimization problem is solved for which an efficient and exact solution exists. While the proposed method finds an exact solution, it does not sacrifice the ability to model the spatial and temporal constraints between body parts in the video frames on the unconstrained video.
Claims
1. A method for estimating human poses in an unconstrained video, the method comprising: receiving, at a computing device comprising hardware components and software programs, an unconstrained video comprising a plurality of consecutive frames including at least one human pose; generating a plurality of best full body pose hypotheses for each of the plurality of consecutive frames; extracting a plurality of real body part nodes from each of the plurality of best full body pose hypotheses; extracting a plurality of real body part nodes from each of the plurality of best full body pose hypotheses in each of the plurality of consecutive frames of the unconstrained video; generating a real body part hypotheses for each of the plurality of real body part nodes extracted from the plurality of best full body pose hypotheses; combining one or more pairs of symmetric real body part nodes into a single abstract coupled body part node to generate a plurality of abstract coupled body part nodes for each of the plurality of consecutive frames of the unconstrained video, wherein each of the one or more pairs of symmetric real body part nodes includes a left real body part node of the at least one human pose and a corresponding symmetric right real body part node of the at least one human pose; generating a plurality of abstract body part hypotheses from the plurality of abstract coupled body part nodes and each of the real body part hypotheses; generating an optimal tracklet for each of the abstract body part hypotheses; and estimating a human pose in the unconstrained video based upon the abstract body part tracklets using tree-based optimization.
2. The method of claim 1, wherein generating a plurality of best full body pose hypotheses for each of the plurality of consecutive frames, further comprising generating a plurality of best full body pose hypotheses using an N-best inference algorithm.
3. The method of claim 1, wherein the real body part nodes are selected from head, neck, right elbow, left elbow, right wrist, left wrist, right hip, left hip, right knee, left knee, right foot and left foot.
4. The method of claim 1, wherein the abstract body part nodes include abstract single body part nodes.
5. The method of claim 4, wherein the abstract single body part nodes include, head top and head bottom.
6. The method of claim 1, wherein the abstract coupled body part nodes include one or more of, right shoulder combined with left shoulder, right elbow combined with left elbow, right wrist combined with left wrist, right hip combined with left hip, right knee combined with left knee and right ankle combined with left ankle.
7. The method of claim 1, wherein generating an optimal tracklet for each of the abstract body part hypotheses further comprises selecting the one tracklet for each of the abstract body parts that maximizes a combined detection score and compatible score weights.
8. A system for estimating human poses in an unconstrained video, the system comprising: at least one computing device comprising hardware components and software programs for; receiving an unconstrained video comprising a plurality of consecutive frames including at least one human pose; generating a plurality of best full body pose hypotheses for each of the plurality of consecutive frames; extracting a plurality of real body part nodes from each of the plurality of best full body pose hypotheses in each of the plurality of consecutive frames of the unconstrained video; generating a real body part hypotheses for each of the plurality of real body part nodes extracted from the plurality of best full body pose hypotheses; combining one or more pairs of symmetric real body part nodes into a single abstract coupled body part node to generate a plurality of abstract coupled body part nodes for each of the plurality of consecutive frames of the unconstrained video, wherein each of the one or more pairs of symmetric real body part nodes includes a left real body part node of the at least one human pose and a corresponding symmetric right real body part node of the at least one human pose; generating a plurality of abstract body part hypotheses from the plurality of abstract coupled body part nodes and each of the real body part hypotheses; generating an optimal tracklet for each of the abstract body part hypotheses; and estimating a human pose in the unconstrained video based upon the abstract body part tracklets using tree-based optimization.
9. The system of claim 8, wherein the system further includes software programs for generating a plurality of best full body pose hypotheses for each of the plurality of consecutive frames, further comprising generating a plurality of best full body pose hypotheses using an N-best inference algorithm.
10. The system of claim 8, wherein the real body part nodes are selected from head, neck, right elbow, left elbow, right wrist, left wrist, right hip, left hip, right knee, left knee, right foot and left foot.
11. The system of claim 8, wherein the abstract body part nodes include abstract single body part nodes.
12. The system of claim 11, wherein the abstract single body part nodes include, head top and head bottom.
13. The system of claim 8, wherein the abstract coupled body part nodes include one or more of, right shoulder combined with left shoulder, right elbow combined with left elbow, right wrist combined with left wrist, right hip combined with left hip, right knee combined with left knee and right ankle combined with left ankle.
14. The system of claim 1, wherein the system is integrated into a gaming system or a camera.
15. One or more non-transitory computer-readable media having computer-executable instructions for performing a method of running a software program on a computing device, the computing device operating under an operating system, the method including issuing instructions from the software program comprising: receiving an unconstrained video comprising a plurality of consecutive frames including at least one human pose; generating a plurality of best full body pose hypotheses for each of the plurality of consecutive frames; extracting a plurality of real body part nodes from each of the plurality of best full body pose hypotheses in each of the plurality of consecutive frames of the unconstrained video; generating a real body part hypotheses for each of the plurality of real body part nodes extracted from the plurality of best full body pose hypotheses; combining one or more pairs of symmetric real body part nodes into a single abstract coupled body part node to generate a plurality of abstract coupled body part nodes for each of the plurality of consecutive frames of the unconstrained video, wherein each of the one or more pairs of symmetric real body part nodes includes a left real body part node of the at least one human pose and a corresponding symmetric right real body part node of the at least one human pose; generating a plurality of abstract body part hypotheses from the plurality of abstract coupled body part nodes and each of the real body part hypotheses; generating an optimal tracklet for each of the abstract body part hypotheses; and estimating a human pose in the unconstrained video based upon the abstract body part tracklets using tree-based optimization.
16. The media of claim 15, wherein generating a plurality of best full body pose hypotheses for each of the plurality of consecutive frames, further comprising generating a plurality of best full body pose hypotheses using an N-best inference algorithm.
17. The media of claim 15, wherein the real body part nodes are selected from head, neck, right elbow, left elbow, right wrist, left wrist, right hip, left hip, right knee, left knee, right foot and left foot.
18. The media of claim 15, wherein the abstract body part nodes include abstract single body part nodes.
19. The media of claim 18, wherein the abstract single body part nodes include, head top and head bottom.
20. The media of claim 15, wherein the abstract coupled body part nodes include one or more of, right shoulder combined with left shoulder, right elbow combined with left elbow, right wrist combined with left wrist, right hip combined with left hip, right knee combined with left knee and right ankle combined with left ankle.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
(2) For a fuller understanding of the invention, reference should be made to the following detailed description, taken in connection with the accompanying drawings, in which:
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
DETAILED DESCRIPTION OF THE INVENTION
(20) The present invention provides a method to estimate a sequence of human poses in an unconstrained video. In contrast with the commonly employed graph optimization framework, which is NP-hard (nondeterministic polynomial time-hard) and necessitates approximate solutions, in the present invention, this problem is formulated into a unified two stage, tree-based, optimization problem for which an efficient and exact solution exists. Although the proposed method finds an exact solution, it does not sacrifice the ability to model the spatial and temporal constraints between body parts in the video frames; indeed it even models the symmetric parts better than the existing methods currently known in the art.
(21) One commonly employed methodology for human pose estimation in videos is the graph optimization formulation. There are two types of such formulation. The first type of this formulation is to generate several human pose hypotheses in each frame and select one best hypothesis from each frame, while making sure they are consistent throughout the video. The inference in this approach is very efficient, however, due to the large variations of pose configurations, it is very difficult to get good poses with all body parts correctly estimated. Therefore, a second type of such formulation was introduced to handle each body part separately. In this formulation, hypotheses are generated for each body part in every frame. Following the spatial constraints between body parts in each frame and using temporal consistency of appearances and locations between adjacent frames, the goal is to optimally select the best hypotheses for each body part from all the frames together. This formulation is desirable, since it is able to expand sufficient diverse human pose configurations and it is able to effectively model spatiotemporal constraints between body parts. Despite all the benefits of this formulation, it is an NP-hard problem due to the underlying loopy graph structure (i.e. there are many simple cycles in the graph; e.g. the simple cycles in
(22)
(23) In various embodiments, the present invention solves the problem of exploiting the spatial constraints between the body parts in each frame and temporal consistency throughout the frames, to the greatest possible extent, while also providing an efficient exact solution. Since it is known that the inference of a tree-based optimization problem has a polynomial time solution, the main issue solved by the present invention becomes is how to formulate the problem in order to model the useful spatial and temporal constraints between body parts among the frames without inducing simple cycles.
(24) To solve the problems known in the art, the present invention approximates the original fully connected model into a simplified tree-based model. In contrast with the standard tree representation of body parts, the present invention introduces a new concept, related to the use of abstract body parts and referred to as abstraction, to conceptually combine the symmetric body parts.
(25)
(26) In computer vision, and several other disciplines, many problems can be abstracted as follows. Assume there is a set of entities ={e.sup.i|.sub.i=1.sup.N}, where each entity can only
(27) be in one of the many states S={s.sup.k|.sub.k=1.sup.M}, with the unary scoring functions {(e.sup.i, s.sup.k|e.sup.i , s.sup.k S}, which gives the likelihood that an entity e.sup.i is in state s.sup.k. And there is a binary compatibility function for each pair of entities {(e.sup.i, e.sup.j, s.sup.k, s.sup.l)|e.sup.i, e.sup.j , s.sup.k, s.sup.lS} which represents the compatibility of entity e.sup.i in state s.sup.k and entity e.sup.j in state s.sup.l. The goal then is to determine the best states for each entity such that all of them have high unary scores and they are also compatible with each other. This problem can be modeled as a graph optimization problem formulated by relational and hypothesis graphs, which is described below.
(28) A relational graph, G.sub.r=(V.sub.r, E.sub.r), represents the relationship of a set of entities which are represented by entity nodes {v.sub.r.sup.i|.sub.i=1.sup.|V.sup.
(29) For a tree-based relational graph, G.sub.r, and the corresponding hypothesis graph, G.sub.h, the objective function for a set of arbitrary selected nodes s={s.sup.i|.sub.i=1.sup.|V.sup.(s)=.sub.s.sub.
in which is the parameter for adjusting the binary and unary weights and the goal is to maximize (s):s*=argmax.sub.s(
(s)). Letting the algorithm process from the leaf nodes to the root, and letting
(i,k)=(v.sub.h(i).sup.k)+.sub.v.sub.
(j,l)).(2)
(30) Based on this recursive function, the problem can be solved efficiently by dynamic programming with a computation complexity of (|V.sub.r|.Math.N), in which N is the max number of hypotheses for each node in V.sub.r.
(31) In the present invention, the term real body parts is used to represent body parts which are commonly used in the literature. The term abstract body parts is a new concept introduced by the present invention to facilitate the formulation of the proposed method, as illustrated in
(32) A known human pose estimation approach can be applied to each video frame to generate N best full body pose hypotheses. N is usually a large number (normally N>300) and for each real body part, the body part hypotheses are body part locations extracted from the-best poses. The body part hypotheses are sampled by an iterative, non-maximum suppression (NMS) scheme based on the detection score map. Detection score is a combination of max-marginal and foreground score,
.sub.s(p)=.sub.M(p)+(1).sub.F(p),(3)
in which .sub.s is the detection score, .sub.M is the max-marginal, .sub.F is the foreground score obtained by the background subtraction, and p is the location of the body part.
(33) The abstract body part hypotheses for a single part are the same as its corresponding real body part hypotheses and the abstract body part hypotheses for a coupled part are the permutation of its corresponding left and right body part hypotheses.
(34) Based on the abstract body part hypotheses described above, the goal is to obtain several best single part and coupled part tracklets through the video frames. The problem is now to select one hypothesis from each frame, ensuring that they have high detection scores and are consistent throughout the frames. Following the definitions previously discussed, the relational graph for this problem is shown in
(35) Based on the single part hypotheses, a single part tracklet hypothesis graph is built, as shown in
(36)
where p.sup.f and p.sup.f+1 are two arbitrary hypotheses from frames f and f+1, (p) is the optical flow predicted location p, {circumflex over (p)}.sup.f is the optical flow predicted location for p.sup.f and f+1, and is a parameter. The goal is to select one node from each frame to maximize the overall combined unary and binary weights. Given an arbitrary selection of nodes from the graph s.sub.s={s.sub.s.sup.i|.sub.i=1.sup.F}, wherein F is the number of frames, the objective function is given by:.sub.s(s.sub.s)=.sub.i=1.sup.F.sub.s(s.sub.s.sup.i)+.sub.s.Math..sub.i=1.sup.F.sub.s(s.sub.s.sup.i, s.sub.s.sup.i+1),(5)
where .sub.s is the parameter for adjusting the binary and unary weights and s.sub.s*=argmax.sub.s.sub..sub.s(s.sub.s)) gives the optimal solution. It is clear that the relational graph of this problem is a degenerate tree (i.e. single branch tree), as shown in
(37) The relational graph for the coupled part tracklets generation is the same as for the single part; however, the nodes and edges are defined differently. In this case, each hypothesis node is composed of the locations of a pair of symmetric parts (e.g. left and right ankles).
(38)
where .sub.s is from Eq. 3, and wherein, r.p and r.q respectively represent the left and right components of the coupled part r, (p) is the normalized color histogram of a local patch around p, the denominator is the inverse sigmoid function which penalizes the overlap of the symmetric parts, and is the parameter that controls the penalty. The binary weights of the edges are computed as:
.sub.c(r.sup.f, r.sup.f+1)=.sub.s(r.p.sup.f, r.p.sup.f+1)+.sub.s(r.q.sup.f, r.q.sup.f+1),(7)
where .sub.s is from Eq. 4.
(39) Similarly, the goal is to select one node (which is a composition of a pair of symmetric parts) from each frame to maximize the overall combined unary and binary weights. Given an arbitrary selection of nodes from the graph s.sub.c={s.sub.c.sup.i|.sub.i=1.sup.F} (where F is the number of frames), the objective function is:.sub.c(s.sub.c)=.sub.i=1.sup.F.sub.c(s.sub.c.sup.i)+.sub.c.Math..sub.i=1.sup.F.sub.c(s.sub.c.sup.i, s.sub.c.sup.i+1),(8)
where .sub.c is the parameter to adjust the binary and unary weights and s.sub.c*=argmax.sub.s.sub..sub.c(s.sub.c)) gives the optimal solution. As previously discussed, the problem can also be solved by dynamic programming efficiently and iterated for multiple times to generate several tracklets.
(40) After the best tracklets for each of the abstract body parts are obtained by the methods previously described, the next step is to select the best tracklets that are compatible. The relational graph G.sub.T=(V.sub.T, E.sub.T), for this final tracklet based optimal pose estimation, is shown in
.sub.T(s.sub.s, t.sub.s)=.sub.i=1.sup.F.sub.d(s.sub.s.sup.i, t.sub.s.sup.i)(9)
the binary weight between a single part tracklet node s.sub.s={s.sub.s.sup.i|.sub.i=1.sup.F} and an adjacent coupled part tracklet node t.sub.c={t.sub.c.sup.i|.sub.i=1.sup.F} is:
.sub.T(s.sub.s, t.sub.c)=.sub.i=1.sup.F(.sub.d(s.sub.s.sup.i, t.sub.c.sup.i.Math.p)+.sub.d(s.sub.s.sup.i, t.sub.c.sup.i.Math.q))(10)
and the binary weight between a pair of adjacent coupled tracklet nodes s.sub.c={s.sub.c.sup.i|.sub.i=1.sup.F} and t.sub.c={t.sub.c.sup.i|.sub.i=1.sup.F} is:
.sub.T(s.sub.c, t.sub.c)=.sub.i=1.sup.F(.sub.d(s.sub.c.sup.i.Math.p, t.sub.c.sup.i.Math.p)+.sub.d(s.sub.c.sup.i.Math.q, t.sub.c.sup.i.Math.q)).(11)
(41) Now, the goal is to select only one tracklet for each abstract body part in order to maximize the combined unary (detection score) and binary (compatible score) weights. Given an arbitrary tree selected from the hypothesis graph s.sub.T={s.sub.T.sup.i|.sub.i=1.sup.|V.sup..sub.T(s.sub.T)=.sub.v.sub.
where .sub.T is a parameter for adjusting the binary and unary weights and the optimal solution, s.sub.T*=argmax.sub.s.sub..sub.T(s.sub.T)), can also be obtained by the dynamic programming algorithm efficiently. The body part locations in each frame are extracted from this final optimal solution.
(42) In one embodiment, the system of the present invention may include various software programs and associated hardware components, such as a central processing unit (CPU) and associated memory. In an exemplary embodiment of the present invention, 15 consecutive video frames were analyzed each time. For Eq. 5 and Eq. 5, the unary and binary weights were normalized for each from between 0 and 1. In Eq. 3, =0.5 and .sub.c=.sub.s=.sub.T=1 for Eq. 5, Eq. 8 and Eq. 12. For Eq. 5 and in Eq. 6, 10% of the median height (normally 15-30 pixels) of N-Best poses was used. For each real body part, 20 hypotheses were generated and for each abstract body part, the top 10 tracklets were selected.
(43) In the present invention, a tree-based optimization method for human pose estimation in videos is provided. The main contribution of the invention is focused on reformulating the problem to remove the simple cycles from the relational graph, while at the same time maintaining the useful connections to the greatest possible extent, in order to transform the original NP-hard problem into a simpler tree-based optimization problem, for which an exact solution exists and which can be solved efficiently. The method of the present invention is general and has potential to be employed in solving other problems in computer vision.
(44) The present invention has improved performance over current human pose estimation methods in videos currently known in the art. The method and associated software-executed algorithm of the present invention has lower computational complexity. As compared to the depth sensor based systems of the prior art, the present invention has broader application in the field of video estimation as a result of the ability to utilize a regular video camera, thereby eliminating the depth sensor requirement. In addition, the present invention also reduces the required equipment cost since video cameras are much cheaper than video systems incorporating depth sensor technology.
(45) The present invention has tremendous commercial value in the computer/digital gaming industry and also in the public surveillance industry. For the gaming companies, they can directly integrate the present inventive method into their new games and attract new customers by this improved user experience. Additionally, video camera producers can incorporate the inventive technology directly into the cameras, thereby creating a competitive advantage over other camera producers.
(46) The present invention may be embodied on various computing platforms that perform actions responsive to software-based instructions. The following provides an antecedent basis for the information technology that may be utilized to enable the invention.
(47) The computer readable medium described in the claims below may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any non-transitory, tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
(48) A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. However, as indicated above, due to circuit statutory subject matter restrictions, claims to this invention as a software product are those embodied in a non-transitory software medium such as a computer hard drive, flash-RAM, optical disk or the like.
(49) Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wire-line, optical fiber cable, radio frequency, etc., or any suitable combination of the foregoing. Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C#, C++, Visual Basic or the like and conventional procedural programming languages, such as the C programming language or similar programming languages.
(50) Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
(51) These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
(52) The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
(53) It should be noted that when referenced, an end-user is an operator of the software as opposed to a developer or author who modifies the underlying source code of the software. For security purposes, authentication means identifying the particular user while authorization defines what procedures and functions that user is permitted to execute.
(54) It will be seen that the advantages set forth above, and those made apparent from the foregoing description, are efficiently attained and since certain changes may be made in the above construction without departing from the scope of the invention, it is intended that all matters contained in the foregoing description or shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.
(55) It is also to be understood that the following claims are intended to cover all of the generic and specific features of the invention herein described, and all statements of the scope of the invention which, as a matter of language, might be said to fall therebetween.