METHOD AND SYSTEM FOR MATCHING 2D HUMAN POSES FROM MULTIPLE VIEWS
20230215043 · 2023-07-06
Inventors
Cpc classification
G06V40/103
PHYSICS
G06V30/1904
PHYSICS
G06V20/52
PHYSICS
G06V30/19067
PHYSICS
International classification
G06V40/10
PHYSICS
G06T7/246
PHYSICS
Abstract
This disclosure is directed to a method and system for matching human pose data in the form of 2D skeletons for the purposes of 3D reconstruction. The system may comprise a scoring module that assigns an affinity score to each pair of cross-view 2D skeletons, a matching module that assigns optimal pairwise matches based on the affinity scores, a grouping module that assigns each 2D skeleton to a group such that each group corresponds to a unique person, based on the pairwise matches; and a temporal consistency module that assigns each group an ID that maintains correspondence to the same person over the multi-video sequence.
Claims
1. A method of identifying humans between two or more camera views from two-dimensional (2D) skeletons of the humans of each view, the method comprising: a) for each skeleton in each of the two or more camera views, performing a pairwise scoring with each of the skeletons in another of the two or more camera views and assigning an affinity score to each pair; b) identifying a best match of a skeleton in a first camera view to a skeleton in a second camera view by maximizing the affinity score of the pair; c) grouping skeletons by identifying a set of skeletons in a first camera view, the set relating to the humans in the first camera view, with a set of skeletons in a second camera view using the best match; and d) assigning an identifier to each skeleton in the grouped skeletons in a frame of the two or more camera views and assigning the same identifier to each skeleton in the grouped skeletons in a subsequent frame of the two or more camera views that match.
2. (canceled)
3. A method of identifying humans between two or more camera views from two-dimensional (2D) skeletons of the humans of each view, the method comprising: a) for each skeleton in each of the two or more camera views, performing a pairwise scoring with each of the skeletons in another of the two or more camera views and assigning an affinity score to each pair, wherein the pairwise scoring of a pair of skeletons from a pair of camera views comprises modelling a ray from each camera view to an element of the 2D skeleton associated with the camera view and determining the minimum distance between the two rays; b) identifying a best match of a skeleton in a first camera view to a skeleton in a second camera view by maximizing the affinity score of the pair; and c) grouping skeletons by identifying a set of skeletons in a first camera view, the set relating to the humans in the first camera view, with a set of skeletons in a second camera view using the best match.
4. The method of claim 3 wherein if the rays are divergent, the pair is not included in the affinity score.
5. The method of claim 3 wherein the pairwise scoring of a pair of skeletons from a pair of camera views further comprises excluding elements where the minimum distance between the two rays exceeds a threshold.
6. The method of claim 3 wherein the pairwise scoring of a pair of skeletons from a pair of camera views further comprises determining a deviation of attributes of a putative three-dimensional (3D) skeleton formed from the 2D skeletons from a typical human.
7. The method of claim 1 further comprising calibrating each camera view by determining the position and angle of the camera, and synchronizing the camera view by aligning frames taken at the same time from the one or more camera views.
8. The method of claim 1 wherein identifying a best match of a skeleton in a first camera view to a skeleton in a second camera view includes not identify any match.
9. A motion capture system for two or more humans, the system comprising: two or more calibrated cameras generating synchronized video streams, each camera having an overlapping field of views that include the two or more humans; a two-dimensional (2D) pose estimator module associated with each of the two or more calibrated cameras for generating a 2D skeleton for each human in the field of view of the camera for a frame of the video stream; a scoring module for performing a pairwise scoring for each of the 2D skeletons associated with a first camera with each 2D skeleton of another of the two or more cameras and assigning an affinity score to each pair; a matching module that matches a 2D skeleton in a first camera view to a 2D skeleton in a second camera view by maximizing the affinity score of the pair; a grouping module that groups 2D skeletons by identifying a set of 2D skeletons in a first camera view, the set relating to the humans in the first camera view, with a set of 2D skeletons in a second camera view using the best match; a temporal matching module that assigns an identifier to each 2D skeleton group that remains consistent across a sequence of frames of the video streams; and a three-dimensional (3D) reconstruction module that combines the grouped 2D skeleton across a sequence of frames for a human to create a 3D skeleton of the human, capturing the position of the human.
10. The system of claim 9 wherein the scoring module comprises a model of a ray from each camera view to an element of the 2D skeleton associated with the camera view and determining the minimum distance between the two rays.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] In drawings which illustrate by way of example only an embodiment of the disclosure,
[0009]
[0010]
[0011]
[0012]
DETAILED DESCRIPTION
[0013] This disclosure is directed to a method and system for matching human pose data in the form of 2D skeletons for the purposes of 3D reconstruction. The system may comprise a scoring module 20 that assigns an affinity score to each pair of cross-view 2D skeletons, a matching module 30 that assigns optimal pairwise matches based on the affinity scores, a grouping module 50 that assigns each 2D skeleton to a group such that each group corresponds to a unique person, based on the pairwise matches; and a temporal consistency module 60 that assigns each group an ID that maintains correspondence to the same person over the multi-video sequence.
[0014] With reference to
[0015] A 2D human pose estimator may generate 2D skeletons for each human in each of the two or more video sequences. This may be done using known techniques, such as using a convolutional neural network (CNN), including such as by Wrnch.AI. A sequence of 2D skeletons may be provided corresponding to the video sequences for each camera.
[0016] With reference to
[0017] An approximate triangulation is computed by projecting a ray through each of the two keypoints. A keypoint of a 2D skeleton may be one particular element such as the centre of the head, centre of the pelvis, right or left wrist. Assuming a pinhole camera model, each ray is modelled as originating at the respective camera's optical center, based on the parameters known of the camera such as its location, angle and field of view, and proceeding in the direction that passes through the keypoint on the virtual image plane. This is done for the same keypoint, for example the centre of the head, for the two skeletons being compared, one arising from a first camera and video sequence and one arising from the second camera and video sequence. The triangulation point is the point in 3-space with a minimum Euclidean distance between the two rays. The triangulation error may be the minimum distance between the two rays. If the triangulation point is determined to be behind the cameras, the rays are diverging and this point may not be considered in the score calculations. In some embodiments, this may be done for more than one keypoint pairs.
[0018] One affinity score metric may be the total count of “inlier” keypoint pairs for the set of approximate triangulations for the given pair of 2D skeletons, where in inlier pair may be defined as a keypoint pair with a triangulation error below a certain threshold. For instance, a pair of 2D skeletons {A, B} may have a total of 7 inlier pairs out of a possible 8 (the pair corresponding to the left wrist joint is not considered an inlier because of high triangulation error), and another pair of skeletons {A, C} may have a total of 6 inlier pairs out of a possible 8 (the pairs corresponding to the right ankle and head joints respectively are not considered inliers). In this instance, {A, B} may score higher on the inlier metric of the weighted affinity score than {A, C}. Another metric may be the average triangulation error of all the pairs of keypoints belongs to the two skeletons. Another metric may be the “human-ness” of a putative 3D skeleton reconstruction consisting of all inlier triangulation points. The human-ness metric may be inversely proportional to the deviation of the limb lengths of the putative skeleton from those of an average person, based on anthropometric data. For instance, a putative 3D skeleton derived from a mismatched pair of 2D skeletons may have limbs that may be double the length of an average person, and thus may have a lower human-ness metric than a pair of correctly matched skeletons.
[0019] With reference to
[0020] The grouping module 50 may take the set of pairwise matches and outputs N sets of 2D skeletons, where N is the number of distinct people in the scene and each set corresponds to a distinct person in the scene. With reference to
[0021] The temporal matching module 60 may assign an ID to each 2D skeleton group, such that each person's ID remains consistent over the video sequences. An embodiment may achieve this by reprojecting the 3D skeletons from a previous timestep according to the camera parameters to create a set of predicted 2D skeletons in a current timestep. The pixel distance to each 2D skeleton group from the 2D skeleton projections of the previous timestep may be computed, and a matching method such as Hungarian algorithm is used to generate a one-to-one correspondence between the set of extant 3D skeletons and the 2D skeleton groups such that the pixel distances are minimized. The 2D groups may then be assigned IDs that correspond to the indices of the extant 3D skeletons. This may be continued for each timestep of the video sequence.
[0022] The system modules described may be separate software modules, separate hardware modules, or portions or one or more hardware components. The functionality of the modules described above may be implemented in a single system or provided in separate modules similar to or different from the modules described.
[0023] The software modules may consist of instructions written in a computer language such as C++ or assembly code and run on computer hardware such as a CPU, or they may be implemented on an FPGA. The software may utilize storage, such as RANI or magnetic storage, such as one or more hard drives. The system may run on a desktop computer, mobile phone or another platform that includes suitable memory for holding the software, data and skeletons parameters.
[0024] In an embodiment, the human matching system may comprise part of a motion capture system which digitizes the 3D poses of two or more humans subjects, such as in real time or post processing. This digitized pose data may be used for such applications as performance capture for digital media, or for sport analytics. Two or more calibrated cameras may be synchronized and their video streams captured and processed by 2D pose estimator systems, such as one for each video stream. The matching system may receive the output 2D skeletons from the 2D pose estimators, such as through a network interface or computer bus. The matched 2D skeleton groups may then be provided to a 3D reconstruction module, which fuses the 2D keypoints for each person in the scene to obtain the 3D pose data for each skeleton.
[0025] Various embodiments of the present disclosure having been thus described in detail by way of example, it will be apparent to those skilled in the art that variations and modifications may be made without departing from the disclosure. The disclosure includes all such variations and modifications as fall within the scope of the appended claims.