METHOD AND SYSTEM OF GLOBAL POSITION PREDICTION FOR IMU MOTION CAPTURE
20230083619 · 2023-03-16
Inventors
Cpc classification
G06F3/011
PHYSICS
G06F3/0346
PHYSICS
G01C23/00
PHYSICS
International classification
G01C23/00
PHYSICS
Abstract
A computerized method for global position prediction for inertial measurement unit (IMU) motion capture comprising: implementing a u-net architecture; obtaining and utilizing a source data from an IMU based motion capture system; implement the pre-processing of source data by: windowing the source data into a set of short sequences of time-windows, and performing a generic rotation of the windowed source data, wherein a motion captured by the IMU based motion capture system is invariant to a facing direction in a horizontal plane; pre-processing of a set of training targets using a set of transformations and adjusting for a center of mass and zeroing a root displacement at a start of each time window; implementing a post-processing by performing an inverse of the set of training targets to generate a plurality of positions estimations; and using a mean value of the plurality of positions estimations for a set of position predictions to generate the global position prediction.
Claims
1. A computerized method for global position prediction for inertial measurement unit (IMU) motion capture comprising: implementing a u-net architecture; obtaining and utilizing a source data from an IMU based motion capture system; implement the pre-processing of source data by: windowing the source data into a set of short sequences of time-windows, and performing a generic rotation of the windowed source data, wherein a motion captured by the IMU based motion capture system is invariant to a facing direction in a horizontal plane; pre-processing of a set of training targets using a set of transformations and adjusting for a center of mass and zeroing a root displacement at a start of each time window; implementing a post-processing by performing an inverse of the set of training targets to generate a plurality of positions estimations; and using a mean value of the plurality of positions estimations for a set of position predictions to generate the global position prediction.
2. The computerized method of claim 1, wherein the u-net architecture is modified for regression and acts as an ensemble of regression models used to construct a prediction.
3. The computerized method of claim 2, wherein the u-net architecture comprises an encoder stage and a decoder stage with a set of skip-connections relaying information at different temporal scales.
4. The computerized method of claim 3, wherein in the encoder stage, the input data is encoded in a temporal dimension while being expanded in a feature dimension using convolutional layers.
5. The computerized method of claim 4, wherein input to the u-net architecture is a two-dimensional (2D) Tensor, with time in the vertical dimension and features in the horizontal dimension.
6. The computerized method of claim 5, wherein between each down and up sampling layer of a same temporal scale, there is a skip connection which passes the output of the encoder directly to a temporal counter part in the decoder side.
7. The computerized method of claim 6, wherein the decoder structure follows an inverse description of the encoding process, and wherein the up sampling is performed using linear interpolation.
8. The computerized method of claim 7, wherein the IMU based motion capture system provides a pose information oriented with respect to a world fixed coordinate system.
9. The computerized system of claim 8, wherein the source data provided by the IMU based motion capture system comprises a set of position vectors that indicate a human joint's position with respect to a root joint that has a fixed position in an origin of a world frame but is free to rotate.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0005] The present application can be best understood by reference to the following description taken in conjunction with the accompanying figures, in which like parts may be referred to by like numerals.
[0006]
[0007]
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024] The Figures described above are a representative set and are not an exhaustive with respect to embodying the invention.
DESCRIPTION
[0025] Disclosed are a system, method, and article of manufacture for global position prediction for IMI motion capture. The following description is presented to enable a person of ordinary skill in the art to make and use the various embodiments. Descriptions of specific devices, techniques, and applications are provided only as examples. Various modifications to the examples described herein will be readily apparent to those of ordinary skill in the art, and the general principles defined herein may be applied to other examples and applications without departing from the spirit and scope of the various embodiments.
[0026] Reference throughout this specification to “one embodiment,” “an embodiment,” “one example,” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases “in one embodiment,” “in an embodiment,” and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.
[0027] Furthermore, the described features, structures, or characteristics of the invention may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided, such as examples of programming, software modules, user selections, network transactions, database queries, database structures, hardware modules, hardware circuits, hardware chips, etc., to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art can recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, materials, and so forth. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
[0028] The schematic flow chart diagrams included herein are generally set forth as logical flow chart diagrams. As such, the depicted order and labeled steps are indicative of one embodiment of the presented method. Other steps and methods may be conceived that are equivalent in function, logic, or effect to one or more steps, or portions thereof, of the illustrated method. Additionally, the format and symbols employed are provided to explain the logical steps of the method and are understood not to limit the scope of the method. Although various arrow types and line types may be employed in the flow chart diagrams, and they are understood not to limit the scope of the corresponding method. Indeed, some arrows or other connectors may be used to indicate only the logical flow of the method. For instance, an arrow may indicate a waiting or monitoring period of unspecified duration between enumerated steps of the depicted method. Additionally, the order in which a particular method occurs may or may not strictly adhere to the order of the corresponding steps shown.
Definitions
[0029] Accelerometer is a device that measures acceleration such as proper acceleration. Proper acceleration is the acceleration (e.g. the rate of change of velocity) of a body in its own instantaneous rest frame.
[0030] Gyroscope is a device used for measuring or maintaining orientation and angular velocity.
[0031] Electromotive force/field (EMF) sensor measures the ambient (e.g. surrounding) electromagnetic field(s).
[0032] Inertial measurement unit (IMU) is an electronic device that measures and reports a body's specific force, angular rate, and sometimes the orientation of the body, using a combination of accelerometers, gyroscopes, and sometimes magnetometers.
[0033] Machine learning is a type of artificial intelligence (Al) that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data. Example machine learning techniques that can be used herein include, inter alia: decision tree learning, association rule learning, artificial neural networks, inductive logic programming, support vector machines, clustering, Bayesian networks, reinforcement learning, representation learning, similarity, and metric learning, and/or sparse dictionary learning.
[0034] Recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs.
Example System for Global Position Prediction for IMU Motion Capture
[0035] A method can be used for reconstructing a global position for IMU-based motion capture. IMU suits can provide pose data along with the global orientation of a capture subject. Additional/other methods can be used to set the global position. For example, an IMU-based motion capture method can use a large collection of motion capture data to train a universal neural network to predict vertical height and per frame horizontal displacement given a short window of pose data. This can integrate horizontal position changes along with measured root orientation to produce a global output motion. The IMU-based motion capture method can be refined in a final step with a kinematic touch up. The IMU-based motion capture method can use various network architectures and data representations along with a quantitative evaluation of the method for different classes of motion.
[0036] The IMU-based motion capture method can utilize a learning-based solution to compute the global position of IMU motion capture by exploiting a large-collection of previously recorded optical motion capture data. The IMU-based motion capture method can train a universal network (u-net) to predict the global body displacement from the optical skeleton data based on a short history of pose data. It is noted that the pose includes a lot of information about the activity being captured, and that a short temporal window of data provides sufficient information to predict the trajectory. The trained u-net model can be used to predict the vertical position of the root, and displacements per frame in the horizontal plane. The latter is integrated to reconstruct the global motion.
[0037] The use of a fixed temporal window makes the IMU-based motion capture method can be history independent, in contrast to, for instance, a recurrent neural network. MU-based motion capture method can reconstruct from the u-net is high quality. A kinematic touch-up can be applied to address a foot-skate motion.
[0038]
[0039] It is noted that ML can be used for root positioning problem of motion-capture systems. Accordingly, an IMU-based motion-capture system can use ML in position estimation for IMU capture has not been investigated to date.
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053] In step 1402, the u-net architecture can be implemented. The u-net can be modified for regression and acts as an ensemble of regression models from which process 1400 can construct a prediction. The network includes of an encoder stage and a decoder stage with skip-connections relaying information at different temporal scales.
[0054] In the encoder stage, the input data is encoded in the temporal dimension while being expanded in the feature dimension using convolutional layers. The input to the network is a 2D Tensor, with time in the vertical dimension and features in the horizontal dimension. Process 1400 can use T to denote the time-window size and N for the dimension of the combined feature vectors. In the case of a time-window of 64 frames and a character with, for example, nineteen (19) positional joint vectors, this results in a T×N=64×57 input tensor to the network.
[0055] It is noted that u-net layout is summarized in
[0056] The second convolution can be in the temporal dimension, over all the output channels from the first convolution. The activation functions used throughout the network are rectified linear units (ReLU). After each set of convolutions the output of that step is reshaped so that the input to the next layer is again of the form [Batch×1×T×F]. Here F=Channels out can be seen as a new abstract feature dimension. At the end of the layer the current output is stored for later use in the skip connections. Then the output is down sampled in the temporal dimension using a maxpool operation with a length of 2. The feature dimension can be kept constant during this step.
[0057] Between each down and up sampling layer of the same temporal scale, there can be a skip connection which passes the output of the encoder directly to its temporal counter part in the decoder side of the network. This ensures that the network can extract information and process it in the output for multiple timescales. The decoder structure can follow an inverse description of the encoding process, where the up sampling is performed using linear interpolation.
[0058] In step 1404, process 1400 can obtain and utilize source data. The raw data is from the a specified motion library and comes in the form of assets, each containing a single character doing a motion or a short sequence of motions, such as a short walk, a dance, or a jump. The motion library can include a database for humanoid motion capture data, with over 3000 different assets. To ensure uniformity throughout the data set, the selected assets can have an identical subset of the skeleton configuration. The final data set contained 577 motion assets, totaling 629,093 frames or nearly two (2) hours of motion data. The data in the motion library comes from different motion capture studios and individuals, guaranteeing diversity of the characters with respect to size, shape, and gender.
[0059]
[0060] In step 1504, process 1500 can perform resampling. The input to a u-net should have the same temporal frequency, that is, each time-window can be the same size and span the same period. However, the motion library assets come in different frame rates. Therefore in a first step, the data is re-sampled to a uniform frame rate of 100 Hz as this is consistent with typical IMU motion capture.
[0061] Returning to process 1400, in step 1406, process 1400 can implement the pre-processing of input data.
[0062] In step 1604, process 1600 performs generic rotation. The motion within the physical world around us is invariant to the facing direction in the horizontal plane: whether a person walks north or south does not change the physical properties of motion. To this end, process 1600 can define a generic space in which the model is trained. In this way, when the model, when it receives a time-window, sees it in the same way. Process 1600 can define the vertical axis of the reference frame to match the global frame vertical, with both set to be opposite the direction of gravity. The axes of the horizontal plane of the reference frame is set from the orientation of the hip at the first frame of a temporal window. The hip's frontal axis is projected to the global horizontal plane to define a forward direction. The lateral motion axis in the global horizontal plane is orthogonal to both the forward and vertical axes.
[0063] Returning to process 1400, in step 1408, process 1400 performs pre-processing of training targets. To compute the training targets, process 1400 can use a slightly different set of transformations, specifically, adjusting for the center of mass and zeroing the root displacement at the start of the temporal window (as also shown in
[0064]
[0065] In step 1704, process 1700 can implement root resetting. Process 1700 can make the network invariant to the starting position of a time-window. To achieve this, in one example, the trajectory in the horizontal plane of each time-window is reset to start at the origin. The result is that the training target is a time series representing the displacement of the character over the time-window. In step 1408, process 1400 can implement pre-processing training targets.
[0066] In step 1410, process 1400 can implement post-processing of prediction at run-time. To recover global root data, the post-processing pipeline performs the inverse of the target pre-processing. It is noted that the same frame can be present in 64 time-windows due to the windowing. This means that the network can give 64 different predictions for the CoM target of the same frame. So as a last step of the post-processing, process 1400 can choose to collect all the estimations into a final answer. Process 1400 can use the mean value for a set of position predictions.
[0067] It is noted that process 1400 can be used for estimating global placement. Using IMU data to the training can improve the performance for this type of data dramatically.
Additional Example Computer Architecture and Systems
[0068]
[0069]
Conclusion
[0070] Although the present embodiments have been described with reference to specific example embodiments, various modifications and changes can be made to these embodiments without departing from the broader spirit and scope of the various embodiments.
[0071] In addition, it will be appreciated that the various operations, processes, and methods disclosed herein can be performed in any order (e.g., including using means for achieving the various operations). Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.