System and Method for Group Activity Recognition in Images and Videos with Self-Attention Mechanisms
20220383639 · 2022-12-01
Assignee
Inventors
- Mehrsan JAVAN ROSHTKHARI (Beaconsfield, CA)
- Kirill GAVRILYUK (Amsterdam, NL)
- Ryan Hartley SANFORD (Montreal, CA)
Cpc classification
G06F18/2414
PHYSICS
G06V10/774
PHYSICS
G06V20/52
PHYSICS
G06V40/23
PHYSICS
International classification
G06V20/52
PHYSICS
G06V10/774
PHYSICS
Abstract
A system and method are described, for automatically analyzing and understanding individual and group activities and interactions. The method includes receiving at least one image from a video of a scene showing one or more individual objects or humans at a given time; applying at least one machine learning or artificial intelligence technique to automatically learn a spatial, temporal or a spatio-temporal informative representation of the image and video content for activity recognition; and identifying and analyzing individual and group activities in the scene.
Claims
1. A method for processing visual data for individual and group activities and interactions, the method comprising: receiving at least one image from a video of a scene showing one or more entities at a corresponding time; using a training set comprising at least one labeled individual or group activity; and applying at least one machine learning or artificial intelligence technique to learn from the training set to represent spatial, temporal or spatio-temporal content of the visual data and numerically model the visual data by assigning numerical representations.
2. The method of claim 1, further comprising: applying learnt machine learning and artificial models to the visual data; identifying individual and group activities by analyzing the numerical representation assigned to the spatial, temporal, or spatio-temporal content of the visual data; and outputting at least one label to categorize an individual or a group activity in the visual data.
3. The method of claim 1, further comprising using both temporally static and temporally dynamic representations of the visual data.
4. The method of claim 3 further comprising using at least one spatial attribute of the entities for representing temporally static or dynamic information of the visual data.
5. The method of claim 4, wherein the spatial attribute of a human entity comprises body pose information on one single image as a static representation, or body pose information on a plurality of image frames in a video as a dynamic representation.
6. The method of claim 3, further comprising generating a numerical representative feature vector in a high dimensional space for a static and dynamic modality.
7. The method of claim 1, wherein the spatial content corresponds to a position of the entities in the scene at a given time with respect to a predefined coordinate system.
8. The method of claim 1, wherein the activities are human actions, human-human interactions, human-object interactions, or object-object interactions.
9. The method of claim 8, wherein the visual data corresponds to a sport event, humans correspond to sport players and sport officials, objects correspond to balls or pucks used in the sport, and the activities and interactions are players' actions during the sport event.
10. The method of claim 9, where the data collected from the sport event is used for sport analytics applications.
11. The method of claim 1, further comprising identifying and localizing a key actor in a group activity, wherein a key actor corresponds to an entity carrying out a main action characterizing the group activity that has been identified.
12. The method of claim 1, further comprising localizing the individual and group activities in space and time in a plurality of images.
13. A non-transitory computer readable medium storing computer executable instructions for processing visual data for individual and group activities and interactions, comprising instructions for: receiving at least one image from a video of a scene showing one or more entities at a corresponding time; using a training set comprising at least one labeled individual or group activity; and applying at least one machine learning or artificial intelligence technique to learn from the training set to represent spatial, temporal or spatio-temporal content of the visual data and numerically model the visual data by assigning numerical representations.
14. A device configured to process visual data for individual and group activities and interactions, the device comprising a processor and memory, the memory storing computer executable instructions that, when executed by the processor, cause the device to: receive at least one image from a video of a scene showing one or more entities at a corresponding time; use a training set comprising at least one labeled individual or group activity; and apply at least one machine learning or artificial intelligence technique to learn from the training set to represent spatial, temporal or spatio-temporal content of the visual data and numerically model the visual data by assigning numerical representations.
15. The device of claim 14, further comprising computer executable instructions to: apply learnt machine learning and artificial models to the visual data; identify individual and group activities by analyzing the numerical representation assigned to the spatial, temporal, or spatio-temporal content of the visual data; and output at least one label to categorize an individual or a group activity in the visual data.
16. The device of claim 14, further comprising using both temporally static and temporally dynamic representations of the visual data.
17. The device of claim 16 further comprising using at least one spatial attribute of the entities for representing temporally static or dynamic information of the visual data.
18. The device of claim 17, wherein the spatial attribute of a human entity comprises body pose information on one single image as a static representation, or body pose information on a plurality of image frames in a video as a dynamic representation.
19. The device of claim 14, further comprising instructions to identify and localize a key actor in a group activity, wherein a key actor corresponds to an entity carrying out a main action characterizing the group activity that has been identified.
20. The device of claim 14, further comprising instructions to localize the individual and group activities in space and time in a plurality of images.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] Embodiments will now be described with reference to the appended drawings wherein:
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
DETAILED DESCRIPTION
[0030] An exemplary embodiment of the presently described system takes a visual input such as an image or video of a scene with multiple entities including individuals and objects to detect, recognize, identify, categorize, label, analyze and understand the individual actions, the group activities, and the key individual or entity that either makes the most important action in the group or carries out a main action characterizing the group activity which is referred to as the “key actor”. The individual actions and group activities include human actions, human-human interactions, human-object interactions, or object-object interactions.
[0031] In the exemplary embodiment, a set of labeled videos or images containing at least one image or video of at least one individual or group activity is used as the “training set” to train machine learning algorithms. Given the training set, machine learning algorithms learn to process the visual data for individual and group activities and interactions by generating a numerical representation of spatial, temporal or spatio-temporal content of the visual data. The numerical representation which sometimes refer to as “visual features” or “features” are either explicitly representing the labels and categories for the individual and group activities, or implicitly representing them to be used for further processing. After the training, the learnt models process an input image or video to generate the numerical representation of the visual content.
[0032] Referring to the drawings,
[0033]
[0034]
[0035]
[0036] Turning now to
[0037] Further detail of the operation of the configurations shown in
[0038] In an exemplary embodiment, illustrated also in
[0039] In this exemplary embodiment, the feature vectors that are representing the appearance and the skeletal structure of the person are obtained by passing images through artificial neural networks. However, any suitable method can be used to extract intermediate features representing the images. Therefore, while examples are provided using artificial neural networks, the principles described herein should not be limited thereto.
Actor Feature Extractor
[0040] All human actions involve the motion of body joints, such as hands and legs. This applies not only to fine-grained actions that are performed in sports activities, e.g., spike and set in a volleyball game, but also to every day actions such as walking and talking. This means that it is important to capture not only the position of joints but their temporal dynamics as well. For this purpose, one can use both position and motion of individual body joints and actors themselves.
[0041] To obtain joint positions, a pose estimation model can be applied. This model receives as an input, a bounding box around the actor, and predicts the location of key joints. This embodiment does not rely on a particular choice of pose estimation model. For example, state-of-the art body pose estimation such as HRNet can be used—see reference [15]. One can use the features from the last layer of the pose estimation neural network, right before the final classification layer. To extract the temporal dynamics of each actor and model the motion data from the video frames, state-of-the art 3D CNNs can be used such as I3D models. The dynamic feature extraction models can be applied on the sequence of the detected body joints across the videos, the raw video pixel data or the optical flow video. The dynamic features are extracted from stacked F.sub.t, t=1, . . . , T frames. The RGB pixel data and optical flow representations are considered here, but for those who are skilled in computer vision the dynamic features can be extracted from multiple different sources using different techniques. The dynamic feature extractors can either be applied on the whole video frame or only the spatio-temporal region that an actor or an entity of interest is present.
Self-Attention Mechanism
[0042] Transformer networks can learn and select important information for a specific task. A transformer network includes two main parts, an encoder and a decoder. The encoder receives an input sequence of words (source) that is processed by a stack of identical layers including a multi-head self-attention layer and a fully connected feed-forward network. Then, a decoder generates an output sequence (target) through the representation generated by the encoder. The decoder is built in a similar way as the encoder having access to the encoded sequence. The self-attention mechanism is the vital component of the transformer network, which can also be successfully used to reason about actors' relations and interactions.
[0043] Attention A is a function that represents a weighted sum of the values V. The weights are computed by matching a query Q with the set of keys K. The matching function can have different forms, most popular is the scaled dot-product. Formally, attention with the scaled dot-product matching function can be written as:
[0044] where d is the dimension of both queries and keys. In the self-attention module all three representations (Q, K, V) are computed from the input sequence S via linear projections so A.sub.h(Q,K,V)=concat(h.sub.1, . . . ,h.sub.m)W.
[0045] Since attention is a weighted sum of all values, it overcomes the problem of forgetfulness over time. This mechanism gives more importance to the most relevant observations which is a required property for group activity recognition because the system can enhance the information of each actor's features based on the other actors in the scene without any spatial constraints. Multi-head attention A.sub.h is an extension of attention with several parallel attention functions using separate linear projections h.sub.i of (Q, K, V):
h.sub.iA(QW.sub.i.sup.Q,KW.sub.i.sup.K,VW.sub.i.sup.V)
[0046] Transformer encoder layer E includes a multi-head attention combined with a feed-forward neural network L:
L(X)=Linear(Dropout(ReLU(Linear(X)))
E(S)=LayerNorm(S+Dropout(A.sub.h(S)))
E(S)=LayerNorm(E(S)+Dropout(L(E(S)))
[0047] The transformer encoder can contain several of such layers which sequentially process an input S.
[0048] S is a set of actors' features S={s.sub.i|i=1, . . . , N} obtained by actor feature extractors and represented by numerical values. As features s.sub.i do not follow any particular order, the self-attention mechanism 18 is a more suitable model than RNN and CNN for refinement and aggregation of these features. An alternative approach can be incorporating a graph representation. However, the graph representation requires explicit modeling of connections between nodes through appearance and position relations. The transformer encoder mitigates this requirement relying solely on the self-attention mechanism 18. The transformer encoder also implicitly models spatial relations between actors via positional encoding of s.sub.i. It can be done by representing each bounding box b.sub.i of respective actor's features s.sub.i with its center point (x.sub.i,y.sub.i) and encoding the center point.
[0049] It is apparent that using information from different modalities, i.e. static, dynamic, spatial attribute, RGB pixel values, and optical flow modalities; improves the performance of activity recognition methods. In this embodiment several modalities are incorporated for individual and group activity detection, referred to as static and dynamic modalities. The static one is represented by the pose models which captures the static position of body joints or spatial attributes of the entities, while the dynamic one is represented by applying a temporal machine learning video processing technique such I3D on a sequence of images in the video and is responsible for the temporal features of each actor in the scene. As RGB pixel values and optical flow can capture different aspects of motion both of them are used in this embodiment. To fuse static and dynamic modalities two fusion strategies can be used, early fusion of actors' features before the transformer network and late fusion which aggregates the assigned labels to the actions after classification/categorization. Early fusion enables access to both static and dynamic features before inference of group activity. Late fusion separately processes static and dynamic features for group activity recognition and can concentrate on static or dynamic features, separately.
[0050] Training Objective
[0051] The parameters of all the components, the static and dynamic models, the self-attention mechanism 18 and the fusion mechanism could be either estimated separately or jointly using standard machine learning techniques such as gradient based learning methods that are commonly used for artificial neural networks. In one ideal setting, the whole parameter estimation of those components can be estimated using a standard classification loss function, learnt from a set of available labelled examples. In case of separately learning the parameters of those components, each one can be estimated separately and then the learnt models can be combined together. To estimate all parameters together, neural network models can be trained in an end-to-end fashion to simultaneously predict individual actions of each actor and group activity. For both tasks one can use a standard loss function such as cross-entropy loss and combine two losses in a weighted sum:
=λ.sub.g
.sub.g(y.sub.g,{tilde over (y)}.sub.g)+λ.sub.a
.sub.a(y.sub.a,{tilde over (y)}.sub.a)
[0052] where .sub.g,
.sub.a are cross-entropy losses, y.sub.g and y.sub.a are ground truth labels, {tilde over (y)}.sub.g and {tilde over (y)}.sub.a are predictions for group activity and individual actions, respectively. λ.sub.g and λ.sub.a are scalar weights of two losses.
Experimental Evaluation
[0053] Experiments were carried out on publicly available group activity datasets, namely the volleyball dataset (see reference [3]) and the collective dataset (see reference [16]). The results were compared to the state-of-the-art.
[0054] For simplicity, the static modality is called “Pose”, the dynamic one that uses raw pixel data from video frames is called “RGB”, and dynamic one with optical flow frames is called “Flow” in the next several paragraphs.
[0055] The volleyball dataset included clips from 55 videos of volleyball games, which are split into two sets: 39 training videos and 16 testing videos. There are 4830 clips in total, 3493 training clips and 1337 clips for testing. Each clip is 41 frames in length. Available annotation includes group activity label, individual players' bounding boxes and their respective actions which are provided only for the middle frame of the clip. This dataset is extended with ground truth bounding boxes for the rest of the frames in clips which are also used in the experimental evaluation. The list of group activity labels contains four main activities (set, spike, pass, win point) which are divided into two subgroups, left and right, having eight group activity labels in total. Each player can perform one of nine individual actions: blocking, digging, falling, jumping, moving, setting, spiking, standing and waiting.
[0056] The collective dataset included 44 clips with varying lengths starting from 193 frames to around 1800 frames in each clip. Every 10th frame has the annotation of persons' bounding boxes with one of five individual actions: (crossing, waiting, queueing, walking and talking. The group activity is determined by the action which most people perform in the clip.
[0057] For experimental evaluation T=10 frames are used as the input, the frame that is labeled for the activity and group activity as the middle frame, 5 frames before and 4 frames after. During training one frame Ftp from T input frames is randomly sampled for the pose modality to extract relevant body pose features. The group activity recognition accuracy is used as an evaluation metric.
[0058] The use of static modality, human body pose, without dynamic modality results in an average accuracy of 91% for group activity recognition on the volleyball dataset. Including the relative position of all the people in the scene, referred to as “positional encoding” increase the accuracy to 92.3%. Therefore, explicitly adding information about actors' positions helps the transformer better reason about this part of the group activity. Using static and dynamic modalities separately without any information fusion, the results on the Volleyball dataset are shown in
[0059] The results of combining dynamic and static modalities are presented in
[0060] Comparison with the state-of-the-art on the volleyball dataset is shown in
[0061] The static and dynamic modalities representing individual and group activities are used together to automatically learn the spatio-temporal context of the scene for group activities using a self-attention mechanism. In this particular embodiment, the human body pose is used as the static modality However, any feature extraction technique can be applied on the images to extract other sort of static representations instead of body pose. In addition, the extracted static features from images can be stacked together to be used as the dynamic modality. The same can be applied to the dynamic modality to generate static features. Another key component is the self-attention mechanism 18 to dynamically select the more relevant representative features for activity recognition from each modality. This exemplary embodiment discloses the use of human pose information on one single image as one of the inputs for the method, however various modifications to make use of a sequence of images instead of one image will be apparent to those skilled in the art. For those skilled in the art, a multitude of different feature extractors and optimization loss functions can be used instead of the exemplary ones in the current embodiment. Although the examples are using videos as the input to the model, one single image can be used instead and rather than using static and dynamic modalities, only static modality can be used. In this case, the body pose and the extracted feature from the raw image pixels are both considered as static modalities.
[0062] The exemplary methods described herein are used to categorize the visual input and assign appropriate labels to the individual actions and group activities. However, similar techniques can detect those activities in a video sequence, meaning that the time the activities are happening in a video can be also identified as well as the spatial region in the video where they activities are happening. A sample method can be using a moving window on multiple video frames in time, to detect and localize those activities which will be apparent to those skilled in the art.
Analysis
[0063] To better understand the performance of the exemplary model one can present confusion matrices for group activity recognition on the volleyball dataset in
[0064]
[0065] For simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth in order to provide a thorough understanding of the examples described herein. However, it will be understood by those of ordinary skill in the art that the examples described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the examples described herein. Also, the description is not to be considered as limiting the scope of the examples described herein.
[0066] It will be appreciated that the examples and corresponding diagrams used herein are for illustrative purposes only. Different configurations and terminology can be used without departing from the principles expressed herein. For instance, components and modules can be added, deleted, modified, or arranged with differing connections without departing from these principles.
[0067] It will also be appreciated that any module or component exemplified herein that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media, or data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Computer storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by an application, module, or both. Any such computer storage media may be part of the system 10, 20, 25, any component of or related to the system 10, 20, 25, etc., or accessible or connectable thereto. Any application or module herein described may be implemented using computer readable/executable instructions that may be stored or otherwise held by such computer readable media.
[0068] The steps or operations in the flow charts and diagrams described herein are just for example. There may be many variations to these steps or operations without departing from the principles discussed above. For instance, the steps may be performed in a differing order, or steps may be added, deleted, or modified.
[0069] Although the above principles have been described with reference to certain specific examples, various modifications thereof will be apparent to those skilled in the art as outlined in the appended claims.
REFERENCES
[0070] 1—Mengshi Qi, Jie Qin, Annan Li, Yunhong Wang, Jiebo Luo, and Luc Van Gool. stagnet: An attentive semantic rnn for group activity recognition. In ECCV, 2018. [0071] 2—Jianchao Wu, Limin Wang, Li Wang, Jie Guo, and Gangshan Wu. Learning actor relation graphs for group activity recognition. In CVPR, 2019. [0072] 3—Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vandat, and Greg Mori. A hierarchical deep temporal model for group activity recognition. In CVPR, 2016. [0073] 4—Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. [0074] 5—Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees GM Snoek. Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding, 166:41-50, 2018. [0075] 6—João Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017. [0076] 7—Rohit Girdhar and Deva Ramanan. Attentional pooling for action recognition. In NIPS, 2017. [0077] 8—Xiaolong Wang and Abhinav Gupta. Videos as space-time region graphs. In ECCV, 2018. [0078] 9—Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. Towards understanding action recognition. In ICCV, 2013. [0079] 10—Yong Du, Wei Wang, and Liang Wang. Hierarchical recurrent neural network for skeleton based action recognition. In CVPR, 2015. [0080] 11—Guilhem Chéron, Ivan Laptev, and Cordelia Schmid. P-cnn: Pose-based cnn features for action recognition. In ICCV, 2015. [0081] 12—Wenbin Du, Yali Wang, and Yu Qiao. Rpan: An end-to-end recurrent pose-attention network for action recognition in videos. In ICCV, 2017. [0082] 13—Tian Lan, Yang Wang, Weilong Yang, Stephen N. Robinovitch, and Greg Mori. Discriminative latent models for recognizing contextual group activities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34:1549-1562, 2012. [0083] 14—Zhiwei Deng, Arash Vandat, Hexiang Hu, and Greg Mori. Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. In CVPR, 2016. [0084] 15—Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, 2019. [0085] 16—Wongun Choi, Khuram Shahid, and Silvio Savarese. What are they doing?: Collective activity classification using spatio-temporal relationship among people. In ICCV Workshops, 2009.