Method for video recognition capable of encoding spatial and temporal relationships of concepts using contextual features
11416774 · 2022-08-16
Assignee
- SAMSUNG ELECTRONICA DA AMAZONIA LTDA. (Campinas, BR)
- UNIVERSIDADE FEDERAL DE MINAS GERAIS-UFMG (Belo Horizonte, BR)
Inventors
- Jesimon Barreto Santos (Minas Gerais, BR)
- Victor Hugo Cunha de Melo (Minas Gerais, BR)
- William Robson Schwartz (Minas Gerais, BR)
- Otávio Augusto Bizetto Penatti (São Paulo, BR)
Cpc classification
G06V20/41
PHYSICS
G06V30/248
PHYSICS
G06F16/7867
PHYSICS
International classification
Abstract
The proposed invention aims at encoding contextual information for video analysis and understanding, by encoding spatial and temporal relationships of objects and the main agent in a scene. The main target application of the invention is human activity recognition. The encoding of such spatial and temporal relationships may be crucial to distinguish different categories of human activities and may be important to help in the discrimination of different video categories, aiming at video classification, retrieval, categorization and other video analysis applications.
Claims
1. A method for video recognition using contextual features capable of encoding spatial and temporal relationships of concepts, the method comprising performing, by at least one processor, operations including: acquiring input video data from a video; processing the input video data to detect concepts in the video; computing contextual features from the detected concepts, wherein the computing contextual features includes: computing, by the Egocentric Pyramid, spatial relationships of detected concepts in relation to a main agent of the video as concept-agent pairings; computing pairings between concepts as concept-concept pairings; and making use of the computed pairings to determine temporal relationships of the concepts, using the Temporal Egocentric Relational Network, to generate prediction scores for the concepts; and outputting the generated prediction scores from the Temporal Egocentric Relational Network.
2. The method according to claim 1, wherein the acquiring input video data comprises splitting the video into t video segments of equal size T and then, from each video segment, sampling a random snippet S.sub.i with length |S.sub.i| such that |S.sub.i|≤T.
3. The method according to claim 1, wherein the computing contextual features from the detected concepts includes attributing scores to captured context to determine the concepts and agents in the video.
4. The method according to claim 3, wherein the Egocentric Pyramid considers as the main agent in the video to be the concept with the highest attributed score obtained by the detected concepts.
5. The method according to claim 1, wherein when more than one agent is in the video, a number of agents is a same number of Egocentric Pyramids, where each Egocentric Pyramid is considered a separate concept as the agent in the video.
6. The method according to claim 1, wherein the Temporal Egocentric Relational Network determines the temporal relationships from both egocentric pairings and concept pairings.
7. The method according to claim 1, wherein the Temporal Egocentric Relational Network determines the temporal relationships from egocentric pairings.
8. The method according to claim 1, wherein the Temporal Egocentric Relational Network determines the temporal relationships from concept pairings.
9. The method according to claim 1, wherein the Temporal Egocentric Relational Network uses the computed pairings to determine features and a classifier in a unified way.
10. The method according to claim 1, wherein the Temporal Egocentric Relational Network is configured to determine concept information over time.
11. The method according to claim 1, wherein the Temporal Egocentric Relational Network is defined as:
TERN(S)=(R.sub.Φ(S.sub.1), R.sub.Φ(S.sub.2), . . . , R.sub.Φ(S.sub.t)), where S.sub.t is the video, R.sub.Φ is a relational network with parameters Φ,
is a pooling operation and the relational network R.sub.Φ, given parameters Φ=[ϕ.sub.1,ϕ.sub.2], is defined as R.sub.Φ(O)=ƒ.sub.ϕ.sub.
.sup.ƒ; and functions ƒ.sub.ϕ.sub.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The objectives and advantages of the current invention will become clearer through the following detailed description of the example and non-limitative drawings presented at the end of this document:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
DETAILED DESCRIPTION
Features of Invention
(9) The proposed invention discloses an approach for human activity recognition on videos, which can encode contextual information by spatial and temporal relationships of objects and the main agent in a scene. Spatial relationships are encoded using temporal egocentric relational network, object occurrences and egocentric pyramids, in which the latter is a technique proposed to encode the spatial arrangement of objects around the main agent in the scene. Temporal relationships are encoded combining relational networks and temporal segment networks. All the encoding steps are differentiable, allowing an end-to-end learning process, which makes it possible to obtain higher recognition rates and to deliver a better classifier for the target product. The proposed approach, although validated for human activity recognition, can be used for other tasks related to video analysis and understanding.
(10) Based on the description of the existing methods and on the description of the proposed invention, the following advantages for the invention can be enumerated:
(11) Encoding contextual information, i.e., both spatial and temporal relationships of objects regarding the main agent in the scene, improves the accuracy for human activity recognition systems based on video data;
(12) Such encoding of contextual information may benefit other video analysis and video understanding applications;
(13) More precise video understanding systems enable better knowledge extraction from video datasets, including better video categorization, better video search, video auto-tagging, video summarization, among other applications;
(14) Better video understanding systems require less human/manual annotation of video content;
(15) Embedding the invention in mobile devices enables better Artificial Intelligence (AI) applications for users;
(16) Better video categorization and video search in mobile applications (e.g., Gallery).
(17) The main goal of the proposed invention is the recognition of human activities based on videos. However, the invention can also be used for any other video analysis application. Input videos are recorded by cameras and may be available from different sources, such as YouTube, surveillance cameras, smart phones, etc. The recognition algorithms can understand the activities performed on video, like horse racing, kayaking, applying lipstick, walking with dog, playing cello, and others.
(18) As shown in
(19)
(20)
(21) a. acquiring input video data (201B);
(22) b. processing the input video data in order to detect concepts in the video (202B);
(23) c. computing contextual features from the detected concepts, further comprising the following sub-steps:
(24) i. computing, by the Egocentric Pyramid, spatial relationships of detected concepts in relation to the main agent of the scene (203B) (concept-agent pairings);
(25) ii. computing pairings between concepts (204B) (concept-concept pairings);
(26) iii. making use of concept pairings and egocentric pairings to learn their temporal relationships, by the Temporal Egocentric Relational Network, to generate prediction scores for the concepts (205B);
(27) d. outputting the prediction scores given by the Temporal Egocentric Relational Network (206B).
(28) The purpose of the invention is to recognize human activities based on video, which is the input video data (201A) of the system. The input video data (201A) is processed in order to detect concepts (202B). Concepts can be objects, people, object parts, etc. The concepts are then passed to the module to generate the contextual features, the egocentric pyramid (203A). Such module is divided into two sub-modules. The egocentric pyramid (203A) obtains information regarding spatial relationships of objects and the main agent in the scene. And the concept pairings module (204A) obtains spatial relationships of pairs of objects. The spatial relationships are used as input for a Temporal Egocentric Relational Network (TERN) (205A), which not only learns the best object pairings, but also their temporal relationships. The output (206A) of the method are the predictions in terms of human activities, considering human activity recognition, or any other video classification task.
(29) The input video data (201A) can be obtained from, including but not limited to, video cameras, smart phones, wearable cameras, surveillance cameras, websites such as YouTube, and others. The input video data (201A) is initially split into t segments of equal size T. From each segment, a random snippet S.sub.i is sampled with length |S.sub.i| such that |S.sub.i|≤T. The video snippets can be used as input for the concept detection module (202A).
(30) The detection of concepts, which can be objects, people, object parts and others, can be based on object detectors including, but not limited to YOLO (“YOLO9000: Better, Faster, Stronger”, Redmon and Farhadi, CVPR, 2017), SSD (“SSD: Single shot multibox detector”, Liu et al., ECCV, 2016), Faster-RCNN (“Faster R-CNN: Towards real-time object detection with region proposal networks”, Ren et al., NIPS, 2015), etc. The concept detection module (202A) outputs all the concepts detected in the input video data (201A).
(31) The egocentric pyramid (203A) is responsible to encode spatial relationships between concepts and the main agent in the scene (concept-agent pairings).
(32) An advantage of egocentric pyramid (203A) over common spatial pyramids is that elements surrounding a given agent are invariant to its position. As a common spatial pyramid takes the center of the frame as reference, it can unveil a problem because it assumes that all activities are always performed at the center of the video, which is not necessarily true. For instance, if the walk the dog activity is being targeted and the person escorting the dog starts on the upper-left corner of the frame and then moves to the bottom-right corner, the corresponding ‘dog’ bin will be assigned into the histograms corresponding to the second and fourth quadrants. That will generate a different histogram signature for the same activity but in one case that the person with the dog starts at the bottom-left and moves to the bottom-right. However, this is prevented by egocentric pyramid as it takes the agent position as reference instead of the frame's center, since the relevant elements move around the one performing the action.
(33) In the egocentric pyramid (203A), in the case a concept is in the boundary of more than one quadrant or a concept is split across multiple quadrants, there are some possibilities to update the corresponding quadrant histograms. One option is to update only the histogram in which the concept has the larger part. Another option is to use the concept dimensions (determined by the bounding box computed by the concept detector) to update all the quadrant histograms weighted by the portion of the concept that belongs to each quadrant.
(34) The egocentric pyramid (203A) can also be used in case there is more than one prominent agent in the scene. This may happen when the scores of the concept detector are similar for more than one concept (e.g., three concepts with scores around 0.3). In this case, all the concepts with similar high scores are used as agents and a separate egocentric pyramid is computed using each concept as the main agent. All these egocentric pyramids can then be used as input for the Temporal Egocentric Relational Network (TERN) (205A).
(35) The concept pairings module (204A) obtains the spatial relationships of all pairs of a concept and other concept (concept-concept pairings).
(36) The concept-agent pairings and concept-concept pairings obtained respectively by Egocentric Pyramids (203A) and Concept Pairings modules (204A), can be used as input for a Temporal Egocentric Relational Network (TERN). That is, the contextual features can be computed by TERN considering only concept-agent information, only concept-concept information or both types of pairings.
(37) For having a machine learning system, it is necessary to first train the method. This learning phase can be based on a given video dataset, in which the system will learn the parameters and generate a classification model. This can happen separately from the system use, i.e., from the inference phase. For instance, the classifier can be trained on a computer/server and then the learned model can be used in a mobile device. It is also possible to have the two phases in the same location. In addition, it is possible to update or re-train the classifier at certain periods of time using new data, which can come from user datasets. The proposed invention has no restriction on where the training and inference phases occur.
(38) The Temporal Egocentric Relational Network (TERN) (205A) makes use of the pairings in order to learn features and classifier in a unified way. TERN is designed to reason over concept information over time, which means that TERN will learn the spatial and temporal relationships for the contextual features. Given a sequence of video snippets S={S.sub.1, S.sub.2, . . . , S.sub.t} comprising t snippets sampled uniformly or randomly, Temporal Egocentric Relational Network is defined as
TERN(S)=(R.sub.Φ(S.sub.1),R.sub.Φ(S.sub.2), . . . ,R.sub.Φ(S.sub.t)),
(39) where S.sub.t is a video snippet, R.sub.Φ is a relational network with parameters Φ, is a pooling operation. In particular, a relational network R.sub.Φ, given parameters Φ=[ϕ.sub.1, ϕ.sub.2], is defined as
(40)
(41) Here, O={o.sub.i}.sub.i=1.sup.n represents an input set of n detected concepts (e.g., objects), where o.sub.i is the i-th concept such that o.sub.i∈.sup.ƒ; and functions ƒ.sub.ϕ.sub.
(42) The learning procedure outputs a model that will be employed during system use for feature extraction and classification. In this training setting, sampling random snippets is a data augmentation technique where every time a different snippet is seen by the network. At the same time, it is ensured that the video is seen as a whole, according to the number of segments and the snippet length. For instance, if three segments are chosen, then it is ensured that the network will see data from the beginning, middle, and end of the video. The consensus layer then pushes the network to learn weights that favors consistency across them. TERN benefits from efficiently reusing weights between concept pairings and temporal segments. This imposes constraints that act as regularizers, while also reducing the number of parameters, as pointed out by the literature.
(43) All the process of obtaining contextual features in the proposed invention (203B-204B-205B) is differentiable, which means that the system can be trained end-to-end, from concept detections to activity predictions. This allows the system to obtain the best parameters automatically, without requiring human intervention or expert knowledge for the problem domain.
(44) Experiments on the UCF101 Human Activity Recognition dataset demonstrate the improvements in accuracy over existing baselines when using the proposed invention. Initially, preliminary experiments are conducted on the 1st split of the UCF101 dataset to evaluate egocentric pyramid alone and baselines based on object occurrences, namely, spatial pyramid, object scores as reported by Jain et al., the implementation using an object detector, and the extension based on occurrences.
(45)
(46)
(47)
(48) To better understand how TERN and TSN affect each other, the difference in accuracy for each activity class regarding the fusion of TERN+two-stream (TSN) is analyzed.
(49) Although the present disclosure has been described in connection with certain preferred embodiments, it should be understood that it is not intended to limit the disclosure to those particular embodiments. Rather, it is intended to cover all alternatives, modifications and equivalents possible within the spirit and scope of the disclosure as defined by the appended claims.