Method for video recognition capable of encoding spatial and temporal relationships of concepts using contextual features

11416774 · 2022-08-16

Assignee

Inventors

Cpc classification

International classification

Abstract

The proposed invention aims at encoding contextual information for video analysis and understanding, by encoding spatial and temporal relationships of objects and the main agent in a scene. The main target application of the invention is human activity recognition. The encoding of such spatial and temporal relationships may be crucial to distinguish different categories of human activities and may be important to help in the discrimination of different video categories, aiming at video classification, retrieval, categorization and other video analysis applications.

Claims

1. A method for video recognition using contextual features capable of encoding spatial and temporal relationships of concepts, the method comprising performing, by at least one processor, operations including: acquiring input video data from a video; processing the input video data to detect concepts in the video; computing contextual features from the detected concepts, wherein the computing contextual features includes: computing, by the Egocentric Pyramid, spatial relationships of detected concepts in relation to a main agent of the video as concept-agent pairings; computing pairings between concepts as concept-concept pairings; and making use of the computed pairings to determine temporal relationships of the concepts, using the Temporal Egocentric Relational Network, to generate prediction scores for the concepts; and outputting the generated prediction scores from the Temporal Egocentric Relational Network.

2. The method according to claim 1, wherein the acquiring input video data comprises splitting the video into t video segments of equal size T and then, from each video segment, sampling a random snippet S.sub.i with length |S.sub.i| such that |S.sub.i|≤T.

3. The method according to claim 1, wherein the computing contextual features from the detected concepts includes attributing scores to captured context to determine the concepts and agents in the video.

4. The method according to claim 3, wherein the Egocentric Pyramid considers as the main agent in the video to be the concept with the highest attributed score obtained by the detected concepts.

5. The method according to claim 1, wherein when more than one agent is in the video, a number of agents is a same number of Egocentric Pyramids, where each Egocentric Pyramid is considered a separate concept as the agent in the video.

6. The method according to claim 1, wherein the Temporal Egocentric Relational Network determines the temporal relationships from both egocentric pairings and concept pairings.

7. The method according to claim 1, wherein the Temporal Egocentric Relational Network determines the temporal relationships from egocentric pairings.

8. The method according to claim 1, wherein the Temporal Egocentric Relational Network determines the temporal relationships from concept pairings.

9. The method according to claim 1, wherein the Temporal Egocentric Relational Network uses the computed pairings to determine features and a classifier in a unified way.

10. The method according to claim 1, wherein the Temporal Egocentric Relational Network is configured to determine concept information over time.

11. The method according to claim 1, wherein the Temporal Egocentric Relational Network is defined as:
TERN(S)=custom character(R.sub.Φ(S.sub.1), R.sub.Φ(S.sub.2), . . . , R.sub.Φ(S.sub.t)), where S.sub.t is the video, R.sub.Φ is a relational network with parameters Φ, custom character is a pooling operation and the relational network R.sub.Φ, given parameters Φ=[ϕ.sub.1,ϕ.sub.2], is defined as R.sub.Φ(O)=ƒ.sub.ϕ.sub.1(1/n.sup.2Σ.sub.o.sub.i.sub.,o.sub.j g.sub.ϕ.sub.2 (o.sub.i, o.sub.j)), where O={o.sub.i}.sub.i=1.sup.n represents an input set of n detected concepts (e.g., objects), where o.sub.i is the i-th concept such that o.sub.i∈custom character.sup.ƒ; and functions ƒ.sub.ϕ.sub.1 and g.sub.ϕ.sub.2 are stacked multi-layer perceptrons (MLP) parameterized by parameters ϕ.sub.1 and ϕ.sub.2, respectively.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) The objectives and advantages of the current invention will become clearer through the following detailed description of the example and non-limitative drawings presented at the end of this document:

(2) FIG. 1 shows a scenario of the proposed invention being used, in which a scenario or people are being recorded by cameras which generate digital videos.

(3) FIG. 2A depicts the pipeline of the proposed approach for the system.

(4) FIG. 2B shows the flowchart of the proposed approach for the method.

(5) FIG. 3 compares a common spatial pyramid with the proposed egocentric pyramid.

(6) FIG. 4 shows experimental results of the proposed invention in comparison with baselines based on object occurrences.

(7) FIG. 5 shows experimental results of one embodiment of the proposed invention (TERN) fused with other state-of-the-art architectures for action recognition, which also consider motion information.

(8) FIG. 6 presents the accuracy differences between the proposed invention alone (TERN) and the proposed invention fused with state-of-the-art approach (TERN+two-stream (TSN)).

DETAILED DESCRIPTION

Features of Invention

(9) The proposed invention discloses an approach for human activity recognition on videos, which can encode contextual information by spatial and temporal relationships of objects and the main agent in a scene. Spatial relationships are encoded using temporal egocentric relational network, object occurrences and egocentric pyramids, in which the latter is a technique proposed to encode the spatial arrangement of objects around the main agent in the scene. Temporal relationships are encoded combining relational networks and temporal segment networks. All the encoding steps are differentiable, allowing an end-to-end learning process, which makes it possible to obtain higher recognition rates and to deliver a better classifier for the target product. The proposed approach, although validated for human activity recognition, can be used for other tasks related to video analysis and understanding.

(10) Based on the description of the existing methods and on the description of the proposed invention, the following advantages for the invention can be enumerated:

(11) Encoding contextual information, i.e., both spatial and temporal relationships of objects regarding the main agent in the scene, improves the accuracy for human activity recognition systems based on video data;

(12) Such encoding of contextual information may benefit other video analysis and video understanding applications;

(13) More precise video understanding systems enable better knowledge extraction from video datasets, including better video categorization, better video search, video auto-tagging, video summarization, among other applications;

(14) Better video understanding systems require less human/manual annotation of video content;

(15) Embedding the invention in mobile devices enables better Artificial Intelligence (AI) applications for users;

(16) Better video categorization and video search in mobile applications (e.g., Gallery).

(17) The main goal of the proposed invention is the recognition of human activities based on videos. However, the invention can also be used for any other video analysis application. Input videos are recorded by cameras and may be available from different sources, such as YouTube, surveillance cameras, smart phones, etc. The recognition algorithms can understand the activities performed on video, like horse racing, kayaking, applying lipstick, walking with dog, playing cello, and others.

(18) As shown in FIG. 1, a user or scenario (101) is recorded by a camera (102), generating a video (103). This video (103) is then processed by the present invention (104), which can be executed in a computer, remote server, mobile device, or other device, including cloud servers. Such videos are then processed by the proposed invention in order to produce an output, which are the recognized human activity categories. The recognized activities are the output (105) of the proposed system. The camera device (102) can be an IP camera, a smart phone, a surveillance camera, or any other device comprising a camera. The proposed invention has the advantage of enabling recognition systems to learn contextual relationships for improving accuracy.

(19) FIG. 2A illustrates the system wherein the proposed invention is performed. FIG. 2A depicts the flowchart of the system, in which the input video data (201A) goes through the concept detection module (202A) in order to obtain the concepts (e.g., objects), then the contextual features are captured by the Egocentric Pyramid (203A) and Concept Pairings module (204A) which have their output processed by the Temporal Egocentric Relational Network (205A), by considering spatial and temporal relationships of concepts, generating the output (206A) of the system. The relationships are computed for both object-agent and object-object pairs and used as input for a neural network, which learns the best concept combinations and parameters.

(20) FIG. 2B illustrates the steps of the method of the proposed invention. FIG. 2B shows the pipeline of video recognition using contextual features capable of encoding spatial and temporal relationships of concepts comprising the steps of:

(21) a. acquiring input video data (201B);

(22) b. processing the input video data in order to detect concepts in the video (202B);

(23) c. computing contextual features from the detected concepts, further comprising the following sub-steps:

(24) i. computing, by the Egocentric Pyramid, spatial relationships of detected concepts in relation to the main agent of the scene (203B) (concept-agent pairings);

(25) ii. computing pairings between concepts (204B) (concept-concept pairings);

(26) iii. making use of concept pairings and egocentric pairings to learn their temporal relationships, by the Temporal Egocentric Relational Network, to generate prediction scores for the concepts (205B);

(27) d. outputting the prediction scores given by the Temporal Egocentric Relational Network (206B).

(28) The purpose of the invention is to recognize human activities based on video, which is the input video data (201A) of the system. The input video data (201A) is processed in order to detect concepts (202B). Concepts can be objects, people, object parts, etc. The concepts are then passed to the module to generate the contextual features, the egocentric pyramid (203A). Such module is divided into two sub-modules. The egocentric pyramid (203A) obtains information regarding spatial relationships of objects and the main agent in the scene. And the concept pairings module (204A) obtains spatial relationships of pairs of objects. The spatial relationships are used as input for a Temporal Egocentric Relational Network (TERN) (205A), which not only learns the best object pairings, but also their temporal relationships. The output (206A) of the method are the predictions in terms of human activities, considering human activity recognition, or any other video classification task.

(29) The input video data (201A) can be obtained from, including but not limited to, video cameras, smart phones, wearable cameras, surveillance cameras, websites such as YouTube, and others. The input video data (201A) is initially split into t segments of equal size T. From each segment, a random snippet S.sub.i is sampled with length |S.sub.i| such that |S.sub.i|≤T. The video snippets can be used as input for the concept detection module (202A).

(30) The detection of concepts, which can be objects, people, object parts and others, can be based on object detectors including, but not limited to YOLO (“YOLO9000: Better, Faster, Stronger”, Redmon and Farhadi, CVPR, 2017), SSD (“SSD: Single shot multibox detector”, Liu et al., ECCV, 2016), Faster-RCNN (“Faster R-CNN: Towards real-time object detection with region proposal networks”, Ren et al., NIPS, 2015), etc. The concept detection module (202A) outputs all the concepts detected in the input video data (201A).

(31) The egocentric pyramid (203A) is responsible to encode spatial relationships between concepts and the main agent in the scene (concept-agent pairings). FIG. 3 shows an egocentric pyramid in comparison with a common spatial pyramid. The egocentric pyramid splits the image space according to the main agent in the scene. An egocentric pyramid takes an agent as reference, and builds a spatial pyramid centered on top of it. An agent is picked as the central concept performing an activity. This may be determined by several ways, including but not limited to, choosing the concept with the highest score assigned by the concept detector; tracking the concept with the highest scores, among others.

(32) An advantage of egocentric pyramid (203A) over common spatial pyramids is that elements surrounding a given agent are invariant to its position. As a common spatial pyramid takes the center of the frame as reference, it can unveil a problem because it assumes that all activities are always performed at the center of the video, which is not necessarily true. For instance, if the walk the dog activity is being targeted and the person escorting the dog starts on the upper-left corner of the frame and then moves to the bottom-right corner, the corresponding ‘dog’ bin will be assigned into the histograms corresponding to the second and fourth quadrants. That will generate a different histogram signature for the same activity but in one case that the person with the dog starts at the bottom-left and moves to the bottom-right. However, this is prevented by egocentric pyramid as it takes the agent position as reference instead of the frame's center, since the relevant elements move around the one performing the action.

(33) In the egocentric pyramid (203A), in the case a concept is in the boundary of more than one quadrant or a concept is split across multiple quadrants, there are some possibilities to update the corresponding quadrant histograms. One option is to update only the histogram in which the concept has the larger part. Another option is to use the concept dimensions (determined by the bounding box computed by the concept detector) to update all the quadrant histograms weighted by the portion of the concept that belongs to each quadrant.

(34) The egocentric pyramid (203A) can also be used in case there is more than one prominent agent in the scene. This may happen when the scores of the concept detector are similar for more than one concept (e.g., three concepts with scores around 0.3). In this case, all the concepts with similar high scores are used as agents and a separate egocentric pyramid is computed using each concept as the main agent. All these egocentric pyramids can then be used as input for the Temporal Egocentric Relational Network (TERN) (205A).

(35) The concept pairings module (204A) obtains the spatial relationships of all pairs of a concept and other concept (concept-concept pairings).

(36) The concept-agent pairings and concept-concept pairings obtained respectively by Egocentric Pyramids (203A) and Concept Pairings modules (204A), can be used as input for a Temporal Egocentric Relational Network (TERN). That is, the contextual features can be computed by TERN considering only concept-agent information, only concept-concept information or both types of pairings.

(37) For having a machine learning system, it is necessary to first train the method. This learning phase can be based on a given video dataset, in which the system will learn the parameters and generate a classification model. This can happen separately from the system use, i.e., from the inference phase. For instance, the classifier can be trained on a computer/server and then the learned model can be used in a mobile device. It is also possible to have the two phases in the same location. In addition, it is possible to update or re-train the classifier at certain periods of time using new data, which can come from user datasets. The proposed invention has no restriction on where the training and inference phases occur.

(38) The Temporal Egocentric Relational Network (TERN) (205A) makes use of the pairings in order to learn features and classifier in a unified way. TERN is designed to reason over concept information over time, which means that TERN will learn the spatial and temporal relationships for the contextual features. Given a sequence of video snippets S={S.sub.1, S.sub.2, . . . , S.sub.t} comprising t snippets sampled uniformly or randomly, Temporal Egocentric Relational Network is defined as
TERN(S)=custom character(R.sub.Φ(S.sub.1),R.sub.Φ(S.sub.2), . . . ,R.sub.Φ(S.sub.t)),

(39) where S.sub.t is a video snippet, R.sub.Φ is a relational network with parameters Φ, custom character is a pooling operation. In particular, a relational network R.sub.Φ, given parameters Φ=[ϕ.sub.1, ϕ.sub.2], is defined as

(40) R Φ ( O ) = f ϕ 1 ( 1 n 2 .Math. o i , o j g ϕ 2 ( o i , o j ) ) .

(41) Here, O={o.sub.i}.sub.i=1.sup.n represents an input set of n detected concepts (e.g., objects), where o.sub.i is the i-th concept such that o.sub.i∈custom character.sup.ƒ; and functions ƒ.sub.ϕ.sub.1 and g.sub.ϕ.sub.2 are stacked multi-layer perceptrons (MLP) parameterized by parameters ϕ.sub.1 and ϕ.sub.2, respectively.

(42) The learning procedure outputs a model that will be employed during system use for feature extraction and classification. In this training setting, sampling random snippets is a data augmentation technique where every time a different snippet is seen by the network. At the same time, it is ensured that the video is seen as a whole, according to the number of segments and the snippet length. For instance, if three segments are chosen, then it is ensured that the network will see data from the beginning, middle, and end of the video. The consensus layer then pushes the network to learn weights that favors consistency across them. TERN benefits from efficiently reusing weights between concept pairings and temporal segments. This imposes constraints that act as regularizers, while also reducing the number of parameters, as pointed out by the literature.

(43) All the process of obtaining contextual features in the proposed invention (203B-204B-205B) is differentiable, which means that the system can be trained end-to-end, from concept detections to activity predictions. This allows the system to obtain the best parameters automatically, without requiring human intervention or expert knowledge for the problem domain.

(44) Experiments on the UCF101 Human Activity Recognition dataset demonstrate the improvements in accuracy over existing baselines when using the proposed invention. Initially, preliminary experiments are conducted on the 1st split of the UCF101 dataset to evaluate egocentric pyramid alone and baselines based on object occurrences, namely, spatial pyramid, object scores as reported by Jain et al., the implementation using an object detector, and the extension based on occurrences. FIG. 4 shows experimental results of the proposed invention in comparison with baselines based on object occurrences. FIG. 4 summarizes the results. First, it can be seen that using the SSD detector to reproduce the baseline yields a gain of 5 percentage points (p.p.) when compared to the original report by Jain et al. (“What do 15,000 object categories tell us about classifying and localizing actions?”, Jain et al., CVPR, 2015). In addition, by evaluating the representation of number of occurrences alone, a similar result to the baseline (65%) is achieved. It shows that occurrences by itself are not as representative as the object scores. Finally, combining both scores and occurrences an accuracy of 72% is obtained, which shows that they are complementary. Afterwards, the Temporal Egocentric Relational Network (TERN) deep learning architecture is evaluated. By comparing the TERN results with the egocentric pyramid alone, there is an improvement of 1.92 p.p., which suggests that there are other non-explicit contextual features that can be exploited besides spatial arrangements, such as temporal cues from relative frame position and multi-snippets, and other spatial cues such as size and fine-grained localization.

(45) FIG. 5 shows experimental results of one embodiment of the proposed invention (TERN) fused with other state-of-the-art architectures for action recognition, which also consider motion information. Results show the competitiveness of the proposed invention with other approaches.

(46) FIG. 5 presents TERN evaluated over the three UCF101 splits, where the last column is the average of the accuracies. These results are compared with the well-known two-stream networks, with modifications by Wang et al. (“Towards good practices for very deep two-stream convnets”, Jain et al., arxiv, 2015). Results of Wang et al. were obtained by running the code provided by the authors. To make a fair comparison, the Temporal Segment Networks are also included. Compared to the spatial stream, TERN obtains a similar result, which might suggest that it is encoding part of the necessary spatial information encoded by the spatial stream. However, when the spatial stream is also imbued in the TSN framework, the gap between the two approaches increases. The reason might be that TSN enables the spatial convolutional network to learn temporally consistent visual patterns that are not available from object detections alone, such as scene/background and pose cues. Fusing TERN predictions with temporal stream yields an improvement close to two-streams itself, suggesting complementarity between the two modalities. Comparing to two-stream alone, it should be noticed that the fusion of TERN+two-stream is able to slightly improve recognition (1.34 p.p.). However, this gain is smaller when fused with TSN (0.04 p.p.), showing a smaller complementarity between both approaches.

(47) FIG. 6 presents the accuracy differences between the proposed invention alone (TERN) and the proposed invention fused with state-of-the-art approach (TERN+two-stream (TSN)), showing which method is better for each activity classes (positive bars indicate that TERN alone is better). The graph shows that activities that have clear objects are easier to recognize by TERN, such as archery, boxing punching bag, shotput, and typing. However, TERN performed worse for classes in which objects are either difficult to detect (e.g., apply eye makeup), appearance and/or motion plays a major role (long jump, punch), or objects are absent among the detector categories (javelin throw).

(48) To better understand how TERN and TSN affect each other, the difference in accuracy for each activity class regarding the fusion of TERN+two-stream (TSN) is analyzed. FIG. 6 shows a summary of the scenarios that TERN+two-stream (TSN) performed better (positive bars) and worse (negative bars) than two-stream (TSN) alone. Activities that have objects are easier to recognize, such as archery, boxing punching bag, shotput, and typing performed better, while TERN performed worse for classes that objects are either difficult to detect (apply eye makeup), appearance and/or motion plays a major role (long jump, punch), or objects are absent among the detector categories (javelin throw). Comparing TERN alone with TSN, TERN only performed better in situations that objects played an important role, such as playing guitar or horse riding. Still, appearance and motion perform better in most classes, as expected. However, as shown in FIGS. 5 and 6, there are activity categories that benefit from fusing it with TERN, suggesting that there are contextual cues that can be exploited by action recognition architectures besides appearance/motion.

(49) Although the present disclosure has been described in connection with certain preferred embodiments, it should be understood that it is not intended to limit the disclosure to those particular embodiments. Rather, it is intended to cover all alternatives, modifications and equivalents possible within the spirit and scope of the disclosure as defined by the appended claims.