ACTIVITY RECOGNITION IN DARK VIDEO BASED ON BOTH AUDIO AND VIDEO CONTENT
20230039641 · 2023-02-09
Inventors
- Yunhua Zhang (Amsterdam, NL)
- Xiantong Zhen (Amsterdam, NL)
- Ling Shao (Abu Dhabi, AE)
- Cees G.M. Snoek (Amsterdam, NL)
Cpc classification
G06V20/41
PHYSICS
G06V20/46
PHYSICS
G06F18/285
PHYSICS
G06F18/256
PHYSICS
International classification
Abstract
Videos captured in low light conditions can be processed in order to identify an activity being performed in the video. The processing may use both the video and audio streams for identifying the activity in the low light video. The video portion is processed to generate a darkness-aware feature which may be used to modulate the features generated from the audio and video features. The audio features may be used to generate a video attention feature and the video features may be used to generate an audio attention feature. The audio and video attention features may also be used in modulating the audio video features. The modulated audio and video features may be used to predict an activity occurring in the video.
Claims
1. A method for identifying an activity occurring in media content captured in different possible lighting conditions including low light conditions, comprising: receiving the media content comprising video content and audio content; extracting video features from at least a portion of the video content using a video model; extracting audio features from at least a portion of the audio content using an audio model; generating a darkness-aware feature by applying the video features to a darkness-aware evaluation model; modulating the video features and the audio features based on the darkness-aware feature; and predicting an activity occurring in the media content based on the modulated video and audio features.
2. The method of claim 1, wherein modulating the video features and audio features based on the darkness-aware feature comprises: generating a video attention feature from the portion of the audio content; generating an audio attention feature from the portion of the video content; modulating the video features based on the video attention feature and the darkness-aware feature; and modulating the audio features based on the audio attention feature and the darkness-aware feature.
3. The method of claim 2, wherein predicting an activity occurring in the media content comprises: generating an audio prediction based on the modulated audio features; generating a video prediction based on the modulated video features; and combining the audio prediction and the video prediction to predict the activity occurring in the media content.
4. The method of claim 3, wherein generating the video prediction comprises adjusting classification boundaries of a video classifier based on the darkness-aware feature.
5. The method of claim 3, wherein generating the audio prediction comprises adjusting classification boundaries of an audio classifier based on the darkness-aware feature.
6. The method of claim 3, wherein combining the audio prediction and the video prediction comprises averaging the audio prediction and the video prediction together.
7. The method of claim 2, wherein the video feature is generated according to:
F′.sub.v=M.sub.v⊙F.sub.v, wherein the modulated audio feature is generated according to:
F′.sub.a=M.sub.a⊙F.sub.a, where: F′.sub.v is the modulated video feature; F′.sub.a is the modulated audio feature; M.sub.v is the video attention from the portion of the audio content; M.sub.a is the audio attention from the portion of the video content; F.sub.v is the extracted video features; F.sub.a is the extracted audio features; and ⊙ indicates channel-wise multiplication.
8. The method of claim 1, wherein the darkness-aware feature is generated according to:
f.sub.d=Φ′.sub.d(F.sub.v), where: f.sub.d is the darkness-aware feature; Φ′.sub.d denotes a module obtained by removing a final layer from Φ.sub.d; Φ.sub.d is a darkness-aware model trained to produce darkness-aware features comprising a 3D residual block followed by a fully connected layer; and F.sub.v is the extracted video features.
9. The method of claim 1, wherein extracting the audio feature from at least a portion of the audio content comprises transforming the audio content to a spectrogram and applying the spectrogram to the audio model.
10. The method of claim 1, wherein the video model is trained on video content captured in well lit conditions.
11. A non-transitory computer readable memory storing instructions which when executed by a processor of a computer system configure the computer system to perform a method comprising: receiving the media content comprising video content and audio content; extracting video features from at least a portion of the video content using a video model; extracting audio features from at least a portion of the audio content using an audio model; generating a darkness-aware feature by applying the video features to a darkness-aware evaluation model; modulating the video features and the audio features based on the darkness-aware feature; and predicting an activity occurring in the media content based on the modulated video and audio features.
12. The computer readable memory of claim 1, wherein modulating the video features and audio features based on the darkness-aware feature comprises: generating a video attention feature from the portion of the audio content; generating an audio attention feature from the portion of the video content; modulating the video features based on the video attention feature and the darkness-aware feature; and modulating the audio features based on the audio attention feature and the darkness-aware feature.
13. The computer readable memory of claim 2, wherein predicting an activity occurring in the media content comprises: generating an audio prediction based on the modulated audio features; generating a video prediction based on the modulated video features; and combining the audio prediction and the video prediction to predict the activity occurring in the media content.
14. The computer readable memory of claim 3, wherein generating the video prediction comprises adjusting classification boundaries of a video classifier based on the darkness-aware feature.
15. The computer readable memory of claim 3, wherein generating the audio prediction comprises adjusting classification boundaries of an audio classifier based on the darkness-aware feature.
16. The computer readable memory of claim 3, wherein combining the audio prediction and the video prediction comprises averaging the audio prediction and the video prediction together.
17. The computer readable memory of claim 2, wherein the video feature is generated according to:
F′.sub.v=M.sub.v⊙F.sub.v, wherein the modulated audio feature is generated according to:
F′.sub.a=M.sub.a⊙F.sub.a, where: F′.sub.v is the modulated video feature; F′.sub.a is the modulated audio feature; M.sub.v is the video attention from the portion of the audio content; M.sub.a is the audio attention from the portion of the video content; F.sub.v is the extracted video features; F.sub.a is the extracted audio features; and ⊙ indicates channel-wise multiplication.
18. The computer readable memory of claim 1, wherein the darkness-aware feature is generated according to:
f.sub.d=Φ′.sub.d(F.sub.v), where: f.sub.d is the darkness-aware feature; Φ′.sub.d denotes a module obtained by removing a final layer from Φ.sub.d; Φ.sub.d is a darkness-aware model trained to produce darkness-aware features comprising a 3D residual block followed by a fully connected layer; and F.sub.v is the extracted video features.
19. The computer readable memory of claim 1, wherein extracting the audio feature from at least a portion of the audio content comprises transforming the audio content to a spectrogram and applying the spectrogram to the audio model.
20. The computer readable memory of claim 1, wherein the video model is trained on video content captured in well lit conditions.
21. A computer system comprising: a processor for executing instructions; and a memory storing instructions, which when executed by the processor configure the computer system to perform a method comprising: receiving the media content comprising video content and audio content; extracting video features from at least a portion of the video content using a video model; extracting audio features from at least a portion of the audio content using an audio model; generating a darkness-aware feature by applying the video features to a darkness-aware evaluation model; modulating the video features and the audio features based on the darkness-aware feature; and predicting an activity occurring in the media content based on the modulated video and audio features.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0030] Further features and advantages of the present disclosure will become apparent from the following detailed description, taken in combination with the appended drawings, in which:
[0031]
[0032]
[0033]
[0034]
[0035]
DETAILED DESCRIPTION
[0036] Exploiting sound for video activity recognition in the dark can improve the activity prediction. As described further below, darkness-aware cross-modal interactions provide complementary information to a sight model and improve the activity predictions when the video is captured in dark conditions since audio signals are considered to be free of visual noise and can provide complementary information in the dark. The process described below exploits both audio and visual content of the video to identify activities that were captured in a dark environment, rather than using image enhancement techniques that can be computationally expensive. The model described below includes a darkness evaluation module that can dynamically adjust cross-modal interactions according to the darkness conditions of the video's visual content. The process may also combine both the audio and visual content of the video by cross-modal attention.
[0037] The audio signals are robust in the dark and the outcome of cross-modal interaction can be sensitive to illuminations. While sound can be an important cue for recognition in the dark, the noise of the visual content should be taken into consideration. Accordingly, the process described below may control the cross-modal interaction considering the conditions of the visual content of the video and therefore exploit complementary information for each modality from the other.
[0038] The process described herein provides video activity recognition in the dark based on not only the sight but also the sound signals. The activity recognition can be applied to either live or recorded videos. The process may use an audio visual model with both a sight and a sound stream, where each stream facilitates each modality to predict classification result. As the illumination conditions may vary in different videos, a darkness evaluation module may be used to produce a darkness-aware feature representation for each video based on the conditions of the visual content. The darkness-aware feature may then be used to modulate features used by the audio and visual streams for predicting activities in the video. The darkness-aware feature may be used to modulate the cross-modal channel attention and classifier boundaries.
[0039] In order to train and test the current activity prediction models, four sight and sound datasets were derived from XD-Violence dataset, Kinetics-Sound dataset and Moments-in-Time dataset, for activity recognition in the dark. Three of these datasets are used for supervised learning and evaluation and the other for assessing the generalization ability from daylight to darkness. Experiments were conducted that demonstrate the benefit of sound, as well as the other introduced network modules, for activity recognition in the dark. The experiments show that the sight and sound model, trained on daytime data only, can generalize well to nighttime videos.
[0040]
[0041]
[0042] The sound stream similarly comprises a convolutional neural network 220 that can extract features from the sound of the video, or more precisely a spectrogram of the sound. The extracted sound features may be passed to an channel attention generation module 222 of the sound stream 204 that uses a 2D residual block 224 and sigmoid activation function 226 to generate an attention feature 228. The attention feature 228 generated by the sound stream 204 is passed to the sight stream and may be used to modulate the extracted sight features, for example by performing element-wise multiplication 230 and passing the generated feature through a fully connected layer 232. The sight features that are modulated by the sound attention feature can be passed to a classifier 234 that has classification boundaries that can be adjusted dynamically based on the darkness aware feature. The classifier 234 outputs a prediction 236 of the activity classification.
[0043] Similarly, the sound stream 204 receives the attention feature 218 from the sight stream and uses it to modulate the extracted sound features, for example by performing element-wise multiplication 238 and passing the generated feature through a fully connected layer 240. The sound features that are modulated by the sight attention feature may be passed to a classifier 242 that has classification boundaries that can be adjusted dynamically based on the darkness aware feature. The classifier 242 outputs a prediction 244 of the activity classification.
[0044] The activity predictions for the video generated by each of the sight stream and the sound stream may be combined together to generate a final activity prediction. For example, the predictions may be averaged together, or may be combined using other weightings possibly based on the darkness-aware feature.
[0045] As described above, a model includes both a visual stream, which may include a network trained using videos captured under normal lighting conditions, and an audio stream may include a darkness evaluation module is designed to assess the extent to which discriminative information can be extracted from the current visual input video. The model may further include a darkness-aware cross-modal channel attention module that produces channel attention feature vectors for each stream considering the other stream and the darkness-aware feature. Further, since the feature distribution of an activity class can deviate from the one under normal illuminations, the darkness-aware classifier rectifies the classification boundary of each stream also based on the darkness aware feature.
[0046] The sight stream 202 and sound stream 204 first work in parallel, extracting the feature of the visual and audio modalities, respectively. Then the darkness evaluation module takes the visual feature as input and outputs the darkness-aware feature. To adapt to low illuminations, for the two streams, the features outputted by the intermediate layers of the streams are modulated with channel attentions and passed to a respective classifier considering the darkness-aware feature and cross-modal information. The final classification prediction may be the average of the predictions of the two streams.
[0047] The sight stream can use a modern 3D convolutional network as the backbone, such as described in “Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet” by Hara et al. in CVPR, 2018; “Identity mappings in deep residual networks” by He et al. in ECCV, 2016; “Aggregated residual transformations for deep neural networks” by Xie et al. in CVPR, 2017; and “Wide residual networks” by Zagoruyko et al. in BMVC, 2016, all of which are incorporated herein by reference in their entirety. The 3D convolutional network may be pre-trained under normal illumination conditions. The sight stream typically takes a video clip Vi of size T.sub.v×H.sub.v×W.sub.v×3 as input and outputs the classification probability distribution P.sub.v∈.sup.K, where K is the number of classes. The sound stream may adopt a ResNet-18 as the backbone. The raw audio clip may be transformed into a spectrogram A.sub.i∈
.sup.257×500×1 which becomes the input to sound stream network. It outputs the recognition prediction P.sub.a ∈
.sup.K parallel to the sight stream.
[0048] For videos under low illumination, assessing whether the sight-only model can extract discriminative features is desirable. To this end, a darkness evaluation module Φ.sub.d is trained to produce darkness-aware features. The darkness evaluation module may include a 3D residual block with the same structure as the backbone network of the sight stream, such as a 3D convolutional network, followed by a fully connected layer. The darkness evaluation module takes the intermediate features F.sub.v∈.sup.T′.sup.
indicating whether the visual content can provide discriminative features.
[0049] For training the darkness evaluation modular, the illuminance of each video may be determined by:
[0050] where R.sub.j, G.sub.j, and B.sub.j are the intensities of the red, green and blue channels of the j.sup.th pixel. It is observed that videos captured in the dark commonly satisfy Y<50. For a training set with N videos, first the predictions of the sight-only model are obtained. For videos captured in the dark, for example based on the illuminance, those on which the sight-only model can make correct predictions are treated as positive samples. For videos captured under normal illuminations, their brightness may be reduced, for example by the method proposed in “Image processing by linear interpolation and extrapolation” by Heiberli et al. in IRIS Universe Magazine, 28: 8-9, which is incorporated herein by reference in its entirety, to generate videos with low illuminations. The videos that still enable the sight-only model to recognize successfully are added into positive samples, while the rest are negative ones. Finally, the darkness evaluation module is trained by binary cross-entropy loss on these collected positive and negative videos.
[0051] Once the training has finished, the output from the penultimate layer of Φ.sub.d is adopted as the darkness-aware feature representation for each video:
f.sub.d=Φ′.sub.d(F.sub.v), (2)
[0052] where Φ′.sub.d denotes the module obtained by removing the final layer from Φ.sub.d and F.sub.d∈.sup.2048 is the darkness-aware feature.
[0053] In order to exploit complementary cross-modal information, the current model uses the feature of one modality to generate channel attention for the other modality. The channel attention is used to filter out noise so that the models can focus on relevant patterns, while the result can be sensitive to the illuminations. For example, when the visual content is too dark, the sight-only model cannot effectively extract discriminative features for recognition and the attention for the sound stream generated by such noisy features can erase important channels. Therefore, most channels of the sound stream should be preserved in such a situation. Similarly, the channel attention for the sight steam are also expected to preserve as much information as possible. To this end, the darkness-aware cross-modal channel attention module can dynamically adjust the sparsity of the channel attentions according to the illumination conditions.
[0054] The attention generation procedure of both the visual and audio streams is similar and the sound stream attention generation feature is described below as an example. The features F.sub.v of the sight stream is passed through a 3D residual block and a fully connected layer, denoted as Φ.sub.v4a.sup.att, to generate a feature vector f.sub.a.sup.att=Φ.sub.v4a.sup.att(F.sub.v), f.sub.a.sup.att∈.sup.1×1×C.sup.
[0055] where i indicates the i.sup.th channel, M.sub.a is the generated attention, and τ.sub.a is the temperature that is dynamically changed per video. To obtain τ.sub.a, f.sub.d is passed through three fully connected layers, for example τ.sub.a=Φ.sub.v4a.sup.t(f.sub.d). Then, the modulated feature of the sound stream is obtained by:
F′.sub.a=M.sub.a⊙F.sub.a, (4)
[0056] where F.sub.a∈.sup.H′.sup.
[0057] Where Φ.sub.a4v.sup.att represents a residual block and fully connected layer, τ.sub.v is the temperature, f.sub.v.sup.att is the feature vector for generating attention, M.sub.v is the channel attention for the sight stream and F′.sub.v is the modulated feature. Similar to the estimation of τ.sub.a, τ.sub.v is also predicted by passing f.sub.d. fd through three fully connected layers, for example τ.sub.v=Φ.sub.a4v.sup.t(f.sub.d).
[0058] A cross-entropy classification loss can be adopted to train the channel attention generation module, i.e. parameters of Φ.sub.v4a.sup.att, Φ.sub.a4v.sup.att, Φ.sub.a4v.sup.t, while fixing the parameters of the other parts of the model.
[0059] The feature distributions of videos in the dark can deviate from those under normal illuminations. Therefore, the classification boundaries, i.e. the parameters of the classification layer, should be adaptive according to the illumination. To this end, a darkness-aware classifier can be trained, whose parameters used for classification prediction can be dynamically adjusted by the darkness-aware feature. For the sight-only model, the weights and biases of its original fully connected layer for classification are denoted as W.sub.v.sup.1 and b.sub.v.sup.1. Another set of weights and biases can be learned on videos in the dark, denoted as W.sub.v.sup.1 and b.sub.v.sup.2. These two sets of parameters are combined through a darkness-aware coefficient λ.sub.v produced by f.sub.d. As a result, when under low illumination, the parameters of the classification layer are:
W.sub.v=λ.sub.v*W.sub.v.sup.1+(1−λ.sub.v)*W.sub.v.sup.2, (8)
b.sub.v=λ.sub.v*b.sub.v.sup.1+(1−λ.sub.v)*b.sub.v.sup.2, (9)
λ.sub.v may be obtained by passing f.sub.d through three fully connected layers, denoted as λ.sub.v=Φ.sub.b(f.sub.d)=. The classifier boundary of the sound stream is adjusted in the same way as the sight stream. Similar to the darkness-aware channel attention, the parameters of the module, i.e. W.sub.v.sup.2, b.sub.v.sup.2 and Φ.sub.b can be trained by cross-entropy loss fixing the rest of the model. For training data, videos captured in the dark and videos captured in normal lighting conditions that have had the brightness lowered to provide “fake” dark videos may be used
[0060]
[0061]
[0062] The functionality 410 provides various components for automatically recognizing activities occurring in a video 412 that may have been captured in a dark environment. The functionality 410 may include audio video ingestion functionality 414 that can receive AV content and generate corresponding video portions 416 and audio portions 418 that can be processed for counting the repetitive actions. The video portions 416 may comprise one or more frames of the video 412, while the audio portions 418 may comprise a spectrogram of the audio of a corresponding time range from the video 412. The video portion 416 is processed by feature extraction functionality 420 which generates video features and the audio portion 418 is processed by feature extraction functionality 422 which generates audio features. The features may be provided as respective feature vectors or other possible structures. The feature extraction functionality 420, 422 may generate multiple sets of video and audio features for subsequent use by other functionality. The feature extraction functionality may be provided by various model architectures. A darkness evaluation component 424 receives the extracted video features and processes them in order to output an indication of the ability to extract discriminative information from the video. As will be appreciated, videos captured under normal lighting conditions will generally have more discriminative information compared to videos captured in the dark.
[0063] The output of the darkness evaluation component 424 is used by video feature modulation functionality 426 and audio feature modulation functionality 428. The video feature modulation functionality receives the output from the darkness evaluation component and the output from the video feature extraction functionality and modulates the video features based on the darkness evaluation of the video. Similarly, the audio feature modulation functionality modulates the audio features from the audio feature extraction functionality based on the output from the darkness evaluation functionality.
[0064] The modulated video and audio features may be processed by activity classifier functionality 430 that outputs a prediction of one or more activities captured in the video. The activity classifier functionality may generate separate activity predictions based on each of the modulated video features and modulated audio features and then combine the two separate predictions into the final activity predictions 432, for example by averaging the two predictions together.
[0065] In order to validate the effectiveness of the sight and sound model for activity recognition in the dark, datasets are needed that allow for multimodal analysis and assure sufficient instances of videos recorded in darkness. XD-Violence, Kinetics-Sound and Moments-in-Time meet the criteria. Yet, they still mainly contain videos captured under normal illumination conditions, so the datasets were repurposed and reorganized to provide the Dark-Violence, Dark-Kinetics, and Dark-Moments datasets
[0066] Dark-Violence: The XD-Violence dataset serves as the largest dataset for violence and non-violence detection with 4,754 untrimmed videos, including the soundtrack, and annotated by video-level classification labels. There are a total of 610 videos captured in the dark and the original test set merely contains a few. The training and test sets were reorganized. The new training set includes all the 4,144 videos under normal illumination conditions and 200 dark videos, while the remaining 410 dark videos form the test set.
[0067] Dark-Kinetics: The Kinetics-Sound dataset is a subset of the Kinetics-400 dataset for activity recognition with 18 classes that make sound. The original training and validation sets with around 20,000 and 1,400 video clips to learn the current model. Instead of directly using the original test set, the Dark-Kinetics dataset uses only videos captured in the dark, for evaluation. 133 videos were selected from the test set of Kinetics-Sound and to enlarge the number of videos, a further 325 videos were manually collected from the training and validation set of Kinetics-700, which are not in the training set of Kinetics-400. The final Dark-Kinetics dataset includes 458 dark videos captured in the wild for testing with their original annotations maintained.
[0068] Dark-Moments: The Moments in Time dataset is a large-scale human-annotated collection of one million short videos corresponding to 305 activity classes. It has a large coverage and diversity in both sight and sound modalities. The original training set of 727,305 video clips is used for model learning and the validation set of 30,500 clips for performance evaluation. 1,135 clips captured in the dark were further selected from the validation set, forming the Dark-Moments dataset to test the models in the dark.
[0069] Day2Dark-Moments: For many activities, videos captured during daytime are easy to be obtained, while nighttime videos can be scarce, such as the activity of nocturnal animals. Training on daytime videos only can lead a visual model to fail in the dark. To further investigate how the sound can help in daylight to darkness generalization, the Moments-in-Time dataset was reorganized and introduce a new challenging task, in which the training set contains daylight videos only and dark videos form the testing set. 54 classes were selected out of 305 classes, for which most videos are under normal illuminations. Then the videos belonging to these classes were split to the training and validation set according to the illuminations. As a result, in this task, a model is expected to learn from daytime videos and generalize to nighttime testing videos.
[0070]
TABLE-US-00001 TABLE 1 Activity recognition in the dark datasets Video Dark Dark Day Activi- dataset Source(s) Test Train Train ties Dark- XD- 410 200 4144 2 Violence Violence Dark- Kinetics- 458 1,760 18,541 18 Kinetics Sound & Kinetics Dark- Moments- 1,135 37,336 689,969 305 Moments in-Time Day2Dark- Moments- 1,813 0 128,418 54 Moments in-Time
[0071] The activity recognition in the dark process was implemented using PyTorch with Adam optimizer and one NVIDIA GTX1080Ti GPU. For the sight stream, all input video frames were resized to 112×112, and each clip contains 16 frames. For Dark-Kinetics, the visual models trained on Kinetics-400 provided by Hara et al. was adopted. The publicly released visual model of Moments-in-Time was used for experiments on Dark-Moments and Day2Dark-Moments. For Dark-Violence, no off-the-shelf state-of-the-art visual model is available, so four NVIDIA GTX1080Ti GPUs were used to learn the sight-only model based on a ResNext 101 backbone. Weights were initialized from the Kinetics pre-trained checkpoint. Only the last residual block and the final classification layer were re-trained. The multi-instance learning strategy was adopted with a batch size of 4. The training takes 30 epochs with an initial learning rate of 10.sup.−2 gradually decreased to 10.sup.−4. For the sound stream, a ResNet-18 for all datasets and initialize the weights from the VGGSound pre-trained checkpoint. The last residual block and the final classification layer were re-trained to adapt to different datasets with an initial learning rate of 10.sup.−2 that is gradually decreased to 10.sup.−4, using a batch size of 16. For the darkness evaluation module, the training takes 15 epochs with an initial learning rate of 10.sup.−2 decreased gradually to 10.sup.−4 with a batch size of 2.
[0072] For the cross-modal channel attention & classifier, each module is trained separately but with the same strategy. During training, for videos captured under normal illuminations, their brightness is lowered, while keeping videos captured in the dark with their original visual content. The modules are trained on the whole training set by 35 epochs with an initial learning rate of 10.sup.−2 that is gradually decreased to 10.sup.−4. Then the modules were fine tuned on videos captured in the dark with a learning rate of 10.sup.−4 for 20 epochs. The batch size is set to 16. For experiments on Day2Dark-Moments, daylight videos only were used without the final fine-tune stage as no dark videos are available.
[0073] Once trained the model's performance was evaluated. As described above, the model may include three darkness-aware components: the darkness evaluation component, the cross-modal channel attention component, and the adaptive classifier component. The performance of several network variants on Dark-Kinetics were evaluated to validate the efficacy of each component. The results are shown in Table 2. “Attention w/o temperature” indicates the channel attention without the “temperature”. It is generated from the visual feature under the sight-only setting, while adopting a cross-modal scheme as described above for the audiovisual setting. The “Attention & classifier” variant uses the feature outputted by the penultimate layer of the sight-only model to replace the darkness-aware feature for generating the temperature and coefficient. For “Darkness-aware attention” and “Darkness-aware attention & classifier”, the visual feature was used to generate the channel attention under the sight-only setting, while combining sight and sound by the method described above.
TABLE-US-00002 TABLE 2 activity recognition accuracy for different models Model component Accuracy Sight or Sound Sight model 0.445 Sound model 0.557 Sight & Sound Late fusion 0.646 Attention 0.736 Attention & classifier 0.739 Darkness-aware attention 0.753 Darkness-aware attention & classifier 0.762
[0074] In isolation, the sound stream performs better than the sight stream on dark videos, with an accuracy of 0.557. When the sight and the sound streams are combined by averaging their predictions, the accuracy increases to 0.646 and is superior to the variants of the sight-only model. This demonstrates the audio signals provide useful temporal information. Illumination-insensitive audiovisual models, i.e. “Attention” and “Attention & classifier”, achieve inferior accuracy of 0.736 and 0.739, compared to 0.753 by the model with darkness-aware cross-modal channel attention. Therefore, it is believed that the darkness evaluation module can provide useful information of sight conditions. However, when the classifier boundary is further adjusted by the darkness feature, the best result with an accuracy of 0.762 is obtained.
[0075] The current sight & sound model for activity recognition in the dark was also compared with a sight-only model that takes the enhanced frames produced by KinD as inputs. As shown in Table 3, image enhancement can only contribute to limited performance improvement over the sight-only model.
TABLE-US-00003 TABLE 3 Sight and sound model vs image enhancement Model Accuracy Sight model 0.225 KinD enhanced sight model 0.183 Sight & sound model 0.250
[0076] Table 4, provides the results of the current model for video recognition in the dark on three tasks, with ResNext-101 architecture as the backbone for the sight stream. On Dark-Kinetics and Dark-XD-Violence, the sound-only model outperforms the sight-only model under low illuminations, as activities in these two datasets usually come along with informative audio signals. After combining the sight and the sound models with the darkness-aware cross-modal components, the best result is obtained with accuracy of 0.762 and 0.955.
TABLE-US-00004 TABLE 4 Sight only and sound only models vs current multimodal model Dark-Violence Dark-Kinetics Dark-Moments Sight model 0.815 0.445 0.214 Sound model 0.941 0.557 0.021 Current model 0.955 0.762 0.243
[0077] On Dark-Moments-in-Time, the sound-only model fails to conduct event classification with a very low accuracy of 0.021, as many event classes do not make distinguishable sound. However, the integration of the sound and the sight models improves over the accuracy of the sight-only model from 0.214 to 0.243.
[0078] Activity recognition under usual illumination. To further demonstrate the benefit of the current model, the method was also tested under usual illumination conditions as common in the literature. The model was trained and tested on the original partitions of Kinetics-Sound and Kinetics-400. The results are shown in Table 5. The model was applied only under low illuminations, otherwise the prediction of the sight-only model was used. Overall, the full sight and sound model achieves better performance compared to the sight-only model by higher accuracy on videos under low illuminations. Qualitative results for different components of the current model are depicted in
TABLE-US-00005 TABLE 5 activity recognition under usual illumination Kinetics-Sound Kinetics-400 Sight model 0.828 0.828 Sound model 0.631 0.111 Current model 0.829 0.828
[0079] Activity recognition from day to dark. As mentioned above, for some activities, videos captured in the dark can be scarce. Since the audio signals are expected to play an important role for activity recognition in the dark, the current model was applied to a new challenging task of daylight and darkness domains generalization, in which only videos under daylight are available for model learning and videos in the dark are for testing. The results are listed in Table 6. As the sound-only model fails to classify the events due to indistinguishable audio features, the sight-only counterpart shows poor generalization ability towards videos in the dark. However, the current darkness-aware sight and sound model outperforms the unimodal models by a large margin. This indicates that the current method enables good generalization ability for illumination gaps.
TABLE-US-00006 TABLE 6 Activity recognition from day to dark Model Accuracy Sound model 0.072 Sight model 0.182 Current model 0.227
[0080] Although certain components and steps have been described, it is contemplated that individually described components, as well as steps, may be combined together into fewer components or steps or the steps may be performed sequentially, non-sequentially or concurrently. Further, although described above as occurring in a particular order, one of ordinary skill in the art having regard to the current teachings will appreciate that the particular order of certain steps relative to other steps may be changed. Similarly, individual components or steps may be provided by a plurality of components or steps. One of ordinary skill in the art having regard to the current teachings will appreciate that the components and processes described herein may be provided by various combinations of software, firmware and/or hardware, other than the specific implementations described herein as illustrative examples.
[0081] The techniques of various embodiments may be implemented using software, hardware and/or a combination of software and hardware. Various embodiments are directed to apparatus, e.g. a node which may be used in a communications system or data storage system. Various embodiments are also directed to non-transitory machine, e.g., computer, readable medium, e.g., ROM, RAM, CDs, hard discs, etc., which include machine readable instructions for controlling a machine, e.g., processor to implement one, more or all of the steps of the described method or methods.
[0082] Some embodiments are directed to a computer program product comprising a computer-readable medium comprising code for causing a computer, or multiple computers, to implement various functions, steps, acts and/or operations, e.g. one or more or all of the steps described above. Depending on the embodiment, the computer program product can, and sometimes does, include different code for each step to be performed. Thus, the computer program product may, and sometimes does, include code for each individual step of a method, e.g., a method of operating a communications device, e.g., a wireless terminal or node. The code may be in the form of machine, e.g., computer, executable instructions stored on a computer-readable medium such as a RAM (Random Access Memory), ROM (Read Only Memory) or other type of storage device. In addition to being directed to a computer program product, some embodiments are directed to a processor configured to implement one or more of the various functions, steps, acts and/or operations of one or more methods described above. Accordingly, some embodiments are directed to a processor, e.g., CPU, configured to implement some or all of the steps of the method(s) described herein. The processor may be for use in, e.g., a communications device or other device described in the present application.
[0083] Numerous additional variations on the methods and apparatus of the various embodiments described above will be apparent to those skilled in the art in view of the above description. Such variations are to be considered within the scope.