G06V10/806

System and method for multimodal emotion recognition

Systems, methods, apparatuses, and computer program products for providing multimodal emotion recognition. The method may include receiving raw input from an input source. The method may also include extracting one or more feature vectors from the raw input. The method may further include determining an effectiveness of the one or more feature vectors. Further, the method may include performing, based on the determination, multiplicative fusion processing on the one or more feature vectors. The method may also include predicting, based on results of the multiplicative fusion processing, one or more emotions of the input source.

Gesture recognition using multiple antenna

Various embodiments wirelessly detect micro gestures using multiple antenna of a gesture sensor device. At times, the gesture sensor device transmits multiple outgoing radio frequency (RF) signals, each outgoing RF signal transmitted via a respective antenna of the gesture sensor device. The outgoing RF signals are configured to help capture information that can be used to identify micro-gestures performed by a hand. The gesture sensor device captures incoming RF signals generated by the outgoing RF signals reflecting off of the hand, and then analyzes the incoming RF signals to identify the micro-gesture.

SYSTEMS AND METHODS FOR VIDEO AND LANGUAGE PRE-TRAINING
20230154146 · 2023-05-18 ·

Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.

SCENE SEGMENTATION AND OBJECT TRACKING
20230386052 · 2023-11-30 ·

Systems and techniques are provided for performing scene segmentation and object tracking. For example, a method for processing one or more frames. The method may include determining first one or more features from a first frame. The first frame includes a target object. The method may include obtaining a first mask associated with the first frame. The first mask includes an indication of the target object. The method may further include generating, based on the first mask and the first one or more features, a representation of a foreground and a background of the first frame. The method may include determining second one or more features from a second frame and determining, based on the representation of the foreground and the background of the first frame and the second one or more features, a location of the target object in the second frame.

LOW-RESOLUTION FACE RECOGNITION DEVICE AND LOW-RESOLUTION FACE RECOGNIZER LEARNING DEVICE AND METHOD

The present disclosure relates to a low-resolution face recognition device, which includes a high-resolution face image inputter; a low-resolution face image inputter; a high-resolution face feature extractor configured to extract a high-resolution face feature by using high-resolution and low-resolution face images; a face quality feature extractor configured to extract face quality features by using the high-resolution and low-resolution face images; a feature combiner configured to detect the high-resolution and low-resolution face features by concatenating the high-resolution face feature and the face quality feature; a feature adaptation network configured to extract a high-resolution face feature map and a low-resolution face feature map by using the detected high-resolution and low-resolution face features, respectively; and a consistency meter configured to determine a face ID by measuring consistency of a face feature map by using the extracted high-resolution and low-resolution face feature maps.

VIDEO RETRIEVAL METHOD AND APPARATUS

Implementations of the present specification provide a video retrieval method and apparatus. In the method, a video frame in a video to be matched is obtained; an image feature and a text feature are extracted from the video frame; the image feature and the text feature are fused based on a center variable used to represent a cluster center to obtain a fused feature, where the center variable is used to associate features of different modes of a same video; and video retrieval is performed in a video database based on the fused feature to determine a video in the video database that matches the video to be matched, where a plurality of videos and video features corresponding to the plurality of videos are stored in the video database.

COMBINED VISION AND LANGUAGE LEARNING MODELS FOR AUTOMATED MEDICAL REPORTS GENERATION
20230386646 · 2023-11-30 ·

A method of generating a medical report is presented herein. In some embodiments, the method includes receiving a medical image and at least one natural language medical question, extracting at least one image feature from the image; extracting at least one text feature from the question; and fusing the at least one image feature with the at least one text feature to form a combined feature. Some embodiments further include encoding, by an encoder, the combined feature to form a transformed combined feature; computing a set of prior context features based on a similarity between the transformed combined feature and each of a set of transformed text features derived from a set of training natural language answers; and generating, by a decoder, a first natural language answer conditioned on the transformed combined feature and the set of prior context features.

Vision-LiDAR fusion method and system based on deep canonical correlation analysis

A vision-LiDAR fusion method and system based on deep canonical correlation analysis are provided. The method comprises: collecting RGB images and point cloud data of a road surface synchronously; extracting features of the RGB images to obtain RGB features; performing coordinate system conversion and rasterization on the point cloud data in turn, and then extracting features to obtain point cloud features; inputting point cloud features and RGB features into a pre-established and well-trained fusion model at the same time, to output feature-enhanced fused point cloud features, wherein the fusion model fuses RGB features to point cloud features by using correlation analysis and in combination with a deep neural network; and inputting the fused point cloud features into a pre-established object detection network to achieve object detection. A similarity calculation matrix is utilized to fuse two different modal features.

End-to-end multimodal gait recognition method based on deep learning

An end-to-end multimodal gait recognition method based on deep learning includes: first extracting gait appearance features (color, texture and the like) through RGB video frames, and obtaining a mask by semantic segmentation of the RGB video frames; then extracting gait mask features (contour and the like) through the mask; and finally performing fusion and recognition on the two kinds of features. The method is configured for extracting gait appearance feature and mask feature by improving GaitSet, improving semantic segmentation speed on the premise of ensuring accuracy through simplified FCN, and fusing the gait appearance feature and the mask feature to obtain a more complete information representation.

Methods and Systems for Generating Composite Image Descriptors
20220414393 · 2022-12-29 ·

An illustrative image descriptor generation system determines a subset of image descriptors from a plurality of image descriptors that each correspond to a different feature point included within an image. The subset of image descriptors is determined based on geometric proximity, within the image, of respective feature points of the subset of image descriptors to a feature point of a primary image descriptor. The image descriptor generation system then selects a secondary image descriptor from the subset of image descriptors and combines the primary image descriptor and the secondary image descriptor to form a composite image descriptor. Corresponding methods and systems are also disclosed.