G06V10/806

METHOD FOR COUNTING SUCKlING PIGLETS BASED ON SELF-ATTENTION SPATIOTEMPORAL FEATURE FUSION

A method for counting suckling piglets based on self-attention spatiotemporal feature fusion is disclosed, which includes: detecting a side-lying sow in a video frame by using CenterNet to acquire a key frame of suckling piglets and a region of interest of the video frame, and overcome the interference of the movement of non-suckling piglets on the spatiotemporal feature extraction for the region of interest; transforming spatiotemporal features extracted by a spatiotemporal two-stream convolutional network from a key frame video clip into a spatiotemporal feature vector, inputting the obtained spatiotemporal feature vector into a temporal, a spatial and a fusion transformer to obtain a self-attention matrix; performing element-wise product for the self-attention matrix and the fused spatiotemporal features to obtain a self-attention spatiotemporal feature map; and inputting the self-attention spatiotemporal feature map into a regression branch of the number of suckling piglets to complete the counting of the suckling piglets.

METHOD AND SYSTEM FOR THE AUTOMATIC CLASSIFICATION OF ROCKS ACCORDING TO THEIR MINERALS

The present disclosure refers to methods and system for classifying rocks according to their minerals from the processing of color images and hyperspectral images. The methods and systems of this disclosure achieve an efficient and low-cost classification of rocks with minerals. In particular, the method uses classification methods that take color images and hyperspectral images and give as a result a probability that the rock or rocks present in said images are suitable or waste.

Method and device for classifying objects
11645848 · 2023-05-09 · ·

A method for classifying objects which comprises a provision of measuring data from a sensor for a feature extraction unit as well as extraction of modality-independent features from the measuring data by means of the feature extraction unit, wherein the modality-independent features are independent of a sensor modality of the sensor, so that a conclusion to the sensor modality of the sensor is not possible from the modality-independent features.

SYSTEMS AND METHODS FOR VIDEO AND LANGUAGE PRE-TRAINING
20230154188 · 2023-05-18 ·

Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.

EMOTION RECOGNITION IN MULTIMEDIA VIDEOS USING MULTI-MODAL FUSION-BASED DEEP NEURAL NETWORK
20230154172 · 2023-05-18 ·

A system and method of landmark detection using emotion recognition in multimedia videos using multi-modal fusion based deep neural network is provided. The system includes circuitry and a memory configured to store a multimodal fusion network which includes one or more feature extractors, a network of transformer encoders, a fusion attention network, and an output network coupled to the fusion attention network. The system inputs a multimodal input to the one or more feature extractors. The multimodal input is associated with an utterance depicted in one or more videos. The system generates input embeddings as an output of the one or more feature extractors for the input and further generates a set of emotion-relevant features based on the input embeddings. The system further generates a fused-feature representation of the set of emotion-relevant features and predicts an emotion label for the utterance based on fused-feature representation.

APPARATUS AND METHOD WITH IMAGE SEGMENTATION

Provided is an image segmentation apparatus and method. The image segmentation method includes extracting, from an image, a feature map of the image; generating a second slot matrix by associating the feature map of the image with a first slot matrix corresponding to the image; and obtaining segmentation results of the image based on the second slot matrix.

METHOD AND DEVICE WITH NEURAL NETWORK TRAINING AND IMAGE PROCESSING

A processor-implemented method includes: generating a first output of each of two or more layers of a teacher network, based on a first image; generating pseudo labels respectively corresponding to the first outputs, based on the first outputs; generating a second output using one or more layers of a student network comprising an output layer, based on the first image; generating prediction results respectively corresponding to the two or more layers of the teacher network, based on the second output; and training the student network by updating the student network based on the pseudo labels and the prediction results.

Video content based on multiple capture devices

Techniques for video content based on multiple capture devices are described and are implementable to enable multiple video capture devices to be utilized for a video feed. Generally, the described implementations enable video content captured by multiple video capture devices to be utilized, such as to integrate different instances of video content into a merged video content stream. In at least one implementation this provides higher quality video attributes to be utilized than is provided by utilizing a single video content source.

DRIVING SCENARIO UNDERSTANDING
20230154195 · 2023-05-18 ·

According to one aspect, intersection scenario description may be implemented by receiving a video stream of a surrounding environment of an ego-vehicle, extracting tracklets and appearance features associated with dynamic objects from the surrounding environment, extracting motion features associated with dynamic objects from the surrounding environment based on the corresponding tracklets, passing the appearance features through an appearance neural network to generate an appearance model, passing the motion features through a motion neural network to generate a motion model, passing the appearance model and the motion model through a fusion network to generate a fusion output, passing the fusion output through a classifier to generate a classifier output, and passing the classifier output through a loss function to generate a multi-label classification output associated with the ego-vehicle, dynamic objects, and corresponding motion paths.

MODAL INFORMATION COMPLETION METHOD, APPARATUS, AND DEVICE
20230206121 · 2023-06-29 ·

A modal information completion method, an apparatus, and a device are provided. A completion apparatus first obtains a modal information group, wherein the modal information group includes at least two pieces of modal information. Then, the completion apparatus may determine, based on an attribute of the modal information group, whether a part or all of first modal information in the modal information group is missing. Subsequently, the completion apparatus determines a target feature vector of the first modal information based on a preset feature vector mapping relationship and a feature vector of second modal information in the modal information group, so that accuracy of the target feature vector of the first modal information is ensured.