Patent classifications
G06V10/806
Video rating method, video rating device, equipment and storage medium
The present disclosure relates to a video rating method, a video rating device, equipment and a storage medium, relating to the field of multimedia. An embodiment of the present disclosure provides a method for automatically rating a video based on features of multiple modals of the video and rating embedding modes. By fusing the features of the multiple modals of the video, the rating of the video is converted into rating embedding in a vector space, and a matching degree between a target feature fusing with the multiple modals and each rating embedding is acquired, the rating of the video is predicted according to the matching degree corresponding to each rating embedding, and the video rating efficiency and accuracy can be improved.
Method and system for determining an activity of an occupant of a vehicle
A computer implemented method for determining an activity of an occupant of a vehicle comprises the following steps carried out by computer hardware components: capturing sensor data of the occupant using at least one sensor; determining respective two-dimensional or three-dimensional coordinates for a plurality of pre-determined portions of the body of the occupant based on the sensor data; determining at least one portion of the sensor data showing a pre-determined body part of the occupant based on the sensor data and the two-dimensional or three-dimensional coordinates; and determining the activity of the occupant based on the two-dimensional or three-dimensional coordinates and the at least one portion of the sensor data.
Systems, Methods and Computer Program Products for Associating Media Content Having Different Modalities
Systems, methods, and computer program products for associating a media content clip(s) with other media content clip(s) having a different modality by determining first embedding vectors of media content items of a first modality, receiving a media content clip of a second modality, determining a second embedding vector of the media content clip of the second modality, ranking the first embedding vectors based on a distance between the embedding vectors and the second embedding vector, and selecting one or more of the media content items of the first modality based on the ranking, thereby pairing media content clips based on emotion.
Automated patient complexity classification for artificial intelligence tools
Mechanisms are provided for implementing a patient complexity classification (PCC) computing system. The PCC computing system receives medical image study data for a patient that comprises one or more medical image data structures and one or more corresponding medical image metadata data structures. A natural language processing engine of the PCC computing system performs natural language processing on the medical image metadata data structure to extract features indicative of at least one of patient or medical image characteristics. A complexity classifier of the PCC computing system evaluates the extracted features to determine a patient complexity indicating a complexity of a medical condition of the patient. Routing logic associated with the PCC computing system routes the one or more medical image data structures and one or more corresponding medical image metadata data structures to one or more downstream patient evaluation computing systems based on the determined patient complexity.
Device and method for training an object detection model
A training device may include one or more processors configured to generate, using a data augmentation model, augmented sensor data for sensor data, the sensor data provided by a plurality of sensors, wherein the augmented sensor data comprise error states of one or more sensors of the plurality of sensors providing the sensor data, and to train an object detection model based on the augmented sensor data.
Visual relationship detection method and system based on region-aware learning mechanisms
The present invention discloses a visual relationship detection method based on a region-aware learning mechanism, comprising: acquiring a triplet graph structure and combining features after its aggregation with neighboring nodes, using the features as nodes in a second graph structure, and connecting in accordance with equiprobable edges to form the second graph structure; combining node features of the second graph structure with features of corresponding entity object nodes in the triplet, using the combined features as a visual attention mechanism and merging internal region visual features extracted by two entity objects, and using the merged region visual features as visual features to be used in the next message propagation by corresponding entity object nodes in the triplet; and after a certain number of times of message propagations, combining the output triplet node features and the node features of the second graph structure to infer predicates between object sets.
Apparatus and method for image processing for machine learning
An image processing apparatus includes a superpixel extractor configured to extract a plurality of superpixels from an input original image, a backbone network including N feature extracting layers (here, N is a natural number of two or more) which divide the input original image into grids including a plurality of regions and generate an output value including a feature value for each of the divided regions, and a superpixel pooling layer configured to generate a superpixel feature value corresponding to each of the plurality of superpixels using a first output value to an N.sup.th output value output from each of the N feature extracting layers.
Multi-modal, multi-technique vehicle signal detection
A vehicle includes one or more cameras that capture a plurality of two-dimensional images of a three-dimensional object. A light detector and/or a semantic classifier search within those images for lights of the three-dimensional object. A vehicle signal detection module fuses information from the light detector and/or the semantic classifier to produce a semantic meaning for the lights. The vehicle can be controlled based on the semantic meaning. Further, the vehicle can include a depth sensor and an object projector. The object projector can determine regions of interest within the two-dimensional images, based on the depth sensor. The light detector and/or the semantic classifier can use these regions of interest to efficiently perform the search for the lights.
Multimodal dimensional emotion recognition method
A multimodal dimensional emotion recognition method includes: acquiring a frame-level audio feature, a frame-level video feature, and a frame-level text feature from an audio, a video, and a corresponding text of a sample to be tested; performing temporal contextual modeling on the frame-level audio feature, the frame-level video feature, and the frame-level text feature respectively by using a temporal convolutional network to obtain a contextual audio feature, a contextual video feature, and a contextual text feature; performing weighted fusion on these three features by using a gated attention mechanism to obtain a multimodal feature; splicing the multimodal feature and these three features together to obtain a spliced feature, and then performing further temporal contextual modeling on the spliced feature by using a temporal convolutional network to obtain a contextual spliced feature; and performing regression prediction on the contextual spliced feature to obtain a final dimensional emotion prediction result.
Framework for training machine-learned models on extremely large datasets
A MapReduce-based training framework exploits both data parallelism and model parallelism to scale training of complex models. Particular model architectures facilitate and benefit from use of such training framework. As one example, a machine-learned model can include a shared feature extraction portion configured to receive and process a data input to produce an intermediate feature representation and a plurality of prediction heads that are configured to receive and process the intermediate feature representation to respectively produce a plurality of predictions. For example, the data input can be a video and the plurality of predictions can be a plurality of classifications for content of the video (e.g., relative to a plurality of classes).