Patent classifications
G06V10/806
Compositional Action Machine Learning Mechanisms
Mechanisms are provided for performing machine learning (ML) training of a ML action recognition computer model which involves processing an original input dataset to generate an object feature bank comprising object feature data structures for a plurality of different objects. For an input video, a verb data structure and an original object data structure are generated and a candidate object feature data structure is selected from the object feature bank for generation of pseudo composition (PC) training data. The PC training data is generated based on the selected candidate object feature data structure and comprises a combination of the verb data structure and the candidate object feature data structure. The PC training data represents a combination of an action and an object not represented in the original input dataset. ML training of the ML action recognition computer model is performed based on an unseen combination comprising the PC training data.
System and Method for Selecting a Dimensioning Function and Dimensioning an Object
An example dimensioning device includes: a sensor to capture data representing an object; a memory configured to store: a first dimensioning function and criteria associated with the first dimensioning function; and a default dimensioning function; and a processor interconnected with the sensor and the memory, the processor configured to: in response to a dimensioning request to dimension the object, obtain the data representing the object from the sensor; select, from the first dimensioning function and the default dimensioning function, a designated dimensioning function based on the data and the criteria; call the designated dimensioning function to obtain dimensions of the object; and output the dimensions of the object.
METHOD AND APPARATUS FOR PROCESSING MODEL DATA, ELECTRONIC DEVICE, AND COMPUTER READABLE MEDIUM
A method and apparatus for processing model data, which relate to the technical field of artificial intelligence. The method comprises: acquiring data of at least two different modalities in a to-be-processed dataset; performing feature extraction on the data of at least two different modalities, then splicing and/or superimposing same, and obtaining a feature sequence; performing model mapping on the feature sequence, to obtain multi-modal input data adapted to an autoregressive model; and inputting the multi-modal input data into the autoregressive model, and obtaining a single-modal result outputted by the autoregressive model.
System and Method for Motion Prediction in Autonomous Driving
The present disclosure provides a system and a method for motion prediction for autonomous driving. The system disclosed herein provides an efficient deep-neural-network-based system to jointly perform perception and motion prediction from 3D point clouds. This system is able to take a pair of LiDAR sweeps as input and outputs for each point in the second sweep, both a classification of the point into one of a set of semantic classes, and a motion vector indicating the motion of the point within the world coordinate system. The system includes a spatiotemporal pyramid network, which extracts deep spatial and temporal features in a hierarchical fashion. The training of this system is regularized with spatial and temporal consistency losses. Thus providing an improved motion planner for autonomous driving applications.
SHAPE FUSION FOR IMAGE ANALYSIS
Various types of image analysis benefit from a multi-stream architecture that allows the analysis to consider shape data. A shape stream can process image data in parallel with a primary stream, where data from layers of a network in the primary stream is provided as input to a network of the shape stream. The shape data can be fused with the primary analysis data to produce more accurate output, such as to produce accurate boundary information when the shape data is used with semantic segmentation data produced by the primary stream. A gate structure can be used to connect the intermediate layers of the primary and shape streams, using higher level activations to gate lower level activations in the shape stream. Such a gate structure can help focus the shape stream on the relevant information and reduces any additional weight of the shape stream.
Single-channel and multi-channel source separation enhanced by lip motion
Methods and systems are provided for implementing source separation techniques, and more specifically performing source separation on mixed source single-channel and multi-channel audio signals enhanced by inputting lip motion information from captured image data, including selecting a target speaker facial image from a plurality of facial images captured over a period of interest; computing a motion vector based on facial features of the target speaker facial image; and separating, based on at least the motion vector, audio corresponding to a constituent source from a mixed source audio signal captured over the period of interest. The mixed source audio signal may be captured from single-channel or multi-channel audio capture devices. Separating audio from the audio signal may be performed by a fusion learning model comprising a plurality of learning sub-models. Separating the audio from the audio signal may be performed by a blind source separation (“BSS”) learning model.
MULTI-MODAL IMAGE CLASSIFICATION SYSTEM AND METHOD USING ATTENTION-BASED MULTI-INTERACTION NETWORK
The present disclosure belongs to the technical field of image processing, and provides a multi-modal image classification system and method using an attention-based multi-interaction network. The present disclosure utilizes a U-net network structure to fuse low-level visual features and high-level semantic features. An attention network is introduced to solve the problem of weak feature discrimination, and high attention is given to discriminative features, so that the attention network plays an important role in the final classification process. A sufficient multi-modal interaction mechanism is introduced, so that more effective correlation information and discriminative information are obtained among a plurality of modalities, and sufficient interaction among the plurality of modalities is completed, thereby solving the problems of weak feature discrimination and insufficient interaction among modalities in a multi-modal image classification task.
MULTI-TASK OBJECT DETECTION METHOD, ELECTRONIC DEVICE, MEDIUM, AND VEHICLE
The disclosure provides a multi-task object detection method, an electronic device, a medium, and a vehicle, to solve the technical problem of low detection accuracy or poor detection effect of an existing multi-task detection method. For this purpose, the multi-task object detection method of the disclosure includes: obtaining images captured by a vehicle-mounted sensor; inputting the images into a multi-scale feature extraction network to extract multi-scale features; inputting the multi-scale features into a multi-scale feature fusion network to obtain fused features, where the multi-scale feature fusion network includes multiple optimal fusion paths, and each optimal fusion path corresponds to one of multiple tasks; and inputting, into a corresponding detection head, the fused features output from each optimal fusion path, to obtain a detection result, where each detection head is capable of detecting one of the multiple tasks. In this way, the accuracy of multi-task object detection is improved.
METHOD AND SYSTEM FOR DETECTING FUNDUS IMAGE BASED ON DYNAMIC WEIGHTED ATTENTION MECHANISM
The present disclosure provides a method and system for detecting a fundus image based on a dynamic weighted attention mechanism. Lesion information in a fundus image of a premature infant is detected using a fundus image segmentation model. First, the fundus image is consecutively downsampled. Dynamical weighted attention fusion is performed on an obtained downsampling feature and an obtained downsampling feature of an adjacent layer. The weighted and fused features are fused with an output feature of a corresponding upsampling layer. Finally, a classification convolution operation is performed on an output of an n-th upsampling layer to obtain a lesion probability for each pixel. The present disclosure performs hierarchical feature fusion on a shallow network model using the dynamic weighted attention mechanism, which can reduce complexity of algorithm design, shorten a running time of an algorithm, and reduce excessive occupation of graphics processing unit (GPU) resources while ensuring recognition accuracy.
Multi-image-based image enhancement method and device
The present disclosure provides a multi-image-based image enhancement method and device, an electronic device and a non-transitory computer readable storage medium. The method includes: aligning a low-resolution target image and a reference image in an image domain; performing, an alignment in a feature domain; and synthesizing features corresponding to the low-resolution target image and features corresponding to the reference image to generate a final output.