G06V10/806

Object and camera localization system and localization method for mapping of the real world

An object and camera localization system and localization method for mapping of the real world. The mapping can be done in real-time or near real-time to the detection of the real objects by a camera device which is used to capture one or more images of an object. The localization method can be used to generate an object label of the object and a bounding box of the object in the image. The localization method can be used to generate anchor points in real world coordinates of the real 3D space of the object, a cuboid of the object, and a centroid of the cuboid. A virtual 3D map can be generated that which includes the location and pose of the real object in the real-world coordinates.

Methods and Systems for Generating Composite Image Descriptors

An illustrative image descriptor generation system generates a descriptor listing that includes a plurality of image descriptors corresponding to different feature points included within an image. Based on the descriptor listing, the system generates a geometric map representing the plurality of image descriptors in accordance with respective geometric positions of the corresponding feature points of the image descriptors within the image. Based on the geometric map, the system determines a proximity listing for a primary image descriptor within the plurality of image descriptors. The proximity listing indicates a subset of image descriptors that are geometrically proximate to the primary image descriptor within the image. Based on the proximity listing, the system selects a secondary image descriptor from the subset of image descriptors and combines the primary and secondary image descriptors to form a composite image descriptor. Corresponding methods and systems are also disclosed.

Supplementing top-down predictions with image features

The described techniques relate to predicting object behavior based on top-down representations of an environment comprising top-down representations of image features in the environment. For example, a top-down representation may comprise a multi-channel image that includes semantic map information along with additional information for a target object and/or other objects in an environment. A top-down image feature representation may also be a multi-channel image that incorporates various tensors for different image features with channels of the multi-channel image, and may be generated directly from an input image. A prediction component can generate predictions of object behavior based at least in part on the top-down image feature representation, and in some cases, can generate predictions based on the top-down image feature representation together with the additional top-down representation.

IMAGE PROCESSING METHOD, IMAGE PROCESSING APPARATUS, ELECTRONIC DEVICE AND COMPUTER-READABLE STORAGE MEDIUM

An image processing method, an image processing apparatus, an electronic device, and a computer-readable storage medium, relating to the technical field of image processing are provided. The image processing method may include performing blur classification on pixels of an image to obtain a classification mask image; and determining a blurred area of the image based on the classification mask image.

COMPLEMENTARY LEARNING FOR MULTI-MODAL SALIENCY DETECTION

A saliency detection explicitly models complementary information between appearance, or color, and depth information in images. A mutual-information minimization is used as a regularizer to reduce the redundancy between appearance features from RGB and geometric features from depth in the latent space. Then the latent features of each of the appearance and geometric modalities are fused to achieve multi-modal feature fusion for saliency detection.

MICROGENRE-BASED HYPER-PERSONALIZATION WITH MULTI-MODAL MACHINE LEARNING
20220245424 · 2022-08-04 ·

A method includes accessing video data that includes at least two different modalities. The method also includes using a convolutional neural network layer to incorporate temporal coherence into a machine learning model architecture configured to process the video data. The method further includes learning dependency among the at least two different modalities in an attention space of the machine learning model architecture. In addition, the method includes predicting one or more correlations among the at least two different modalities.

LEARNING ORTHOGONAL FACTORIZATION IN GAN LATENT SPACE
20220254152 · 2022-08-11 ·

A method for learning disentangled representations of videos is presented. The method includes feeding each frame of video data into an encoder to produce a sequence of visual features, passing the sequence of visual features through a deep convolutional network to obtain a posterior of a dynamic latent variable and a posterior of a static latent variable, sampling static and dynamic representations from the posterior of the static latent variable and the posterior of the dynamic latent variable, respectively, concatenating the static and dynamic representations to be fed into a decoder to generate reconstructed sequences, and applying three regularizers to the dynamic and static latent variables to trigger representation disentanglement. To facilitate the disentangled sequential representation learning, orthogonal factorization in generative adversarial network (GAN) latent space is leveraged to pre-train a generator as a decoder in the method.

SYSTEM AND METHOD OF USING RIGHT AND LEFT EARDRUM OTOSCOPY IMAGES FOR AUTOMATED OTOSCOPY IMAGE ANALYSIS TO DIAGNOSE EAR PATHOLOGY
20220261987 · 2022-08-18 ·

Disclosed herein are systems and methods to detect a wide range of eardrum abnormalities by using high-resolution otoscope images of both a left eardrum and a right eardrum of a subject and report the condition of each of the eardrums as “normal” or “abnormal.”

IMAGE PROCESSING DEVICE, TELEPRESENCE SYSTEM, IMAGE PROCESSING METHOD, AND IMAGE PROCESSING PROGRAM

An image processing device according to an embodiment of the present disclosure is an image processing device that processes an image to be displayed on a display unit of a telepresence robot disposed at a first base. The image processing device includes a background image acquisition unit that acquires, in real time, image information obtained by capturing an image in a rear direction of the display unit, a person image acquisition unit that acquires, in real time, image information including a person at a second base that is a remote area from the first base, an extraction unit that extracts an area in which the person is displayed from the image information including the person, and a composition unit that combines the extracted image information on the person and the image information in the rear direction of the display unit.

Object detection network and method

An object detection network includes: a hybrid voxel feature extractor configured to acquire a raw point cloud, extract a hybrid scale voxel feature from the raw point cloud, and project the hybrid scale voxel feature to generate a pseudo-image feature map; a backbone network configured to perform a hybrid voxel scale feature fusion by using the pseudo-image feature map to generate multi-class pyramid features; and a detection head configured to predict a three-dimensional object box of a corresponding class according to the multi-class pyramid features. The object detection network can effectively solve a problem that under a single voxel scale, inference time is longer if the voxel scale is smaller, and an intricate feature cannot be captured and a smaller object cannot be accurately located if the voxel scale is larger. Different classes of 3D objects can be detected quickly and accurately in a 3D scene.