G06K9/72

METHOD AND APPARATUS FOR VISUAL QUESTION ANSWERING, COMPUTER DEVICE AND MEDIUM

The present disclosure provides a method for visual question answering, which relates to fields of computer vision and natural language processing. The method includes: acquiring an input image and an input question; detecting visual information and position information of each of at least one text region in the input image; determining semantic information and attribute information of each of the at least one text region based on the visual information and the position information; determining a global feature of the input image based on the visual information, the position information, the semantic information, and the attribute information; determining a question feature based on the input question; and generating a predicted answer for the input image and the input question based on the global feature and the question feature. The present disclosure further provides a device for visual question answering, a computer device and a medium.

Systems and methods to manage application program interface communications

Systems and methods for managing Application Programming Interfaces (APIs) are disclosed. For example, the system may include one or more memory units storing instructions and one or more processors configured to execute the instructions to perform operations. The operations may include receiving a call to an API node. The operations may include determining that the call is associated with the first version of the API. The operations may include determining that the API node is associated with a second version of the API. The operations may include translating the call into a translated call using a translation model, the translated call being associated with the second version of the API.

Aligning symbols and objects using co-attention for understanding visual content

A method, apparatus and system for understanding visual content includes determining at least one region proposal for an image, attending at least one symbol of the proposed image region, attending a portion of the proposed image region using information regarding the attended symbol, extracting appearance features of the attended portion of the proposed image region, fusing the appearance features of the attended image region and features of the attended symbol, projecting the fused features into a semantic embedding space having been trained using fused attended appearance features and attended symbol features of images having known descriptive messages, computing a similarity measure between the projected, fused features and fused attended appearance features and attended symbol features embedded in the semantic embedding space having at least one associated descriptive message and predicting a descriptive message for an image associated with the projected, fused features.

Artificial neural networks having attention-based selective plasticity and methods of training the same

An autonomous navigation system for a vehicle includes a controller configured to control the vehicle, sensors configured to detect objects in a path of the vehicle, nonvolatile memory including an artificial neural network configured to classify the objects detected by the sensors, and a processor. The artificial neural network includes a series of neurons in each of an input layer, at least one hidden layer, and an output layer. The memory includes instructions which, when executed by the processor, cause the processor to train the artificial neural network on a first task, identify, utilizing a contrastive excitation backpropagation algorithm, important neurons for the first task, identify, utilizing a learning algorithm, important synapses between the neurons for the first task based on the important neurons identified, and rigidify the important synapses to achieve selective plasticity of the series of neurons in the artificial neural network.

Scene-to-Text Conversion
20210397842 · 2021-12-23 ·

Various implementations disclosed herein include devices, systems, and methods for performing scene-to-text conversion. In various implementations, a device includes a non-transitory memory and one or more processors coupled with the non-transitory memory. In some implementations, a method includes obtaining environmental data corresponding to an environment. Based on the environmental data, a plurality of objects that are in the environment are identified. An audio output describing at least a first object of the plurality of objects in the environment is generated based on a characteristic value associated with a user of the device. The audio output is outputted.

IMAGE GENERATION USING ONE OR MORE NEURAL NETWORKS
20210398338 · 2021-12-23 ·

Apparatuses, systems, and techniques are presented to generate view-specific representations of an object or environment. In at least one embodiment, one or more neural networks are used to generate one or more images based, at least in part, on two or more two-dimensional (2D) images having different frames of reference.

AUTOMATIC IDENTIFICATION OF MISLEADING VIDEOS USING A COMPUTER NETWORK

Machine-based video classifying to identify misleading videos by training a model using a video corpus, obtaining a subject video from a content server, generating respective feature vectors of a title, a thumbnail, a description, and a content of the subject video, determining a first semantic similarities between ones of the feature vectors, determining a second semantic similarity between the title of subject video and titles of videos in the misleading video corpus in a same domain as the subject video, determining a third semantic similarity between comments of the subject video and comments of videos in the misleading video corpus in the same domain as the subject video, classifying the subject video using the model and based on the first semantic similarities, the second semantic similarity, and the third semantic similarity, and outputting the classification of the subject video to a user.

OPTICAL CHARACTER RECOGNITION OF DOCUMENTS HAVING NON-COPLANAR REGIONS
20210390328 · 2021-12-16 ·

Systems and methods for performing OCR of an image depicting text symbols and imaging a document having a plurality of planar regions are disclosed. An example method comprises: receiving a first image of a document having a plurality of planar regions and one or more second images of the document; identifying a plurality of coordinate transformations corresponding to each of the planar regions of the first image of the document; identifying, using the plurality of coordinate transformations, a cluster of symbol sequences of the text in the first image and in the one or more second images; and producing a resulting OCR text comprising a median symbol sequence for the cluster of symbol sequences.

LOCAL SELF-ATTENTION COMPUTER VISION NEURAL NETWORKS

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for processing images using a computer vision neural network that has one or more local self-attention layers. Each local self-attention layer is configured to apply or more local self-attention mechanisms to the layer input to the local self-attention layer.

Machine learning classifiers

In an implementation, a non-transitory machine-readable storage medium stores instructions that when executed by a processor, cause the processor to allocate classifier data structures to persistent memory, read a number of categories from a set of training data, and populate the classifier data structures with training data including training-based, category and word probabilities calculated based on the training data.