Patent classifications
G06V10/86
SUPERVISED CONTRASTIVE LEARNING FOR VISUAL GROUNDING
A method of training a neural network model includes generating a positive image based on an original image, generating a positive text corresponding to the positive image based on an original text corresponding to the original image, the positive text referring to an object in the positive image, constructing a positive image-text pair for the object based on the positive image and the positive text, constructing a negative image-text pair for the object based on the original image and a negative text, the negative text not referring to the object, training the neural network model based on the positive image-text pair and the negative image-text pair to output features representing an input image-text pair, and identifying the object in the original image based on the features representing the input image-text pair.
Automatic delineation and extraction of tabular data in portable document format using graph neural networks
Aspects of the present invention disclose a method for automatic delineation and extraction of tabular data in portable document format (PDF). The method includes one or more processors extracting metadata corresponding to tabular data in a text-based portable document format (PDF), wherein the metadata is associated with characters and border lines of the tabular data. The method further includes generating a graph structure corresponding to the tabular data in the text-based PDF based at least in part on the metadata. The method further includes generating a vector representation of the graph structure. The method further includes constructing a tree structure corresponding to the tabular data based at least in part on the vector representation.
VIDEO QUESTION ANSWERING METHOD, ELECTRONIC DEVICE AND STORAGE MEDIUM
There is provided a video question answering method and apparatus, an electronic device and a storage medium, which relates to the field of artificial intelligence, such as natural language processing technologies, deep learning technologies, voice recognition technologies, knowledge graph technologies, computer vision technologies, or the like. The method includes: determining M key frames for a video corresponding to a to-be-answered question, M being a positive integer greater than 1 and less than or equal to a number of video frames in the video; and determining an answer corresponding to the question according to the M key frames.
VIDEO QUESTION ANSWERING METHOD, ELECTRONIC DEVICE AND STORAGE MEDIUM
There is provided a video question answering method and apparatus, an electronic device and a storage medium, which relates to the field of artificial intelligence, such as natural language processing technologies, deep learning technologies, voice recognition technologies, knowledge graph technologies, computer vision technologies, or the like. The method includes: determining M key frames for a video corresponding to a to-be-answered question, M being a positive integer greater than 1 and less than or equal to a number of video frames in the video; and determining an answer corresponding to the question according to the M key frames.
SYSTEM AND METHOD FOR ONTOLOGY GUIDED INDOOR SCENE UNDERSTANDING FOR COGNITIVE ROBOTIC TASKS
Existing cognitive robotic applications follow a practice of building specific applications for specific use cases. However, the knowledge of the world and the semantics are common for a robot for multiple tasks. In this disclosure, to enable usage of knowledge across multiple scenarios, a method and system for ontology guided indoor scene understanding for cognitive robotic tasks is described where in scenes are processed based on techniques filtered based on querying ontology with relevant objects in perceived scene to generate a semantically rich scene graph. Herein, an initially manually created ontology is updated and refined in online fashion using external knowledge-base, human robot interaction and perceived information. This knowledge helps in semantic navigation, aids in speech, and text based human robot interactions.
SYSTEM AND METHOD FOR ONTOLOGY GUIDED INDOOR SCENE UNDERSTANDING FOR COGNITIVE ROBOTIC TASKS
Existing cognitive robotic applications follow a practice of building specific applications for specific use cases. However, the knowledge of the world and the semantics are common for a robot for multiple tasks. In this disclosure, to enable usage of knowledge across multiple scenarios, a method and system for ontology guided indoor scene understanding for cognitive robotic tasks is described where in scenes are processed based on techniques filtered based on querying ontology with relevant objects in perceived scene to generate a semantically rich scene graph. Herein, an initially manually created ontology is updated and refined in online fashion using external knowledge-base, human robot interaction and perceived information. This knowledge helps in semantic navigation, aids in speech, and text based human robot interactions.
VISUALIZATION METHOD, PROGRAM FOR THE SAME, VISUALIZATION DEVICE, AND DISCRIMINATION DEVICE HAVING THE SAME
The second multi-dimensional feature vectors 92a of sample image data 34a having instruction signals that are converted by a feature converter 27 are read in (Step S10), two-dimensional graph data for model 36a is generated based on the read second multi-dimensional feature vectors 92a to be stored (Step S12), two-dimensional model graphs Og and Ng are generated based on the generated two-dimensional graph data for model 36a, to be displayed on the window 62 (Step S14). The second multi-dimensional feature vectors 92a are indicators appropriate for visualization of the trained state (individuality) of a trained model 35. Thus, it is possible to visually check and evaluate whether the trained model 35 is in an appropriately trained state (individuality) or not.
VISUALIZATION METHOD, PROGRAM FOR THE SAME, VISUALIZATION DEVICE, AND DISCRIMINATION DEVICE HAVING THE SAME
The second multi-dimensional feature vectors 92a of sample image data 34a having instruction signals that are converted by a feature converter 27 are read in (Step S10), two-dimensional graph data for model 36a is generated based on the read second multi-dimensional feature vectors 92a to be stored (Step S12), two-dimensional model graphs Og and Ng are generated based on the generated two-dimensional graph data for model 36a, to be displayed on the window 62 (Step S14). The second multi-dimensional feature vectors 92a are indicators appropriate for visualization of the trained state (individuality) of a trained model 35. Thus, it is possible to visually check and evaluate whether the trained model 35 is in an appropriately trained state (individuality) or not.
NON-TRANSITORY COMPUTER-READABLE RECORDING MEDIUM, INFORMATION PROCESSING METHOD, AND INFORMATION PROCESSING APPARATUS
An information processing apparatus acquires video image data that includes target objects including a person and an object, and specifies, by using graph data that indicates a relationship between each of target objects stored in a storage unit, a relationship between each of the target objects included in the acquired video image data. The information processing apparatus specifies, by using a feature value of the person included in the acquired video image data, a behavior of the person included in the video image data. The information processing apparatus predicts, by inputting the specified behavior of the person and the specified relationship to a probability model, a future behavior or a future state of the person.
NON-TRANSITORY COMPUTER-READABLE RECORDING MEDIUM, INFORMATION PROCESSING METHOD, AND INFORMATION PROCESSING APPARATUS
An information processing apparatus acquires video image data that includes target objects including a person and an object, and specifies, by using graph data that indicates a relationship between each of target objects stored in a storage unit, a relationship between each of the target objects included in the acquired video image data. The information processing apparatus specifies, by using a feature value of the person included in the acquired video image data, a behavior of the person included in the video image data. The information processing apparatus predicts, by inputting the specified behavior of the person and the specified relationship to a probability model, a future behavior or a future state of the person.