Scene-aware video dialog
11210523 · 2021-12-28
Assignee
Inventors
- Shijie Geng (Piscataway, NJ, US)
- Peng Gao (Cambridge, MA, US)
- Anoop Cherian (Belmont, MA, US)
- Chiori Hori (Lexington, MA, US)
- Jonathan Le Roux (Arlington, MA)
Cpc classification
G06V20/41
PHYSICS
G06V10/84
PHYSICS
G06F18/2113
PHYSICS
G06F16/9035
PHYSICS
International classification
Abstract
A scene aware dialog system includes an input interface to receive a sequence of video frames, contextual information, and a query and a memory configured to store neural networks trained to generate a response to the input query by analyzing one or combination of input sequence of video frames and the input contextual information. The system further includes a processor configured to detect and classify objects in each video frame of the sequence of video frames; determine relationships among the classified objects in each of the video frame; extract features representing the classified objects and the determined relationships for each of the video frame to produce a sequence of feature vectors; and submit the sequence of feature vectors, the input query and the input contextual information to the neural network to generate a response to the input query.
Claims
1. A scene-aware dialog system, comprising: an input interface configured to receive a sequence of video frames, contextual information, and a query: a memory configured to store at least one neural network comprising a visual scene-aware dialog neural network trained to generate a response to the input query by analyzing one or combination of the input sequence of video frames and the input contextual information provided to the neural network; a processor configured to detect and classify objects in each video frame of the sequence of video frames; integrate region of interests of objects in the sequence of video frames to determine relationships among the classified objects in each of the video frame, wherein at least one video frame of the sequence of video frames includes at least two classified objects, and wherein the relationship between the two classified objects is an intra-frame object relationship confined within the video frame; extract features representing the classified objects and the determined relationships for each of the video frame to produce a sequence of feature vectors, wherein there is one feature vector for one video frame; and submit the sequence of feature vectors, the input query and the input contextual information to the neural network to generate a response to the input query; and an output interface to render the response to the input query.
2. The scene-aware dialog system of claim 1, wherein the input query concerns one or combination of objects, relationships among the objects, and temporal evolutions of the objects in the input sequence of video frames, and wherein the contextual information includes one or combination of audio information and textual information about the input video, such that the neural network is a multi-modal neural network configured to process information of modalities.
3. The scene-aware dialog system of claim 2, wherein the processor is further configured to modify values of each feature vector of the sequence of feature vectors with weighted values of neighboring feature vectors in the sequence of feature vectors.
4. The scene-aware dialog system of claim 3, wherein the values of each of the feature vector are determined as a weighted combination of values of multiple feature vectors fitting a window centered on the feature vector.
5. The scene-aware dialog system of claim 3, wherein the at least one neural network stored in the memory includes an audio visual scene aware dialog (AVSD) neural network trained to prepare the response to the input query, a feature extraction neural network trained to represent the objects and the corresponding relationships among the objects in the sequence of video frames with the sequence of feature vectors, and an aggregation neural network trained to determine the values of each feature vectors of the sequence of feature vectors as a weighted combination of values of multiple feature vectors fitting the window centered on the feature vector.
6. The scene-aware dialog system of claim 5, wherein the AVSD neural network corresponds to an attention-based architecture and includes one or combination of a faster region-based convolutional neural network (faster RCNN) and a 3-dimensional (3D) convolutional neural network (CNN).
7. The scene-aware dialog system of claim 1, wherein the memory stores a set of neural network based classifiers comprising an object classifier configured to detect and classify a predefined type of objects in the input sequence of video frames and a relationship classifier to classify relationships among the classified objects, and wherein the processor is configured to select and execute the selected neural network based classifiers to detect and classify the objects and corresponding relationships among the classified objects in each video frame of the input sequence of video frames.
8. The scene-aware dialog system of claim 7, wherein the processor is further configured to select the object classifier and the relationship classifier from the set of neural network based classifiers based on the input sequence of video frames, the input contextual information, the input query or combination thereof.
9. The scene-aware dialog system of claim 1, wherein the memory stores an object and a relationship classifiers configured to detect and classify objects and their relationship relevant for generating navigation instructions for driving a vehicle, and wherein the processor is configured to generate a navigation instruction using a description and a relationships of an object pertinent to a navigation route to a destination of the vehicle.
10. The scene-aware dialog system of claim 1, wherein the processor is further configured to generate a spatio-temporal scene graph representation (STSGR) model for each frame of the sequence of video frames based on an integrated region of interests and the visual memory, and wherein the at least one neural network is trained to perform spatio-temporal relational learning on training STSGR models of the sequence of video frames to generate responses to training queries.
11. The scene-aware dialog system of claim 10, wherein each STSGR model represents each corresponding video frame as a spatio-temporal visual graphs stream and a semantic graph stream, and wherein the at least one neural network is a multi-head shuffled transformer for generating an object-level graph reasoning, the multi-head shuffled transformer enable shuffling heads of the sequence of feature vectors.
12. The scene-aware dialog system of claim 1, wherein the processor is further configured to aggregate the classified objects and the determined relationships for generating visual memory for each video frame of the sequence of video frames.
13. A scene-aware dialog method, wherein the method uses a processor coupled with stored instructions implementing the method, wherein the instructions, when executed by the processor carry out steps of the method, comprising: receiving a sequence of video frames, contextual information, and a query; detecting and classifying objects in each video frame of the sequence of video frames; integrating region of interests of objects in the sequence of video frames for determining relationships among the classified objects in each of the video frame, wherein at least one video frame of the sequence of video frames includes at least two classified objects, and wherein the relationship between the two classified objects is an intra-frame object relationship confined within the video frame; extracting features representing the classified objects and the determined relationships for each of the video frame to produce a sequence of feature vectors, wherein there is one feature vector for one video frame; submitting the sequence of feature vectors, the input query and the input contextual information to at least one neural network comprising a visual scene-aware dialog neural network trained to generate a response to the input query by analyzing one or combination of input sequence of video frames and the input contextual information; and rendering the response to the input query via an output interface.
14. The method of claim 13, wherein the input query concerns one or combination of objects, relationships among the objects, and temporal evolutions of the objects in the input sequence of video frames, and wherein the contextual information includes one or combination of audio information and textual information about the input video, such that the neural network is a multi-modal neural network configured to process information of different modalities.
15. The method of claim 14, further comprising modifying values of each feature vector of the sequence of feature vectors with weighted values of neighboring feature vectors in the sequence of feature vectors, the values of the of each of the feature vector are determined as a weighted combination of values of multiple feature vectors fitting a window centered on the feature vector.
16. The method of claim 15, wherein the at least one neural network includes an audio visual scene aware dialog (AVSD) neural network trained to prepare the response to the input query, a feature extraction neural network trained to represent the objects and the corresponding relationship among the objects in the sequence of video frames with the sequence of feature vectors, and an aggregation neural network trained to determine values of each feature vectors of the sequence of feature vectors as a weighted combination of values of multiple feature vectors fitting the window centered on the feature vector.
17. The method of claim 16, further comprising selecting an object classifier and a relationship classifier from a set of neural network based classifiers and executing the selected object classifier for detecting and classifying a predefined type of objects in the input sequence of video frames and the relationship classifier for classifying relationships among the classified objects, the selection of the neural network based classifiers based on the input contextual information, the input sequence of video frames, the input query, or combination thereof.
18. The method of claim 13, further comprising generating a spatio-temporal scene graph representation (STSGR) model for each frame of the sequence of video frames, each STSGR model represents each corresponding video frame as a spatio-temporal visual graphs stream and a semantic graph stream, wherein the at least one neural network is a multi-head shuffled transformer for generating an object-level graph reasoning and wherein the neural network is trained to perform spatio-temporal relational learning on training STSGR models of the sequence of video frames to generate responses to training queries.
19. The method of claim 13, further comprising aggregating the classified objects and the determined relationships for generating visual memory for each video frame of the sequence of video frames.
20. A scene-aware dialog system, comprising: an input interface configured to receive a sequence of video frames, contextual information, and a query; a memory configured to store at least one neural network comprising a visual scene-aware dialog neural network trained to generate a response to the input query by analyzing one or combination of the input sequence of video frames and the input contextual information provided to the neural network, wherein the input query concerns one or combination of objects, relationships among the objects, and temporal evolutions of the objects in the input sequence of video frames, and wherein the contextual information includes one or combination of audio information and textual information about the input video, such that the neural network is a multi-modal neural network configured to process information of modalities; a processor configured to detect and classify objects in each video frame of the sequence of video frames; determine relationships among the classified objects in each of the video frame, wherein at least one video frame of the sequence of video frames includes at least two classified objects, and wherein the relationship between the two classified objects is an intra-frame object relationship confined within the video frame; extract features representing the classified objects and the determined relationships for each of the video frame to produce a sequence of feature vectors, wherein there is one feature vector for one video frame; modify values of each feature vector of the sequence of feature vectors with weighted values of neighboring feature vectors in the sequence of feature vectors; and submit the sequence of feature vectors, the input query and the input contextual information to the neural network to generate a response to the input query; and an output interface to render the response to the input query.
21. A scene-aware dialog system, comprising: an input interface configured to receive a sequence of video frames, contextual information, and a query; a memory configured to store: at least one neural network comprising a visual scene-aware dialog neural network trained to generate a response to the input query by analyzing one or combination of input sequence of video frames and the input contextual information provided to the neural network; and a set of neural network based classifiers comprising an object classifier configured to detect and classify a predefined type of objects in the input sequence of video frames and a relationship classifier to classify relationships among the classified objects; a processor configured to: select the object classifier and the relationship classifier from the set of neural network based classifiers based on the input sequence of video frames, the input contextual information, the input query or combination thereof; execute the selected object classifier and relationship classifier to detect and classify objects and corresponding relationships among the classified objects in each video frame of the input sequence of video frames; determine relationships among the classified objects in each of the video frame, wherein at least one video frame of the sequence of video frames includes at least two classified objects, and wherein the relationship between the two classified objects is an intra-frame object relationship confined within the video frame; extract features representing the classified objects and the determined relationships for each of the video frame to produce a sequence of feature vectors, wherein there is one feature vector for one video frame; and submit the sequence of feature vectors, the input query and the input contextual information to the neural network to generate a response to the input query; and an output interface to render the response to the input query.
22. A scene-aware dialog system, comprising: an input interface configured to receive a sequence of video frames, contextual information, and a query; a memory configured to store: at least one neural network comprising a visual scene-aware dialog neural network trained to generate a response to the input query by analyzing one or combination of input sequence of video frames and the input contextual information provided to the neural network; and an object classifier and a relationship classifier configured to detect and classify objects and their relationship relevant for generating navigation instructions for driving a vehicle; a processor configured to detect and classify objects in each video frame of the sequence of video frames; determine relationships among the classified objects in each of the video frame, wherein at least one video frame of the sequence of video frames includes at least two classified objects, and wherein the relationship between the two classified objects is an intra-frame object relationship confined within the video frame; extract features representing the classified objects and the determined relationships for each of the video frame to produce a sequence of feature vectors, wherein there is one feature vector for one video frame; and submit the sequence of feature vectors, the input query and the input contextual information to the neural network to generate a response to the input query; and an output interface to render the response to the input query, wherein the processor is further configured to generate a navigation instruction using a description and a relationships of an object pertinent to a navigation route to a destination of the vehicle.
23. A scene-aware dialog system, comprising: an input interface configured to receive a sequence of video frames, contextual information, and a query; a memory configured to store at least one neural network comprising a visual scene-aware dialog neural network trained to generate a response to the input query by analyzing one or combination of the input sequence of video frames and the input contextual information provided to the neural network; a processor configured to detect and classify objects in each video frame of the sequence of video frames; determine relationships among the classified objects in each of the video frame, wherein at least one video frame of the sequence of video frames includes at least two classified objects, and wherein the relationship between the two classified objects is an intra-frame object relationship confined within the video frame; extract features representing the classified objects and the determined relationships for each of the video frame to produce a sequence of feature vectors, wherein there is one feature vector for one video frame; and submit the sequence of feature vectors, the input query and the input contextual information to the neural network to generate a response to the input query; and an output interface to render the response to the input query, wherein the processor is further configured to generate a spatio-temporal scene graph representation (STSGR) model for each frame of the sequence of video frames based on an integrated region of interests and the visual memory, and wherein the at least one neural network is trained to perform spatio-temporal relational learning on training STSGR models of the sequence of video frames to generate responses to training queries.
24. A scene-aware dialog method, wherein the method uses a processor coupled with stored instructions implementing the method, wherein the instructions, when executed by the processor carry out steps of the method, comprising: receiving a sequence of video frames, contextual information, and a query, wherein the input query concerns one or combination of objects in the input sequence of video frames, relationships among the objects in the input sequence of video frames, and temporal evolutions of the objects in the input sequence of video frames, and wherein the contextual information includes one or combination of audio information and textual information about the input video; detecting and classifying objects in each video frame of the sequence of video frames; determining relationships among the classified objects in each of the video frame, wherein at least one video frame of the sequence of video frames includes at least two classified objects, and wherein the relationship between the two classified objects is an intra-frame object relationship confined within the video frame; extracting features representing the classified objects and the determined relationships for each of the video frame to produce a sequence of feature vectors, wherein there is one feature vector for one video frame; modifying values of each feature vector of the sequence of feature vectors with weighted values of neighboring feature vectors in the sequence of feature vectors, the values of the of each of the feature vector are determined as a weighted combination of values of multiple feature vectors fitting a window centered on the feature vector; submitting the sequence of feature vectors, the input query and the input contextual information to at least one neural network comprising a visual scene-aware dialog neural network trained to generate a response to the input query by analyzing one or combination of input sequence of video frames and the input contextual information, wherein the at least one neural network is a multi-modal neural network configured to process information of different modalities; and rendering the response to the input query via an output interface.
25. A scene-aware dialog method, wherein the method uses a processor coupled with stored instructions implementing the method, wherein the instructions, when executed by the processor carry out steps of the method, comprising: receiving a sequence of video frames, contextual information, and a query; detecting and classifying objects in each video frame of the sequence of video frames; determining relationships among the classified objects in each of the video frame, wherein at least one video frame of the sequence of video frames includes at least two classified objects, and wherein the relationship between the two classified objects is an intra-frame object relationship confined within the video frame; extracting features representing the classified objects and the determined relationships for each of the video frame to produce a sequence of feature vectors, wherein there is one feature vector for one video frame; generating a spatio-temporal scene graph representation (STSGR) model for each frame of the sequence of video frames, each STSGR model represents each corresponding video frame as a spatio-temporal visual graphs stream and a semantic graph stream; submitting the sequence of feature vectors, the input query and the input contextual information to at least one neural network comprising a visual scene-aware dialog neural network trained to generate a response to the input query by analyzing one or combination of input sequence of video frames and the input contextual information, wherein the at least one neural network is a multi-head shuffled transformer for generating an object-level graph reasoning and wherein the neural network is trained to perform spatio-temporal relational learning on training STSGR models of the sequence of video frames to generate responses to training queries; and rendering the response to the input query via an output interface.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
DETAILED DESCRIPTION
(18) In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure may be practiced without these specific details. In other instances, apparatuses and methods are shown in block diagram form only in order to avoid obscuring the present disclosure.
(19) As used in this specification and claims, the terms “for example,” “for instance,” and “such as,” and the verbs “comprising,” “having,” “including,” and their other verb forms, when used in conjunction with a listing of one or more components or other items, are each to be construed as open ended, meaning that the listing is not to be considered as excluding other, additional components or items. The term “based on” means at least partially based on. Further, it is to be understood that the phraseology and terminology employed herein are for the purpose of the description and should not be regarded as limiting. Any heading utilized within this description is for convenience only and has no legal or limiting effect.
(20) System Overview
(21)
(22) The processor 104 is configured to detect and classify objects in each video frame of the sequence of video frames. The processor 104 is further configured to determine relationships among the objects in each video frame of the sequence of image frames and extract features representing the objects and their relationships in each video frame in order to generate a sequence of feature vectors. Each feature vector of the sequence of feature vectors corresponds to a corresponding video frame of the sequence of video frames. The processor 104 is further configured to submit the sequence of feature vectors, the query and the contextual information to one or more neural networks stored in the memory 106 to generate a response to the query, where the query concerns one or combination of objects, relationships among the objects, and temporal evolutions of the objects in the input sequence of video frames. The one or more neural networks are trained to generate the response to the query by analyzing one or combination of the input sequence of video frames and the input contextual information such that the one or more neural networks are multi-modal neural networks configured to process information of modalities. The one or more trained neural networks include an audio-visual scene aware neural network 108, a feature extraction neural network 110 and an aggregation neural network 112.
(23) For example, the audio-visual scene aware neural network 108 is trained to process features (e.g., scene graph representations) that represent scene in each video frame of the sequence of video frames to generate a response. Examples of the audio-visual scene aware neural network 108 include one or combination of a faster region-based convolutional neural network (faster RCNN) and a 3-dimensional (3D) convolutional neural network (CNN). The scene graph representations provide spatial information of the each video frame that includes features of objects in each video frame and relationships among the objects. The processor 104 is further configured to combine the spatial information with temporal information of each video frame to generate a spatio-temporal scene graph representation (STSGR) model. More specifically, the processor 104 generates the STSGR model for each video frame of the sequence of video frames based on an integrated region of interests. Further, the one or more neural networks are trained to utilize each STSGR model for performing spatio-temporal learning on training STSGR models of the sequence of video frames such that a relevant response to the query is generated. The spatio-temporal reasoning captures visual and semantic information flows inside videos, accordingly each STSGR model represents each corresponding video frame as a spatio-temporal visual graphs stream and a semantic graph stream. This allows an object-level graph reasoning for generating responses relevant to queries for the video. In some embodiments, the one or more neural networks are multi-head shuffled transformers for generating the object-level graph reasoning, where the multi-head shuffled transformers enable shuffling heads of the sequence of feature vectors.
(24) In some embodiments, the audio-visual scene aware neural network 108 corresponds to an attention-based network architecture that computes attention coefficient for each edge in the scene graph representation. The attention coefficient provides values that determine importance between two linked nodes of a graph, such as the STSGR. The attention coefficient is utilized to selectively focus on salient features, such as the classified objects and the relationships for the spatio-temporal reasoning.
(25) Further, the processor 104 is configured to extract feature vectors from the generated scene graph representations by utilizing the feature extraction neural network 110. The feature extraction neural network 124 corresponds to a pre-trained neural network that extracts the feature vectors from the generated scene graph representation and generates a sequence of feature vectors corresponding to the sequence of scene graph representation of the sequence of video frames. Further, the processor 104 is configured to modify values of each feature vector of the sequence of feature vectors with weighted values of neighboring feature vectors in the sequence of feature vectors.
(26) In particular, the feature extraction neural network 110 performs a frame-level intra-graph reasoning for extracting the feature vectors. The feature vectors herein correspond to visual graph memories of the sequence of scene graph representations.
(27) In one implementation, the intra-graph reasoning is performed by aggregating node-based features (i.e. object features) of the scene graph representations and aggregating edge-based features (i.e. relationship features) of the scene graph representations. Further, the intra-graph reasoning uses an attention based neural network for computing a weight value (i.e. a self-attention value) for a pair of linked nodes. The weight value indicates importance of a node paired to another node. In a similar manner, weights of other paired nodes are determined using the attention based neural network. Further, a weighted sum of neighboring nodes (i.e. neighboring objects) in the scene graph representation is computed based on all the weight values. The processor 104 updates features of each node in each corresponding scene graph representation based on the weighted sum.
(28) Furthermore, the processor 104 provides the weighted sum as input to a relationship neural network. In one implementation, the relationship neural network corresponds to a multi-layer fully connected network for generating relationship features from two connected node features of the updated scene graph representation. Thus, the updated scene graph representations are pooled into visual graph memories that provide the sequence of feature vectors. The sequence of feature vectors represents spatial representations and temporal representations of the input data 126. Further, to determine each feature vector of the sequence of feature vectors, the aggregation neural network 110 aggregates values of multiple feature vectors fitting a window centered on the feature vector. Such a feature vector of the sequence of feature vectors is a weighted combination of the values of the multiple feature vectors fitting the window centered on that feature vector. This allows aggregation of similar features about the objects and their relationships in neighboring video frames. The processor 104 is configured to utilize each feature vector of the sequence of feature vectors for generating the response to the query. Further, the generated response to the query is rendered on an output device 118 via the output interface 116. The output device 118 includes, but not limited to, a computer, a laptop, a tablet, a phablet, or any display device. In some implementations, the output device 118 may include an application interface for rendering the response.
(29) In some embodiments, in order to improve efficiency of the AVSD system 100, the processor 104 may be configured to extract a set of key frames from the sequence of video frames. The set of key frames includes video frames that represent transition in movement of a person or an object in the sequence of video frames of the video. The processor 104 may be configured to extract the set of key frames based on pre-trained models (e.g. Visual Genome) for AVSD applications. The extraction of the set of key frames enables the processor 104 to efficiently process the video frames (i.e. the set of key frames) to generate the response to the query of the video, as a number of video frames from the sequence of video frames required to be processed is reduced. Accordingly, utilization of the key frames facilitates an efficient AVSD system (i.e. the AVSD system 100) to generate a response to a query for a video.
(30)
(31) In some embodiments, the processor 104 further detects and classifies objects in each video frame of the sequence of video frames 202a-202d. The audio-visual scene aware neural network 108 may include a set of instructions for object detection techniques such as a bounding box technique. The processor 104 may be configured to execute such instructions to detect the objects in each video frame. For instance, in video frame 202a, detected objects are indicated by bounding boxes, such as bounding box 204a and bounding box 204b. The detected objects are classified based on an object classifier of the audio-visual scene aware neural network 108. The object classifier may include a set of instructions to classify objects based on conventional object classification techniques. The processor 104 may be configured to execute the set of instructions of the object classifier in order to classify the objects in each video frame. The classified objects are labeled (not shown in
(32)
(33) In some embodiments, the audio-visual scene aware neural network 108 includes a set of classifiers (also known as a set of neural network based classifiers), for extracting features of the objects based on visual features or semantic features of the objects in each video frame of the sequence of video frames 202a-202d. The set of classifiers include object classifier and a relationship classifier. The processor 104 is configured to select the object classifier and the relationship classifier from the set of neural network based classifiers based on the input sequence of video frames, the input contextual information, the input query or combination thereof. For instance, the video frame 202a denoted by I, the object features denoted by F.sub.1, bounding boxes denoted by B.sub.1, and semantics denoted by S.sub.I can be extracted by a neural network based object classifier, such as faster recurrent convolutional neural network (R-CNN) object detection model as
F.sub.I, B.sub.I , S.sub.I=RCNN (I) ( 1)
where F.sub.l∈R.sup.N.sup.
(34) In a similar manner, the processor 104 selects the relationship classifier for determining relationships among the classified objects in each video frame of the sequence of video frames 202a-202d. The relationship classifier recognizes visual relations between the classified objects in the sequence of video frames 202a-202d. Further, the relationship classifier generates a fixed number of relation proposals, N.sub.r with highest confidences. For instance, N.sub.r is set to fixed number 100. In some implementations, the visual relations in each video frame (i.e., each of the video frames 202a-202d) are determined using relationship detection model that embeds objects and relations into vector spaces where both discriminative capability and semantic affinity are preserved. The relationship detection model is trained on a video dataset that contains 150 objects and 50 relationships indicated as predicates. When the relationship detection model is applied on the video frames 202a-202d, a set of subject S, predicate P and object O, i.e. <S, P, O> is obtained as output for each video frame. In one embodiment, the original predicate semantics P are discarded as relation predicates of the relationship detection model trained on the video dataset are limited and fixed. Thus, the relation proposals are based on <S, O> pairs that are used to learn implicit relation semantics of the objects. In most cases, the relation proposals N.sub.r may not include all objects that are unmentioned. The unmentioned objects are filtered out by conducting a graph pruning.
(35) Further, the processor 104 integrates region of interests of objects in the sequence of video frames 202a-202d for determining relationships between two objects of the classified objects. More specifically, for the determined <S, O> pairs, a union box of bounding boxes (e.g., the bounding boxes 204a and 204b of
(36) Thus, the processor 104 extracts the sequence of sequence of features 208a-208d that includes features of both visual and semantic information using the feature extraction neural network 110. Further, the processor 104 uses the scene graph representations 206a-206d to extract visual memories of the video frames 202a-202d, which is described further in description of
(37)
(38) In some embodiments, the processor 104 aggregates the classified objects and the determined relationships for generating visual memory for each video frame of the sequence of video frames 202a-202d. To that end, the processor 104 utilizes graph attention network 210 and the relationship graph network 212 that are stored in the memory 106. The graph attention network 210 includes a node 210a representing an attention coefficient value. Each scene graph representation of the sequence of scene graph representations 206a-206d is aggregated by performing intra-graph reasoning using the graph attention network 210 and the relationship graph network 212. The processor 104 utilizes the feature extraction neural network 110 of the AVSD system 100 to execute a frame-level intra-graph reasoning on the scene graph representations 206a-206d to extract visual graph memory or semantic graph memory for each video frame of the sequence of video frames 202a-202d.
(39) The visual graph memory provides higher-level features that represent finer-grained information for each node (i.e. the object) in the scene graph representations 206a-206d. The higher-level features are extracted based on attention coefficient for each edge in each scene graph representation of the sequence of scene graph representations 206a-206d. The processor 104 determines the attention coefficient by the graph attention network 210. Further, the processor 104 aggregates the node features of the scene graph representations 206a-206d based on the attention coefficients. The processor 104 computes extra edge features based on the node features by the relationship graph network 212. Further, the processor 104 aggregates the extra edge features such that the relationship graph network 212 the node features are updated.
(40) In the node-based feature aggregation, for M node features, X={x.sub.1, x.sub.2, . . . , x.sub.M} in a scene graph representation (e.g., the scene graph representation 206a-206d), self-attention is performed for each pair of linked nodes. For linked nodes x.sub.i and x.sub.j, the attention coefficient 210a, ∝.sub.ij which indicates importance of node j to node i is calculated by
(41)
where ∥ denotes vertical concatenation operation, N.sub.i indicates neighborhood object nodes of object i, W∈R.sup.d.sup.
x′.sub.i=∥.sub.k=1.sup.K σ(Σ.sub.j∈N.sub.
where variables k and K define the number of heads in a multi-head attention scheme. The weighted sums of x′.sub.i is used as input for the relationship graph network 212 for performing edge-based feature aggregation. The relationship graph network 212 is a multi-layer fully connected network h.sub.Λ is employed to generate edge features e.sub.ij from two connected node features (x′.sub.i, x′.sub.j):
e.sub.ij=h.sub.Λ(x′.sub.i, x′.sub.j) (4)
where h.sub.Λ: R.sup.d.sup.
x*.sub.i=max.sub.j:(j,i)∈ε.sub.
where ε.sub.i denotes set of edges pointing to node i. Thus, the processor 104 updates the node features inside the sequence of scene graph representations 206a-206d based on the graph attention network 210 and the relationship graph network 212. Further, to obtain the higher-level features for each node of the scene graph representations 206a-206d, the updated graph is pooled into the visual graph memory. In one implementation, the processor 104 is configured to execute the pooling of visual graph memory based on graph average pooling (GAP) and graph max pooling (GMP). The GAP and GMP are stored in the memory 106. The processor 104 accesses the GAP and GMP pooling and provides to the feature extraction neural network 110 for generating two graph streams that represent the visual graph memories. The visual graph memories such as visual graph memory 214a, visual graph memory 214b, visual graph memory 214c and visual graph memory 214d are described in
(42)
V*=GAP (X*, ∈)∥GMP(X*, ∈) (6)
where ∈ denotes the connection structure of the scene graph representations 206a-206d, and X* the final node features {x*.sub.1, x*.sub.2, . . . , x*.sub.M}.
(43) For a sequence of scene graph memories (such as the scene graph memories 214a-214d) denoted by {υ*.sub.1, υ*.sub.2 , . . . , υ*.sub.L} of length L, windows 216a, 216b and 216c of size S are used to update the graph memory 214b of center video frame (such as the video frame 202b) in each window of the windows 216a-216c by aggregating graph memories 214a and 214c of neighboring video frames 202a and 202c in the window 216b. The processor 104 utilizes the aggregation neural network 112 for aggregating the graph memories 214a and 214c. The sequence of visual graph memories 214a-214d is set as f∈R.sup.2.sub.d.sub.
∝=softmax(P.sub.∝.sup.Ttanh (W.sub.tf)) (7)
where W.sub.t∈R.sup.2d.sup.
υ.sub.c=∝ f.sup.T
(44) The windows 216a-216c sliding over a visual graph memory (such as the graph memory 216b) of center video frame (e.g., the video frame 202b) of the sequence of video frames 202a-202d provides a sequence of final graph memories 2018a, 218b, 218c and 218d. The sequence of the final graph memories 218a-218d can be represented as V={v.sub.1, v.sub.2, . . . , v.sub.L}, which aggregates both the spatial information and the temporal information of the video frames 202a-202d. The final graph memories 218a-218d are provided as an input to a self-attention encoder and a feed forward network layer 220. The self-attention encoder and the feed forward network layer 220 extracts features represented as feature vectors 220a, 220b, 220c and 220d. The feature vectors 220a, 220b, 220c and 220d are submitted to a semantic-controlled transformer for generating a response to a query of the video 200A. The semantic-controlled transformer encodes contextual information, which is described further in
(45)
(46) In particular, the feature vector 302, the contextual information 304 and the query 306 are provided as the input to the MHA network layer 308a. The MHA network layer 308a encodes text information based on the contextual information 304 and learns a dialog model for generating a response to the query 306. Further, the MHA network layer 308a generates an encoded feature vector 312, encoded contextual information 314 and an encoded query 316. The encoded feature vector 312, the encoded contextual information 314, the encoded query 316 and features of a sub answer 310 (A×D) are provided as input to another MHA network layer 308b to generate a response for the query 306. The response includes feature vector 318a, feature vector 318b, feature vector 318c and feature vector 318d generated by shuffling head vectors of the reference answer 310, the encoded feature vector 312, the encoded contextual information 314 and the encoded query 316, respectively. The shuffling of the head vectors improves performance of the semantic-controlled transformer 300 as hidden features are also extracted. The response is generated in an iterative manner, as shown in
(47) The head vectors of the feature vectors 318a-318d are shuffled before feeding into feed-forward network (FFN) module 320 that are later concatenated. The FFN module 320 includes two fully connected layers with a ReLI function in between. The concatenation fuses the features of the contextual information 304 and the visual features of the feature vector 302 to extract a feature vector 322. A loss function (L) 326 is implemented between a predicted probability distribution P of the feature vector 322 and a ground token distribution G of features 324 of reference answers. In one embodiment, the loss function 326 is based on Kullback-Leibler divergence:
(48)
(49) In each iteration, one word is generated and next word for the response is predicted using a co-attention transformer of the semantic-controlled transformer 300. Further, all next token probability distributions are collected in a batch to obtain the predicted probability distribution P. In a similar manner, ground token distribution G is obtained from ground truth answers or responses to the query 306.
(50) Thus, the semantic-controlled transformer 300 learns the dialog model and generates the responses to the query 306.
(51)
(52) Typically, in language modelling, words for the answer sentences are predicted from a vocabulary repository. In one implementation, prediction of next word for a word in an answer sentence is performed based on the input query 412. The contextual information 304 includes source sentences, such as video caption, dialog history and the reference answer 310 (i.e., an already generated answer). For instance, dialog history, H={C, (Q.sub.1, A.sub.1), . . . , (Q.sub.l-1, A.sub.l-1)}, where C is the video caption, Q.sub.l i is the query and A.sub.l.sup.in is the reference answer. The semantics-controlled transformer reasoning 410 generates probability distribution of next token of a word for all tokens of words in the vocabulary for the output response 414. The reasoning process of the semantics-controlled transformer reasoning 410 is controlled based on concatenated visual graph memories 214a-214d and final graph memories 218a-218d.
(53) In the semantics-controlled transformer reasoning 410, the sentence sources that include the dialog history (H), the video caption (C), the query (Q.sub.1) and the reference answer (A.sub.l.sup.in) are embedded together using tokenization and word positional embedding layer. For instance, text sources (H, C, Q.sub.l, A.sub.l.sup.in) is tokenized as e.sub.h,e.sub.c,e.sub.q,e.sub.a. In one implementation, a text source is tokenized by byte-pair encoding (BPE). The tokenized text source is transformed into a representation of LW dimensional vectors that correspond to a sentence length (L) and a word embedding dimension (W), by the word positional embedding layer. Each word of the tokenized text source is encoded into a position embedding space and added to the word embedding layer. In a similar manner, a target sentence is encoded into a position embedding space. A continuous representation S∈R.sup.L×C of the text source at the input of a self-attention module is translated into key (k), query (q) and value (v) using linear transforms. The self-attention module computes an attention value between the key and the query. The attention value between the key and query enable each word in the text source to aggregate information from other words using the self-attention module.
(54) Further, the visual graph memories (i.e., the visual graph memories 214c-214d) of dimension 2d.sub.h are transferred to d.sub.h dimension features, e.sub.v that match LW dimension of the text sources. Next, the tokenized reference answer (i.e., the reference answer 310) e.sub.a is encoded using a self-attention based mutli-head shuffling transformer (i.e., the MHA network layer 308a) to generate encoded hidden representations (h.sub.enc).
h.sub.enc=FFN (Attention (W.sub.qe.sub.a, W.sub.ke.sub.a, W.sub.ve.sub.a), (9)
where W.sub.q, W.sub.k, W.sub.v are weight matrices for the query (q), key (k) and value (v), respectively. FFN is a feed-forward network module that includes two fully-connected layers with an activation function (i.e., rectified linear unit (ReLu)) in between. The encoded hidden representations correspond to the feature vectors 318a-318d.
(55) The attention coefficient between the key (k) and the query (q) with the value (v) is determined based on attention function defined as:
(56)
where, √{square root over (d.sub.h)} is a scaling factor for maintaining scalars in order of magnitude and d.sub.h is dimension of each head in the feature vectors (i.e., the encoded feature vector 312, the encoded contextual information 314 and the encoded query 316).
(57) After encoding the input query 412, co-attention for each of the other word and visual embedding e.sub.j is performed, where j∈{h,c, q,v}, with the same transformer structure of the multi-head shuffling transformer (i.e., the MHA network layer 308b):
h′.sub.enc,j=FFN (Attention (W.sub.qh.sub.enc, W.sub.ke.sub.j, W.sub.ve.sub.j), (11)
where, h′.sub.enc,j is a new encoded feature.
(58) By concatenating features of the sentence sources and the visual features (i.e., the feature vectors 318a-318d), a feature vector h*.sub.enc,j, is extracted. Each head vector in each sentence source feature (i.e., the encoded contextual information 314 and the encoded query 316) and each visual feature (i.e., the encoded feature vector 312) are shuffled by the multi-head shuffling transformer (i.e., the MHA network layer 308b). The multi-head shuffling enable head vectors of the encoded feature vectors 312, the encoded contextual information 314 and the encoded query 316 to interact from start to end, which improves performance of the semantic-controlled transformer reasoning 410. The head vectors are shuffled before feeding into two fully connected layers of the FFN module 320 that are later concatenated. The concatenation fuses the features of the text sources and the visual features to extract final encoded feature vector h*.sub.enc,j. The feature vector h*.sub.enc,j, is used for predicting next token probability distribution (p.sub.vocab) over the tokens in the vocabulary. The next token probability distribution (p.sub.vocab) is predicted using a FFN with softmax function:
(p.sub.vocab)=softmax (FFN (h*.sub.enc)) (12)
VIn testing stage, beam search with b beams is conducted to generate an answer sentence. In each step, b tokens with the top-b highest confidence scores are selected. The answer is completed either when token end of sentence, <eos> is generated or when maximum number of tokens is reached. Accordingly, the processor outputs the output response 414 to the input query 412 based on the generated answer.
(59)
(60) In an alternate embodiment, the AVSD system 100 may be remotely coupled with the device 516 through an online connection link of a network, such as the network 124. Further, the user 502 is associated with an electronic device 504 that is capable of communicating with the device 516. The electronic device 504 may communicate with the device 516 via communication links, such as Bluetooth connection, infra-red connection, Wi-Fi connection, or the like. In an alternate embodiment, the AVSD system 100 may be coupled to the device 516 via a cloud network (not shown in
(61) Further, the device 516 may include one or more components such as a camera 508, a display screen 510, a microphone 512 a speaker 514, and the like. The camera 508 captures the user 502 that is in field of view 518 of the camera 508. Additionally or alternatively, the camera 508 captures gestures of the user 502, such as hand gestures pointing to an object in a video. Accordingly, the device 516 transmits the query 506 along with the gestures to the AVSD system 100.
(62) For instance, the user 502 is watching a sports match displayed on the display screen 510 of the device 516 and the user 502 provides a query 506 for the sports match via the electronic device 504. The query 506 may be “how many players are playing in the field?”. The user 502 may provide the query 506 along with a hand gesture pointing to the sports match. The query 506 is captured by the electronic device 504 enable the system 100 to determine what the user 502 is asking about. The electronic device 504 transmits the query 506 to the device 516. Alternatively, the user 502 may provide the query 506 via the microphone 512 of the device 516. The microphone 512 receives the query 506 and provides to the AVSD system 100 in the device 516. The AVSD system 100 processes the sports match video, the input query 506 and contextual information of the sports match stored in the storage device 114 to generate a response 520 to the query 506 as described above in description of
(63)
(64) Further, at row 614, under the column of generated answers 612, one or more answers to an input query, are generated. Each generated answers in the generated answers 612 is associated with a confidence score. The generated answer with the highest confidence score is selected as an output response (e.g. the output response 314 as described in description of
(65)
(66) To that end, the AVSD system 100 includes the processor 104 that processes the video frames 702 and extracts visual and semantic information from the video frames 702. Further, the processor 104 encodes the visual and semantic information with contextual information, such as video caption of the video frame 702, video dialog history and audio of the video frame 702 for generating the response 708. The response 708 is generated based on a generated answer with the highest confidence score in the generated answers 612 as described in description of
(67)
(68)
(69) In this embodiment, the AVSD system 100 can use an object and a relationship classifiers configured to detect and classify objects and their relationship relevant for generating navigation instructions. For example, the objects can include buildings, cars, pedestrians, poles, traffic lights or any other object relevant to a driver. Examples of relationships can include ahead, behind, on the right, on the left, etc. In this embodiment, the AVSD system 100 is configured to generate a navigation instruction using description of classified objects and their relationship with navigation route for the destination. For example, the AVSD system 100 can generate a navigation instruction such as “follow the car ahead, and make a left turn after the tree ahead left.” In this example, the classified objects are car and a tree. Their relationships with the navigated vehicle indicate that both the car and the tree are ahead of the vehicle. Their relationships with the navigation route for the destination indicate that there is a need to turn left to follow the navigation route.
(70) This embodiment is based on recognition that there is a need to provide route guidance to a driver of a vehicle based on real-time unimodal or multimodal information about static and dynamic objects in the vicinity of the vehicle. For example, it is an object of some embodiments to provide context based driving instruction like “turn right before the brown brick building” or “follow the white car” in addition to or in alternative to GPS based instructions like “in 100 feet take the second right onto Johnson street.” Such context based driving instructions can be generated based on real-time awareness of a scene in proximity of the vehicle. To that end, the context based navigation is referred herein as a scene-aware navigation that can be implemented using a dialog system according to the various embodiments.
(71)
(72) At block 808, the system extracts features representing the classified objects and the determined relationships for each of the video frame to produce a sequence of feature vectors (i.e., the sequence of feature vectors 418a-418d of
(73)
(74) As shown in
(75)
(76) By evaluating on AVSD at DSTC7 with objective metrics, performance of the AVSD system 100 is compared with four baselines methods, such as a baseline method, a multimodal attention method, a simple method and an MTN method. The baseline method is based on DSTC challenge that extracts features for different modalities. The extracted features of the different modalities are combined using simple concatenation or addition for generating a response to an input query. The multimodal attention method implements a multimodal attention that utilizes attention to selectively focus on salient features for the response generation. The simple method adds image features, such as VGG feature and factor graph attention for the response generation. The MTN method applies self-attention and co-attention to aggregate information between video, audio, and multi-tum dialog information. Besides, an answer auto-encoding loss has been applied to boost the performance.
(77)
(78) Exemplar Embodiments
(79)
(80) The processor 104 utilizes the visual graph memories 1108a-1108c for inter-graph information aggregation 1112 to generate final graph memories, i.e. the graph memories 1128a-1128c. The final graph memories 1128a-1128c are provided as an input for the semantic-controlled transformer 1124. Further, the processor 104 is configured to execute the semantic-controlled transformer reasoning 1124, to encode the final graph memories 1128a-1128c, the contextual information 304 and the input query 1134 to generate the output response 1140. In some embodiments, the input query 1134 is provided to the semantic-controlled transformer 300 to execute the semantics-controlled transformer reasoning 1124. The semantics-controlled transformer reasoning 1124 generates probability distribution of next token of a word for all tokens of words in the vocabulary for the output response 1140. The reasoning process of the semantics-controlled transformer reasoning 1140 is controlled based on concatenated visual graph memories 214a-214d and final graph memories 218a-218d.
(81) The following description provides exemplary embodiments only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the following description of the exemplary embodiments will provide those skilled in the art with an enabling description for implementing one or more exemplary embodiments. Contemplated are various changes that may be made in the function and arrangement of elements without departing from the spirit and scope of the subject matter disclosed as set forth in the appended claims.
(82) Specific details are given in the following description to provide a thorough understanding of the embodiments. However, understood by one of ordinary skill in the art can be that the embodiments may be practiced without these specific details. For example, systems, processes, and other elements in the subject matter disclosed may be shown as components in block diagram form in order not to obscure the embodiments in unnecessary detail. In other instances, well-known processes, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments. Further, like reference numbers and designations in the various drawings indicated like elements.
(83) Also, individual embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process may be terminated when its operations are completed, but may have additional steps not discussed or included in a figure. Furthermore, not all operations in any particularly described process may occur in all embodiments. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, the function's termination can correspond to a return of the function to the calling function or the main function.
(84) Furthermore, embodiments of the subject matter disclosed may be implemented, at least in part, either manually or automatically. Manual or automatic implementations may be executed, or at least assisted, through the use of machines, hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine readable medium. A processor(s) may perform the necessary tasks.
(85) Various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
(86) Embodiments of the present disclosure may be embodied as a method, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts concurrently, even though shown as sequential acts in illustrative embodiments. Further, use of ordinal terms such as “first,” “second,” in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
(87) Although the present disclosure has been described with reference to certain preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the present disclosure. Therefore, it is the aspect of the append claims to cover all such variations and modifications as come within the true spirit and scope of the present disclosure.