System and method for generation of unseen composite data objects

Abstract

A computer implemented system for generating one or more data structures is described, the one or more data structures representing an unseen composition based on a first category and a second category observed individually in a training data set. During training of a generator, a proposed framework utilizes at least one of the following discriminators—three pixel-centric discriminators, namely, frame discriminator, gradient discriminator, video discriminator; and one object-centric relational discriminator. The three pixel-centric discriminators ensure spatial and temporal consistency across the frames, and the relational discriminator leverages spatio-temporal scene graphs to reason over the object layouts in videos ensuring the right interactions among objects.

Claims

1. A computer implemented system for generating one or more data structures, the one or more data structures representing an unseen composition based on a first category and a second category observed individually, the system comprising: one or more processors operating in conjunction with computer memory, the one or more processors configured to provide: a data receiver interface configured to receive a training data set including labelled data elements based on the first category and labelled data elements based on the second category and to receive a target category indication representative of the unseen composition; a conditional generative adversarial network configured to process the training data set to train a discriminator model architecture coupled to a generator model architecture, the discriminator model architecture having a plurality of adversarial networks operating in concert to train the generator model architecture, the discriminator model architecture including at least: a sequence discriminator configured to distinguish between a real sequence and a generated sequence; a frame discriminator configured to differentiate between frames representing sequence subsets of the real sequence and the generated sequence; a gradient discriminator configured to differentiate between a domain-specific gradient determined based on the type of data structure of the one or more data structures and the training data set; and a relational discriminator configured to assign weights for shifting focus of the generator model architecture to a subset of the one or more new data structures based on an identified context associated with the target category indication of the unseen composition; wherein the generator model architecture is configured to generate the one or more data structures representing the unseen composition based on the outputs of the plurality of adversarial networks.

2. The system of claim 1, wherein the first category includes a set of actions, the second category includes a set of objects, and the training data set includes a plurality of data structures of action/object pairs different than the target category indication representative of the unseen composition.

3. The system of claim 2, wherein the new data structures includes at least a new video data structure generated to represent an action/object pair representative of the unseen composition by synthesizing independently observed data represented in the training data set.

4. The system of claim 1, wherein the first category includes vectorized transactional information and wherein the second category includes vectorized representation of one or more events.

5. The system of claim 1, wherein vectorized labels associated with each training data element in the training data set are processed to identify one or more contextual components that are used for comparison with a vector representing the unseen composition, the one or more contextual components utilized for modifying the operation of the discriminator model architecture.

6. The system of claim 1, wherein the sequence discriminator utilizes a loss function having the relation:
L.sub.v=½[log(D.sub.v(V.sub.real,s.sub.a,s.sub.o)+log(1−D.sub.v(V.sub.gen,s.sub.a,s.sub.o))].

7. The system of claim 1, wherein the frame discriminator utilizes a loss function having the relation: $L_{f} = \frac{1}{2 T} {.Math.}_{i = 1}^{T} [\log (D_{f}^{i} (V_{real}, s_{a}, s_{o}) + \log (1 - D_{f}^{i} (V_{gen}, s_{a}, s_{o}))] .$

8. The system of claim 1, wherein the gradient discriminator utilizes a loss function having the relation: $L_{g} = \frac{1}{2 (T - 1)} {.Math.}_{i = 1}^{T - 1} [\log (D_{g}^{i} (δ V_{real}, s_{a}, s_{o}) + \log (1 - D_{g}^{i} (δ V_{gen}, s_{a}, s_{o}))] .$

9. The system of claim 1, wherein the relational discriminator utilizes a loss function having the relation: $L_{fg} = \frac{1}{2 T} {.Math.}_{i = 1}^{T} [\log (D_{fg}^{i} (F_{real}, s_{a}, s_{o}) + \log (1 - D_{fg}^{i} (F_{gen}, s_{a}, s_{o}))] .$

10. The system of claim 1, wherein the generator model architecture is configured to be optimized using an objective function having the relation: $ℒ_{gan} = \log (1 - D_{v} (V_{gen}, s_{a}, s_{o}))] + \frac{1}{T} {.Math.}_{i = 1}^{T} [\log (1 - D_{f}^{i} (V_{gen}, s_{a}, s_{o}))] + \frac{1}{(T - 1)} {.Math.}_{i = 1}^{T - 1} [\log (1 - D_{g}^{i} (δ V_{gen}, s_{a}, s_{o}))] + \frac{1}{T} {.Math.}_{i = 1}^{T} [\log (1 - D_{fg}^{i} (F_{gen}, s_{a}, s_{o}))] .$

11. A computer implemented method for generating one or more data structures using a conditional generative adversarial network, the one or more data structures representing an unseen composition based on a first category and a second category observed individually, the method comprising: receiving a training data set including labelled data elements based on the first category and labelled data elements based on the second category; receiving a target category indication representative of the unseen composition; processing the training data set to train a discriminator model architecture coupled to a generator model architecture, the discriminator model architecture including at least: a relational discriminator D.sub.r configured to assign weights for shifting focus of the generator model architecture to a subset of the one or more new data structures based on an identified context associated with the target category indication of the unseen composition; and generating using the generator model architecture the one or more data structures; wherein the relational discriminator utilizes a spatio-temporal scene graph, and adapts a neural network to distinguish between element layouts of real data objects V.sub.real and generated data objects V.sub.gen; wherein the spatio-temporal scene graph is represented as custom character =(, ε) and generated from V, where the nodes and edges are represented by and ε.

12. The method of claim 11, wherein the relational discriminator operates on scene graph custom character using a graph convolutional network (GCN) followed by stacking and average-pooling of the resulting node representations along the time axis.

13. The method of claim 12, wherein the scene graph is the concatenated with spatially replicated copies of s.sub.a and s.sub.o to generate a tensor of size (dim(s.sub.a)+dim(s.sub.o)+N.sup.(t))×w.sub.0.sup.(t)×h.sub.0.sup.(t), wherein s.sub.a and s.sub.o represent word embeddings of two different characteristics.

14. The method of claim 13, the method further comprising applying convolutions and sigmoid to the tensor of size (dim(s.sub.a)+dim(s.sub.o)+N.sup.(t))×w.sub.0.sup.(t)×h.sub.0.sup.(t) to obtain an intermediate output which denotes the probability of the scene graph belonging to a real data object, the intermediate output used to assign the weights for shifting focus of the generator model architecture.

15. The method of claim 11, wherein an objective function of the relational discriminator is given by:
L.sub.r=½[log(D.sub.r( custom character .sub.real;s.sub.a,s.sub.o))+log(1−D.sub.r(.sub.gen;s.sub.a,s.sub.o))].

16. The method of claim 11, wherein the discriminator model architecture further includes a sequence discriminator configured to distinguish between a real sequence and a generated sequence.

17. The method of claim 16, wherein the discriminator model architecture further includes a gradient discriminator configured to differentiate between a domain-specific gradient determined based on the type of data structure of the one or more data structures and the training data set.

18. The method of claim 17, wherein the discriminator model architecture further includes a frame discriminator configured to differentiate between frames representing sequence subsets of the real sequence and the generated sequence.

19. The method of claim 18, wherein the relational discriminator, the sequence discriminator, the gradient discriminator, and the frame discriminator are trained simultaneously.

20. A non-transitory, computer readable medium, storing machine interpretable instructions, which when executed by a processor, cause the processor to perform a computer implemented method of generating one or more data structures using a conditional generative adversarial network, the one or more data structures representing an unseen composition based on a first category and a second category observed individually, the method comprising: receiving a training data set including labelled data elements based on the first category and labelled data elements based on the second category; receiving a target category indication representative of the unseen composition; processing the training data set to train a discriminator model architecture coupled to a generator model architecture, the discriminator model architecture having a plurality of adversarial networks operating in concert to train the generator model architecture, the discriminator model architecture including at least: a relational discriminator configured to assign weights for shifting focus of the generator model architecture to a subset of the one or more new data structures based on an identified context associated with the target category indication of the unseen composition; and generating, using the generator model architecture, the one or more data structures; wherein the relational discriminator utilizes a spatio-temporal scene graph, and learns to distinguish between element layouts of real element objects V.sub.real and generated data elements V.sub.gen; wherein the spatio-temporal scene graph is represented as custom character =(, ε) and generated from V, where the nodes and edges are represented by and ε.

Description

DESCRIPTION OF THE FIGURES

(1) In the figures, embodiments are illustrated by way of example. It is to be expressly understood that the description and figures are only for the purpose of illustration and as an aid to understanding.

(2) Embodiments will now be described, by way of example only, with reference to the attached figures, wherein in the figures:

(3) FIG. 1 is an example generative adversarial network system, according to some embodiments.

(4) FIG. 2 is a set of illustrations showing an example approach in relation to human-object interaction (HOI) videos, according to some embodiments.

(5) FIG. 3A is an example block rendering of an example generative multi-adversarial network, according to some embodiments.

(6) FIG. 3B is a more in-depth rendering of components of the discriminator network, according to some embodiments.

(7) FIG. 3C is an example diagram showing a spatio-temporal scene graph, according to some embodiments.

(8) FIG. 4A is an example of a word embedding that can be used to establish relationships between different object/action pairs in the context of video generation, according to some embodiments.

(9) FIG. 4B is an example of a word embedding that can be used to establish relationships between different object/action pairs in the context of transaction generation, according to some embodiments.

(10) FIG. 5 is an example method for generating one or more data structures, the one or more data structures representing an unseen composition based on a first category and a second category observed individually, according to some embodiments.

(11) FIG. 6 is a schematic diagram of a computing device such as a server, according to some embodiments.

(12) FIG. 7 are generated renderings of composite data objects (in this case, videos) based on unseen compositions, according to some embodiments.

(13) FIG. 8 is an example set of output frames of videos generated by example proposed systems, according to some embodiments.

(14) FIG. 9 is an example set of output frames of videos generated by example proposed systems depicting failed attempts, according to some embodiments.

DETAILED DESCRIPTION

(15) Despite the promising success of generative models in the field of image and video generation, the capability of video generation models is limited to constrained settings. Task-oriented generation of realistic videos is a natural next challenge for video generation models. Human activity videos are a good example of realistic videos and serve as a proxy to evaluate action recognition models.

(16) Current action recognition models are limited to the predetermined categories in the dataset. Thus, it is valuable to be able to generate video corresponding to unseen categories and thereby enhancing the generalizability of action recognition models even with limited data collection. Embodiments described herein are not limited to videos, and rather extend to other types of composites generated based on unseen combinations of categories.

(17) FIG. 1 is an example generative adversarial network system, according to some embodiments. The generative adversarial network system 100 is adapted to generate one or more composite data objects, which are one or more data structures representing an unseen composition. Training data can be received at a data receiver interface 102, along with a target category indication that represents a desired unseen composition.

(18) Concretely, the conditional inputs to the system 100 can be semantic labels (e.g., of action and object), and a single start frame with a mask providing the background and location for the object. Then, the model has to create the object, reason over the action, and enact the action on the object (leading to object translation and/or transformation) over the background, thus generating the interaction video.

(19) During training of the generator, the system 100 can utilizes four discriminators (or subsets thereof having one or more of the discriminators)—three pixel-centric discriminators, namely, frame discriminator, gradient discriminator, sequence (video) discriminator; and one object-centric relational discriminator. The three pixel-centric discriminators ensure spatial and temporal consistency across the frames. The novel relational discriminator leverages spatio-temporal scene graphs to reason over the object layouts in videos ensuring the right interactions among objects. Through experiments, Applicants show that the proposed GAN framework of various embodiments is able to disentangle objects and actions and learns to generate videos with unseen compositions. Different performance can be obtained by using different variations of the discriminator networks.

(20) The discriminator networks can be established using neural networks, for example, implemented on computer circuitry and provided, for example, on a computer server or distributed computing resources. Neural networks maintain a number of interconnected nodal data objects which when operated in concert, process incoming data to generate output data through traversal of the nodal data objects.

(21) Over a period of training epochs, the architecture of the neural network is modified (e.g., weights represented as data values coupled to each of the nodal data objects are changed) in response to specific optimization of an objective function, such that the processing of inputs to outputs is modified.

(22) As noted below, each of the discriminator networks is configured for different tracking, and Applicant provides experimental validation of some embodiments.

(23) The components shown in blocks in FIG. 1 are implemented using computer components, including processors, computer memory, and electronic circuitry. In some embodiments, system 100 is a computer server configured for machine learning and composite generation, and may interface with a simulation engine and an object library, which interoperate to submit requests for composite generation. Composites are generated as new data objects for downstream processing.

(24) The simulation engine may, for example, be used for scenario generation and evaluation of potential simulated events and responses thereof. For example, composite data objects can be used to generate data representations of hypothetical transactions that someone may undertake upon the birth of a new baby (diaper purchases), etc. Other types of composite data objects can include stock market/equity market transaction records and event information.

(25) In the context of a composite video, the video may, for example, be uploaded to a new object library storing simulations. In the context of a sequence of transactions, a data structure may be generated encapsulating a set of simulated transactions and/or life events, for example.

(26) As described herein, a discriminator network 106 is provided that is adapted to evaluate and contribute to an aggregated loss function that combines sequence level discrimination, frame (e.g., subsets of sequences) level discrimination, and foreground discrimination (e.g., assigning sub-areas of focus within frames). Generator network G 104 is depicted with a set of 4 discriminators: (1) a frame discriminator D.sub.f, which encourages the generator to learn spatially coherent content (e.g., visual content); (2) a gradient discriminator D.sub.g, which incentivizes G to produce temporally consistent frames; (3) a video discriminator D.sub.v, which provides the generator with global spatio-temporal context; and (4) a relational discriminator D.sub.r, which assists the generator in producing correct object layouts (e.g., in a video). The system 100 can utilize all or a subset of the discriminator networks. While some examples and experimentation describe using all of the networks together, the embodiments are not meant to be limited to using them all together.

(27) The frame discriminator, gradient discriminator, and video discriminators can be considered pixel-centric discriminators, while the relational discriminator can be considered an object (e.g., in the context of a video, physical article, or in the context of stock market or transaction data analysis, event) based discriminator. The discriminators can be operated separately in some embodiments, which can increase performance as parallelization is possible across different devices, different threads, or different processor cores.

(28) The video discriminator is configured to process a block of frames as one, and conduct an analysis based on whether this set of frames is similar to what it is supposed to appear to be. For example, in the context of a transaction flow, the client becomes married, moves somewhere—if one were to generate the future sequence as a whole, the video discriminator would look at the whole set of frames—e.g., determine whether this set of time-series slices look like a realistic month for that client. While slices for a video set of frames can be considered two dimensional images, the video discriminator described herein can also be applied in respect of single dimensional information (e.g., for transaction flows).

(29) The temporal gradient is configured to effectively avoid abrupt changes to promote consistency over time. In the context of a video, for example, a person should not jump from one physical location and jumping to another location between frames—e.g., pixels in a video should be smooth with occasionally transitions, and there is a bias towards having them more often smooth than not.

(30) The relational discriminator, for example, can track elements that are consistent across multiple frames (e.g., slices of time) and track their corresponding layouts (e.g., spatial layouts, or based on other types of vector distance-based “layouts”). For example, spatial layouts can include positioning of physical articles in the context of a video (e.g., background articles such as tables, cutting boards), and in the context of a transaction flow, this can include the tracking of events that persist across frames (e.g., raining, heat wave, Christmas holidays), among others. The spatial layout in respect of event tracking can be based on assigned characterizations that can be mapped to vectors or points in a transformed representative space, and “spatial” distances can then be gauged through determining vector distances (e.g., through mapping to a Euclidean space or other type of manifold space).

(31) The difference in the relational discriminator as opposed to the video discriminator is that it tracks, for example, on a specific event or characteristic that persists over a set of time series slices in querying whether the generated output is realistic.

(32) The aggregated loss function provided by the discriminator network 106 is combined with a generator 104, such that the generator 104 (e.g., generator model architecture), operating in concert with the discriminator network 106 (e.g., discriminator model architecture), provides the overall generative adversarial network system 100.

(33) In various embodiments, one, two, three, or all four of the discriminators can be used together. In an example embodiment pretrained word embeddings can be used for semantic representations of actions and objects, and all discriminators are conditioned on word embeddings of the characteristic pair (e.g., in the context of a video, it can be action (s.sub.a) and physical object/object (s.sub.o)) and all discriminators can be trained simultaneously in an end-to-end manner. For example, the discriminators can be implemented using python code that runs on different processors for generation time, run separately (e.g., parallelized over a number of CPUs, for example, based on data parallelism or model parallelism).

(34) The generator 104 is optimized to generate composite data from the underlying training data that is difficult for the discriminator network to differentiate from it establishes as “real data” (as extracted from the training data).

(35) As a simplified description, the generator 104 generates novel candidate data object composites which are then evaluated by discriminator network 106 and accepted/rejected. Ultimately, the system 100 attempts to output the data object composites which the discriminator network 106 is unable to distinguish as synthesized, and thus would be considered computationally as part of the real data distribution.

(36) The generative adversarial network system as provided in various embodiments, is a conditional generative adversarial network system that maintains a computer-based representation in memory that is updated over a period of training iterations and/or reinforcement learning feedback iterations to estimate a mapping (e.g., a transfer/policy function) between conditioning variables and a real data distribution.

(37) The generative adversarial network system can store, on a data storage 108, a memory object representation, maintained, for example, as one or more neural networks.

(38) The neural networks may be represented as having interconnected computing nodes, stored as data structure objects, that are linked to one another through a set of link weights, filters, etc., which represent influence/activation associated with the corresponding computing nodes. As the neural networks receive feedback during training or reinforcement learning, the neural networks iteratively update and tune weightings and connections.

(39) The interconnections and computing nodes can represent various types of relationships that together provide the policy/transfer function, being tweaked and refined across numerous iterations by, in some embodiments, computationally attempting to minimize errors (e.g., as defined by a loss function). The generative adversarial network system, in some embodiments, can utilize support vector machines, or other types of machine learning computing representations and data architectures.

(40) The training data includes example training compositions and data objects that show linkages between different labels associated with the data objects. In an example embodiment, the training data includes data objects primarily labelled across two categories, the two categories providing a pairwise relationship.

(41) In variants, there may be more than two categories. The pairwise relationships are used to establish training examples that aid in generating interferences, and underlying vectorized metadata and other labels, in some embodiments, expanding upon the category labels, aid in providing additional context.

(42) The categories, as provided in some embodiments, can include action/object pairs associated with underlying training data objects. The training data objects can be associated with vector data structures storing metadata, which together is used to establish relationships in the underlying data.

(43) When a request to generate a new composite data object is received, the system 100 utilizes the generative adversarial network to attempt to create the composite data object by combining aspects of the underlying training data, compositing aspects in an attempt to create a data object that cannot be distinguished by the discriminator (or minimizes a loss function thereof).

(44) However, as the system 100 has not encountered any training data representing the combination required in generating the composite data object (“zero shot”), it has to determine which aspects of the underlying training data to transform, combine, merge, or otherwise stitch together to generate the composite data object.

(45) In the example of FIG. 2, an example approach 200 is described in relation to human-object interaction (HOI) videos 202 and 204. Generation of HOI videos would abridge the gap between the requirement of training data for recognition models on one hand (the more the better) and data collection (the lesser the cheaper). Furthermore, it is valuable to be able to learn recognition models that generalize well over unseen categories or compositions.

(46) Consider the action sequences for “wash aubergine” (A1: wash, O1: aubergine) and “put tomato” (A2: put, O2: tomato) in FIG. 2, as humans it is likely that after looking at these videos, humans would be able to imagine the sequences for categories “wash tomato” (A1,O2) 206 and “put aubergine” (A2,O1) 208 without explicitly looking at the corresponding videos. Individual frames of sequences 202 and 204 may show, for example, the actions being conducted, including still image frames showing the objects, and actions being conducted with them, having movement and interactions being captured in differences between frames. For example, wash aubergine, 202 may include a person's hands lifting the aubergine, and washing and cleaning the aubergine in successive frames. Similarly, for put tomato 204, the tomato may be observed being lifted and moved across frames to be disposed on a table, for example, on a plate.

(47) A composite video may focus on replacing aspects of the videos to shift a tomato into the wash video, or to put an aubergine in the put video, replacing parts of the frames, and in some embodiments, applying transformations to modify the aspects that do not particularly fit to better fit in the context of the composite videos (e.g., the shapes may not match exactly).

(48) Thus, besides providing more training data for recognition models, the advantages of generating HOI videos in zero-shot compositionality setting are multifold: (1) including unseen compositions in the training data would enhance the generalizability of our recognition models; and (2) generated videos can serve as a testbed for several visual reasoning tasks such as counterfactual reasoning.

(49) A task of generating HOI videos with unseen compositions of action and physical article having seen the action and physical article pertaining to that combination independently is proposed in relation to this example, and referred to this “zero-shot HOI video generation”.

(50) Towards this goal, based on the observation that the human activity videos are typically labeled as compositions of an action and a object (e.g., physical article) involved in that action, in an aspect, a task of generating human-object interaction videos in zero-shot compositionality setting is proposed. To generate zero-shot human-object interaction videos, a conditional DCGAN based multi-adversarial GAN is proposed that is configured for focusing on different aspects of a video. Finally, the approach is evaluated on two challenging datasets: Dataset 1 and Dataset 2.

(51) As described herein, the task of zero-shot HOI video generation is introduced. Specifically, given the videos of a set of action and object compositions, an approach proposes to generate unseen compositions having been seen the action and object of a target composition individually, i.e., the target action paired with another object in the existing dataset or the target object being involved in another action in the dataset.

(52) A conditional GAN based generative framework is proposed to generate videos for zero-shot HOI interactions in videos. The proposed framework adopts a multi-adversary approach with each adversarial network focusing on different aspects of the video to train a generator network. Specifically, given an action and object labels along with an image as a context image of the scene, the generator learns to generate a video corresponding to the given action and object in the scene given as the context.

(53) Empirical results and extensive evaluation of an example model is conducted on both subjective and objective metrics demonstrating that the proposed approach outperforms the video generation baselines for two challenging datasets: Dataset 1 and Dataset 2.

(54) Overall, approaches are valuable in enhancing the generalization of HOI models with limited data acquisition. Furthermore, embodiments described herein provide a way to accelerate research in direction of the robust transfer learning based discriminative tasks in human activity videos, thus taking the computational AI systems a step closer to robust understanding and reasoning of the visual world.

(55) Model

(56) To generate videos of human-object interactions, a generative multi-adversarial network is proposed.

(57) FIG. 3A is an example block rendering 300A of an example generative multi-adversarial network, according to some embodiments. Information is shown to be provided to a generator, which generates Vgen (generated data objects) for comparison with Vreal (real data objects). For example, the data objects can be time series based data, such as videos, transaction data, or stock market data, according to various embodiments.

(58) The generator operates in conjunction with the discriminator networks in an attempt to computationally and automatically reduce the loss between Vgen and Vreal. Aspects of information, for example, can be extracted and converted into visual embeddings, word embeddings, among others. Furthermore, as described in further detail below, a scene graph data structure can be maintained which aids in relational discrimination tasks. All or a subset of the discriminators can operate in concert to provide feedback data sets to the generator for incorporation to inform how the generator network should be modified to reduce the loss. Accordingly, over a training period, the generator network along with the discriminator networks are continuously updated.

(59) FIG. 3B is a more in-depth rendering 300B of components of the discriminator network, according to some embodiments.

(60) This model focuses on several aspects of videos, namely, each of the frames of the video, temporally coherent frames and salient objects involved in the activity in the video.

(61) A detailed description of an example model architecture is as follows.

(62) Preliminaries

(63) Generative Adversarial Networks (GAN) consist of two models, namely, generator G and discriminator D that compete with each other. On one hand, the generator G is optimized to learn the true data distribution p.sub.data by generating data that is difficult for the discriminator D to differentiate from real data.

(64) On the other hand, D is optimized to differentiate real data and synthetic data generated by G. Overall, the training follows a two-player zero-sum game with the objective function described below.

(65) $\min_{G} \max_{D} ℒ (G, D) = {��}_{x \sim p_{data}} [\log D (x)] + {��}_{x \sim p_{z}} [\log (1 - D (G (z))]$
where custom character is a noise vector sampled from a distribution p.sub.z such as uniform or Gaussian distribution and x is the real data sample from the true data distribution p.sub.data.

(66) Conditional GAN is a variant of GAN where both generator and discriminator are provided conditioning variables c. Subsequently, the network is optimized using the similar zero-sum game objective to obtain G( custom character , c) and D(x, c). The class of GANs allows the generator network G to learn a mapping between conditioning variables c and the real data distribution.

(67) Proposed Model

(68) Based on the above discussion, a model is introduced on conditional GANs and the training of the model is described in some embodiments. In the following examples, there is a description of each of the elements of the proposed framework below. Overall, the four discriminator networks, i.e., frame discriminator D.sub.f, gradient discriminator D.sub.g, video discriminator D.sub.v, and relational discriminator D.sub.r are all involved in a zero-sum game with the generator network G.

(69) Problem Formulation. Let s.sub.a and s.sub.o be the semantic embedding of action and object label. In the context of non-video examples, these can be two different labelled characteristics instead. Let I be the image provided as a context for the sequence (e.g., video) generation. The approach encodes I using an encoder E.sub.v to obtain an embedding s I, which can be referred to as a context vector. The goal is to generate an output object (e.g., video) V=(V.sup.(i)).sub.i=1.sup.T of length T depicting the action a performed on the object o with context image I as the background of V. To this end, the system 100 learns a function G: (z, s.sub.a, s.sub.o, s I) custom character V, where z is a noise vector sampled from a distribution p.sub.z, such as a Gaussian distribution.

(70) The sequence may, in some embodiments, be a set of sequential data elements, such as frames representing transaction events, rather than videos, and videos are used as a non-limiting, illustrative example.

(71) The context image is encoded using an encoder E to obtain I.sub.c as the context vector.

(72) Let V be the target video to be generated consisting of T(>0) frames V.sup.1, V.sup.2 . . . V.sup.T. The overall goal is to learn a function G:|( custom character , s.sub.a, s.sub.o, I.sub.c)V where z is the noise vector sampled from a distribution p.sub.z such as uniform or Gaussian distribution.

(73) An adversarial approach is proposed with multiple adversaries working simultaneously to learn this generator function. Concretely, the generator network G is trained using four discriminator networks described below: (1) sequence (video) discriminator D.sub.v, (2) frame discriminator D.sub.f, (3) gradient discriminator D.sub.g, and (4) foreground discriminator D.sub.fg as shown in FIG. 3A, and FIG. 3B (i). Not all of the discriminator networks need to be used together, in variant embodiments, one or a plurality of the discriminator networks in various combinations are used.

(74) Sequence (Video) Discriminator Given semantic embeddings s.sub.a and s.sub.o of action and object labels, the sequence (video) discriminator network D.sub.v learns to distinguish between the real video V.sub.real and generated video V.sub.gen=G( custom character , s.sub.a, s.sub.o, I.sub.c).

(75) The network comprises of stacked 3D convolution layers each followed by Batch Normalization layer and LeakyReLU layer with a=0.2 except the last layer which has only sigmoid activation layer, shown in FIG. 3B (ii). The objective function of the network D.sub.v is the following loss function L.sub.v.
L.sub.v=½[log(D.sub.v(V.sub.real,s.sub.a,s.sub.o)+log(1−D.sub.v(V.sub.gen,s.sub.a,s.sub.o))]

(76) The video discriminator network D.sub.v learns to distinguish between real videos V.sub.real and generated videos V.sub.gen by comparing their global spatio-temporal contexts. The architecture consists of stacked conv3d layers, i.e., 3D convolutional layers followed by spectral normalization and leaky ReLU layers with a=0.2.

(77) The system obtains a N×d.sub.0×w.sub.0×h.sub.0 tensor, where N, d.sub.0, w.sub.0, and h.sub.0 are the channel length, depth, width, and height of the activation of the last conv3d layer respectively. We concatenate this tensor with spatially replicated copies of s.sub.a and s.sub.o, which results in a tensor of size (dim(s.sub.a)+dim(s.sub.o)+N)×d.sub.0×w.sub.0×h.sub.0, where dim(⋅) returns the dimensionality of a vector. The system then applies another conv3d layer to obtain a N×d.sub.0×w.sub.0×h.sub.0 tensor.

(78) Finally, the system applies a 1×1×1 convolution followed by a d.sub.0×w.sub.0×h.sub.0 convolution and a sigmoid to obtain the output, which represents the probability that the video V is real. The objective function of the network D.sub.v is the following loss function:
L.sub.v=½[log(D.sub.v(V.sub.real;s.sub.a,s.sub.o))+log(1−D.sub.v(V.sub.gen;s.sub.a,s.sub.o))s.sub.o))].

(79) Frame Discriminator Given semantic embeddings s.sub.a and s.sub.o of action and object labels, the frame discriminator network D.sub.f is optimized to differentiate between each of the frames of the real video V.sub.real and that of the generated video V.sub.gen=G( custom character , s.sub.a, s.sub.o, I.sub.c). In an example embodiment, each of the T frames are processed independently using a network consists of stacked 2D convolution layers each followed by Batch Normalization layer and LeakyReLU layer with a=0.2 [47] except the last layer which has only sigmoid activation layer, shown in FIG. 3B (iii).

(80) The frame discriminator network D.sub.f learns to distinguish between real and generated frames corresponding to the real video V.sub.real and generated video V.sub.gen=G(z, s.sub.a, s.sub.o, s.sub.I) respectively. Each frame in V.sub.gen and V.sub.real can be processed independently using a network consisting of stacked conv2d layers, i.e., 2D convolutional layers followed by spectral normalization and leaky ReLU layers with a=0.2.

(81) The system then obtains a tensor of size N.sup.(t)×w.sub.0.sup.(t)×h.sub.0.sup.(t) (t=1, 2, . . . , T), where N.sup.(t), w.sub.0.sup.(t), and h.sub.0.sup.(t) are the channel length, width and height of the activation of the last conv2d layer respectively.

(82) This tensor is concatenated with spatially replicated copies of s.sub.a and s.sub.o, which results in a tensor of size (dim(s.sub.a)+dim(s.sub.o)+N.sup.(t))×w.sub.0.sup.(t)×h.sub.0.sup.(t). The system then applies another conv2d layer to obtain a N×w.sub.0.sup.(t)×h.sub.0.sup.(t) tensor, and the system now performs 1×1 convolutions followed by w.sub.0.sup.(t)×h.sub.0.sup.(t) convolutions and a sigmoid to obtain a T-dimensional vector corresponding to the T frames of the video V. The i-th element of the output denotes the probability that the frame V.sup.(i) is real.

(83) An example objective function of the network D.sub.f is defined below.

(84) The output of D.sub.f is a T-dimensional vector corresponding to each of the T frames of the video (real or generated).

(85) $- 2 {ptL}_{f} = \frac{1}{2 T} {.Math.}_{i = 1}^{T} [\log (D_{f}^{(i)} (V_{real}; s_{a}, s_{o})) + \log (1 - D_{f}^{(i)} (V_{gen}; s_{a}, s_{o}))],$
where D.sub.f.sup.i is the i-th element of the T-dimensional output of the frame discriminator network D.sub.f.

(86) Another variation of the objective function is the loss function:

(87) $L_{f} = \frac{1}{2 T} {.Math.}_{i = 1}^{T} [\log (D_{f}^{(i)} (V_{real}; s_{a}, s_{o})) + \log (1 - D_{f}^{(i)} (V_{gen}; s_{a}, s_{o}))],$
where D.sub.f.sup.i is the i-th element of the output of the frame discriminator.

(88) Gradient Discriminator

(89) The gradient discriminator network D.sub.g enforces temporal smoothness by learning to differentiate between the temporal gradient of a real video V.sub.real and a generated video V.sub.gen. The temporal gradient ∇.sub.tV of a video V with T frames V.sup.(1), . . . , V.sup.(T) is defined as pixelwise differences between two consecutive frames of the video. The i-th element of ∇.sub.tV is defined as:
[∇.sub.tV].sub.i=V.sup.(i+1)−V.sup.(i), i=1,2, . . . ,(T−1).

(90) Given semantic embeddings s.sub.a and s.sub.o of action and object labels, the frame discriminator network D.sub.g is optimized to differentiate between pixelwise gradient of the real video δV.sub.real and that of the generated video δV.sub.gen.

(91) The pixelwise gradient is a domain-specific aspect that may be different based on different types of target composite data objects. For example, if the composite data object is a linked transaction data structure associated with an event (e.g., coffee shop purchase after job promotion), a different component may be utilized. The gradient discriminator aids in avoiding “jagged” or aberrant shifts as between different sequential sequence elements (e.g., in the context of a video, abrupt jumps between pixels of proximate frames).

(92) The architecture of the gradient discriminator D.sub.g can be similar to that of the frame discriminator D.sub.f. The output of D.sub.g is a (T−1)-dimensional vector corresponding to the (T−1) values in gradient ∇.sub.tV.

(93) The objective function of D.sub.g is

(94) $L_{g} = \frac{1}{2 (T - 1)} {.Math.}_{i = 1}^{T - 1} [\log (D_{g}^{(i)} (\nabla_{t} V_{real}; s_{a}, s_{o})) + \log (1 - D_{g}^{(i)} (\nabla_{t} V_{gen}; s_{a}, s_{o}))],$

(95) where D.sub.g.sup.(i) is the i-th element of the output of D.sub.g.

(96) Foreground Discriminator The foreground of the sequence (video) V with T frames V.sup.1 . . . V.sup.T can be defined with corresponding foreground mask M with T foreground masks m.sup.1 . . . m.sup.T corresponding to the T frames.
F.sup.t=m.sup.t⊙V.sup.t+(1−m.sup.t)⊙V.sup.t, t=1,2 . . . T (6)
where ⊙ is elementwise multiplication of the mask and corresponding frame.

(97) The foreground discriminator is adapted to track and focus attention of the discriminator network in relation to sub-portions of a frame, and in some embodiments, track these attention elements as they move relative to the frame. In the context of a video, if the desired data object is “cut aubergine”, focus may be emphasized on pixels or interface elements representing knife and/or the eggplant, and more specifically on the part of the eggplant being cut.

(98) The focus may be tracked as, for example, a knife and an eggplant translate and rotate in 3-D space and such movements are tracked in the frames of the video. In the context of FIG. 3A, m refers to a mask, which is used, in some embodiments, to identify to areas of focus for the discriminator.

(99) Different approaches can be used to establishing focus—in some embodiments, a human or other mechanism may establish a “ground truth” portion, but such establishing may be very resource intensive (e.g., human has to review and flag sections). Other approaches include generating or establishing ranges and/or areas automatically, for example, using bounding boxes (bboxes) or masks (e.g., polygons or other types of continuous shapes and/or rules).

(100) In relation to a potential sequence of transactions (instead of videos/screen frames), each transaction may be considered a frame. In this example, a ground truth may be established based on which transactions are involved—for example, rent payments can be flagged and tagged.

(101) In another embodiment, a bounding box can be established based on a region of time of payments which are likely to be rent payments (e.g., first of the month). In another embodiment, masks are used as an automated way of getting a detailed estimate of which payments are relevant as rent payments.

(102) Given semantic embeddings s.sub.a and s.sub.o, of action and object labels, the frame discriminator network D.sub.fg is optimized to differentiate between pixelwise gradient of the real video δV.sub.real and that of the generated video δV.sub.gen.

(103) The architecture for foreground discriminator D.sub.g can be similar to that of frame discriminator. The objective function of the network D.sub.fg is defined below. The output of D.sub.fg is a T-dimensional vector corresponding to each of the T foreground frames of the sequence (e.g., video) (real or generated).

(104) $\begin{matrix} L_{fg} = \frac{1}{2 T} {.Math.}_{i = 1}^{T} [\log (D_{fg}^{i} (F_{real}, s_{a}, s_{o}) + \log (1 - D_{fg}^{i} (F_{gen}, s_{a}, s_{o}))] & (7) \end{matrix}$

(105) Relational Discriminator. The relational discriminator D.sub.r leverages a spatio-temporal scene graph to distinguish between object layouts in videos. Each node contains convolutional embedding, position and aspect ratio (AR) information of the object crop obtained from MaskRCNN. The nodes are connected in space and time and edges are weighted based on their inverse distance. Edge weights of (dis)appearing objects are set to 0.

(106) In addition to the pixel-centric discriminators above, Applicants also propose a novel object-centric discriminator D.sub.r. Driven by a spatio-temporal scene graph, this relational discriminator learns to distinguish between the object layouts of real videos V.sub.real and generated videos V.sub.gen (see FIG. 3C). As shown in FIG. 3C, objects (e.g., physical articles) in this frame are tracked—glass, aubergine, sponge, fork.

(107) Specifically, the discriminator builds a spatio-temporal scene graph custom character =(, ε) from V, where the nodes and edges are represented by and ε respectively.

(108) The scene graph can include spatial edges 302, temporal edges 304, and disabled edges 306.

(109) The system assumes one node per object per frame. Each node is connected to all other nodes in the same frame, referred to as spatial edges 302. In addition, to represent temporal evolution of objects, each node is connected to the corresponding nodes in the adjacent frames that also depict the same object, referred to as temporal edges 304. To obtain the node representations, the system crops the objects in V using Mask-RCNN, computes a convolutional embedding for them, and then augments the resulting vectors with the aspect ratio and position of the corresponding bounding boxes.

(110) The weights of spatial edges in ε are given by inverse Euclidean distances between the centers of these bounding boxes. The weights of the temporal edges 304 is set to 1 by default. The cases of (dis)appearing objects are handled by setting the corresponding spatial and temporal edges to 0 (e.g., disabled edge 306).

(111) The relational discriminator D.sub.r operates on this scene graph custom character by virtue of a graph convolutional network (GCN) followed by stacking and average-pooling of the resulting node representations along the time axis.

(112) The discriminator is configured to then concatenate this tensor with spatially replicated copies of s.sub.a and s.sub.o to result in a tensor of size (dim(s.sub.a)+dim(s.sub.o)+N.sup.(t))×w.sub.0.sup.(t)×h.sub.0.sup.(t).

(113) As before, the discriminator is configured to then apply convolutions and sigmoid to obtain the final output which denotes the probability of the scene graph belonging to a real output data object (e.g., video). The objective function of the network D.sub.r is given by:
L.sub.r=½[log(D.sub.r( custom character .sub.real;s.sub.a,s.sub.o))+log(1−D.sub.r(.sub.gen;s.sub.a,s.sub.o))].

(114) Generator. Given the semantic embeddings s.sub.a, s.sub.o of action and object labels and context vector I.sub.c, the generator network learns to generate T frames of size H×W×3 See FIG. 3B (i). The approach can include concatenating noise z with the conditions, namely, s.sub.a, s.sub.o, and s I. This concatenated vector can be provided as the input to the network G.

(115) The network comprises stacked deconv3d layers, i.e., 3D transposed convolution layers each followed by Batch Normalization and leaky ReLU layers with a=0.2 except the last convolutional layer which is followed by a Batch Normalization layer and a tan h activation layer

(116) The network can comprise stacked 3D transposed convolution networks. Each convolutional layer can be followed by Batch Normalization layers and ReLU activation layer except the last convolutional layer which is followed by Batch Normalization layers and tan h activation layer. The network can be optimized according to the following objective function, in an embodiment.

(117) 0 $L_{gan} = \frac{1}{T} {.Math.}_{i = 1}^{T} [\log (1 - D_{f}^{(i)} (V_{gen}; s_{a}, s_{o}))] + \frac{1}{(T - 1)} {.Math.}_{i = 1}^{T - 1} [\log (1 - D_{g}^{(i)} (\nabla_{t} V_{gen}; s_{a}, s_{o}))] + \log (1 - D_{v} (V_{gen}; s_{a}, s_{o})) + \log (1 - D_{r} ({��}_{gen}; s_{a}, s_{o}))$

(118) FIG. 4A is an example depiction 400A of a word embedding that can be used to establish relationships between different physical article/action pairs in the context of video generation, according to some embodiments.

(119) In this example, similar to FIG. 2, a new video is requested to be generated based off of “wash tomato”. The system has observed “wash aubergine” and “put tomato” in the training set. To create “wash tomato”, the system identifies aspects of the training videos for composite generation to create a composite, and in some embodiments, transforms the aspects based on other features extracted from other training data.

(120) As the size of the training data set grows, the system's ability to mix and match, transform, and generate composites grows. For example, if the system has observed tomatoes, peaches, strawberries, etc., in videos, it may draw upon and generate new compositions based on combinations and transformations thereof based on, for example, a vector distance between the desired composition and the underlying training data vectors.

(121) In another, more technically challenging example, the system may receive requests for unseen compositions where aspects of the unseen compositions are unknown even in the training examples. In these situations, the system may attempt generation of unknown aspects based on extending aspects of other training examples, even if such generation may yield (to humans) a fairly nonsensical result.

(122) For example, an unseen composition may be directed to “cut peach”, or “open egg”, and the system may adapt aspects of other approaches and insert frame elements into these sequences based on similarities in word embeddings associated with the underlying categories and training objects. For “cut peach”, the inside portion of a nectarine may be inserted into the peach since the system may have observed that a nectarine is also a stone fruit. Similarly, opening an egg may also yield nectarine inner portions as the system may not be able to identify what should be in an egg as it has never observed the insides of an egg in training, and simply picks the nectarine based on the shape of the nectarine (round).

(123) FIG. 4B is an example depiction 400B of a word embedding that can be used to establish relationships between different object/action pairs in the context of transaction generation, according to some embodiments. In this example, the system is tasked with generating a data object that is a composite of the underlying training data elements, without having observed the data object classification in the training data.

(124) The system in this example is tasked with generating a representation of transaction sequences in a hypothetical scenario where Michael has two children.

(125) As shown in FIG. 4B, transaction sequences in the real world are known for Michael (with no children), and for Greg (with children). A mapping and extension of aspects of Greg to Michael would be generated as a vector representation, and, for example, a sequence of simulated transactions could be stored therein.

(126) FIG. 5 is an example method for generating one or more data structures, the one or more data structures representing an unseen composition based on a first category and a second category observed individually, according to some embodiments. The method 500 is shown as an example, and other steps, alternate steps, and variations are possible.

(127) At 502, a data receiver interface receives a training data set including labelled data elements based on the first category and labelled data elements based on the second category and receives a target category indication representative of the unseen composition.

(128) At 504 a conditional generative adversarial network processes the training data set to train a discriminator model architecture coupled to a generator model architecture, the discriminator model architecture having a plurality of adversarial networks operating in concert to train the generator model architecture.

(129) At 506, a sequence discriminator is configured to distinguish between a real sequence and a generated sequence.

(130) At 508, a frame discriminator is configured to differentiate between frames representing sequence subsets of the real sequence and the generated sequence.

(131) At 510, a gradient discriminator is configured to differentiate between a domain-specific gradient determined based on the type of data structure of the one or more data structures and the training data set.

(132) At 512, a foreground or a relational discriminator is configured to assign weights for shifting focus of the generator model architecture to a subset of the one or more new data structures based on an identified context associated with the target category indication of the unseen composition.

(133) At 514, a generator model architecture generates the one or more data structures representing the unseen composition.

(134) FIG. 6 is a schematic diagram of a computing device 600 such as a server. As depicted, the computing device includes at least one processor 602, memory 606, at least one I/O interface 606, and at least one network interface 608.

(135) Processor 602 may be an Intel or AMD x86 or x64, PowerPC, ARM processor, or the like. Memory 604 may include a combination of computer memory that is located either internally or externally such as, for example, random-access memory (RAM), read-only memory (ROM), compact disc read-only memory (CDROM). Each I/O interface 606 enables computing device 600 to interconnect with one or more input devices, such as a keyboard, mouse, camera, touch screen and a microphone, or with one or more output devices such as a display screen and a speaker.

(136) Each network interface 608 enables computing device 600 to communicate with other components, to exchange data with other components, to access and connect to network resources, to serve applications, and perform other computing applications by connecting to a network (or multiple networks) capable of carrying data including the Internet, Ethernet, plain old telephone service (POTS) line, public switch telephone network (PSTN), integrated services digital network (ISDN), digital subscriber line (DSL), coaxial cable, fiber optics, satellite, mobile, wireless (e.g. WiMAX), SS7 signaling network, fixed line, local area network, wide area network, and others.

(137) Computing device 600, in some embodiments, is a special purpose machine that may reside at a data center. The special purpose machine, for example, incorporates the features of the system 100 and is provided in a portable computing mechanism that, for example, may be placed into a data center as a rack server or rack server component that interoperates and interconnects with other devices, for example, across a network or a message bus, and configured to generate insights and create new composite data objects based on training data and received data requests.

Experiments

(138) Experiments on zero-shot human-object sequence (e.g. video) generation showcase: (1) the ability of the proposed model to generate videos in different scenarios, (2) the performance comparison of proposed approach over state-of-the-art video generation models, and (3) finally, the limitations of the proposed approach of some embodiments. As mentioned, videos are only one type of data object and other types of composite data objects are contemplated in other embodiments.

(139) In the experiments, the convolutional layers in all networks, namely, G, D.sub.f, D.sub.g, D.sub.v, D.sub.r have kernel size 4 and stride 2.

(140) The approach includes generating a video clip consisting of T=16 frames having H=W=64. The noise vector z is of length 100. The parameters w.sub.0=h.sub.0=4, d.sub.0=1 and N=512 for D.sub.v and w.sub.0.sup.t=h.sub.0.sup.t=4 and N.sup.(t)=512 for D.sub.f, D.sub.g, and D.sub.r. To obtain the semantic embeddings s.sub.a and s.sub.o corresponding to action and object labels respectively, Applicants use Wikipedia-pretrained GLoVe embedding vectors of length 300.

(141) For training, Applicants use the Adam optimizer with learning rate 0.0002 and β.sub.1=0.5, β.sub.2=0.999 but other approaches are possible. Applicants train all models with a batch size of 32. In this experimental validation, Applicants used dropout (probability=0.3) in the last layer of all discriminators and all layers (except first) of the generator.

(142) Dataset 1 is shown on the left and Dataset 2 is shown on the right.

(143) TABLE-US-00001 TABLE 1 Quantitative Evaluation for GS1 I-score ↑ S-score ↓ D-score ↑ I-score ↑ S-score ↓ D-score ↑ Baseline: C-VGAN 1.8 30.9 0.2 2.1 25.4 0.4 Baseline: C-TGAN 1.5 35.9 0.4 2.2 28.9 0.6 Ours-V 2.0 29.2 0.3 2.1 27.2 0.3 Ours-V + F 2.3 26.1 0.6 2.5 22.2 0.65 Ours-V + F + G 2.8 15.1 1.4 2.8 14.2 1.1 Ours-V + F + G + Fg(gt) 4.1 13.1 2.1 — — — Ours-V + F + G + Fg(bboxes) 4.0 14.5 1.9 5.6 12.7 2.4 Ours-V + F + G + Fg(masks) 4.8 11.5 2.9 6.6 10.2 3.0 Including Unlabeled Data Baseline: C-VGAN Baseline: C-TGAN Ours(bboxes) 5.0 9.5 2.4 7.3 10.2 3.6 Ours(masks) 7.7 7.5 3.4 9.4 6.2 4.5 One hot encoded labels instead of embeddings Baseline: C-VGAN Baseline: C-TGAN Ours(bboxes) 3.0 20.5 1.4 3.3 29.2 1.6 Ours(masks) 2.8 24.5 2.0 4.2 18.5 3.1

(144) TABLE-US-00002 TABLE 2 Quantitative Evaluation for GS2 I-score ↑ S-score ↓ D-score ↑ I-score ↑ S-score ↓ D-score ↑ Baseline: C-VGAN 1.4 44.9 0.3 1.8 40.5 0.3 Baseline: C-TGAN 1.5 35.9 0.4 1.6 39.7 0.5 Ours-V 1.2 42.1 0.4 1.6 41.1 0.6 Ours-V + F 2.2 34.1 0.6 2.2 37.3 0.7 Ours-V + F + G 2.6 29.7 1.9 2.4 27.6 1.7 Ours-V + F + G + Fg(gt) 3.6 21.1 2.1 — — — Ours-V + F + G + FG(bboxes) 3.4 27.5 2.4 4.3 15.2 1.4 Ours-V + F + G + FG(masks) 3.6 32.7 3.4 4.6 12.9 2.4 Including Unlabeled Data Baseline: C-VGAN Baseline: C-TGAN Ours(bboxes) 4.5 15.7 2.4 5.3 10.2 3.7 Ours(masks) 5.0 12.6 3.4 7.0 9.6 4.1 One hot encoded labels instead of embedding Baseline: C-VGAN Baseline: C-TGAN Ours(bboxes) 2.4 25.5 1.3 3.6 32.2 1.6 Ours(masks) 3.6 21.2 2.1 4.7 25.2 3.1
Experimental Setup

(145) Datasets. Two datasets: (1) Dataset 1, (2) Dataset 2 consisting of diverse and challenging human-object interaction videos ranging from simple translational motion of objects (e.g., push, move) to rotation (e.g. open) and transformations in state of objects (e.g. cut, fold).

(146) Both of these datasets comprise a diverse set of HOI videos ranging from simple translational motion of objects (e.g. push, move) and rotation (e.g. open) to transformations in state of objects (e.g. cut, fold). Therefore, these datasets, with their wide ranging variety and complexity, provide a challenging setup for evaluating HOI video generation models.

(147) Dataset 1 contains egocentric videos of activities in several kitchens. A video clip V is annotated with action label a and object label O (e.g., open microwave, cut apple, move pan) along with a set of bounding boxes custom character (one per frame) for objects that the human interacts with while performing the action. There are around 40000 instances in the form of (V, a, o, ) across 352 objects and 125 actions. This dataset is referred to as Dataset 1 hereafter.

(148) Dataset 2 contains videos of daily activities performed by humans. A video clip V is annotated with a label l with action template and one or two objects O involved in the activity (e.g., moving a book down with action template ‘moving something down’, hitting ball with racket with action template ‘hitting something with something’). There are 220,847 training instances of the form (V, l) across 30,408 objects and 174 action templates.

(149) To transform the dataset from elements of the form of videos with natural language labels (V, l) to videos with action and object labels (V, a, o), Applicant used NLTK POS-tagger to obtain verbs and nouns in l as follows. Applicant derived action label a by stemming the verb (e.g. for closing, the action label a is close) in l. All of the labels in the dataset begin with present perfect form of the verb therefore, the active object O is the noun that occurs just after the verb in the label l. Applicant refer to this dataset as Dataset 2 hereafter.

(150) Splitting by Compositions/Data Splits. To make the dataset splits suitable for the problem of zero-shot human-object interactions, the system combined the videos in the validation and train splits originally provided in the dataset and perform the split ensuring that all the unique objects and action labels in the original dataset are seen independently in training set however a particular combination of object and action present in testing set is not present in training and vice versa. Formally, the approach splits the dataset custom character into two splits training set .sub.tr and testing set .sub.te based on the set of unique actions and the set of unique objects in the dataset .

(151) The training set custom character .sub.tr contains videos with action and object label (V,a,o) with a∈ and o∈ such that the data samples, i.e., videos cover all elements in set of actions and set of object .

(152) Therefore, videos with both action label a.sub.t and object label o.sub.t in custom character .sub.te would never occur in .sub.tr however video with action label a.sub.t and another object label o.sub.t′ or another action label a′.sub.t and the object label o.sub.t can be present in .sub.tr.

(153) Data Processing To obtain the semantic embedding for action and object labels, one can use Wikipedia-pretrained GLoVe embeddings. Each of the embeddings are of dimension 300. To obtain the foreground masks (both bounding boxes and segmentation masks), one can use MS-COCO pretrained Mask-RCNN. The masks were obtained for both datasets.

(154) Generation Scenarios. Two different generation scenarios are provided to evaluate the Generator model trained on the training set described earlier in the section.

(155) Recall that the generator network in an embodiment of the proposed framework 300A (FIG. 3A) has 3 conditional inputs, namely, action embedding, object embedding, and context frame I.

(156) The context frame serves as the background in the scene. Thus, to provide this context frame during training, the system can apply a binary mask M.sup.(1) corresponding to the first frame V.sup.(1) of a real video as I=(1−M.sup.(1))⊙V.sup.(1), where 1 represents a matrix of size M.sup.(1) containing all ones and ⊙ denotes elementwise multiplication.

(157) This mask M.sup.(1) contains ones in regions (either rectangular bounding boxes or segmentation masks) corresponding to the objects (non-person classes) detected using MaskRCNN and zeros for other regions. Intuitively, this helps ensure the generator learns to map the action and object embeddings to relevant visual content in the HOI video.

(158) During testing, to evaluate the generator's capability to synthesize the right human-object interactions, Applicants provide a background frame as described above. This background frame can be selected from either the test set or training set, and can be suitable or unsuitable for the target action-object composition. To capture these possibilities, we design two different generation scenarios.

(159) Specifically, in Generation Scenario 1 (GS1), the input context frame I is the masked first frame of a video from the test set corresponding to the target action-object composition (unseen during training).

(160) In Generation Scenario 2 (GS2), I is the masked first frame of a video from the training set which depicts an object other than the target object. The original action in this video could be same or different than the target action. Refer to Table 1 to see the contrast between the two scenarios.

(161) TABLE-US-00003 TABLE 1 Generation Scenarios. Description of the conditional inputs for the two generation scenarios GS1 & GS2 used for evaluation. Target Conditions GS1 GS2 Target action a seen during training ✓ ✓ Target object o seen during training ✓ ✓ Background of target context I seen during training x ✓ Object mask in target context I corresponds to target ✓ x object o Target action a seen with target context I during training x ✓/x Target object o seen with target context I during training x x Target action-object composition (a-o) seen during x x training ✓ denotes ‘Yes’, x denotes ‘No’.

(162) As such, in GS1, the generator receives a context that it has not seen during training but the context (including object mask) is consistent with the target action-object composition it is being asked to generate.

(163) In contrast, in GS2, the generator receives a context frame that it has seen during training but is not consistent with the action-object composition it is being asked to generate. Particularly, the object mask in the context does not correspond to the target object. Thus, these generation scenarios help illustrate that the generator indeed generalizes over compositions.

(164) Evaluation Metrics. Quantitative evaluation of the quality of images or videos is inherently challenging thus, Applicants use both quantitative and qualitative metrics.

(165) Quantitative Metrics. Inception Score (I-score) is a widely used metric for evaluating image generation models. For images x with labels y, I-score is defined as exp(KL(ρ(y|x)∥ρ(x))) where ρ(y|x) is the conditional label distribution of an ImageNet-pretrained Inception model. Applicants adopted this metric for video quality evaluation. Applicants fine-tune a Kinetics-pretrained video classifier ResNeXt for each of the source datasets and use it for calculating I-score (higher is better). It is based on one of the state-of-the-art video classification architectures. Applicants used the same evaluation setup for the baselines and an embodiment of the proposed model to ensure a fair comparison.

(166) In addition, Applicants hypothesize that measuring realism explicitly is more relevant for the task as the generation process can be conditioned on any context frame arbitrarily to obtain diverse samples. Therefore, in addition to I-score, Applicants also analyze the first and second terms of the KL divergence separately.

(167) Applicants refer to these terms as: (1) Saliency score or S-score (lower is better) to specifically measure realism, and (2) Diversity score or D-score (higher is better) to indicate the diversity in generated samples.

(168) A smaller value of S-score implies that the generated videos are more realistic as the classifier is very confident in classifying the generated videos. Specifically, the saliency score will have a low value (low is good) only when the classifier is confidently able to classify the generated videos into action-object categories matching the conditional input composition (action-object), thus indicating realistic instances of the required target interaction. In fact, even if a model generates realistic-looking videos but depicts an action-object composition not corresponding to the conditional action-object input, the saliency score will have high values.

(169) Finally, a larger value of D-score implies the model generates diverse samples.

(170) Human Preference Score. Applicants conducted a user study for evaluating the quality of generated videos. In each test, Applicants present the participants with two videos generated by two different algorithms and ask which among the two better depicts the given activity, i.e., action-object composition (e.g. lift fork). Applicants evaluate the performance of an approach as the overall percentage of tests in which that approach's outputs are preferred. This is an aggregate measure over all the test instances across all participants.

(171) Baselines. Applicants compare the approach of some embodiments with three state-of-the-art video generation approaches: (1) VGAN, (2) TGAN, and (3) MoCoGAN. Applicants develop the conditional variants of VGAN and TGAN from the descriptions provided in their papers. Applicants refer to the conditional variants as C-VGAN and C-TGAN respectively.

(172) Applicants observed that these two models saturated easily in the initial iterations, thus, Applicants added dropout in the last layer of the discriminator network in both models. MoCoGAN focuses on disentangling motion and content in the latent space and is the closest baseline. Applicants use the code provided by the authors.

(173) As shown in Table 2, the proposed generator network with different conditional inputs outperforms C-VGAN and C-TGAN by a wide margin in both generation scenarios. Ours refers to models based on variations of the proposed embodiments.

(174) TABLE-US-00004 TABLE 2 Quantitative Evaluation. Comparison of HOI-GAN with C-VGAN, C-TGAN, and MoCoGAN baselines. We distinguish training of HOI-GAN with bounding boxes (bboxes) and segmentation masks (masks). EPIC SS GS1 GS2 GS1 GS2 Model I↑ S↓ D↑ I↑ S↓ D↑ I↑ S↓ D↑ I↑ S↓ D↑ C-VGAN [68] 1.8 30.9 0.2 1.4 44.9 0.3 2.1 25.4 0.4 1.8 40.5 0.3 C-TGAN [58] 2.0 30.4 0.6 1.5 35.9 0.4 2.2 28.9 0.6 1.6 39.7 0.5 MoCoGAN [66] 2.4 30.7 0.5 2.2 31.4 1.2 2.8 17.5 1.0 2.4 33.7 1.4 (ours) HOI-GAN (bboxes) 6.0 14.0 3.4 5.7 20.8 4.0 6.6 12.7 3.5 6.0 15.2 2.9 HOI-GAN (masks) 6.2 13.2 3.7 5.2 18.3 3.5 8.6 11.4 4.4 7.1 14.7 4.0 Arrows indicate whether lower (↓) or higher (↑) is better. [I: inception score; S: saliency score; D: diversity score]

(175) In addition, the overall proposed model shows considerable improvement over MoCoGAN, while MoCoGAN has comparable scores to some ablated versions of the proposed models (specifically where gradient discriminator and/or relational discriminator is missing).

(176) Furthermore, Applicants varied the richness of the masks in the conditional input context frame ranging from bounding boxes to segmentation masks obtained corresponding to non-person classes using MaskRCNN framework. As such, the usage of segmentation masks implies explicit shape information as opposed to the usage of bounding boxes where the shape information needs to be learnt by the model. Applicants observe that providing masks during training leads to slight improvements in both scenarios as compared to using bounding boxes (refer to Table 2).

(177) Applicants also show the samples generated using the best version of the generator network for the two generation scenarios (FIG. 7).

(178) FIG. 7 shows screen captures 700 of videos generated using the best version of HOI-GAN using embeddings for action (a)-object (o) composition and the context frame. Applicants show 5 frames of the video clip generated for both generation scenarios GS1 and GS2. The context frame in GS1 is obtained from a video in the test set depicting an action object composition same as the target one. The context frame for GS2 scenarios shown here are from videos depicting \take carrot” (for row 3) and \put bowl” (for row 4). Conditional VideoGAN. VideoGAN involves two stream generator involving generation of foreground and background separately. Applicants develop the conditional variant of the VGAN model from the descriptions in the paper. Specifically, the approach provides semantic embeddings as the inputs and encoded images as the inputs to the generator and the semantic embeddings as the inputs to the last fully-connected layer of the discriminator. The conditional variant of the VideoGAN model is referred to as C-VGAN hereafter. Conditional TemporalGAN. TemporalGAN uses a temporal generator involving 1D convolutions along the depth of the input to produce n latent variables from the input noise.

(179) These latent variables are provided inputs to n independent generator to generate each of the n frames in a video. The conditional variant is developed of the TGAN as described in various embodiments. Specifically, the approach provides semantic embeddings and context image (encoded) as inputs to the temporal and image generators and the semantic embeddings as the inputs to the last fully-connected layer of the discriminator. The conditional variant of TemporalGAN is referred to as C-TGAN hereafter.

(180) Implementation Details Networks G, D, D.sub.f, D.sub.g, D.sub.fg are implemented with convolutional layers of kernel size 4 and stride 2. To optimize the networks, an approach uses Adam optimizer with learning rate 0.0002 with β.sub.1=0:9 and β.sub.2=0:999. A batch size of 64 is maintained while training our model and baselines (C-VGAN abd C-TGAN).

(181) Quantitative Results

(182) Comparison with Baselines

(183) Applicants compare with baselines as described above in both generation scenarios (shown in Table 1 and 2).

(184) Including Unlabeled Data

(185) A weaker zero-shot is performed in semi-supervised setting where the model is fed the full dataset with the categories in the testing set are not given any labels or embedding. Refer Table 1 and 2.

(186) Labels vs Embeddings

(187) Applicants argue that the embeddings provide auxiliary information about the label categories. To verify this arguments, Applicants compare the model outputs of labels with categories, and refers to the results of Table 1 and 2.

(188) Qualitative Results

(189) Qualitative results of experiments are provided in FIG. 8. FIG. 8 are generated versions 800 of composite data objects (in this case, videos) based on unseen compositions, according to some embodiments.

(190) As shown in FIG. 8, unseen compositions are based on category combinations where the training data may have observed data objects based off of each of the categories individually, or off of similar categories. In this example, the computer system is tasked with generating composite's based off of the compositions put banana celery, hold bowl, and put apple. As shown in these illustrative examples, the system takes aspects of the underlying training data objects and combines them together to form new generated videos. However, as there may be gaps in observations, the system adapts by transforming or otherwise manipulating the underlying data objects in an attempt to create realistic looking composite data objects. FIG. 8 shows that this is problem is challenging for computer systems.

(191) As described herein, various embodiments are proposed in relation to systems and methods for generating composite objects, including, for example, zero-shot HOI videos.

(192) Specifically, the problem of generating video corresponding to unseen compositions of action and object having seen the action and object independently is evaluated. In various embodiments, there is proposed a DC-GAN based multi-adversarial model. An example evaluation is evaluated using subjective and objective measures and demonstrated that some embodiments of the approach perform better than baselines.

(193) Ablation Study. To illustrate the impact of each discriminator in generating HOI videos, Applicants conducted ablation experiments (refer to Table 3). Applicant observe that the addition of temporal information using the gradient discriminator and spatio-temporal information using the video discriminator lead to improvement in generation quality.

(194) In particular, the addition of our scene graph based relational discriminator leads to considerable improvement in generation quality resulting in more realistic videos (refer to second block in Table 3).

(195) TABLE-US-00005 TABLE 3 Ablation Study. We evaluate the contributions of our pixel-centric losses (F, G, V) and relational losses (first|block vs. second block) by conducting ablation study on HOI-GAN (masks). The last row corresponds to the overall proposed model. EPIC SS GS1 GS2 GS1 GS2 Model I↑ S↓ D↑ I↑ S↓ D↑ I↑ S↓ D↑ I↑ S↓ D↑ −R HOI-GAN (F) 1.4 44.2 0.2 1.1 47.2 0.3 1.8 34.7 0.4 1.5 39.5 0.3 HOI-GAN (F + 2.3 25.6 0.7 1.9 30.7 0.5 3.0 24.5 0.9 2.7 28.8 0.7 G) HOI-GAN (F + 2.8 21.2 1.3 2.6 29.7 1.7 3.3 18.6 1.2 3.0 20.7 1.0 G + V) +R HOI-GAN (F) 2.4 24.9 0.8 2.2 26.0 0.7 3.1 20.3 1.0 2.9 27.7 0.9 HOI-GAN (F + 5.9 15.4 3.5 4.8 21.3 3.3 7.4 12.1 3.5 5.4 19.2 3.4 G) HOI-GAN (F + 6.2 13.2 3.7 5.2 18.3 3.5 8.6 11.4 4.4 7.1 14.7 4.0 G + V) [F: frame discriminator D.sub.f; G: gradient discriminator D.sub.g; V: video discriminator D.sub.v; R: relational discriminator D.sub.r]

(196) Human Evaluation: Applicants recruited 15 sequestered participants for a user study. Applicants randomly chose 50 unique categories and chose generated videos for half of them from generation scenario GS1 and the other half from GS2. For each category, Applicants provided three instances, each containing a pair of videos; one generated using a baseline model and the other using HOI-GAN. For each instance, at least 3 participants (ensuring inter-rater reliability) were asked to choose the video that best depicts the given category. The (aggregate) human preference scores for the proposed model versus the baselines range between 69-84% for both generation scenarios (refer Table 4) indicate that HOI-GAN generates more realistic videos than the baselines.

(197) TABLE-US-00006 TABLE 4 Human Evaluation. Human Preference Score (%) for scenarios GS1 and GS2. All the results have p-value less than 005 implying statistical significance. Ours/Baseline GS1 GS2 HOI-GAN/MoCoGAN 71.7/28.3 69.2/30.8 HOI-GAN/C-TGAN 75.4/34.9 79.3/30.7 HOI-GAN/C-VGAN 83.6/16.4 80.4/19.6

(198) Failure Cases: Applicants discuss the limitations of the framework using qualitative examples shown in the screenshots 900 of FIG. 9. For “open microwave”, Applicants observe that although HOI-GAN is able to generate conventional colors for a microwave, it shows limited capability to hallucinate such large objects. For “cut peach” (FIG. 9), the generated sample shows that the model can learn the increase in count of partial objects corresponding to the action cut and yellow-green color of a peach.

(199) However, as the model has not observed the interior of a peach during training (as cut peach was not in training set), it is unable to create realistic transformations in the state of peach that show the interior clearly. Accordingly, in some embodiments, Applicants suggest that using external knowledge and semi-supervised data in conjunction with the models described herein can potentially lead to more powerful generative models while still adhering to the zero-shot compositional setting.

(200) Applicant notes that the described embodiments and examples are illustrative and non-limiting. Practical implementation of the features may incorporate a combination of some or all of the aspects, and features described herein should not be taken as indications of future or existing product plans. Applicant partakes in both foundational and applied research, and in some cases, the features described are developed on an exploratory basis.

(201) The term “connected” or “coupled to” may include both direct coupling (in which two elements that are coupled to each other contact each other) and indirect coupling (in which at least one additional element is located between the two elements).

(202) Although the embodiments have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the scope. Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification.

(203) As one of ordinary skill in the art will readily appreciate from the disclosure, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed, that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized. Accordingly, the embodiments are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.

(204) As can be understood, the examples described above and illustrated are intended to be exemplary only.

System and method for generation of unseen composite data objects

Assignee

Inventors

Cpc classification

Classification Explorer

G06V20/41

PHYSICS

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06F18/214

PHYSICS

Classification Explorer

G06V10/774

PHYSICS

Classification Explorer

G06V20/44

PHYSICS

Classification Explorer

G06V10/776

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

H04N5/222

ELECTRICITY

Classification Explorer

G06F18/211

PHYSICS

International classification

Classification Explorer

G06K9/62

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Abstract

Claims

Description