METHOD AND SYSTEM FOR SCENE GRAPH GENERATION
20220391704 · 2022-12-08
Assignee
Inventors
Cpc classification
G06V10/778
PHYSICS
G06V10/751
PHYSICS
G06V20/70
PHYSICS
G06V10/84
PHYSICS
G06N3/042
PHYSICS
G06V10/771
PHYSICS
G06V10/774
PHYSICS
International classification
G06V10/75
PHYSICS
G06V10/771
PHYSICS
G06V10/774
PHYSICS
Abstract
Broadly speaking, the disclosure generally relates to relates to a computer-implemented methods and systems for scene graph generation, and in particular for training a machine learning, ML, model to generate a scene graph. The method includes inputting training a training image into a machine learning model, outputting a predicted label for at least two objects in the training image and a predicted label for a relationship between the at least two objects. The training method includes calculating a loss, which takes into account both a supervised loss calculated by comparing the predicted labels to the actual labels for the training image, and a logic-based loss calculated by comparing the predicted labels to stored integrity constraints comprising common-sense knowledge. Advantageously, this means that the performance of the model is improved without increasing processing at inference-time.
Claims
1. A computer-implemented method for training a machine learning model comprising a plurality of neural network layers for scene graph generation, SGG, comprising: inputting a training image into the machine learning model, the training image depicting a scene comprising at least two objects; outputting, from the machine learning model, a predicted label for the each of the at least two objects and a predicted label for a relationship between the at least two objects; calculating a loss; and updating, based on the calculated loss, weights for computation by the plurality of neural network layers of the machine learning model, wherein calculating the loss comprises: calculating a supervised loss by comparing the predicted label for the each of the at least two objects and the predicted label for a relationship between the at least two objects to actual labels corresponding to the training image; and calculating a logic-based loss by comparing the predicted label for the each of the at least two objects and the predicted label for a relationship between the at least two objects to a plurality of stored integrity constraints, wherein the integrity constraints comprise common-sense knowledge.
2. The method of claim 1, wherein calculating the logic-based loss comprises: selecting a maximally non-satisfied subset of the plurality of stored integrity constraints, and calculating the logic-based loss based on the maximally non-satisfied subset.
3. The method of claim 2, wherein selecting the maximally non-satisfied subset comprises: obtaining output predictions of the machine learning model with highest confidences; ordering the obtained output predictions by decreasing confidence; removing any prediction of the obtained output predictions for which there is no corresponding negative integrity constraint in the plurality of stored integrity constraints; selecting random subsets of the obtained output predictions; calculating a loss is calculated for each of the random subsets; and selecting the maximum loss corresponding to the random subsets as the logic-based loss.
4. The method of claim 1, wherein the logic-based loss is inversely proportional to a satisfaction function, which measures how closely the output prediction for the each of the at least two objects and the output prediction for the relationship between the at least two objects satisfies the plurality of stored integrity constraints.
5. The method of claim 1, wherein the logic-based loss is calculated using probabilistic logic.
6. The method of claim 1, wherein the logic-based loss is calculated using fuzzy logic.
7. The method of claim 1, wherein the plurality of stored integrity constraints comprises positive integrity constraints expressing permissible relationships between objects, and negative integrity constraints expressing impermissible relationships between objects.
8. A system comprising: at least one processor, coupled to memory, configured to train a machine learning model comprising a plurality of neural network layers for scene graph generation, SGG, by: inputting the training image into the machine learning model, the training image depicting a scene comprising at least two objects; outputting, from the machine learning model, a predicted label for the each of the at least two objects and a predicted label for a relationship between the at least two objects; calculating a loss; and updating, based on the calculated loss, weights for computation by the plurality of neural network layers of the machine learning model, wherein calculating the loss comprises: calculating a supervised loss by comparing the predicted label for the each of the at least two objects and the predicted label for a relationship between the at least two objects to actual labels corresponding to the training image; and calculating a logic-based loss by comparing the predicted label for the each of the at least two objects and the predicted label for a relationship between the at least two objects to a plurality of stored integrity constraints, wherein the integrity constraints comprise common-sense knowledge.
9. The system of claim 8, wherein the at least one processor is configured to train the machine learning model by: selecting a maximally non-satisfied subset of the plurality of stored integrity constraints, and calculating the logic-based loss based on the maximally non-satisfied subset.
10. The system of claim 9 wherein the at least one processor is configured to train the machine learning model by: obtaining output predictions of the machine learning model with highest confidences; ordering the obtained output predictions by decreasing confidence; removing any prediction of the obtained output predictions for which there is no corresponding negative integrity constraint in the plurality of stored integrity constraints; selecting random subsets of the obtained output predictions; calculating a loss is calculated for each of the random subsets; and selecting the maximum loss corresponding to the random subsets as the logic-based loss.
11. The system of claim 8, wherein the logic-based loss is inversely proportional to a satisfaction function, which measures how closely the output prediction for the each of the at least two objects and the output prediction for the relationship between the at least two objects satisfies the plurality of stored integrity constraints.
12. The system of claim 8, wherein the logic-based loss is calculated using probabilistic logic.
13. The system of claim 8, wherein the logic-based loss is calculated using fuzzy logic.
14. The system of claim 8, wherein the plurality of stored integrity constraints comprises positive integrity constraints expressing permissible relationships between objects, and negative integrity constraints expressing impermissible relationships between objects.
15. A non-transitory machine-readable medium comprising instructions that, when executed by at least one processor of an electronic device, cause the at least one processor to train a machine learning model comprising a plurality of neural network layers for scene graph generation, SGG, by: inputting the training image into the machine learning model, the training image depicting a scene comprising at least two objects; outputting, from the machine learning model, a predicted label for the each of the at least two objects and a predicted label for a relationship between the at least two objects; calculating a loss; and updating, based on the calculated loss, weights for computation by the plurality of neural network layers of the machine learning model, wherein calculating the loss comprises: calculating a supervised loss by comparing the predicted label for the each of the at least two objects and the predicted label for a relationship between the at least two objects to actual labels corresponding to the training image; and calculating a logic-based loss by comparing the predicted label for the each of the at least two objects and the predicted label for a relationship between the at least two objects to a plurality of stored integrity constraints, wherein the integrity constraints comprise common-sense knowledge.
16. The non-transitory machine-readable medium of claim 15, wherein the instructions when executed cause the at least one processor to train the machine learning model by: selecting a maximally non-satisfied subset of the plurality of stored integrity constraints, and calculating the logic-based loss based on the maximally non-satisfied subset.
17. The non-transitory machine-readable medium of claim 16, wherein the instructions when executed cause the at least one processor to train the machine learning model by: obtaining output predictions of the machine learning model with highest confidences; ordering the obtained output predictions by decreasing confidence; removing any prediction of the obtained output predictions for which there is no corresponding negative integrity constraint in the plurality of stored integrity constraints; selecting random subsets of the obtained output predictions; calculating a loss is calculated for each of the random subsets; and selecting the maximum loss corresponding to the random subsets as the logic-based loss.
18. The non-transitory machine-readable medium of claim 15, wherein the logic-based loss is inversely proportional to a satisfaction function, which measures how closely the output prediction for the each of the at least two objects and the output prediction for the relationship between the at least two objects satisfies the plurality of stored integrity constraints.
19. The non-transitory machine-readable medium of claim 15, wherein the logic-based loss is calculated using probabilistic logic.
20. The non-transitory machine-readable medium of claim 15, wherein the logic-based loss is calculated using fuzzy logic.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0035] Implementations of the some embodiments of disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
MODE OF DISCLOSURE
[0053] Broadly speaking, an embodiment of the disclosure generally relate to relates to a computer-implemented methods and systems for scene graph generation, and in particular for training a machine learning, ML, model to generate a scene graph. The method includes inputting training a training image into a machine learning model, outputting a predicted label for at least two objects in the training image and a predicted label for a relationship between the at least two objects. The training method includes calculating a loss, which takes into account both a supervised loss calculated by comparing the predicted labels to the actual labels for the training image, and a logic-based loss calculated by comparing the predicted labels to stored integrity constraints comprising common-sense knowledge. Advantageously, this means that the performance of the model is improved without increasing processing at inference-time.
[0054]
[0055] The system 100 receives an input image 10. The input image 10 includes at least two objects. In the example shown, the image 10 objects in the form of a man 11 and a horse 12. It will be appreciated that this is merely an example and that input images 10 to the system 100 may include a wide variety of objects, and more than two objects.
[0056] The system 100 includes a machine learning model 110, which will be discussed in more detail below. The machine learning model 110 includes a plurality of neural network layers.
[0057] The plurality of neural network layers includes an output layer 120. The output layer 120 comprises a plurality of neurons, which provide the output 20 of the machine learning model 110.
[0058] The output layer 120 comprises a first set of neurons 121 that provide an output 21 identifying a first object in the image 10, for example using a bounding box, and providing a label for the identified first object. In the example of
[0059] The output layer 120 also comprises a second set of neurons 122 that provide an output 22 identifying a second object in the image 10, for example using a bounding box and providing a label for the identified second object. In the example of
[0060] The output layer 120 further comprises a third set of neurons 123, which provide an output 25 identifying a relationship between the first object 11 and the second object 12. In the example show, the identified relationship is “riding”. The relationship is a semantic relationship, and may be directional. In other words, “a man riding a horse” and “a horse riding a man” represent two different relationships. As such, it is important which object is identified as the first object 11 and which object is identified as the second object 12.
[0061] Throughout this disclosure, the first object 11 may also be referred to as the “subject”, the second object 12 may also be referred to as the “object” and the relationship may also be referred to as the “predicate” or “relation”.
[0062] Each neuron in a set of neurons 121, 122, 123 may provide an output value in the range 0 to 1 that represents the probability that the object represented by that neuron is the correct output. For example, neurons in set 121 may respectively output the probability that the first object 11 is a man, a horse, a cat, and so on. The sum of the probabilities in a set may be 1, so that probabilities in each set form a probability distribution. The output layer 120 may for example use the softmax activation function to provide the probability distribution. Accordingly, the system 100 may provide as output the label corresponding to the neuron with the highest value. The probabilities may also be referred to herein as “confidences” or “confidence scores”, in that they represent the confidence the model 110 has in the predicted label.
[0063] The system also includes at least one processor 101 and a memory 102 coupled to the processor 101. The at least one processor 101 may comprise one or more of: a microprocessor, a microcontroller, and an integrated circuit. The memory 102 may comprise volatile memory, such as random access memory (RAM), for use as temporary memory, and/or non-volatile memory such as Flash, read only memory (ROM), or electrically erasable programmable ROM (EEPROM), for storing data, programs, or instructions, for example.
[0064] The system 100 may include or be any one of: a smartphone, tablet, laptop, computer or computing device, virtual assistant device, a vehicle, a drone, an autonomous vehicle, a robot or robotic device, a robotic assistant, image capture system or device, an augmented reality system or device, a virtual reality system or device, a gaming system, an Internet of Things device, a smart consumer device, a smartwatch, a fitness tracker, and a wearable device. It will be understood that this is a non-exhaustive and non-limiting list of example systems.
[0065]
[0066] The object detector 130 receives the image 10 as input and detects the objects therein. Optionally, the object detector 130 may also output labels for the detected objects. It will be appreciated that object detection is a well-known computer vision task. Accordingly, any known object detection technique may be employed for the object detector 130. For example, the object detection technique may be the technique described in “Faster R-CNN: towards real-time object detection with region proposal networks”, Ren et al, NeurIPS 2015. In another example, the object detection technique may be the technique described in “You Only Look Once: Unified, Real-Time Object Detection”, Redmon et al, CVPR 2016. The disclosure of these documents in incorporated herein by reference in its entirety.
[0067] The SGG model 140 may receive the output of the object detector 130 and output subjects, objects and relations. The SGG model 140 may be based on any suitable SGG technique, further trained as discussed herein. Example SGG models suitable as a basis for the SGG model 140 include those described in the following publications, each of which is incorporated herein by reference in its entirety: “Scene graph generation by iterative message passing”, Xu et al., CVPR 2017; “Neural-motifs: Scene graph parsing with global context”, Zellers et al., CVPR 2018; “Learning to compose dynamic tree structures for visual contexts”, Tang et al., CVPR 2019; “Unbiased Scene Graph Generation from Biased Training”, Tang et al., The output layer 120 forms part of the SGG model 140.
[0068]
[0069]
[0070] The system 200 also includes at least one processor 201 and a memory 202 coupled to the processor 201. The at least one processor 201 may comprise one or more of: a microprocessor, a microcontroller, and an integrated circuit. The memory 202 may comprise volatile memory, such as random access memory (RAM), for use as temporary memory, and/or non-volatile memory such as Flash, read only memory (ROM), or electrically erasable programmable ROM (EEPROM), for storing data, programs, or instructions, for example. Although shown separately on
[0071] In common with widely known and well-established techniques for training machine learning models 110 including neural network layers, the system 200 receives labelled input data, represented by input image 210 and label 211, which is processed by the model 110. The error of the output prediction of model 110 is calculated using a loss function, in this case represented by loss computation module 240. The output of the loss function is then fed back to the model 110, for example using backpropagation, adjusting the weights thereof. This process iterates until a relevant termination condition is reached, such as an acceptable error value being reached or completion of a threshold number of iterations.
[0072] The system 200 differs from existing training techniques by the calculation of the loss function by the loss computation module 240. Particularly, the loss computation module 240 takes into account both supervised loss, which is calculated by comparing the output prediction of the model 110 to the input labels, and logic-based loss, which is calculated by evaluating the output of the model 110 against the knowledge in the background knowledge store 220. The computation of the loss by the loss computation module 240 will be discussed in detail below.
[0073] In more detail, the system 200 receives a plurality of labelled training images 210 as input, wherein the label for each training image 210 takes the form of a scene graph 211. In some examples, the scene graph 211 may instead be replaced by a list of relations, for example in the form (subject, object, predicate).
[0074] The background knowledge store 220 stores a plurality of rules, which encode common sense or background knowledge. A number of common sense knowledge bases exist, including ConceptNet (Robyn Speer, Joshua Chin, and Catherine Havasi, “Conceptnet 5.5: An open multilingual graph general knowledge”, In AAAI, pages 4444-4451, 2017) and ATOMIC (Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, and Yejin Choi, “ATOMIC: an atlas of machine commonsense for if-then reasoning”, In AAAI, pages 3027-3035, 2019). The rules in the background knowledge store 220 may comprise all or a subset of the rules from one or more knowledge bases.
[0075] Particularly, the rules in the background knowledge store 220 comprise integrity constraints (ICs), which express permissible and/or impermissible relationships between objects. A number of example ICs are shown in the table below, both in plain text form and in logical representation:
TABLE-US-00001 Text rule Logical representation A man can wear a jacket wear(man, jacket) A man cannot eat a jacket ¬eat(man, jacket) A painting can be on a wall on(painting, wall) A painting cannot ride a wall ¬ride(painting, wall) An orange can hang from a tree hang(orange, tree) An orange cannot be made of a tree ¬made of(orange, tree) A horse cannot drink an eye ¬drink(horse, eye)
[0076] In one example, the background knowledge store 220 may be populated by identifying subject-object pairs in background knowledge bases that have a limited number of possible relations. For example, the subject-object pair “person-fruit” may only have a small number of possible relationships that link the subject and object, such as “eat” or “hold”. These relationships form particularly useful ICs, on the basis that relationships beyond the limited number in the knowledge base are likely to be incorrect. In one example, the background knowledge store 220 may comprise approximately 500,000 rules.
[0077] As noted above, the loss computation module 240 computes the loss based on both supervised loss and logic-based loss. Accordingly, the loss function may take the following form:
[0078] In equation (1) above, the term Σ.sub.i=1.sup.m.sup.n (
.sup.i, w.sub.θ.sup.i) represents the supervised loss and the term Σ.sub.i=1.sup.m
.sup.s (T, w.sub.θ.sup.i) represents the logic-based loss.
[0079] β.sub.1 and β.sub.2 are hyperparameters that control the importance of each of the two components of the loss. The hyperparameters may be computed automatically, for example as set out in Kendall, Alex, Yarin Gal, and Roberto Cipolla. “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, the contents of which are incorporated herein by reference. In other examples, the relative weighting of the supervised loss and logic-based loss may be predetermined. For example, the losses may be weighted equally.
[0080] In the equation (1) above, i represents an input image, 4 are the predictions of the model 110, .sup.i are the labels of the input image, and T represents the ICs in the background knowledge store 220.
[0081] .sup.n may comprise any suitable supervised loss function for neural networks. For example, it may include cross entropy loss, mean squared error, mean absolute error, exponential loss or any other function that can be applied to determine the loss.
[0082] In terms of .sup.s, the logic-based loss is inversely proportional to a function SAT: (φ, w.sub.θ).fwdarw.R.sup.+, where φ is an IC drawn from T. For example, if φ=¬drink(horse, eye), SAT will decrease as drinks(horse, eye) becomes closer to true. Therefore, the logic-based loss penalises predictions of the neural network 110 that violate the ICs in the background knowledge store 220.
[0083] In equation (1) above, the term .sup.s (T, w.sub.θ.sup.i) acts as a shorthand for the following:
[0084] In other words, to obtain the logic-based loss in respect of the entirety of T, the conjunction is taken of the loss calculated for each rule in T.
[0085] It is envisaged that the .sup.s may be calculated in any of a number of ways. In one example,
.sup.s complies with the following properties:
[0086] .sup.s (φ, w.sub.θ) is differential almost everywhere.
.sup.s(φ,w.sub.θ)=0 if w.sub.θ
φ
.sup.s(φ,w.sub.θ)≤
.sup.s(φ′,w.sub.θ), if SAT(φ,w.sub.θ)≥SAT(φ′,w.sub.θ)
.sup.sCS(φ,w.sub.θ)=
.sup.s(φ′,w.sub.θ), if φ is logically equivalent to φ′
[0087] Two example techniques for calculating the logic-based loss will now be discussed.
[0088] In a first example technique, fuzzy logic is employed to calculate the logic-based loss. In one example, the fuzzy loss DL2 is employed. DL2 is discussed in Marc Fischer, Mislay Balunovic, Dana Drachsler-Cohen, Timon Gehr, Ce Zhang, and Martin T Vechev. “DL2: training and querying neural networks with logic.” In ICML, volume 97, pages 1931-1941, 2019., the contents of which are incorporated herein by reference.
[0089] DL2 includes the following definitions:
.sup.s(t,w):=1−w(t)
.sup.s(¬t,w):=w(t)
.sup.s(φ.sub.1∧φ.sub.1,w):=
.sup.s(φ.sub.1,w)+
.sup.s(φ.sub.2,w)
.sup.s(φ.sub.1∨φ.sub.1,w):=
.sup.s(φ.sub.1,w)+
.sup.s(φ.sub.2,w)
.sup.s(¬φ,w):=
.sup.s(ψ,w)
[0090] In the above, t is a variable φ, φ.sub.2 and φ.sub.3 are formulas over variables, ∧, ∨ and ¬are the Boolean connectives, and ψ is the formula that results after applying the De Morgan's rule to ¬φ until each negation operate is applied on the level of single variables.
[0091] Accordingly, the example negative rule ¬drink(horse, eye) becomes w(drink).Math.w(horse).Math.w(eye). An example positive rule wear(man, jacket) becomes 1−(w(wear).Math.w(man).Math.w(jacket)). As noted above, w is the output of the machine learning model 110 in respect of the given label. Accordingly, w(wear)=0.7 would represent an example in which the machine learning model has predicts there is a 0.7 probability of relationship being “wear”.
[0092] In a second example technique, probabilistic logic is employed to calculate the logic-based loss. In contrast to fuzzy logic, in probabilistic logic all interpretations of a formula are reasoned over. Particularly, each variable in a formula can be assigned a value corresponding to true (i.e. the value obtained from the machine learning model 110) or false (i.e. 1−the value obtained from the machine learning model 110). An interpretation of the formula is one permutation of the assignments of values to the variables in the formula. Accordingly, in probabilistic logic, each possible permutation of values for the variables is considered.
[0093] In one example, Weighted Model Counting is used to reason over the possible interpretations. Weighted Model Counting is discussed in Chavira, Mark and Adnan Darwiche. “On probabilistic inference by weighted model counting.” Artif. Intell. 172 (2008): 772-799, the contents of which are incorporated herein by reference.
[0094] More formally, an interpretation I of a propositional formula φ is an assignment of its variables to true (T) or false ().
(φ) is the set of all interpretations of formula φ and I(t) is the value of variable tin an interpretation I. Interpretation I is a model of φ, denoted as I
φ, if φ evaluates to true under I. The weight W(I,w) of I under w is defined as follows: if I is not a model of φ then it is zero; otherwise it is given by:
[0095] Where t∈φ denotes that variable t occurs in formula φ. WMC is then calculated as the sum of the weights of all models of φ under w:
[0096]
[0097] The WMC provides a measure of satisfiability of the IC. However, to define the loss function a measure is required that effectively defines the distance to satisfiability. Accordingly, in one example, the logic-based loss function is defined as follows:
L.sup.s(φ,w)=−log WMC(φ,w)
[0098] Returning now to
[0099] In one example, the knowledge selection unit 230 is configured to select a subset of the rules in the knowledge store 220 that are maximally non-satisfied. In other words, the knowledge selection unit 230 selects rules that are maximally violated by the current prediction of the model 110 in respect of the current training image, to use as feedback to the machine learning model 110.
[0100] This may improve the chances of providing meaningful feedback to the machine learning model 110, on the basis that predictions of the machine learning model 110 that do not satisfy ICs (i.e. have a low value for SAT) are highly likely to be incorrect and in need of amendment. On the other hand, predictions that satisfy an IC may not necessarily be correct, and instead it could be the case that the predictions relate to a permissible relationship between a subject and object, but not the one present in the training image.
[0101] In one example, the subset of rules may be selected by the following procedure.
[0102] Firstly, the N output predictions of the machine learning model 110 with the highest confidences are obtained. N is a hyperparameter of the training method. The list of N output predictions are ordered by decreasing confidence.
[0103] Next, any prediction predicate(subject, object) for which there is no corresponding negative IC ¬predicate(subject object) in the background knowledge store 220 is removed from the list. This effectively removes any high-confidence predictions that do not violate a negative IC from consideration. Accordingly, at this stage the list contains only predictions that violate a negative IC.
[0104] Next, random subsets of the list are selected of size p. In one example, p is 10, though in other examples it may take different values, for example 2, 3, 5, or 7. In one example, 30 or under random subsets are selected from the list. In other examples, 20 or under, or 10 or under random subsets will be selected. It will be appreciated that the number may be varied.
[0105] The loss .sup.s is then calculated for each of the subsets. The maximum loss is then taken as the actual logic-based loss.
[0106]
[0107] The method comprises a step S61 of inputting the training image into a machine learning model, for example machine learning model 110 discussed herein. The training image depicts a scene comprising at least two objects.
[0108] The method further comprises a step S62 of outputting, from the machine learning model, a predicted label for the each of the at least two objects and a predicted label for a relationship between the at least two objects.
[0109] Next, in step S63, a loss is calculated using a loss function. Calculating the loss comprises calculating a supervised loss by comparing the predicted label for the each of the at least two objects and the predicted label for a relationship between the at least two objects to actual labels corresponding to the training image. Calculating the loss also comprises calculating a logic-based loss by comparing the predicted label for the each of the at least two objects and the predicted label for a relationship between the at least two objects to a plurality of stored integrity constraints, wherein the integrity constraints comprise common-sense knowledge.
[0110] Subsequently, in step S64, weights of the machine learning model are updated based on the calculated loss.
[0111] The example method is iterated to train the machine learning model. The method may iterate until a termination condition is met. The example method of
[0112]
[0113] In a first step S71, the N output predictions of the machine learning model with the highest confidences are obtained. In a second step S72, the list of N output predictions are ordered by decreasing confidence. Next, in step S73, any prediction for which there is no corresponding negative IC in the background knowledge store is removed. Next, in step S74, random subsets of the list are selected. Next, in step S75, the loss is calculated for each of the subsets. Subsequently, in step S76, the maximum loss is selected the logic-based loss.
[0114]
[0115] In the experiments represented in
[0116] In the experiments represented in
[0117] Results are presented for predicate classification and SGG classification. The predicate classification task is defined as follows: given an input image I, a pair of a subject and an object (s, o) and the coordinates of the bounding boxes that enclose s and o, predict the relationship r between s and o. The scene graph classification task conceals the pair (s, o) and asks the model to return the fact r(s, o).
[0118] The metrics shown in
[0119] Taking
[0120]
[0121]
[0122]
[0123]
[0124]
[0125]
[0126] The system 300 comprises a user interface 350. The user interface 350 may take the form of any suitable means of receiving user input, including but not limited to a touch screen display, a mouse and/or keyboard, and a microphone, which may be associated with suitable speech-to-text conversion software.
[0127] The memory 301 of the system 300 stores a plurality of images 301a. For example, the plurality of images 301a may be images captured via an image capture unit (not shown) of the system. Accordingly, the plurality of images 301a may be an image gallery, for example as commonly found on a smartphone or tablet.
[0128] The system 300 is configured to receive a user image query 305 via the user interface 350. The query may comprise at least a label of an object in an image to be retrieved and a label of a relationship involving the object. For example, a query may include a subject and a relationship, such as “show me an image of a horse eating”. In another example, the query may include a relationship and object, such as “show me images of things on a table”. The image query may also include all three of a subject, object and relation.
[0129] The system 300 is further configured to retrieve one or more images of the images 301a matching the query. Particularly, the system 300 may generate scene graphs corresponding to the images 301a. The system 300 may then return an image 301b that includes the queried object and relationship in its scene graph. The returned image 301b may be displayed via the user interface 350.
[0130] It will be appreciated that in some examples the scene graphs of the images are generated in response to the query, but in other examples the scene graphs are generated in advance (e.g. shortly after capture of each image) and stored in the memory 301.
[0131] In another example, the stored images 301a may be frames of a video, including for example television programmes or films. Accordingly, the system 300 may be used to search for scenes in a video in which objects are performing a particular action.
[0132]
[0133] The system 400 includes a user interface 450 and an image capture unit 460. The user interface 350 may take the form of any suitable means of receiving user input, including but not limited to a touch screen display, a mouse and/or keyboard, and a microphone, which may be associated with suitable speech-to-text conversion software.
[0134] The image capture unit 460 may comprise one or more cameras. The image capture unit 460 may comprise cameras integral with the other components of the system 400 (i.e. built into a smartphone or tablet), though in other examples the image capture unit 460 may comprise cameras disposed remote from the other components of the system 400.
[0135] The personal assistant system 400 is configured to capture images using the image capture unit 460 of a user performing an activity. In the example shown in
[0136] The personal assistant system 400 is further configured to generate a scene graph 406 based on the captured image 405 of the user performing the activity. For example, in
[0137] The personal assistant system 400 may be further configured to determine whether the user is performing the activity correctly. For example, the memory 402 may store a list of permitted actions associated with the activity. The system 400 may then determine whether the user is performing one of the permitted actions stored in the memory, based on the scene graph 406. In some examples, the permitted actions may be ordered—for example in the case of sequential steps in a recipe. Accordingly, the system 400 may be configured to determine whether the action performed by the user in the captured image 405 is being performed at the correct time or in the correct sequence.
[0138] The personal assistant system 400 may be configured to alert the user in the event that the activity is not being performed correctly, for example via user interface 450.
[0139] It will be appreciated that the personal assistant system 400 is not limited to being a kitchen assistant system. In other example, the personal assistant system 400 may guide the user in other tasks, such as exercise routines, repair or maintenance tasks, and the like.
[0140]
[0141] Advantageously, an embodiment of the disclosure provide a method of augmenting the training of a machine learning model for SGG using common sense knowledge. An embodiment of the disclosure advantageously operate at training time, and thus do not increase the computation required at inference time. An embodiment of the disclosure are furthermore applicable to any existing SGG model, which may be further trained according to the disclosed techniques. In addition, an embodiment of the disclosure can efficiently leverage the knowledge stored in very large knowledge stores, by selecting integrity constraints from the store that are maximally non-satisfied. The SGG models may be employed in applications including image search and smart assistants.
[0142] Those skilled in the art will appreciate that while the foregoing has described what is considered to be the best mode and where appropriate other modes of performing some embodiments of the disclosure, the some embodiments of the disclosure should not be limited to the specific configurations and methods disclosed in this description of the preferred embodiment. Those skilled in the art will recognise that some embodiments of the disclosure have a broad range of applications, and that the embodiments may take a wide range of modifications without departing from any inventive concept as defined in the appended claims.