Visual relationship detection method and system based on adaptive clustering learning
11361186 · 2022-06-14
Assignee
Inventors
Cpc classification
G06F18/214
PHYSICS
Y02D10/00
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
G06F18/213
PHYSICS
International classification
Abstract
The present disclosure discloses a visual relationship detection method based on adaptive clustering learning, including: detecting visual objects from an input image and recognizing the visual objects to obtain context representation; embedding the context representation of pair-wise visual objects into a low-dimensional joint subspace to obtain a visual relationship sharing representation; embedding the context representation into a plurality of low-dimensional clustering subspaces, respectively, to obtain a plurality of preliminary visual relationship enhancing representation; and then performing regularization by clustering-driven attention mechanism; fusing the visual relationship sharing representations and regularized visual relationship enhancing representations with a prior distribution over the category label of visual relationship predicate, to predict visual relationship predicates by synthetic relational reasoning. The method is capable of fine-grained recognizing visual relationships of different subclasses by mining latent relationships in-between, which improves the accuracy of visual relationship detection.
Claims
1. A visual relationship detection method based on adaptive clustering learning, comprising, executed by a processor, the following steps: detecting visual objects from an input image and recognizing the visual objects by a contextual message passing mechanism to obtain context representations of the visual objects; embedding the context representations of pair-wise visual objects into a low-dimensional joint subspace to obtain visual relationship sharing representations; embedding the context representations of pair-wise visual objects into a plurality of low-dimensional clustering subspaces, respectively, to obtain a plurality of preliminary visual relationship enhancing representations; and then performing regularization to the preliminary visual relationship enhancing representations by clustering-driven attention mechanisms; and fusing the visual relationship sharing representations, the regularized visual relationship enhancing representations and a prior distribution over category labels of visual relationship predicates, to predict visual relationship predicates by synthetic relational reasoning.
2. The visual relationship detection method based on adaptive clustering learning according to claim 1, wherein the method further comprises: calculating empirical distribution of the visual relationships from training set samples of a visual relationship data set to obtain a visual relationship prior function.
3. The visual relationship detection method based on adaptive clustering learning according to claim 1, wherein the method further comprises: constructing an initialized visual relationship detection model, and training the model by the training data of the visual relationship data set.
4. The visual relationship detection method based on adaptive clustering learning according to claim 1, wherein the step of obtaining the visual relationship sharing representations is specifically: obtaining a first product of a joint subject mapping matrix and the context representations of the visual object of the subject, obtaining a second product of a joint object mapping matrix and the context representations of the visual object of the object; subtracting the second product from the first product, and dot-multiplying the difference value and convolutional features of a visual relationship candidate region; wherein, the joint subject mapping matrix and the joint object mapping matrix are mapping matrices that map the visual objects context representations to a joint subspace; and the visual relationship candidate region is the minimum rectangle box that can fully cover the corresponding visual object candidate regions of the subject and object; the convolutional features are extracted from the visual relationship candidate region by a convolutional neural network.
5. The visual relationship detection method based on adaptive clustering learning according to claim 4, wherein the step of obtaining a plurality of preliminary visual relationship enhancing representation is specifically: obtaining a third product of a k.sup.th clustering subject mapping matrix and the context representation of the visual object of the subject, obtaining a fourth product of a k.sup.th clustering object mapping matrix and the context representation of the visual object of the object; subtracting the fourth product from the third product, and dot-multiplying the difference value and convolutional features of a visual relationship candidate region to obtain a k.sup.th preliminary visual relationship enhancing representation; wherein the k.sup.th clustering subject mapping matrix and the k.sup.th clustering object mapping matrix are mapping matrices that map the visual objects context representation to the k.sup.th clustering subspace.
6. The visual relationship detection method based on adaptive clustering learning according to claim 5, wherein the step of “performing regularization to the preliminary visual relationship enhancing representations of different subspaces by clustering-driven attention mechanisms” is specifically: obtaining attentive scores of the clustering subspaces; obtaining a sixth product of the k.sup.th preliminary visual relationship enhancing representations and the k.sup.th regularized mapping matrix, and performing weighted sum operation to the sixth products of different clustering subspaces by using the attentive scores of the clustering subspace as the clustering weight; wherein, the k.sup.th regularized mapping matrix is the k.sup.th mapping matrix that transforms the preliminary visual relationship enhancing representation.
7. The visual relationship detection method based on adaptive clustering learning according to claim 6, wherein the step of “obtaining attentive scores of the clustering subspaces” is specifically: inputting a predicted category label of visual object of subject and a predicted category label of visual object of object into the visual relationship prior function to obtain a prior distribution over the category label of visual relationship predicate; obtaining a fifth product of the prior distribution over the category label of visual relationship predicate and the k.sup.th attention mapping matrix, and substituting the fifth product into the softmax function for normalization; wherein, the k.sup.th attention mapping matrix is the mapping matrix that transforms the prior distribution over the category label of visual relationship predicate.
8. The visual relationship detection method based on adaptive clustering learning according to claim 6, wherein the step of “fusing the visual relationship sharing representations and the regularized visual relationship enhancing representations with a prior distribution over category labels of visual relationship predicates, to predict visual relationship predicates by synthetic relational reasoning” is specifically: inputting a predicted category label of visual object of subject and a predicted category label of visual object of object into the visual relationship prior function to obtain a prior distribution over the category label of visual relationship predicate; and obtaining a seventh product of the visual relationship sharing mapping matrix and the visual relationship sharing representations, obtaining an eighth product of the visual relationship enhancing mapping matrix and the regularized visual relationship enhancing representations; summing the seventh product, the eighth product and the prior distribution over the category label of visual relationship predicate, and then substituting the result into the softmax function.
9. A system for a visual relationship detection method based on adaptive clustering learning, the system comprising: a processor configured for: detecting visual objects from an input image and recognizing the visual objects by a contextual message passing mechanism to obtain context representations of the visual objects; embedding the context representations of pair-wise visual objects into a low-dimensional joint subspace to obtain visual relationship sharing representations; embedding the context representations of pair-wise visual objects into a plurality of low-dimensional clustering subspaces, respectively, to obtain a plurality of preliminary visual relationship enhancing representations; and then performing regularization to the preliminary visual relationship enhancing representations by clustering-driven attention mechanisms; and fusing the visual relationship sharing representations, the regularized visual relationship enhancing representations and a prior distribution over category labels of visual relationship predicates, to predict visual relationship predicates by synthetic relational reasoning.
10. The system according to claim 9, wherein the method further comprises: calculating empirical distribution of the visual relationships from training set samples of a visual relationship data set to obtain a visual relationship prior function.
11. The system according to claim 9, wherein the method further comprises: constructing an initialized visual relationship detection model, and training the model by the training data of the visual relationship data set.
12. The system according to claim 9, wherein the step of obtaining the visual relationship sharing representations is specifically: obtaining a first product of a joint subject mapping matrix and the context representations of the visual object of the subject, obtaining a second product of a joint object mapping matrix and the context representations of the visual object of the object; subtracting the second product from the first product, and dot-multiplying the difference value and convolutional features of a visual relationship candidate region; wherein, the joint subject mapping matrix and the joint object mapping matrix are mapping matrices that map the visual objects context representations to a joint subspace; and the visual relationship candidate region is the minimum rectangle box that can fully cover the corresponding visual object candidate regions of the subject and object; the convolutional features are extracted from the visual relationship candidate region by a convolutional neural network.
13. The system according to claim 12, wherein the step of obtaining a plurality of preliminary visual relationship enhancing representation is specifically: obtaining a third product of a k.sup.th clustering subject mapping matrix and the context representation of the visual object of the subject, obtaining a fourth product of a k.sup.th clustering object mapping matrix and the context representation of the visual object of the object; subtracting the fourth product from the third product, and dot-multiplying the difference value and convolutional features of a visual relationship candidate region to obtain a k.sup.th preliminary visual relationship enhancing representation; wherein the k.sup.th clustering subject mapping matrix and the k.sup.th clustering object mapping matrix are mapping matrices that map the visual objects context representation to the k.sup.th clustering subspace.
14. The system according to claim 13, wherein the step of “performing regularization to the preliminary visual relationship enhancing representations of different subspaces by clustering-driven attention mechanisms” is specifically: obtaining attentive scores of the clustering subspaces; obtaining a sixth product of the k.sup.th preliminary visual relationship enhancing representations and the k.sup.th regularized mapping matrix, and performing weighted sum operation to the sixth products of different clustering subspaces by using the attentive scores of the clustering subspace as the clustering weight; wherein, the k.sup.th regularized mapping matrix is the k.sup.th mapping matrix that transforms the preliminary visual relationship enhancing representation.
15. The system according to claim 14, wherein the step of “obtaining attentive scores of the clustering subspaces” is specifically: inputting a predicted category label of visual object of subject and a predicted category label of visual object of object into the visual relationship prior function to obtain a prior distribution over the category label of visual relationship predicate; obtaining a fifth product of the prior distribution over the category label of visual relationship predicate and the k.sup.th attention mapping matrix, and substituting the fifth product into the softmax function for normalization; wherein, the k.sup.th attention mapping matrix is the mapping matrix that transforms the prior distribution over the category label of visual relationship predicate.
16. The system according to claim 14, wherein the step of “fusing the visual relationship sharing representations and the regularized visual relationship enhancing representations with a prior distribution over the category labels of visual relationship predicates, to predict visual relationship predicates by synthetic relational reasoning” is specifically: inputting a predicted category label of visual object of subject and a predicted category label of visual object of object into the visual relationship prior function to obtain a prior distribution over the category label of visual relationship predicate; and obtaining a seventh product of the visual relationship sharing mapping matrix and the visual relationship sharing representations, obtaining an eighth product of the visual relationship enhancing mapping matrix and the regularized visual relationship enhancing representations; summing the seventh product, the eighth product and the prior distribution over the category label of visual relationship predicate, and then substituting the result into the softmax function.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The accompanying drawings illustrate one or more embodiments of the present disclosure and, together with the written description, serve to explain the principles of the invention. Wherever possible, the same reference numbers are used throughout the drawings to refer to the same or like elements of an embodiment.
(2)
(3)
(4)
DETAILED DESCRIPTION OF THE PRESENT DISCLOSURE
(5) The method provided by the present disclosure will be described below in detail by embodiments with reference to the accompanying drawings. The present disclosure may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure is thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Like reference numerals refer to like elements throughout.
(6) In order to solve the above problems, a visual relationship detection method capable of fully, automatically, and accurately mining latent relatedness information between visual relationships is needed. Studies have shown that there exist highly relevant visual relationships in reality. The existing visual relationships share a specific visual mode and characteristics, thus we can further complete fine-grained detection of multiple visual relationships based on the recognition of highly relevant visual relationships, and can improve the recall rate of visual relationship detection (hereinafter referred to as VRD). The present disclosure proposes a VRD method based on adaptive clustering learning. Referring to
(7) 101: calculating empirical distribution of the visual relationships from training set samples of the visual relationship data set to obtain a visual relationship prior function.
(8) Wherein, the visual relationship data set may be any data set containing images and corresponding visual relationship annotations, including but not limited to a VisualGenome data set. The training set samples of the visual relationship data set include training images and corresponding visual relationship true label data. The visual relationship true label data of each training image includes: a visual object true category label ô.sub.i of the subject, a visual object true category label ô.sub.j of the object and a corresponding visual relationship predicate true category label r.sub.i.fwdarw.j. Given the visual object true category label ô.sub.i of the subject and the visual object true category label ô.sub.j of the object, calculating the corresponding conditional empirical distribution of the visual relationship predicate true category label P(r.sub.i.fwdarw.j|ô.sub.i,ô.sub.j) in all visual relationship true label date, which is then stored as the visual relationship prior function w (ô.sub.i,ô.sub.j).
(9) 102: Constructing an initialized visual relationship detection model, and training the model by the training data of the visual relationship data set.
(10) Wherein, the visual relationship data set may be any data set containing images and corresponding visual relationship annotations, including but not limited to a VisualGenome data set. The training data of the visual relationship data set includes: training images, and corresponding visual relationship true region data and true label data. And the true region data of each training image include: a visual object true region of the subject, a visual object true region of the object, and a corresponding visual relationship predicate true region. The true label data of each training image include: a visual object true category label of the subject, a visual object true category label of the object, and a corresponding visual relationship predicate true category label.
(11) During the process of training an initialized VRD model, the embodiment uses the initialized VRD model to predict a subject visual object prediction category label, an object visual object prediction category label and a corresponding visual relationship predicate prediction category label of each training image, and obtain category training errors between the subject visual object prediction category label and the subject visual object true category label, between the object visual object prediction category label and the object visual object true category label, and between visual relationship predicate prediction category label and the visual relationship predicate true category label; and further obtain region training errors between the subject visual object prediction region and the subject visual object true region, between the object visual object prediction region and the object visual object true region, and between visual relationship predicate prediction region and the visual relationship predicate true region.
(12) In the embodiment, the gradient back-propagation operation is performed iteratively to the model according to the category training errors and the region training errors of each training image until the model converges, and the parameters in the trained VRD model are applied to the subsequent steps.
(13) 103: Detecting visual objects from an input image and recognizing the visual objects by contextual message passing mechanism to obtain context representations of the visual objects.
(14) Firstly, a candidate region set and a corresponding candidate region feature set are extracted from the input image.
(15) Wherein, any object t detector can be used for the extraction operation, including but not limited to the FasterR-CNN object detector used in this embodiment; candidate regions include visual object candidate regions and visual relationship candidate regions. The visual relationship candidate region is represented by the minimum rectangle box that can fully cover the corresponding visual object candidate regions of the subject and object, and the visual object candidate regions of the subject and object comprise any one of a plurality of the visual object candidate regions. The candidate region feature includes: a visual object candidate region convolutional feature f.sub.i, a visual object category label probability l.sub.i, and a visual object candidate region bounding box coordinate b.sub.i; the visual relationship candidate region feature includes a visual relationship candidate region convolutional feature f.sub.i,j.
(16) Secondly, contextual encoding is performed on the visual object candidate region features to obtain the visual object representations.
(17) Wherein, the embodiment adopts a bi-directional long-short-term memory network (biLSTM) to sequentially encode all the visual object candidate region features to obtain the object context representations C:
C=biLSTM.sub.1([f.sub.iW.sub.1l.sub.i].sub.i=1, . . . ,N) (1)
(18) where the parameters of the bi-directional long-short-term memory network (biLSTM) are obtained in the step 102, C={c.sub.i}.sub.i=1.sup.N is the set of hidden state of long-short-term memory network (LSTM) and c.sub.i corresponds to the i.sup.th input visual object candidate region feature; W.sub.1 is the learned parameters obtained in the step 102; [;] denotes the concatenation operation, and N is the number of the input visual object candidate region features.
(19) Thirdly, visual objects is recognized by using the visual object representations.
(20) Wherein, the embodiment adopt a LSTM to predict the i.sup.th visual object category label ô.sub.i depending on visual object representation c.sub.i and the previously detected i−1.sup.th label ô.sub.i-1:
h.sub.i=LSTM.sub.1([c.sub.i;ô.sub.i-1]) (2)
ô.sub.i=argmax(W.sub.2h.sub.i) (3)
(21) where the parameters of the LSTM are obtained in the step 102, h.sub.i is the hidden state of the LSTM, W.sub.2 is the learned parameters obtained in the step 102.
(22) Finally, the visual object context representations are obtained by visual object representations and visual object label embeddings.
(23) Wherein, due to visual object label embeddings are beneficial to visual relationships inference, this embodiment adopts another biLSTM to predict the visual object context representations depending on the previously predicted visual object category label ô.sub.i and the visual object representation c.sub.i:
D=biLSTM.sub.2([c.sub.i;W.sub.3ô.sub.i].sub.i=1, . . . ,N) (4)
(24) where the parameters of the biLSTM are obtained in the step 102, D={d.sub.i}.sub.i=1.sup.N is the set of hidden state of the LSTM and d.sub.i corresponds to the i.sup.th input visual object representation; W.sub.3 is the learned parameters in the step 102.
(25) 104: embedding the context representations of pair-wise visual objects into a low-dimensional joint subspace to obtain a visual relationship sharing representations.
(26) Where the detected subject visual object context representation is denoted as d.sub.i, the object visual object context representation is denoted as d.sub.j, the subject and object visual object context representations include any two of a plurality of the visual object context representations, and f.sub.i,j is the convolutional features of the visual relationship candidate region corresponding to the subject visual object and the object visual object, and the visual relationship sharing representation can be obtained as follows:
E.sub.i,j.sup.s=(W.sub.esd.sub.i−W.sub.eod.sub.j)∘f.sub.i,j (5)
(27) where W.sub.es and W.sub.eo are the joint subject mapping matrix and the joint object mapping matrix that map the visual object context representations to the joint subspace, which are obtained by the step 102; “∘” represents element-wise multiplication operation, and E.sub.i,j.sup.s is a visual relationship sharing representation obtained by calculation.
(28) 105: embedding the context representation of pair-wise visual objects into a plurality of low-dimensional clustering subspaces, respectively, to obtain a plurality of preliminary visual relationship enhancing representations.
(29) Where the detected subject visual object context representation is denoted as d.sub.i, the object visual object context representation is denoted as d.sub.j, the subject and object visual object context representation include any two of a plurality of the visual object context representations, and f.sub.i,j is the convolutional features of the visual relationship candidate region corresponding to the subject visual object and the object visual object, and the k.sup.th preliminary visual relationship enhancing representation can be obtained as follows:
e.sub.i,j.sup.k=(W.sub.es.sup.kd.sub.i−W.sub.eo.sup.kd.sub.j)∘f.sub.i,j,k∈[1,K] (6)
(30) where W.sub.es.sup.k and W.sub.eo.sup.k are a clustering subject mapping matrix and a clustering object mapping matrix that map the visual object context representations to the k.sup.th clustering subspace, which are obtained by the step 102; e.sub.i,j.sup.k represents the obtained k.sup.th preliminary visual relationship enhancing representation, and K is the number of the clustering subspaces.
(31) 106: performing regularization to a plurality of preliminary visual relationship enhancing representations in the different clustering subspaces by clustering-driven attention mechanism.
(32) Where the i.sup.th and the j.sup.th visual object category labels are denoted as ô.sub.i and ô.sub.j, respectively; attentive scores of the clustering subspaces can be obtained by following:
α.sub.i,j.sup.k=soft max(W.sub.α.sup.kw(ô.sub.i,ô.sub.j)),j∈[1,n],k∈[1,K] (7)
(33) where W.sub.α.sup.k is the k.sup.th attention mapping matrix, which is obtained by the step 102; w(⋅,⋅) is the visual relationship prior function; α.sub.i,j.sup.k is an attentive score of the k.sup.th clustering subspace, and soft max(∘) represents the following equation:
(34)
(35) Where i.sub.j represents the j.sup.th input variable of the soft max function, and n represents the number of input variables of the soft max function;
(36) where e.sub.i,j.sup.k is obtained kt.sup.h preliminary visual relationship enhancing representation, and the regularized visual relationship enhancing representation can be calculated as follows:
(37)
(38) where W.sub.b.sup.k is the regularized mapping matrix that transforms the kth preliminary visual relationship enhancing representation, which is obtained by the step 102, and E.sub.i,j.sup.p represents the regularized visual relationship enhancing representation.
(39) 107: fusing the visual relationship sharing representation and the regularized visual relationship enhancing representation with a prior distribution over the category labels of visual relationship predicate, to predict visual relationship predicates by synthetic relational reasoning.
(40) Where E.sub.i,j.sup.s is the visual relationship sharing representation, E.sub.i,j.sup.p is the regularized visual relationship enhancing representation, w(⋅,⋅) is the visual relationship prior function, and the probability distribution Pr(d.sub.i.fwdarw.j|B,O) of the i.sup.th and j.sup.th visual objects corresponding to the visual relationship predicate can be obtained by following:
Pr(d.sub.i.fwdarw.j|B,O)=soft max(W.sub.r.sup.sE.sub.i,j.sup.s+W.sub.r.sup.pE.sub.i,j.sup.p+w(ô.sub.i,ô.sub.j)) (9)
(41) where W.sub.r.sup.s and W.sub.r.sup.p are learned visual relationship sharing mapping matrix and visual relationship enhancing mapping matrix, respectively, which are obtained by the step 102; w (ô.sub.i,ô.sub.j) represents the prior distribution over visual relationship predict category labels when the subject visual object category label is ô.sub.i and the object visual object category label is ô.sub.j.
(42) The methods and systems of the present disclosure can be implemented on one or more computers or processors. The methods and systems disclosed can utilize one or more computers or processors to perform one or more functions in one or more locations. The processing of the disclosed methods and systems can also be performed by software components. The disclosed systems and methods can be described in the general context of computer-executable instructions such as program modules, being executed by one or more computers or devices. For example, each server or computer processor can include the program modules such as mathematical construction module, simplifying module, and maximum delay calculation module, and other related modules described in the above specification. These program modules or module related data can be stored on the mass storage device of the server and one or more client devices. Each of the operating modules can comprise elements of the programming and the data management software.
(43) The components of the server can comprise, but are not limited to, one or more processors or processing units, a system memory, a mass storage device, an operating system, a system memory, an Input/Output Interface, a display device, a display interface, a network adaptor, and a system bus that couples various system components. The server and one or more power systems can be implemented over a wired or wireless network connection at physically separate locations, implementing a fully distributed system. By way of example, a server can be a personal computer, portable computer, smartphone, a network computer, a peer device, or other common network node, and so on. Logical connections between the server and one or more power systems can be made via a network, such as a local area network (LAN) and/or a general wide area network (WAN).
(44) Although the principle and implementations of the present disclosure have been described above by specific examples in the embodiments of the present disclosure, the foregoing description of the embodiments is merely for helping understanding the method of the present disclosure and the core concept thereof.
(45) Meanwhile, various alterations to the specific implementations and application ranges may come to a person of ordinary skill in the art according to the concept of the present disclosure. In conclusion, the contents of this specification shall not be regarded as limitations to the present disclosure.
(46) The foregoing description of the exemplary embodiments of the present disclosure has been presented only for the purposes of illustration and description and is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching.
(47) The embodiments were chosen and described in order to explain the principles of the invention and their practical application so as to activate others skilled in the art to utilize the invention and various embodiments and with various modifications as are suited to the particular use contemplated. Alternative embodiments will become apparent to those skilled in the art to which the present disclosure pertains without departing from its spirit and scope. Accordingly, the scope of the present disclosure is defined by the appended claims rather than the foregoing description and the exemplary embodiments described therein.
REFERENCES
(48) [1] Lu C, Krishna R, Bernstein M, et al. Visual relationship detection with language priors[C]//European Conference on Computer Vision. Springer, Cham, 2016: 852-869. [2] Johnson J, Krishna R, Stark M, et al. Image retrieval using scene graphs[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3668-3678. [3] Yao T, Pan Y, Li Y, et al. Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 684-699. [4] Shi J, Zhang H, Li J. Explainable and explicit visual reasoning over scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 8376-8384. [5] Yatskar M, Zettlemoyer L, Farhadi A. Situation recognition: Visual semantic role labeling for image understanding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5534-5542. [6] Johnson J, Gupta A, Fei-Fei L. Image generation from scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 1219-1228. [7] Norcliffe-Brown W, Vafeias S, Parisot S. Learning conditioned graph structures for interpretable visual question answering [C]//Advances in Neural Information Processing Systems. 2018: 8334-8343. [8] Teney D, Liu L, van den Hengel A. Graph-structured representation for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 1-9. [9] Sadeghi M A, Farhadi A. Recognition using visual phrases [C]//CVPR 2011. IEEE, 2011: 1745-1752. [10] Yu R, Li A, Morariu V I, et al. Visual relationship detection with internal and external linguistic knowledge distillation[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 1974-1982. [11] Dai B, Zhang Y, Lin D. Detecting visual relationships with deep relational networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 3076-3086. [12] Xu D, Zhu Y, Choy C B, et al. Scene graph generation by iterative message passing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 5410-5419. [13] Zellers R, Yatskar M, Thomson S, et al. Neural motifs: Scene graph parsing with global context[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5831-5840. [14] Liu A A, Su Y T, Nie W Z, et al. Hierarchical clustering multi-task learning for joint human action grouping and recognition[1]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(1): 102-114.