MULTI-MODAL IMAGE SEARCH

20200104318 ยท 2020-04-02

    Inventors

    Cpc classification

    International classification

    Abstract

    The present invention relates to methods for searching for two-dimensional or three-dimensional objects. More particularly, the present invention relates to searching for two-dimensional or three-dimensional objects in a collection by using a multi-modal query of image and/or tag data. Aspects and/or embodiments seek to provide a method of searching for digital objects using any combination of images, three-dimensional shapes and text by embedding the vector representations for these multiple modes in the same space. Aspects and/or embodiments can be easily extensible to any other type of modality, making it more general.

    Claims

    1. A method for combining image data and tag data into a unified representation, comprising the steps of: determining a vector representation for the image data in a vector space of words; determining a vector representation for the tag data in the vector space of words; and combining the vector representations by performing vector calculus.

    2. The method of claim 1, wherein semantically close words are mapped to spatially close vectors in the vector space of words.

    3. The method of claim 1, wherein the vector representations are determined by: determining a vector for each of the image data in the vector space of words; and embedding the image data and tag data in the vector space of words.

    4. The method of claim 1, wherein the step of combining the vector representations by performing vector calculus comprises determining linear combinations of the vector representations in the vector space of words.

    5. The method of claim 1, wherein the image data or tag data comprises one or more weights.

    6. The method of claim 1, further comprising performing the step of determining a vector representation for the image data using a neural network, optionally a convolutional neural network.

    7. The method of claim 6, further comprising generating the neural network by an image classifier followed by an image encoder operable to generate embeddings in the vector space of words.

    8. The method of claim 7, wherein the classifier is operable to be trained to identify image labels.

    9. The method of claim 7, further comprising converting the image classifier to an encoder operable to generate semantic-based descriptors.

    10. The method of claim 6, wherein the neural network comprises one or more fully-connected layers.

    11. The method of claim 10, wherein the one or more fully-connected layers are operable to return a vector of the same dimensionality of the image data.

    12. The method of claim 10, wherein one or more parameters of the one or more fully-connected layers are updated to minimize a total Euclidian loss.

    13. The method of claim 12, further comprising calculating the total Euclidian loss through consideration of the smallest Euclidian difference between two points.

    14. The method of claim 6, wherein the neural network is operable to minimize a softmax loss.

    15. The method of claim 1, further comprising: calculating, in advance of a query being received and/or processed, one or more descriptors; receiving a query regarding the image data and/or the tag data; and providing a unified representation in relation to the query.

    16. (canceled)

    17. The method of claim 15, wherein: the image data comprises one or more embedded images; the one or more descriptors are calculated in relation to each of the one or more embedded images; and a shape descriptor is calculated according to an average of the one or more descriptors in relation to each of the one or more embedded images, optionally wherein the average is biased according to one or more of the weights.

    18. (canceled)

    19. (canceled)

    20. A method of searching a collection of objects based on visual and semantic similarity of the unified representations of the collection of objects wherein the unified representations are combined using the method of any preceding claim, comprising the steps of: determining a unified descriptor for the search query, where the search query comprises both image or object data and word or tag data; determining one or more objects in the collection of objects having a spatially close vector representation to the unified descriptor for the search query.

    21. The method of claim 20, wherein the unified descriptor or vector representation of a shape is an average of one or more rendered views of the shape.

    22. A method for searching for an image or shape based on a query comprising tag and image data, comprising the steps of: creating a word space in which images, three dimensional objects, text and combinations of the same are embedded; determining vector representations for each of the images, three dimensional objects, text and combinations of the same; determining a vector representation for the query; determining which one or more of the images, three dimensional objects, text and combinations have a spatially close vector representation to the vector representation for the query.

    23. The method of claim 22, further comprising the step of replacing objects in an image with contextually similar objects.

    Description

    BRIEF DESCRIPTION OF DRAWINGS

    [0036] Embodiments will now be described, by way of example only and with reference to the accompanying drawings having like-reference numerals, in which:

    [0037] FIG. 1 illustrates an example of a multi-modal query comprising one image and one tag, and the corresponding 3D models retrieved;

    [0038] FIG. 2 illustrates an example of vector space showing semantically close words being mapped to spatially close vectors;

    [0039] FIG. 3 illustrates an example of vector space showing 3D shape descriptors mapped into the above word vector space, in particular showing semantically close shapes being mapped to spatially close vectors;

    [0040] FIG. 4 illustrates an example trained convolutional neural network to embed an image in a vector space of words;

    [0041] FIG. 5 illustrates an example of a multimodal query comprising an image and a tag, and the corresponding results retrieved comprising databases of 3D models, sketches and images;

    [0042] FIG. 6 illustrates an example of a multimodal query comprising a hand-drawn sketch and a tag, and the corresponding results retrieved comprising databases of 3D models, sketches and images; and

    [0043] FIG. 7 illustrates a further example of a multi-modal query comprising one image and one tag, and the corresponding 3D models retrieved.

    SPECIFIC DESCRIPTION

    [0044] In the publication by Tasse, Flora Ponjou, and Neil Dodgson with the title Shape2Vec: semantic-based descriptors for 3D shapes, sketches and images and published in ACM Transactions on Graphics (TOG) 35.6 (2016): 208, a method is presented on shape description that consisted of embedding a set of modalities in a fixed word vector space, thus generating semantic-based descriptors. Shape2Vec captures both visual and semantic features, and is applicable to modalities such as sketches, colour images, 3D shapes and RGB-D images. Results show that retrieval based on this representation outperforms previous mesh-based shape retrieval by a 73% improvement. However, the retrieval applications of Shape2Vec only accepted one modality at a time as query. In embodiments, the method is extended to support multimodal queries, and in particular queries consisting of an image and text.

    [0045] Referring to FIGS. 1 to 6, an embodiment will now be described.

    [0046] First, a unified descriptor from a combination of image and text is constructed. This is done by first representing each of the image 10 and text descriptors 20a separately using Shape2Vec, followed by using vector calculus (which can comprise vector analysis and/or multivariable calculus and/or vector arithmetic) to combine the descriptors 20b.

    [0047] In this embodiment, W is a vector space of words 200 that can map text to a vectorial representation. It is assumed that such a vector space already exists, and remains fixed. An example of such a vector space approach is the word2vec approach and neural network as presented in Mikolov, Tomas, et al. Distributed representations of words and phrases and their compositionality as published in Advances in neural information processing systems 2013. Semantically close words are mapped to spatially close vectors, as illustrated in the FIG. 2. A similar vector space 300 showing 3D shape descriptors mapped into the above word vector space, in particular showing semantically close shapes being mapped to spatially close vectors, is shown in FIG. 3.

    [0048] The visual-semantic descriptors are computed in two steps: (1) embedding images and shapes in W and then (2) embedding multi-modal input of the image (object) and text in W.

    [0049] For the first step of computing the visual-semantic descriptors, based on a training dataset, a neural network is trained to map an image (or shape) to a vector that lies in W. This is a similar approach to that taken by Shape2Vec described and referenced above. This process embeds images and shapes in the same word vector space, thus ensuring that all modalities share a common representation.

    [0050] For the second step of computing the visual-semantic descriptors, in this embodiment W is a vector space where addition and scalar multiplication are defined. Using the properties of a vector space, linear combinations of vectors in W also lie in W. Thus, taking a multimodal input, a descriptor for each modality is computed, and these descriptors are combined into one descriptor using vector arithmetic.

    [0051] Given the above steps, descriptors can be computed for each object in a collection in advance of queries being received or processed. At query time, a unified descriptor is computed for a multimodal input, and any object with a spatially close descriptor is a relevant result.

    [0052] For training, a dataset of images (and 3D models) is needed. Let I={(I.sub.i,y.sub.i)}.sub.i[0,N-1] be a set of N labelled images, and S={(S.sub.i,y.sub.i)}.sub.i[0,M-1] be a set of M labelled 3D shapes. An example of a set of images/is the ImageNet dataset and an example of S is the Shapenet [10].

    [0053] The next step is to embed images and shapes in W, specifically to embed an image I.sub.k in W. This is extendable to three-dimensional shapes by computing rendered views for a shape from multiple viewpoints and computing a descriptor for each view. The shape descriptor is then an average of its view descriptors.

    [0054] The approach of this embodiment thus trains a convolutional neural network to embed an image in W, using Shape2Vec. This is illustrated in FIG. 4.

    [0055] To obtain this convolutional network, an image classifier is created, followed by an image encoder operable to generate embeddings in the word vector space. A classifier may be trained to correctly identify image labels. This classifier may then be converted into an encoder and fine-tuned to generated semantic-based descriptors in the next section.

    [0056] The classifier can be implemented as a convolutional neural network, for example based on the AlexNet implementation set out in a publication by Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton titled Imagenet classification with deep convolutional neural networks published in Advances in neural information processing systems 2012. The AlexNet implementation is a multi-layer network consisting of one input layer, a combination of five convolutional and pooling layers and three fully-connected layers.

    [0057] The convolutional layers of the AlexNet implementation capture progressively more complex edge information, while the pooling layers apply subsampling. The fully-connected layers specify how the information captured from previous layers is combined to compute the probability that an image has a given label. These layers are controlled by about 60 million parameters that are initialized with AlexNet and optimised using the training dataset I. The AlexNet architecture can replaced with any other architecture that achieves more accurate performance on classification tasks.

    [0058] The top half of FIG. 4 illustrates the classifier architecture 405. The optimization goal is to minimize the Softmax loss 410, which captures the performance of the classifier. The Softmax loss 410 for a training sample (I.sub.i, y.sub.i) is:

    [00001] L i s = - log ( e f y i .Math. j = 0 K - 1 .Math. e f j )

    [0059] where y.sub.i is the correct label, K is the number of classes in the dataset and f is the Softmax function.

    [00002] f j ( z ) = e z j .Math. k = 0 K - 1 .Math. e z k

    [0060] where z is the output of the final fully-connected (FC) layers 415 comprising the scores of each class given the image I.sub.i 400.

    [0061] The total Softmax loss L.sup.S is the mean of individual losses L.sub.i.sup.S over a batch of training input, plus regularisation terms such as L2 regularisation that encourages parameters to be small.

    [0062] After training the above neural network, a classifier has been developed that can identify image labels. This network can then be modified and its parameters fine-tuned to embed 425 images in W as described below.

    [0063] Next, an image encoder is trained to output vectors in W. To generate a vector that lies in W, given an image, the classifier is converted into an encoder that returns vectors similar to the vector representation of the image label.

    [0064] More specifically, the final fully-connected layer now returns a vector of the same dimensionality as W, and given the training sample (I.sub.i, y.sub.i), the loss is


    L.sup.E.sub.i=z.sub.iW(y.sub.i).sub.2

    [0065] where W(y.sub.i) is the vector representation of the label y.sub.i.

    [0066] To preserve visual features captured by the classifier, only the parameters of the fully-connected layers are updated to minimize to total Euclidean loss 420 L.sup.E=L.sup.E.sub.i.

    [0067] Training the modified network produces an encoder E that embed images in W.

    [0068] Next, the multi-modal input of image/object and text is embedded in W. Given an image I.sub.input and a set of tags {t.sub.0, . . . , t.sub.n-1}, we use the vector arithmetic in the vector space W, to compute a unified descriptor as follows:

    [00003] E ( I input , t 0 , .Math. .Math. , t n - 1 ) = w v .Math. E ( I input ) + w s .Math. .Math. j = 0 n - 1 .Math. W ( t j )

    [0069] where weights w.sub.v and w.sub.s represents the weights of the image (visual) and tags (semantic) respectively. These weights are used to specify the influence of each of the query items. Although by default they are set to w.sub.v=w.sub.s=0.5, they can be modified to: [0070] 1) accommodate user intention (for example, the tags may carry more value than the image), or [0071] 2) automatically computed weights that reflect how well the given queries are described in the vector space (for instance the tag balafon is poorly represented during training and thus its descriptor should carry less value).

    [0072] This may be considered a significant step away from typical descriptor fusion method such as concatenation, where this type of weighting is can present a number of challenges, as well as learning-based fusion methods where weighting has been incorporated into the training process. The disclosure represents a simple and efficient weighting scheme enabled by the semantic information that vector arithmetic carries.

    [0073] To retrieve the top K objects in a collection based on the multimodal query, the method of this embodiment returns objects or images I.sub.k such that the smallest Euclidean distance between E(I.sub.k) and E(I.sub.input, t.sub.0, . . . , t.sub.n-1).

    [0074] FIG. 5 illustrates an example of a multimodal query 500 comprising an image and a tag, and the corresponding results retrieved comprising databases of 3D models 505, sketches 510 and images 515.

    [0075] This method can be easily extended to a new modality such as depth images and sketches by training an encoder for mapping the modality to W. An example of a multi-modal query 600 comprising a sketch and tag is shown in FIG. 6. There is also shown the corresponding results retrieved comprising databases of 3D models in an upright orientation 605, databases of 3D models in an arbitrary orientation 610, sketches 615 and images 620.

    [0076] Any system feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure.

    [0077] Any feature in one aspect may be applied to other aspects, in any appropriate combination. In particular, method aspects may be applied to system aspects, and vice versa. Furthermore, any, some and/or all features in one aspect can be applied to any, some and/or all features in any other aspect, in any appropriate combination.

    [0078] It should also be appreciated that particular combinations of the various features described and defined in any aspects can be implemented and/or supplied and/or used independently.