Text Based Image Search
20220343626 · 2022-10-27
Assignee
Inventors
Cpc classification
G06V10/76
PHYSICS
G06V10/454
PHYSICS
International classification
G06V10/44
PHYSICS
G06V10/75
PHYSICS
Abstract
Method and system for building a machine learning model for finding visual targets from text queries, the method comprising the steps of receiving a set of training data comprising text attribute labelled images, wherein each image has more than one text attribute label. Receiving a first vector space comprising a mapping of words, the mapping defining relationships between words. Generating a visual feature vector space by grouping images of the set of training data having similar attribute labels. Mapping each attribute label within the training data set on to the first vector space to form a second vector space. Fusing the visual feature vector space and the second vector space to form a third vector space. Generating a similarity matching model from the third vector space.
Claims
1. A method for building a machine learning model for finding visual targets from text queries, the method comprising the steps of: receiving a set of training data comprising text attribute labelled images, wherein each image has more than one text attribute label; receiving a first vector space comprising a mapping of words, the mapping defining relationships between words; generating a visual feature vector space by grouping images of the set of training data having similar attribute labels; mapping each attribute label within the training data set on to the first vector space to form a second vector space; fusing the visual feature vector space and the second vector space to form a third vector space; and generating a similarity matching model from the third vector space.
2. The method of claim 1, wherein the similarity matching model is generated using a mean square error loss function.
3. The method of claim 2, wherein the mean square error loss function is:
4. The method according to claim 1, wherein the first vector space is based on a Wikipedia pre-trained word2vector model.
5. The method according to claim 1, wherein the textual terms within the first vector space include the words of the text labels of the images within the training data set.
6. The method according to claim 1, wherein generating the visual feature vector space by grouping images of the set of training data having similar attribute labels further comprises discriminative learning using a softmax Cross Entropy loss in a Deep Convolutional Neural Network, CNN, where each attribute label is treated as a separate classification task, .sub.cls, according to
7. The method according to claim 1, wherein mapping each attributed label within the training data set on to the first vector space to form a second vector space further comprises embedding each attribute label, z.sub.i.sup.loc, i∈{1, . . . , N.sub.att}.
8. The method according to claim 7 further comprising the step of obtaining a global textual embedding, z.sup.glo, according to:
9. The method of claim 8 further comprising discriminative learning using a softmax Cross Entropy loss, where each attribute label is treated as a separate classification task, .sub.cls, according to
10. The method according to claim 1, wherein generating the visual feature vector space by grouping images of the set of training data having similar attribute labels further comprises building local attribute-specific embedding:
(x.sub.i.sup.loc,i∈{1, . . . ,N.sub.att}) based on a global part (x.sup.glo) in a ResNet-50 CNN architecture.
11. The method according to claim 1, wherein fusing the visual feature vector space and the second vector space to form the third vector space further comprises element-wise multiplication.
12. The method of claim 11, wherein the element-wise multiplication is a Hadamard Product in CNN learning optimisation.
13. The method of claim 12, wherein for each attribute label a separate lightweight branch with two fully connected, FC, layers of a deep CNN are used.
14. The method of claim 12, further comprising cross-modality global-level embedding s.sup.glo according to:
s.sup.glo=x.sup.glo∘z.sup.glo wherein ∘ specifies the Hadamard Product.
15. The method according to claim 1, wherein fusing the visual feature vector space and the second vector space to form the third vector space further comprises forming per-attribute cross-modality embedding according to:
s.sub.i.sup.loc=x.sub.i.sup.loc∘z.sub.i.sup.loc,i∈{1, . . . ,N.sub.att}.
16. The method of claim 15, wherein fusing the visual feature vector space and the second vector space to form the third vector space is based on a quality aware fusion algorithm.
17. The method of claim 16 further comprising estimating a per-attribute quality, ρ.sub.i.sup.loc, using minimum prediction scores on image and text as:
ρ.sub.i.sup.loc=min(p.sub.i.sup.vis,ρ.sub.i.sup.tex),i∈{1, . . . ,N.sub.att} where p.sub.i.sup.vis and p.sub.i.sup.tex denote ground-truth class posterior probability estimated by a corresponding classifier.
18. The method of claim 17 further comprising adaptively cross-attribute embedding according to:
s.sup.loc=f({ρ.sub.i.sup.loc.Math.s.sub.i.sup.loc}.sub.i=1.sup.N.sup.
19. The method of claim 18 further comprising forming a final cross-modality cross-level embedding according to:
s=f({s.sup.loc,s.sup.glo}) where the final embedding s is used to estimate an attribute matching result ŷ.
20. (canceled)
21. One or more non-transitory computer readable media storing computer readable instructions which, when executed by a processor of a wireless communication device, cause the device to perform: receiving a set of training data comprising text attribute labelled images, wherein each image has more than one text attribute label; receiving a first vector space comprising a mapping of words, the mapping defining relationships between words; generating a visual feature vector space by grouping images of the set of training data having similar attribute labels; mapping each attribute label within the training data set on to the first vector space to form a second vector space; fusing the visual feature vector space and the second vector space to form a third vector space; and generating a similarity matching model from the third vector space.
22. (canceled)
23. (canceled)
24. Use of the similarity matching model generated according to claim 1, to identify unlabelled images from a text query.
Description
BRIEF DESCRIPTION OF THE FIGURES
[0040] The present invention may be put into practice in a number of ways and embodiments will now be described by way of example only and with reference to the accompanying drawings, in which:
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057] It should be noted that the figures are illustrated for simplicity and are not necessarily drawn to scale. Like features are provided with the same reference numerals.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0058]
[0059]
[0060]
[0061]
[0062]
[0063]
[0064]
[0065]
[0066] Person search in large scale video datasets is a challenging problem with extensive applications in forensic video analysis and live video surveillance [6]. From increasing numbers of smart cities across the world equipped with tens to hundreds of thousands of 24/7 surveillance cameras per city, a massive quantity of raw video data is cumulatively produced daily. It is infeasible for human operators to manually search people (e.g. criminal suspects or missing persons) in such data. Automated person search becomes essential.
[0067] Most existing person search methods are based on image queries (probes), also known as person re-identification [6, 8, 16, 35, 36]. Given a query image, a system computes pairwise visual similarity scores between the query image and every gallery image in the test data. The top ranks with the highest similarity scores are considered as possible matches. Such an operation assumes that at least one image (one-shot) of the queried person is available for initiating the search. This is limited when there is only a verbal or text description of the target persons.
[0068] There are a number of attempts on person search by text queries, e.g. natural language descriptions [15, 14] or discrete text attributes [33, 10, 27]. To learn such search systems, labelling a large training dataset across textual and visual data modalities is necessary. Elaborative language descriptions not only require more expensive training data labelling, but also present significant computational challenges. This is due to ambiguities in interpretation between language descriptions and image appearance such that: (1) significant and/or subtle visual variations for the same language description; (2) flexible sentence syntax in language descriptions for the same image; and (3) modelling the sequential word dependence in a sentence is a difficult problem, particularly for long descriptions.
[0069] In contrast, text attribute descriptions are not only much cheaper in collecting labelled training data, but also more tractable in model optimisation. Importantly, they eliminate the need for modelling complex sentence structures and their correlations to the same visual appearance, and vice versa. Whilst giving a compromise of weaker appearance descriptive capacity, using text attributes favourably enables a more robust and computationally tractable means to execute text-queries for person searches without requiring image probes.
[0070] An intuitive approach of text image search is to estimate an attribute vector (text description) of each person image, and then to match the attribute vector of the query person with those of all the gallery person images [10, 27]. By treating the attribute labels independently, this method scales flexibly to handling the huge attribute combination space. However, this technique suffers from lacking a supporting context that accounts for a holistic interpretation of all the text attributes as a whole which helps the text-image matching in person search. The current state-of-the-art model, AAIPR [33], takes the text-image matching strategy but loses the generalisation scalability of individual attribute modelling.
[0071] The present system solves the problem of text attribute query person search by providing zero-shot learning (ZSL) [31, 5]. ZSL does not require a probe image to provide results. Potential test query categories (text attribute combinations) image data may exist at large scale, but only a small proportion of these can be available for model training due to the high cost for exhaustively acquiring training data per category. This results in a cross-category problem between model training and test, i.e. zero-shot samples for unseen categories during training. Therefore, the present system and method provide a cross-modal matching method based on global category-level visual-textual embedding with a common zero-shot learning approach. AAIPR [33] also uses the global embedding idea but totally ignores the zero-shot learning challenge in model design.
[0072] As a type of solution for attribute query person search, existing ZSL models are however suboptimal. Unlike the conventional ZSL settings that classify a test image into a small number of categories, the present system and method matches a text attribute description against a large number of person images with many more categories. This represents a larger scale and more challenging problem (i.e. a “zero-shot search” problem). Existing state-of-the-art ZSL methods may be based on global category-level visual-textual embedding but scale poorly [31]. One reason for this may be due to insufficient local attribute-level discrimination for more fine-grained matching. Furthermore, surveillance images in person search usually present significantly more noise and ambiguity, presenting a more difficult task. Additionally, lacking semantically meaningful person category names prevents the exploitation of inter-class relationships.
[0073] In present system, an Attribute-Image Hierarchical Matching (AIHM) method is formulated. This performs attribute and image matching for person search at multiple hierarchical levels, including both global category-level visual-textual embedding and local attribute-level feature embedding. This method overcomes the limitations of conventional ZSL models and existing text-based person search methods, by benefiting from the generalisation scalability of conventional attribute classification methods. Importantly, cross-modal matching can be end-to-end optimised across different levels simultaneously.
[0074] At a high level: (I) An extended ZSL approach is formulated to solve a text attribute query person search problem. The present model solves the intrinsic challenge of limited training category data in surveillance videos. (II) The method (AIHM) is able to match more reliably sparse attribute descriptions with noisy surveillance person images at global category and local attribute levels concurrently. This goes beyond the common ZSL nearest neighbour search. (III) The system and method further introduce a quality-aware fusion scheme for resolving visual ambiguity problems. Extensive experiments show the superiority of the system AIHM over the state-of-the-art methods for attribute query person search on three benchmarks: Market-1501 [35], DukeMTMC [22, 18], and PA100K [19].
[0075] Related Work: Person Search. The most common existing person search approach is based on taking bounding box images as probes (queries), framed as an extension of the person re-identification problem [6, 16, 35, 11, 17]. However, image queries are not always available in practice. Recently, text query person search has gained increasing attention with search queries as natural language descriptions [15, 14] or short text keywords (text attributes) [33, 10, 27]. These models enable person search on images by verbal or written text descriptions. Using natural language sentences for person search is attractive due to its natural human user friendliness. However, this imposes extra challenges in computational modelling because (1) accurate and rich training data is expensive to obtain, and (2) modelling consistently and reliably rich and complex sentence syntax and its interpretation to arbitrary images is non-trivial, with added difficulties from poor-quality surveillance images. In contrast, short text attribute descriptions offer a more cost-effective and computationally more tractable approach to solving this problem.
[0076] Visual Attributes. Computing visual attributes has been extensively used for person search [12, 10, 11, 23, 21, 29]. The idea is to exploit the visual representation of a person by attributes as the mid-level descriptions, which are semantically meaningful and more reliable than low-level pixel feature representations. For example, Peng et al. [21] mine un-labelled latent visual attributes in a limited attribute label space for enriching the appearance representation. Considered as a more domain-invariant or domain adaptive visual feature representation, Wang et al. [29] exploit visual attribute learning for unsupervised identity knowledge transfer across surveillance domains. All these existing methods are focused on visual attribute representations to facilitate image query person search. On the contrary, the focus of this work is on text query person search.
[0077] Text Attributes: A few attempts for text attribute query person search have been proposed [27, 10, 33]. In particular, Vaquero et al. [27] and Layne et al. [10] propose the first studies that treat the problem as a multi-label classification learning task. Whilst flexibly modeling arbitrary attribute combinations, this strategy has no capacity for modelling the holistic person category information and is therefore suboptimal for processing ambiguous surveillance data. More recently, Yin et al. [33] exploit the idea of cross-modal data alignment. This captures the holistic appearance information of persons, but suffers from a cross-category domain gap problem between the training and test data. In contrast, the present system and method considers the problem from a zero-shot learning perspective. Critically, the present system and method not only addresses the limitation of existing solutions but also combines their modelling merits for enabling extra complementary benefits.
[0078] Zero-Shot Learning: Attribute query person search can be understood from zero-shot learning (ZSL) [9, 31, 25, 34], due to the need for generalizing to unseen categories. However, there are several significant differences. First of all, most ZSL methods are designed for image classification other than search/retrieval. The latter is often more challenging due to larger search space. In contrast to conventional ZSL setting, there is no meaningful category names in person search. This disables the exploitation of semantic relationships between seen and unseen categories. For example, the imagery data of person search often involve more noise and corruption which is more difficult to handle. These factors render the state-of-the-art ZSL methods less effective for person search, as are demonstrated in the experiments described within the description.
[0079] To train a textual attribute query person search model, there is required a labeled set of N image-attribute training pairs, D={I.sub.i, a.sub.i}.sub.i=1.sup.N describing N.sub.id different person descriptions. A multi-label attribute text description of a person image may be described as an attribute vector a.sub.i and defines a value of each attribute label with respect to the corresponding person appearance. Persons sharing the same attribute vector description specifying a type of people are considered to belong to a person category. There are a total of N.sub.att different binary-class or multi-class attribute labels. This problem may be modeled by zero-shot learning (ZSL) considering that test person categories may be unseen to model training.
[0080] A schematic overview of the proposed AIHM model is illustrated in
where y.sub.i and ŷ.sub.i denote the ground-truth and predicted similarity of the i-th training pair, respectively. The mini-batch size is specified by N.sub.batch. To enable such matching, a hierarchical visual-textual embedding is formed (see below) and cross-modality fusion (see below) as the matching input (equation (7)). As a simplification, in the following a two-level hierarchy is assumed: a global category level, and a local per-attribute level. It is straightforward to extend to more hierarchical levels without changing the model designs as described below.
[0081] Hierarchical Visual Embedding. For hierarchical visual embedding of a person image, a multi-task joint learning strategy [2] is employed. An overview of hierarchical visual embedding is given in
[0082] For discriminative learning of local attribute-level visual embedding, the softmax Cross Entropy (CE) loss is utilised. Each individual attribute label is treated as a separate classification task (.sub.cls). Formally, they are formulated as:
where p.sub.ij is the probability estimation of the i-th training sample on a j-th ground truth attribute. By multi-task learning, the global category-level visual embedding can be obtained as the shared feature representation of all local embeddings.
[0083] Hierarchical Textual Embedding. A hierarchical embedding of text attributes needs to be learnt. An overview of hierarchical textual embedding is shown in
[0084] To enable the benefit of rich Wikipedia information (other text sources can be used), the attribute labels are represented by word to vector (e.g. word2vector) representations. Specifically, word2vector is used model and map each attribute name into a semantic (300-D) space, then further into the local textual embedding space z.sup.loc by one FC layer. Similarly, the multi-task learning is adopted for embedding each attribute label (z.sub.i.sup.loc, i∈{1, . . . , N.sub.att}). To obtain the global textual embedding z.sup.glo. A simplified approach is average pooling per-attribute embeddings. This may be suboptimal due to lacking of task-specific supervised learning. To overcome this problem, per-attribute embeddings may be combined by a fusion unit consisting of two 1×1 cony layers. This allows for both intra-attribute and inter-attribute fusion:
where w.sub.1 and w.sub.2 are learnable parameters and Tan h is a non-linear activation function.
[0085] The CE loss function (Eq (2)) is used to supervise the textual embedding. In training, the embedding loss and matching loss may be jointly optimised end-to-end with identical weight. Note, unlike the visual embedding process, the global category-level textual embeddings is obtained by combining all local attribute-level counterparts, an inverse process. This is due to additionally using auxiliary information (e.g. Wikipedia).
[0086] Negative Category Augmentation. The one-shot per category problem in textual modality raises model training difficulty. To alleviate this problem, negative category augmentation is exploited for AIHM model learning. This may be achieved by generating new random attribute vectors. This uses synthesised attribute vectors as negative samples in the matching loss (Eq (1)). This helps alleviate the model over-fitting risk whilst enhancing the sparse training data, particularly for global textual embedding. Existing ZSL and person search methods do not use or leverage this strategy. One possible reason is that previous methods mostly do not exploit negative cross-modality pairs in objective learning loss function. The efficacy of this scheme is demonstrated within the graphs of
[0087] Cross-Modality Cross-Level Embedding. Given hierarchical visual-textual embedding as described above, these are combined across modalities and levels to form the final embedding for attribute-image matching. An illustration of this cross-modality cross-level embedding is shown in
[0088] Cross-Modality Global-Level Embedding. The cross-modality global-level embedding s.sup.glo may be defined as:
s.sup.glo=x.sup.glo∘z.sup.glo (equation 4)
where ∘ specifies the Hadamard product.
[0089] Cross-Modality Local-Level Embedding. Unlike the single global-level embedding, multiple local per-attribute embeddings are required in both modalities. Therefore, per-attribute cross-modality embedding may be formed as:
s.sub.i.sup.loc=x.sub.i.sup.loc∘z.sub.i.sup.loc,i∈{1 . . . ,N.sub.att} (equation 5)
[0090] Fusing over attributes then takes place. Instead of average pooling, a quality aware fusion algorithm may be used. This is based on two considerations: (1) Both surveillance imagery (poor quality with noisy and corrupted observations) and attribute labelling (annotation errors due to poor imaging condition) are not highly reliable. Trusting all attributes and treating them equally in matching is prone to error; and (2) The significance for person search may vary across attributes.
[0091] Specifically, to estimate the per-attribute quality p.sub.i.sup.loc, minimal prediction scores may be used on image and text as p.sub.i.sup.loc=min (p.sub.i.sup.vis, p.sub.i.sup.tex), i∈{1, . . . , N.sub.att}, where p.sub.i.sup.vis and p.sub.i.sup.tex denote the ground-truth class posterior probability estimated by the corresponding classifier. This discourages the model fit towards corrupted and noisy observations. Based on this quantity measure, a fusion unit (Eq (3)) leans adaptively cross-attribute embedding as:
s.sup.loc=f({ρ.sub.i.sup.loc.Math.s.sub.i.sup.loc}.sub.i=1.sup.N.sup.
[0092] Cross-Modality Cross-Level Embedding. A fusion unit (Eq (3)) is used to form the final cross-modality cross-level embedding as:
s=f({s.sup.loc,s.sup.glo}) (equation 7)
[0093] The final embedding s is used to estimate the attribute-image matching result ŷ (Eq (1)) given an input attribute query and person image.
EXPERIMENTS
[0094] Datasets. In evaluations, two publicly available person search (Market-1501 [35], DukeMTMC [22, 18]) and used as well as one large pedestrian analysis (PA100K [19]) benchmarks. These datasets present good challenges for person search with varying camera viewing conditions. Standard evaluation settings were followed. The dataset statistics are summarised in table 1.
[0095] Performance Metrics. The CMC and mAP were used as evaluation metrics. As [33], the gallery images were treated respecting a given attribute vector query as true matches.
[0096] Implementation Details. For fair comparison to [33], ResNet-50 [7] was used as the backbone net for learning visual embedding. Adam was employed as the optimiser. The batch size was set to 16 (attribute-image pairs), the learning rate to 1e-5, and the epoch number to 150. In each mini-batch, on-the-fly 16/255(16*16−1) positive/negative text-image training pairs were formed. 50 training person categories for parameter cross-validation were used. A two-layer hierarchy in AIHM for the main experiments, with different hierarchy structures evaluated independently were used.
[0097] The system and method (AIHM) were compared with a wide range of plausible solutions to text attribute person search methods in two paradigms: (1) Global category-level visual-textual embedding methods: Learning to align the distributions of text attributes and images in a common space, including CCA [1, 30, 3, 24] or MMD [26] based cross-modal matching models, ZSL methods (DEM [34], RN[25], GAZSL [37]), visual semantics embedding (VSE++[4]), and GAN based cross-modality alignment (AAIPR [33]). (2) Local attribute-level visual-textual embedding methods: Learning attribute-image region correspondence, including region proposal based dense text-image cross-modal matching (SCAN [13]), natural language query based person search (GAN-RNN [15] and CMCE [14]). Officially released codes were used with careful parameter tuning if needed, e.g. those originally applied to different applications. In testing language models [4, 13, 15, 14], random attribute sentences were used due to no ordering and reported the average results of 10 trials. For all methods, ResNet-50 was used for visual embedding.
[0098] Results. The person search performance comparisons on three benchmarks are shown in table 2. It is evident that our AIHM model outperforms all the existing methods, e.g. surpassing the second best and state-of-the-art person search model AAIPR [33] by a margin of 5.0%/3.7% in Rank-1/mAP on Market-1501. The performance margins over other global visual-textual embedding methods and local region correspondence learning model are even more significant. In particular, state-of-the-art ZSL models also fail to excel due to the larger scale search, more ambiguous visual observation, and meaningless category names. Overall, these results show that despite their respective modelling strength, either global and/or local embedding alone are suboptimal for the more challenging person search problems. It is clearly beneficial to the overall model performance if their complementary advantages are utilised as formulated in the AIHM model.
[0099] Qualitative Analysis and Visual Examination. To provide more in-depth and visual examination on the performance of the system (AIHM) 10, a qualitative analysis was conducted, as shown in
[0100] False retrieval images are often due to ambiguous visual appearances and/or text descriptions. For example, the Rank 7 image (b) is with “up-purple” whilst the Rank 9 is with “up-red”. Such a colour difference is visually very subtle even for humans. Another example with visual ambiguity is “blue” vs “black” (c). In terms of ambiguous text attribute descriptions “Teenage” and “Young” are semantically very close. This causes the failure search results (d), where “Teenage” person images in top-7 are instead retrieved against the query attribute “Young”.
[0101] Further Analysis and Discussion. Hierarchical embedding and matching. The effect and complementary benefit of joint local attribute-level and global category-level visual-textual embedding in AIHM was examined. This is conducted by comparing individual performances with their combinations. Table 3 shows that: (1) Either embedding alone is already considerably strong and discriminative for person search. Local AIHM embedding alone is competitive to the state-of-the-art AAIPR [33]. (2) A clear performance gain is obtained by combining both global and local embedding as a whole in person search. This validates the complementary benefits and performance advantages of jointly learning local and global visual-textual embedding interactively in the present system and method (AIHM).
[0102] Quality-aware fusion. Recall that a quality-aware fusion (Eq (6)) was included in AIHM for alleviating the negative effect of noisy and ambiguous observation in local visual-textual embedding. The efficacy of this component was tested in comparison to the common average pooling strategy. Table 4 shows that our quality-aware fusion is more effective in suppressing noisy information, e.g. improving over the average pooling in Rank1/mAP rates by 6.2%/0.5% on Market-1501, 5.6%/1.3% on DukeMTMC, and 5.2%/1.9% on PA100K, respectively. This shows the benefit of taking into account the input data quality in person search.
[0103] Negative category augmentation. To combat the one-shot learning challenge in global textual embedding, negative category augmentation was exploited in AIHM model learning, so to enrich training text data for reducing over-fitting risk. Three different augmentation sizes were tested: 5 k, 10 k, and 20 k. It is shown in
[0104] Person search by individual attribute recognition. Two high-level model design strategies were examined for person search: (1) Attribute Recognition (AR): Using the attribute prediction scores by the AIHM's visual component, and the L.sub.2 distance metric in the attribute vector space for cross-modal matching and ranking. (2) Learning to match strategy, i.e. the AIHM, which considers both global category-level and local attribute-level textual-visual embedding. It is interesting to find from table 5 that the AR baseline performs reasonably well when compared to other techniques in table 2. For example, AR even approaches the performance of the state-of-the-art person search model AAIPR [33]. Note that, this strong AR is likely to benefit from our hierarchical embedding learning design. The big performance margins of the present model over AR suggests that the learning to match strategy in joint optimisation is superior.
[0105] Global textual embedding. Three design considerations for learning the global textual embedding were examined: (1) Individual attribute representation: One-Hot (OH) vs Word2Vec (WV), (2) Aggregation of multiple attribute embedding: RNN (LSTM) vs CNN. (3) Binary-class label representation: Zero vs Transformed Input. Table 6 shows that:
(1) OH+CNN outperforms OH+RNN, suggesting that artificially introducing the modelling of temporal structure information on orderless person attributes is not only unnecessary but also brings adverse effect to model performance.
(2) WV+CNN outperforms OH+CNN, indicating that WV is a more informative attribute representation particularly in case of sparse training attribute data. Textual embedding design via CNN is superior to directly using WV, suggesting the necessity of feature transformation because the generic WV is not optimised particularly for person image analysis.
[0106] Multi-task learning scalability. Multi-task learning for local visual-textual embedding was used, so the branch number is decided by the attribute set size N.sub.att (
[0107]
[0108] Hierarchy depth. The effect of AIHM's hierarchy depth was evaluated on model performance. Random grouping to form size-balanced intermediate layers was used for I-layers (I= 2/4) hierarchies (see
[0109] Unlike most existing methods, which assume image based queries that are not always available in practice, the present system and method (AIHM) enables person search with only short text attribute descriptions. In contrast to few existing methods for attribute query person search, this problem is formulated as an extended zero-shot learning problem with a more principled approach to its solution. Algorithmically, the AIHM model solves the fundamental limitations of existing ZSL learning methods by joint global category-level and local attribute-level visual-textual embedding and matching. This aims to eliminate their respective modelling weaknesses whilst optimising their mutual complementary advantages. Extensive comparative evaluations demonstrated the performance superiority of the AIHM model over a wide range of existing alternative methods on three attribute person search benchmarks. Detailed component analysis were provided in order to give insights on model design and its performance advantages.
[0110] As described above, an example implementation of the system and method (AIHM) comprises four components: (1) Hierarchical visual embedding, (2) Hierarchical textual embedding, (3) Cross-modality cross-level embedding, and (4) Matching module. The network design of these components are detailed below. The embedding dimensions as summarised in table 9.
Hypercritical Visual Embedding Network. The Details of the 2-Layers and 4-Layers Hier-Archical Visual Embedding Follow.
[0111] 2-Layers Hierarchical Visual Embedding. In the previously described experiments, a 2-layers multitask learning design for hierarchical visual embedding is described. The architecture details are shown in
[0112] 4-Layers Hierarchical Visual Embedding. The 4-layers hierarchical visual embedding is by a tree-structured multi-task learning design. The architecture design is shown in
[0113] Hypercritical Textual Embedding Network. The textual embedding consists of two parts: (1) local textual embedding and (2) global textual embedding. Similarly, 2-layers and 4-layers hierarchical textual embedding are described, respectively.
[0114] 2-Layers Hierarchical Textual Embedding. In textual embedding, the input is a set of text attributes. Each text attribute is firstly passed into a word2vector model trained on Wikipedia [38] and then into three FC layers. The resulting local embeddings are then utilised to form the global embedding. See the architecture in
[0115] 4-Layers Hierarchical Textual Embedding. The 4-layers textual embedding is in a similar structure as the 2-layers counterpart. See the architecture in
[0116] Cross-Modality Cross-Level Embedding. Given the hierarchical visual and textual embedding, global-level cross-modality embedding is conducted followed with cross-level cross-modality embedding. The configuration of layers are listed in table 14.
[0117] Cross-Modality Global-Level Embedding. The global-level fusion module takes as input the global visual embedding x.sup.glo and the global textual embedding z.sup.glo, outputting the global cross-modality embedding g.sup.glo. The architecture is shown in
[0118] Cross-Modality Local-level Embedding. The local-level fusion module takes as input the local visual embedding {x.sub.i.sup.loc}.sub.i=1.sup.N.sup.
[0119] Cross-Modality Cross-Level Embedding. Given the global s.sup.glo and local s.sup.loc cross-modality embedding, we obtain the cross-modality cross-level embedding s as shown in
[0120] Matching Module. The matching module takes as input the cross-modality cross-level embedding s, and outputs the similarity score ŷ∈[0,1] of the input image and attribute set. In training, we set the ground-truth similarity score 1 for the matching attribute-image pairs and 0 for the unmatched attribute-image pairs. The details are shown in table 15 and
[0121] As will be appreciated by the skilled person, details of the above embodiment may be varied without departing from the scope of the present invention, as defined by the appended claims.
[0122] For example, although the examples provided use images of people and the text-based search are descriptions of physical attributes of people, the methods, techniques and systems can be used with images (e.g. from video sources) of other targets. For example, the system and method may be used with searching for manufactured products, buildings, animals, plants and geographic or natural structures, for example.
[0123] Many combinations, modifications, or alterations to the features of the above embodiments will be readily apparent to the skilled person and are intended to form part of the invention. Any of the features described specifically relating to one embodiment or example may be used in any other embodiment by making the appropriate changes.
TABLE-US-00001 TABLE 1 Statistics of person search datasets. Other datasets may be used Datasets Market-1501 DukeMTMC PA100L # Attribute category 10 8 15 # Train person category 508 300 2020 # Train image 12,936 16,522 80,000 # Test person category 529 387 849 # Unseen 367 229 168 # Test image 15,913 19,889 10,000
TABLE-US-00002 TABLE 2 Comparisons to the state-of-the-art methods. Market-1501 DukeMTMC PA100K Method Rank1 Rank5 Rank 10 mAP Rank 1 Rank 5 Rank 10 mAP Rank 1 Rank 5 Rank 10 mAP DEM[34] 34.0 48.1 57.5 17.0 22.7 43.9 54.5 12.9 20.8 38.7 44.2 14.8 RN[25 17.2 38.7 47.3 15.5 25.1 42.0 51.5 13.0 27.5 38.8 46.6 13.6 GAZSL[37] 23.3 36.9 45.9 14.1 18.2 30.0 37.8 11.9 2.2 3.8 5.3 0.9 Deep CCAE[30] 8.1 23.9 34.5 9.7 33.2 59.3 67.6 14.9 21.2 39.7 48.0 15.6 DeepCCA[1] 29.9 507 58.1 17.5 36.7 58.8 65.1 13.5 19.5 40.3 49.0 15.4 2WayNet[3] 11.2 24.3 31.4 7.7 25.2 39.8 45.9 10.1 19.5 26.6 34.5 10.6 MMD[26] 34.1 47.9 57.2 18.9 41.7 62.3 68.6 14.2 25.8 38.9 46.2 14.4 DeepCoral[24] 36.5 47.6 55.9 20.0 46.1 61.0 68.1 17.1 22.0 39.7 48.1 14.1 VSE++[4] 27.0 49.1 58.2 17.2 33.6 54.7 62.8 15.5 22.7 39.8 48.1 15.7 AAIPR[33] 40.2 49.2 58.6 20.6 46.6 59.6 69.0 15.6 27.3 40.5 49.8 15.2 SCAN[13] 4.0 10.1 15.3 2.1 3.5 9.3 14.3 1.6 2.9 8.2 12.5 1.9 GNA-RNN[15] 30.4 38.7 44.4 15.4 34.6 52.7 65.8 14.2 20.3 30.8 38.2 9.3 CMCE[14] 35.0 50.9 56.4 22.8 39.7 56.3 62.7 15.4 25.8 34.9 45.4 13.1 AIHM 45.2 56.7 64.5 24.3 50.5 65.2 75.3 17.4 31.3 45.1 51.0 17.0 Bold: Best results.
TABLE-US-00003 TABLE 3 Hierarchical embedding and matching analysis. Market-1501 DukeMTMC PA100K Method Rank1 mAP Rank1 mAP Rank1 mAP Global Only 30.6 20.5 40.7 13.7 26.1 14.3 Local Only 39.5 21.9 46.9 15.3 29.4 15.6 Hierarchy 45.2 24.3 50.5 17.4 31.3 17.0
TABLE-US-00004 TABLE 4 Quality-aware fusion vs. Average Pooling Market-1501 DukeMTMC PA100K Method Rank1 mAP Rank1 mAP Rank1 mAP Avg Pool 39.0 23.8 44.9 16.1 26.1 15.1 AIHM 45.2 24.3 50.5 17.4 31.3 17.0
TABLE-US-00005 TABLE 5 Model design strategy examination: Attribute Recognition (AR) vs Learning to Compare (as AIHM) Dataset Method Rank1 Rank5 Rank10 mAP Market-1501 AR 35.7 47.8 57.8 19.8 AIHM 45.2 56.7 64.5 24.3 DukeMTMC AR 42.0 52.9 63.2 15.8 AIHM 50.5 65.2 75.3 17.4 PA100K AR 30.3 42.8 47.8 13.8 AIHM 31.3 45.1 51.0 17.0
TABLE-US-00006 TABLE 6 Global textual embedding analysis. Market-1501 DukeMTMC PA100K Method Rank1 mAP Rank1 mAP Rank1 mAP OH + RNN 35.7 17.8 46.6 16.8 21.4 12.3 OH + CNN 37.1 21.0 49.8 18.1 25.3 13.7 WV 43.8 22.9 48.7 16.2 29.1 14.2 OH + CNN 39.1 22.0 46.5 16.1 25.3 13.7 WV + CNN 45.2 24.3 50.5 17.4 31.3 17.0 OH: One-Hot; WV: Word2Vec.
TABLE-US-00007 TABLE 7 Scalability of multi-task learning local embedding. Market-1501 DukeMTMC PA100K #Branch Rank1 mAP Rank1 mAP Rank1 mAP N.sub.att/4 43.5 23.9 47.9 15.6 30.3 16.3 N.sub.att 45.2 24.3 50.5 17.4 31.3 17.0
TABLE-US-00008 TABLE 8 Effect of hierarchy depth. Market-1501 DukeMTMC PA100K #Depth Rank1 mAP Rank1 mAP Rank1 mAP 2 45.2 24.3 50.5 17.4 31.3 17.0 4 47.5 25.2 53.6 18.5 33.4 17.8
TABLE-US-00009 TABLE 9 Embedding dimensions Definition Notation Value Local embedding dimension Dim.sub.emb.sup.loc 512 Global embedding dimension Dim.sub.emb.sup.glo 1024 Cross-modal embedding dimension Dim.sub.emb.sup.loc 512
TABLE-US-00010 TABLE 10 Configuration of 2-layers visual embedding Structure Size ResNet-50 Output size is 2048 FC.sub.1 2048 × Dim.sub.emb.sup.glo Tanh FC.sub.1, i (i = 1, 2, . . . , N.sub.Attr) 2048 × 1024 ReLU FC.sub.2, i (i = 1, 2, . . . , N.sub.Attr) 1024 × Dim.sub.emb.sup.loc Classification.sub.i (i = 1, 2, . . . , N.sub.Attr) Dim.sub.emb.sup.glo × N.sub.Attr.sub.
TABLE-US-00011 TABLE 11 Configuration of 4-layers visual embedding. Structure Size ResNet-50 Output size is 2048 FC.sub.1 2048 × Dim.sub.emb.sup.glo Tanh FC.sub.1, i (i = 1, 2) 2048 × 1024 ReLU FC.sub.2, i (i = 1, 2, 3, 4) 1024 × 512 ReLU FC.sub.3, i (i = 1, 2, . . . , N.sub.Attr) 512 × Dim.sub.emb.sup.loc Classification.sub.i (i = 1, 2, . . . , N.sub.Attr) Dim.sub.emb.sup.glo × N.sub.Attr.sub.
TABLE-US-00012 TABLE 12 Configuration of 2-layers textual embedding. Structure Size ResNet-50 Output size is 2048 FC.sub.1 300 × 512 Tanh FC.sub.2 512 × 1024 Tanh FC.sub.3 1024 × Dim.sub.emb.sup.loc Tanh Cls.sub.i (i = 1, 2, . . . , N.sub.Attr) Dim.sub.emb.sup.loc × N.sub.Attr.sub.
TABLE-US-00013 TABLE 13 Configuration of 4-layers textual embedding. Structure Size FC.sub.1 300 × 512 Tanh FC.sub.2 512 × 1024 Tanh FC.sub.3 1024 × Dim.sub.emb.sup.loc Tanh Cls.sub.i (i = 1, 2, . . . , N.sub.Attr) Dim.sub.emb.sup.loc × N.sub.Attr.sub.
TABLE-US-00014 TABLE 14 Configuration of cross-level (CL) cross-modality (CM) embedding. Structure Size FC.sub.T/V.sup.i, i ∈ {glo, loc} Dim.sub.emb.sup.j × 512 FC.sub.1 512 × 512 Tanh FC.sub.2 512 × 512 Tanh Fusion.sub.CM Conv [512, Dim.sub.emb.sup.S, 1, 1, 0 Conv [N.sub.Attr, 1, 1, 1, 0] Tanh Fusion.sub.CM Conv [512, Dim.sub.emb.sup.S, 1, 1, 0 Conv [N.sub.Attr, 1, 1, 1, 0] Tanh
TABLE-US-00015 TABLE 15 Configuration of matching module. Structure Size FC.sub.1 Dim.sub.emb.sup.S × 256 ReLU FC.sub.2 256 × 128 ReLU FC.sub.3 128 × 1 Sigmoid
REFERENCES
[0124] [1] G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In ICML, 2013. 5,6 [0125] [2] Q. Dong, S. Gong, and X. Zhu. Multi-task curriculum transfer deep learning of clothing attributes. In WACV, 2017. 4 [0126] [3] A. Eisenschtat and L. Wolf. Linking image and text with 2-way nets. In CVPR, 2017. 5, 6 [0127] [4] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler. Vse++: Improving visual-semantic embeddings with hard negatives. 2018. 5, 6 [0128] [5] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Transductive multi-view zero-shot learning. IEEE transactions on pattern analysis and machine intelligence, 37(11):2332-2345, 2015. 2 [0129] [6] S. Gong, M. Cristani, S. Yan, and C. C. Loy. Person re-identification. Springer, 2014. 1, 2 [0130] [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 4, 5 [0131] [8] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof. Large scale metric learning from equivalence constraints. In CVPR, 2012. [0132] [9] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453-465, 2014. 3 [0133] [10] R. Layne, T. M. Hospedales, and S. Gong. Attributes-based re-identification In Person Re-Identification. Springer, 2014. 1, 2, 3 [0134] [11] R. Layne, T. M. Hospedales, and S. Gong. Re-id: Hunting attributes in the wild. In BMVC, 2014. 2, 3 [0135] [12] R. Layne, T. M. Hospedales, S. Gong, and Q. Mary. Person re-identification by attributes. In BMVC, 2012. 3 [0136] [13] K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He. Stacked cross attention for image-text matching. In ECCV, 2018. 5, 6 [0137] [14] S. Li, T. Xiao, H. Li, W. Yang, and X. Wang. Identity-aware-textual-visual matching with latent co-attention. In ICCV, 2017. 1, 2, 5, 6 [0138] [15] S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang. Person search with natural language description. CVPR, 2017. 1, 2, 5, 6 [0139] [16] W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter pairing neural network for person re-identification. In CVPR, 2014. 1, 2 [0140] [17] W. Li, X. Zhu, and S. Gong. Harmonious attention network for person re-identification In CVPR, 2018. 2 [0141] [18] Y. Lin, L. Zheng, and W. Y. a. Y. Y. Zheng, and Zhedong. Improving person re-identification by attribute and identity learning arXiv, 2017. 2, 5 [0142] 19] X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, J. Yan, and X. Wang. Hydraplus-net: Attentive deep features for pedestrian analysis. In ICCV, 2017. 2. 5 [0143] [20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013. 4 [0144] [21] P. Peng, Y. Tian, T. Xiang, Y. Wang, M. Pontil, and T. Huang. Joint semantic and latent attribute modelling for cross-class transfer learning. TPAMI, 2018. 3 [0145] [22] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV workshop on Benchmarking Multi-Target Tracking, 2016. 2, 5 [0146] [23] C. Su, S. Zhang, J. Xing, W. Gao, and Q. Tian. Deep attributes driven multi-camera person re-identification. In ECCV, 2016. 3 [0147] [24] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In ECCV, 2016. 5, 6 [0148] [25] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, 2018. 3, 4, 5, 6 [0149] [26] I. O. Tolstikhin, B. K. Sriperumbudur, and B. Schölkopf. Minimax estimation of maximum mean discrepancy with radial kernels. In NIPS. 5, 6 [0150] [27] D. A. Vaquero, R. S. Feris, D. Tran, L. Brown, A. Hampapur, and M. Turk. Attribute-based people search in surveillance environments. In Workshop of WACV, 2009. 1, 2, 3 [0151] [28] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In NIPS, 2016. [0152] [29] J. Wang, X. Zhu, S. Gong, and W. Li. Transferable joint attribute-identity deep learning for unsupervised person re-identification. In CVPR, 2018. 3 [0153] [30] W. Wang, R. Arora, K. Livescu, and J. Bilmes. On deep multi-view representation learning. In ICML, 2015. 5, 6 [0154] [31] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. 2, 3 [0155] [32] W. Xie, L. Shen, and A. Zisserman. Comparator networks. ECCV, 2018. 5 [0156] [33] Z. Yin, W.-S. Zheng, A. Wu, H.-X. Yu, H. Wan, X. Guo, F. Huang, and J. Lai. Adversarial attribute-image person re-identification. In IJCAI, 2018. 1, 2, 3, 5, 6, 7 [0157] [34] L. Zhang, T. Xiang, and S. Gong. Learning a deep embed-ding model for zero-shot learning. In CVPR, 2017. 3, 5, 6 [0158] [35] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In ICCV, 2015. 1, 2, 5 [0159] [36] W.-S. Zheng, S. Gong, and T. Xiang. Reidentification by relative distance comparison. TPAMI, 2013. 1 [0160] [37] Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, and A. Elgammal. A generative adversarial approach for zero-shot learning from noisy texts. In CVPR, 2018. 5, 6 [0161] [38] Z. Fu, T. Xiang, E. Kodirov and S. Gong. Zero-Shot Learning on Semantic Class Prototype Graph. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, No. 8, pp. 2009-2022, August 2018 [0162] [39] Y. Fu, T. Xiang, Y-G Jiang, X. Xue, L. Sigal, S. Gong. Recent Advances in Zero-Shot Recognition: Towards Data-Efficient Understanding of Visual Content. IEEE Signal Processing Magazine, Vol. 35, No. 1, pp. 112-125, January 2018 [0163] [40] X. Xu, T. Hospedales and S. Gong. Transductive Zero-Shot Action Recognition by Word-Vector Embedding. International Journal of Computer Vision, Vol. 123, No. 3, pp. 309-333, July 2017 [0164] [41] Y. Fu, T. Hospedales, T. Xiang and S. Gong. Transductive Multi-View Zero-Shot Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 37, No. 11, pp. 2332-2345, November 2015 [0165] [42] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. Neural Information Processing Systems Conf., 2013, pp. 3111-3119