DICTIONARY LEARNING METHOD AND MEANS FOR ZERO-SHOT RECOGNITION

Abstract

Dictionary learning method and means for zero-shot recognition can establish the alignment between visual space and semantic space at category layer and image level, so as to realize high-precision zero-shot image recognition. The dictionary learning method includes the following steps: (1) training a cross domain dictionary of a category layer based on a cross domain dictionary learning method; (2) generating semantic attributes of an image based on the cross domain dictionary of the category layer learned in step (1); (3) training a cross domain dictionary of the image layer based on the image semantic attributes generated in step (2); (4) completing a recognition task of invisible category images based on the cross domain dictionary of the image layer learned in step (3).

Claims

1. A dictionary learning method for zero-shot recognition, comprises the following steps: (1) training a cross domain dictionary of a category layer based on a cross domain dictionary learning method; (2) generating semantic attributes of an image based on the cross domain dictionary of the category layer learned in step (1); (3) training a cross domain dictionary of the image layer based on the image semantic attributes generated in step (2); (4) completing a recognition task of invisible category images based on the cross domain dictionary of the image layer learned in step (3).

2. The dictionary learning method for zero-shot recognition according to claim 1, the step (1) comprises: (1.1) extracting a category prototype P.sub.v of visual space by calculating a category center of a visible category image, the formula is as follows:
custom-character .sub.p=∥Y.sub.v−P.sub.vH∥.sub.F.sup.2, (1) wherein, Y.sub.v is a sample characteristic matrix, H is a sample label matrix; (1.2) forming a pair of inputs with the category prototype P.sub.v and category semantic attributes P.sub.s, training the cross domain dictionary at the category layer, and establishing a relationship between visual space and semantic space at the category layer by constraining the category prototype and category semantic attributes to share the sparsity coefficient; a specific representation is formula (2)
custom-character .sub.seen=∥P.sub.v−D.sub.vX.sub.pμ.sub.F.sup.2+λ∥P.sub.s−D.sub.sX.sub.p∥.sub.F.sup.2, (2) wherein, the first term is a reconstruction error term of visual space dictionary, the second term is a reconstruction error term of semantic space dictionary, D.sub.v is a visual space dictionary, D.sub.s is a semantic space dictionary, X.sub.p is a sparse coefficient matrix, λ is a harmonic parameter; (1.3) introducing an adaptive loss function of invisible category as formula (3), in order to reduce an impact of domain difference between visible category and invisible category on model accuracy and improve the recognition ability of the model for invisible category samples,
custom-character .sub.unseen=∥P.sub.v.sup.u−D.sub.vX.sub.p.sup.uμ.sub.F.sup.2+λ∥P.sub.s.sup.u−D.sub.sX.sub.p.sup.u∥.sub.F.sup.2, (3) wherein, P.sub.v.sup.u is a category prototype of unseen category to be solved, P.sub.s.sup.u is a semantic attribute matrix of invisible category, X.sub.p.sup.u is a sparse coefficient matrix corresponding to invisible category; a whole loss function of class-level model is as follows:
custom-character .sub.class=.sub.seen+α.sub.unseen+β.sub.p, (4) training objective of the category layer is to minimize the loss function shown in equation (4) for solving variables including: visual space dictionary D.sub.v, semantic space dictionary D.sub.s, seen category prototype P.sub.v, invisible category prototype P.sub.v.sup.u, seen category sparse coefficient X.sub.p, and invisible category sparse coefficient X.sub.p.sup.u.

3. The dictionary learning method for zero-shot recognition according to claim 2, the step (2) comprises: (2.1) generating a sparse coefficient X.sub.y of the image by using the visual space dictionary D.sub.v, and a specific representation is formula (5):
min.sub.X.sub.y∥Y.sub.v−D.sub.vX.sub.y∥.sub.F.sup.2+w.sub.x∥X.sub.y−X.sub.pH∥.sub.F.sup.2, (5) wherein, the first term is a reconstruction error term, the second term is a constraint term which constrains the generated image sparse coefficient to be closed to a sparse coefficient generated by its category based on the same visual space dictionary D.sub.v, w.sub.x is a harmonic parameter; (2.2) generating a semantic attribute of the image Y.sub.s by using the semantic space dictionary D.sub.s and its category semantic attribute P.sub.s, a specific representation is formula (6): $\begin{matrix} Y_{s} = \frac{\sqrt{λ} D_{s} X_{y} + \sqrt{w_{p}} P_{s} H}{\sqrt{λ} + \sqrt{w_{p}}}, & (6) \end{matrix}$ wherein, w.sub.p is a harmonic parameter.

4. The dictionary learning method for zero-shot recognition according to claim 3, the step (3) comprises: training the cross domain dictionary of the image layer based on the image semantic attributes generated in step (2), in order to further find information of the image and improve generalization performance of the model, a specific representation is formula (7):
custom-character .sub.seen=∥Y.sub.v−D.sub.v.sup.imageXμ.sub.F.sup.2+μ∥Y.sub.s−D.sub.s.sup.imageX∥.sub.F.sup.2, (7) wherein, the first term is a reconstruction error term of visual space; a second term is a reconstruction error term of semantic space, D.sub.v.sup.image and D.sub.s.sup.image is a dictionary of visual space in the image layer and a dictionary of semantic space in the image layer, respectively; X is a sparse coefficient, and μ a harmonic parameter.

5. The dictionary learning method for zero-shot recognition according to claim 4, the step (4) comprises: in the aspect of comparison of visual space: generating a sparse coefficient X.sup.u through semantic space dictionary of the image layer D.sub.s.sup.image firstly by the invisible category semantic attribute P.sub.s.sup.u, which is formula (8):
min.sub.X.sub.u∥P.sub.s.sup.u−D.sub.s.sup.imageX.sup.u∥.sub.F.sup.2. (8) then, generating representation whose category is in visual space P.sub.v.sup.ul=D.sub.v.sup.imageX.sup.u by using the dictionary of visual space in the image layer D.sub.v.sup.image, computing cosine distance between a test image and a description of each category P.sub.v.sup.ul[c] respectively, and judging the category of the test image according to the distance, which is formula (9):
min.sub.c( custom-character .sub.c(P.sub.v.sup.ul[c], y.sub.v)). (9); in the aspect of comparison of sparse domain: extracting its representation in sparse space according to the visual space dictionary of the image layer by the test image, which is formula (10):
min.sub.x.sub.u∥y.sub.v−D.sub.v.sup.imagex.sup.u∥.sub.F.sup.2. (10) computing cosine distance between x.sup.u and the description of each category in sparse space X.sup.u[c], the closest category to the test image is the category of the image, which is formula (11):
min.sub.c( custom-character .sub.c(X.sup.u[c], x.sup.u)). (11); in the aspect of comparison of semantic space: firstly, encoding the test image to attain x.sup.u according to the visual space dictionary of the image layer; then, generating the semantic attribute of the image y.sub.s=D.sub.s.sup.imagex.sup.u according to the semantic space dictionary of the image layer; computing cosine distance between y.sub.s and semantic attributes of various categories, and judging the category of the test image according to the distance, which is formula (12):
min.sub.c( custom-character .sub.c(P.sub.s.sup.u[c], y.sub.s)). (12).

6. The dictionary learning method for zero-shot recognition according to claim 5, testing the method on two image data sets based on zero-shot recognition task: AwA data set and aPY data set, and comparing recognition accuracy with current mainstream zero-shot recognition models, including SJE, EZSL, SYNC, SAE, CDL, ALE, CONSE, LATEM, DEVISE; AwA is an animal image data set, including 50 animal categories and 30475 images, and each category has 85 annotated attributes; the standard division of zero-shot recognition experiment is to use 40 categories as seen categories and the other 10 categories as unseen categories.

7. A dictionary learning means for zero-shot recognition, comprises: a first training module, training a cross domain dictionary of a category layer based on a cross domain dictionary learning method; generation module, generating semantic attributes of an image based on the cross domain dictionary of the category layer learned in the first training module; a second training module, training a cross domain dictionary of the image layer based on the image semantic attributes generated in the generation module; recognition module, completing a recognition task of invisible category images based on the cross domain dictionary of the image layer learned in the second training module.

8. The dictionary learning means for zero-shot recognition according to claim 7, the first training module performs: extracting a category prototype P.sub.v of visual space by calculating a category center of a visible category image, the formula is as follows:
custom-character .sub.p=∥Y.sub.v−P.sub.vH∥.sub.F.sup.2, (1) wherein, Y.sub.v is a sample characteristic matrix, H is a sample label matrix; forming a pair of inputs with the category prototype P.sub.v and category semantic attributes P.sub.s, training the cross domain dictionary at the category layer, and establishing a relationship between visual space and semantic space at the category layer by constraining the category prototype and category semantic attributes to share the sparsity coefficient; a specific representation is formula (2)
custom-character .sub.seen=∥P.sub.v−D.sub.vX.sub.pμ.sub.F.sup.2+λ∥P.sub.s−D.sub.sX.sub.p∥.sub.F.sup.2, (2) wherein, the first term is a reconstruction error term of visual space dictionary, the second term is a reconstruction error term of semantic space dictionary, D.sub.v is a visual space dictionary, D.sub.s is a semantic space dictionary, V.sub.p is a sparse coefficient matrix, λ is a harmonic parameter; introducing an adaptive loss function of invisible category as formula (3), in order to reduce an impact of domain difference between visible category and invisible category on model accuracy and improve the recognition ability of the model for invisible category samples,
custom-character .sub.unseen=∥P.sub.v.sup.u−D.sub.vX.sub.p.sup.uμ.sub.F.sup.2+λ∥P.sub.s.sup.u−D.sub.sX.sub.p.sup.u∥.sub.F.sup.2, (3) wherein, P.sub.v.sup.u is a category prototype of invisible category to be solved, P.sub.s.sup.u is a semantic attribute matrix of invisible category, X.sub.p.sup.u is a sparse coefficient matrix corresponding to invisible category; a whole loss function of class-level model is as follows:
custom-character .sub.class=.sub.seen+α.sub.unseen+β.sub.p, (4) training objective of the category layer is to minimize the loss function shown in equation (4) for solving variables including: visual space dictionary D.sub.v, semantic space dictionary D.sub.s seen category prototype P.sub.v, invisible category prototype P.sub.v.sup.u, seen category sparse coefficient X.sub.p, and invisible category sparse coefficient X.sub.p.sup.u.

9. The dictionary learning means for zero-shot recognition according to claim 8, the generation module performs: generating a sparse coefficient X.sub.y of the image by using the visual space dictionary D.sub.v, and a specific representation is formula (5):
min.sub.X.sub.y∥Y.sub.v−D.sub.vX.sub.y∥.sub.F.sup.2+w.sub.x∥X.sub.y−X.sub.pH∥.sub.F.sup.2, (5) wherein, the first term is a reconstruction error term, the second term is a constraint term which constrains the generated image sparse coefficient to be closed to a sparse coefficient generated by its category based on the same visual space dictionary D.sub.v, w.sub.x is a harmonic parameter; generating a semantic attribute of the image Y.sub.s by using the semantic space dictionary D.sub.s and its category semantic attribute P.sub.s, a specific representation is formula (6): $\begin{matrix} Y_{s} = \frac{\sqrt{λ} D_{s} X_{y} + \sqrt{w_{p}} P_{s} H}{\sqrt{λ} + \sqrt{w_{p}}}, & (6) \end{matrix}$ wherein, w.sub.p is a harmonic parameter; the second training module performs: training the cross domain dictionary of the image layer based on the image semantic attributes generated in step (2), in order to further find information of the image and improve generalization performance of the model, a specific representation is formula (7):
custom-character .sub.seen=∥Y.sub.v−D.sub.v.sup.imageXμ.sub.F.sup.2+μ∥Y.sub.s−D.sub.s.sup.imageX∥.sub.F.sup.2, (7) wherein, the first term is a reconstruction error term of visual space; a second term is a reconstruction error term of semantic space, and D.sub.v.sup.image and D.sub.s.sup.image is a dictionary of visual space in the image layer and a dictionary of semantic space in the image layer, respectively; X is a sparse coefficient, and μ is a harmonic parameter.

10. The dictionary learning means for zero-shot recognition according to claim 9, the identification module performs: in the aspect of comparison of visual space: generating a sparse coefficient X.sup.u through semantic space dictionary of the image layer D.sub.s.sup.image firstly by the invisible category semantic attribute P.sub.s.sup.u, which is formula (8):
min.sub.X.sub.u∥P.sub.s.sup.u−D.sub.s.sup.imageX.sup.u∥.sub.F.sup.2. (8) then, generating representation whose category is in visual space P.sub.v.sup.ul=D.sub.v.sup.imageX.sup.u by using the dictionary of visual space in the image layer D.sub.v.sup.image , computing cosine distance between a test image and a description of each category P.sub.v.sup.ul[c] respectively, and judging the category of the test image according to the distance, which is formula (9):
min.sub.c( custom-character .sub.c(P.sub.v.sup.ul[c], y.sub.v)). (9); in the aspect of comparison of sparse domain: extracting its representation in sparse space according to the visual space dictionary of the image layer by the test image, which is formula (10):
min.sub.x.sub.u∥y.sub.v−D.sub.v.sup.imagex.sup.u∥.sub.F.sup.2. (10) computing cosine distance between x.sup.u and the description of each category in sparse space X.sup.u[c], the closest category to the test image is the category of the image, which is formula (11):
min.sub.c( custom-character .sub.c(X.sup.u[c], x.sup.u)). (11); in the aspect of comparison of semantic space: firstly, encoding the test image to attain x.sup.u according to the visual space dictionary of the image layer; then, generating the semantic attribute of the image y.sub.s=D.sub.s.sup.imagex.sup.u according to the semantic space dictionary of the image layer; computing cosine distance between y.sub.s and semantic attributes of various categories, and judging the category of the test image according to the distance, which is formula (12):
min.sub.c( custom-character .sub.c(P.sub.s.sup.u[c], y.sub.s)). (12)

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0022] FIG. 1 is a frame diagram of the present invention, in which the data provided in the data set is in the wave rectangular box and the generated data is in the rectangular box. The figure shows three training steps for the model and one testing step, in which the comparison in semantic space is shown in the testing step.

[0023] FIG. 2 shows a flowchart of a dictionary learning method for zero-shot recognition according to the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0024] As shown as FIG. 2, this dictionary learning method for zero-shot recognition, comprises the following steps:

[0025] (1) training a cross domain dictionary of a category layer based on a cross domain dictionary learning method;

[0026] (2) generating semantic attributes of an image based on the cross domain dictionary of the category layer learned in step (1);

[0027] (3) training a cross domain dictionary of the image layer based on the image semantic attributes generated in step (2);

[0028] (4) completing a recognition task of invisible category images based on the domain dictionary of the image layer learned in step (3).

[0029] Based on the cross domain dictionary learning model, by constraining the consistency of the representation of the sparse space projected by the present visual space data and semantic space data respectively through the spatial dictionary, the association between the visual space and semantic space is established successively in the category layer and the image layer, and by adding the cross domain dictionary at the image level, the alignment between visual space and semantic space can be established at category layer and image level respectively, and more fine-grained image information can be extracted compared with category layer, so as to realize high-precision zero-shot image recognition.

[0030] Preferably, the step (1) comprises:

[0031] (1.1) extracting a category prototype P.sub.v of visual space by calculating a category center of a visible category image, the formula is as follows:

custom-character .sub.p=∥Y.sub.v−P.sub.vH∥.sub.F.sup.2, (1)

[0032] wherein, Y.sub.v is a sample characteristic matrix, H is a sample label matrix;

[0033] (1.2) forming a pair of inputs with the category prototype P.sub.v and category semantic attributes P.sub.s, training the cross domain dictionary at the category layer, and establishing a relationship between visual space and semantic space at the category layer by constraining the category prototype and category semantic attributes to share the sparsity coefficient; a specific representation is formula (2)

custom-character .sub.seen=∥P.sub.v−D.sub.vX.sub.pμ.sub.F.sup.2+λ∥P.sub.s−D.sub.sX.sub.p∥.sub.F.sup.2, (2)

[0034] wherein, the first term is a reconstruction error term of visual space dictionary, the second term is a reconstruction error term of semantic space dictionary, D, is a visual space dictionary, D.sub.s is a semantic space dictionary, X.sub.p is a sparse coefficient matrix, λ is a harmonic parameter;

[0035] (1.3) introducing an adaptive loss function of invisible category as formula (3), in order to reduce an impact of domain difference between visible category and invisible category on model accuracy and improve the recognition ability of the model for invisible category samples,

custom-character .sub.unseen=∥P.sub.v.sup.u−D.sub.vX.sub.p.sup.uμ.sub.F.sup.2+λ∥P.sub.s.sup.u−D.sub.sX.sub.p.sup.u∥.sub.F.sup.2, (3)

[0036] wherein, P.sub.v.sup.u is a category prototype of invisible category to be solved, P.sub.s.sup.u is a semantic attribute matrix of invisible category, X.sub.p.sup.u is a sparse coefficient matrix corresponding to invisible category;

[0037] a whole loss function of class-level model is as follows:

custom-character .sub.class=.sub.seen+α.sub.unseen+β.sub.p, (4)

[0038] training objective of the category layer is to minimize the loss function shown in equation (4) for solving variables including: visual space dictionary D.sub.v, semantic space dictionary D.sub.s seen category prototype P.sub.v, invisible category prototype P.sub.v.sup.u, seen category sparse coefficient X.sub.p, and invisible category sparse coefficient X.sub.p.sup.u.

[0039] Preferably, the step (2) comprises:

[0040] (2.1) generating a sparse coefficient X.sub.y of the image by using the visual space dictionary D.sub.v, and a specific representation is formula (5):

min.sub.X.sub.y∥Y.sub.v−D.sub.vX.sub.y∥.sub.F.sup.2+w.sub.x∥X.sub.y−X.sub.pH∥.sub.F.sup.2, (5)

[0041] wherein, the first term is a reconstruction error term, the second term is a constraint term which constrains the generated image sparse coefficient to be closed to a sparse coefficient generated by its category based on the same visual space dictionary D.sub.v, w.sub.x is a harmonic parameter;

[0042] (2.2) generating a semantic attribute of the image Y.sub.s by using the semantic space dictionary D.sub.s and its category semantic attribute P.sub.s, a specific representation is formula (6):

[00001] $\begin{matrix} Y_{s} = \frac{\sqrt{λ} D_{s} X_{y} + \sqrt{w_{p}} P_{s} H}{\sqrt{λ} + \sqrt{w_{p}}}, & (6) \end{matrix}$

[0043] wherein, w.sub.p is a harmonic parameter.

[0044] Preferably, the step (3) comprises:

[0045] training the cross domain dictionary of the image layer based on the image semantic attributes generated in step (2), in order to further find information of the image and improve generalization performance of the model, a specific representation is formula (7):

custom-character .sub.seen=∥Y.sub.v−D.sub.v.sup.imageXμ.sub.F.sup.2+μ∥Y.sub.s−D.sub.s.sup.imageX∥.sub.F.sup.2, (7)

[0046] wherein, the first term is a reconstruction error term of visual space; a second term is a reconstruction error term of semantic space, D.sub.v.sup.image and D.sub.s.sup.image is a dictionary of visual space in the image layer and a dictionary of semantic space in the image layer, respectively; X is a sparse coefficient, and μ is a harmonic parameter.

[0047] Preferably, the step (4) comprises:

[0048] in the aspect of comparison of visual space:

[0049] generating a sparse coefficient X.sup.u through semantic space dictionary of the image layer D.sub.s.sup.image firstly by the invisible category semantic attribute P.sub.s.sup.u, which is formula (8):

min.sub.X.sub.u∥P.sub.s.sup.u−D.sub.s.sup.imageX.sup.u∥.sub.F.sup.2. (8)

[0050] then, generating representation whose category is in visual space P.sub.v.sup.ul=D.sub.v.sup.imageX.sup.u by using the dictionary of visual space in the image layer D.sub.v.sup.image, computing cosine distance between a test image and a description of each category P.sub.v.sup.ul[c] respectively, and judging the category of the test image according to the distance, which is formula (9):

min.sub.c( custom-character .sub.c(P.sub.v.sup.ul[c], y.sub.v)). (9);

[0051] in the aspect of comparison of sparse domain:

[0052] extracting its representation in sparse space according to the visual space dictionary of the image layer by the test image, which is formula (10):

min.sub.x.sub.u−∥y.sub.v−D.sub.v.sup.imagex.sup.u∥.sub.F.sup.2. (10)

[0053] computing cosine distance between x.sup.u and the description of each category in sparse space X.sup.u[c], the closest category to the test image is the category of the image, which is formula (11):

min.sub.c( custom-character .sub.c(X.sup.u[c], x.sup.u)). (11);

[0054] in the aspect of comparison of semantic space:

[0055] firstly, encoding the test image to attain x.sup.u according to the visual space dictionary of the image layer; then, generating the semantic attribute of the image y.sub.s=D.sub.s.sup.imagex.sup.u according to the semantic space dictionary of the image layer; computing cosine distance between y.sub.s and semantic attributes of various categories, and judging the category of the test image according to the distance, which is formula (12):

min.sub.c( custom-character .sub.c(P.sub.s.sup.u[c], y.sub.s)). (12).

[0056] Preferably, testing the method on two image data sets based on zero-shot recognition task: AwA data set and aPY data set, and comparing recognition accuracy with current mainstream zero-shot recognition models, including Structure Joint Embedding (SJE), Embarrassingly Simple ZSL (EZSL), Synthesized Classifiers (SYNC), Semantic Autoencoder (SAE), Coupled Dictionary Learning (CDL), Attribute Label-Embedding (ALE), Convex Semantic Embeddings (CONSE), Latent Embeddings (LATEM), Deep Visual-Semantic (DEVISE);

[0057] AwA is an animal image data set, including 50 animal categories and 30475 images, and each category has 85 annotated attributes; the standard division of zero-shot recognition experiment is to use 40 categories as seen categories and the other 10 categories as unseen categories.

[0058] A dictionary learning means for zero-shot recognition is also provided, which comprises:

[0059] a first training module, training a cross domain dictionary of a category layer based on a cross domain dictionary learning method;

[0060] generation module, generating semantic attributes of an image based on the cross domain dictionary of the category layer learned in the first training module;

[0061] a second training module, training a cross domain dictionary of the image layer based on the image semantic attributes generated in the generation module;

[0062] recognition module, completing a recognition task of invisible category images based on the domain dictionary of the image layer learned in the second training module.

[0063] Preferably, the first training module performs:

[0064] extracting a category prototype P.sub.v of visual space by calculating a category center of a visible category image, the formula is as follows:

custom-character .sub.p=∥Y.sub.v−P.sub.vH∥.sub.F.sup.2, (1)

[0065] wherein, Y.sub.v is a sample characteristic matrix, H is a sample label matrix;

[0066] forming a pair of inputs with the category prototype P.sub.v and category semantic attributes P.sub.s, training the cross domain dictionary at the category layer, and establishing a relationship between visual space and semantic space at the category layer by constraining the category prototype and category semantic attributes to share the sparsity coefficient; a specific representation is formula (2)

custom-character .sub.seen=∥P.sub.v−D.sub.vX.sub.pμ.sub.F.sup.2+λ∥P.sub.s−D.sub.sX.sub.p∥.sub.F.sup.2, (2)

[0067] wherein, the first term is a reconstruction error term of visual space dictionary, the second term is a reconstruction error term of semantic space dictionary, D.sub.v is a visual space dictionary, D.sub.s is a semantic space dictionary, X.sub.p is a sparse coefficient matrix, λ is a harmonic parameter;

[0068] introducing an adaptive loss function of invisible category as formula (3), in order to reduce an impact of domain difference between visible category and invisible category on model accuracy and improve the recognition ability of the model for invisible category samples,

custom-character .sub.unseen=∥P.sub.v.sup.u−D.sub.vX.sub.p.sup.uμ.sub.F.sup.2+λ∥P.sub.s.sup.u−D.sub.sX.sub.p.sup.u∥.sub.F.sup.2, (3)

[0069] wherein, P.sub.v.sup.u is a category prototype of invisible category to be solved, P.sub.s .sup.u is a semantic attribute matrix of invisible category, X.sub.p.sup.u is a sparse coefficient matrix corresponding to invisible category;

[0070] a whole loss function of class-level model is as follows:

custom-character .sub.class=.sub.seen+α.sub.unseen+β.sub.p, (4)

[0071] training objective of the category layer is to minimize the loss function shown in equation (4) for solving variables including: visual space dictionary D.sub.v, semantic space dictionary D.sub.s, seen category prototype P.sub.v, invisible category prototype P.sub.v.sup.u, seen category sparse coefficient X.sub.p, and invisible category sparse coefficient X.sub.p.sup.u.

[0072] Preferably, the generation module performs:

[0073] generating a sparse coefficient X.sub.y of the image by using the visual space dictionary D.sub.v, and a specific representation is formula (5):

min.sub.X.sub.y∥Y.sub.v−D.sub.vX.sub.y∥.sub.F.sup.2+w.sub.x∥X.sub.y−X.sub.pH∥.sub.F.sup.2, (5)

[0074] wherein, the first term is a reconstruction error term, the second term is a constraint term which constrains the generated image sparse coefficient to be closed to a sparse coefficient generated by its category based on the same visual space dictionary D.sub.v, w.sub.x is a harmonic parameter;

[0075] generating a semantic attribute of the image Y.sub.s by using the semantic space dictionary D.sub.s and its category semantic attribute P.sub.s, a specific representation is formula (6):

[00002] $\begin{matrix} Y_{s} = \frac{\sqrt{λ} D_{s} X_{y} + \sqrt{w_{p}} P_{s} H}{\sqrt{λ} + \sqrt{w_{p}}}, & (6) \end{matrix}$

[0076] wherein, w.sub.p is a harmonic parameter;

[0077] the second training module performs:

[0078] training the cross domain dictionary of the image layer based on the image semantic attributes generated in step (2), in order to further find information of the image and improve generalization performance of the model, a specific representation is formula (7):

custom-character .sub.seen=∥Y.sub.v−D.sub.v.sup.imageXμ.sub.F.sup.2+μ∥Y.sub.s−D.sub.s.sup.imageX∥.sub.F.sup.2, (7)

[0079] wherein, the first term is a reconstruction error term of visual space; a second term is a reconstruction error term of semantic space, D.sub.v.sup.image and D.sub.s.sup.image is a dictionary of visual space in the image layer and a dictionary of semantic space in the image layer, respectively; X is a sparse coefficient, and μ is a harmonic parameter.

[0080] Preferably, the identification module performs:

[0081] in the aspect of comparison of visual space:

[0082] generating a sparse coefficient X.sup.u through semantic space dictionary of the image layer D.sub.s.sup.image firstly by the invisible category semantic attribute P.sub.s.sup.u, which is formula (8):

min.sub.X.sub.u∥P.sub.s.sup.u−D.sub.s.sup.imageX.sup.u∥.sub.F.sup.2. (8)

[0083] then, generating representation whose category is in visual space P.sub.v.sup.ul=D.sub.v.sup.imageX.sup.u by using the dictionary of visual space in the image layer D.sub.v.sup.image, computing cosine distance between a test image and a description of each category P.sub.v.sup.ul[c] respectively, and judging the category of the test image according to the distance, which is formula (9):

min.sub.s( custom-character .sub.c(P.sub.v.sup.ul[c], y.sub.v)). (9);

[0084] in the aspect of comparison of sparse domain:

[0085] extracting its representation in sparse space according to the visual space dictionary of the image layer by the test image, which is formula (10):

min.sub.x.sub.u∥y.sub.v−D.sub.v.sup.imagex.sup.u∥.sub.F.sup.2. (10)

[0086] computing cosine distance between x.sup.u and the description of each category in sparse space X.sup.u[c], the closest category to the test image is the category of the image, which is formula (11):

min.sub.c( custom-character .sub.c(X.sup.u[c], x.sup.u)). (11);

[0087] in the aspect of comparison of semantic space:

[0088] firstly, encoding the test image to attain x.sup.u according to the visual space dictionary of the image layer; then, generating the semantic attribute of the image y.sub.s=D.sub.s.sup.imagex.sup.u according to the semantic space dictionary of the image layer; computing cosine distance between y.sub.s and semantic attributes of various categories, and judging the category of the test image according to the distance, which is formula (12):

min.sub.c( custom-character .sub.c(P.sub.s.sup.u[c], y.sub.s)). (12).

[0089] To test the effectiveness of the proposed method, experiments of the invention are carried out on two image data sets (AwA data set and aPY data set) based on zero-shot recognition tasks, and its recognition accuracy is compared with the current mainstream zero-shot recognition models, including SJE, EZSL, SYNC, SAE, CDL, ALE, CONSE, LATEM, DEVISE. Table 1 and table 2 respectively show the comparison of the zero-shot recognition accuracy of the method proposed in the invention and other existing methods on the two data sets.

[0090] AwA is an animal image data set, which contains 50 animal categories and 30475 images, and each category has 85 annotated attributes. The standard division of zero-shot recognition experiment is to use 40 categories as seen categories and the other 10 categories as unseen categories.

[0091] resnet101 is used to extract image features in the invention, and the feature dimension is 2048 dimensions. There are 40 atoms in cross domain dictionary at the category layer and 200 atoms in cross domain dictionary at the image layer, parameter λ=1, parameter α=1, parameter β=1, parameter μ=1, parameter w.sub.x=1, parameter w.sub.p=1e.sup.−10. The recognition accuracy of the method proposed by the invention and the compared methods are shown in Table 1. It can be seen that the method proposed by the invention obtains the highest accuracy on this data set.

TABLE-US-00001 TABLE 1 Method regnization accuracy (%) SAE 53.0 SYNC 54.0 EZSL 58.2 ALE 59.9 SJE 65.6 CDL 69.9 the method proposed by the 71.0 invention

[0092] aPY data set contains 32 categories and 15339 images, each of which has 64 dimensional semantic attributes. According to the standard division method, 20 categories are regarded as seen categories and the other 12 categories are regarded as unseen categories in the invention.

[0093] resnet101 is used to extract image features in the invention, and the feature dimension is 2048 dimensions. There are 20 atoms in cross domain dictionary at the category layer and 200 atoms in cross domain dictionary at the image layer, parameter λ=1, parameter α=1, parameter β=1, parameter μ=1, parameter w.sub.x=1, parameter w.sub.p=1. The realization accuracy of the method proposed by the invention and the compared methods are shown in Table 2. It can be seen that the method proposed by the invention obtains the highest accuracy on this data set.

TABLE-US-00002 TABLE 2 Method regnization accuracy (%) SYNC 23.9 CONSE 26.9 SJE 32.9 LATEM 35.2 EZSL 38.3 ALE 39.7 DEVISE 39.8 CDL 43.0 the method proposed by the 47.3 invention

[0094] The above contents are only the preferable embodiments of the present invention, and do not limit the present invention in any manner. Any improvements, amendments and alternative changes made to the above embodiments according to the technical spirit of the present invention shall fall within the claimed scope of the present invention.

DICTIONARY LEARNING METHOD AND MEANS FOR ZERO-SHOT RECOGNITION

Assignee

Inventors

Cpc classification

Classification Explorer

G06V10/778

PHYSICS

Classification Explorer

G06V10/772

PHYSICS

Classification Explorer

G06V20/41

PHYSICS

Classification Explorer

G06V10/7753

PHYSICS

Classification Explorer

G06V10/7747

PHYSICS

International classification

Classification Explorer

G06V10/772

PHYSICS

Classification Explorer

G06V20/40

PHYSICS

Abstract

Claims

Description