PROGRESSIVE LOCALIZATION METHOD FOR TEXT-TO-VIDEO CLIP LOCALIZATION

Abstract

A progressive localization method for text-to-video clip localization. The method comprises: first, respectively extracting features of two modes, namely a video mode and a text mode by using different feature extraction methods; then progressively selecting different step sizes, and learning the correlation between the video and the text in multiple stages; and finally, training a model in an end-to-end manner based on the correlation loss of each stage. Moreover, the fine time granularity stage is fused with information of the coarse time granularity stage by means of a condition feature update module and up-sampling connection, such that different stages are mutually promoted. Different stages can pay attention to clips with different time granularities, and the model can cope with the situation that the length of a target clip is obviously changed based on the interrelation between the stages.

Claims

1. A progressive localization method for text-to-video clip localization, comprising: step 1: extracting a video feature and a text feature, respectively, by using different feature extraction methods; step 2: coarse-time-granularity localization: sampling the video feature obtained in step 1 with a first step length to generate a candidate clip; step 3: fusing the candidate clip obtained in step 2 with the text feature obtained in step 1; step 4: feeding the fused feature to a convolution neural network to obtain a coarse-grained feature map, and then obtaining a correlation score map via an FC layer; step 5: fine-time-granularity localization: sampling the video feature obtained in step 1 with a second step size, updating the features by a conditional feature update module by combining with the feature map obtained in step (4), and then generating a candidate clip, wherein the first step is greater than the second step; step 6: fusing the candidate clip in step 5 with the text features obtained in step 1, and fusing information of the last step by combining the feature map obtained in step 4 through up-sampling connection; step 7: feeding the fused features to the convolution neural network to obtain a fine-grained feature map, and then obtaining a correlation score map by an FC layer; step 8: calculating a loss value of the correlation score matrices obtained in step 4 and step 7 by using binary cross entropy loss, respectively, combining the loss value with a certain weight, and finally training a model in an end-to-end manner; and step 9: realizing text-based video clip localization by using the model trained in step 8.

2. The method according to claim 1, wherein said extracting a video feature and a text feature, respectively, in step 1 comprises: step 1-1: dividing a video into several video units at a certain interval, extracting visual features of each video unit by using a pre-trained CNN model, and finally obtaining the video features by average pooling and an FC layer; and step 1-2: transforming each word in a text into an embedding vector by using a Glove word2vec model, learning the relationship between words by a LSTM network to, and taking output features of a last hidden state as the text features.

3. The method according to claim 1, wherein step 2 comprises: step 2-1: sampling the video features obtained in step 1 with a large step size to obtain temporally ordered basic clip feature vectors, and obtaining a series of temporally continuous clips with different lengths by combining basic clips; step 2-2: selecting a candidate clip from all candicate clips by a sparse sampling strategy, and reducing redundant information as much as possible without affecting the performance of the model; step 2-3: performing a maximum pooling operation on the basic clips in each candidate clip interval to obtain a feature of the candidate clip; and step 2-4: representing features of all candidate clips by using a two-dimensional feature map, with starting and ending positions of each candidate clip corresponding to coordinates of the two-dimensional feature map, and placing a feature of each candidate clip in corresponding positions to finally obtain a two-dimensional feature map of the candidate clip.

4. The method according to claim 1, wherein in step 3, the text features and the features of the candidate clips are mapped to a same dimensional space through an FC layer, and then the fused feature is obtained through Hadamard Product and Frobenius normalization.

5. The method according to claim 1, wherein step 4 comprises: step 4-1: feeding the fused feature to a two-layer convolutional neural network to learn the correlation between the candidate clip and the text to obtain an intermediate feature map with a same shape as an input, wherein the intermediate feature map will transfer learning information to the fine time granularity localization stage; and step 4-2: feeding the intermediate feature map obtained in step 4-1 to an FC layer to obtain the correlation score map of the candidate clip in the coarse time granularity localization stage.

6. The method according to claim 5, wherein step 5 comprises: step 5-1: sampling the video feature obtained in step 1 with a second step size to obtain a series of temporally ordered basic clip feature vectors; step 5-2: updating the basic clip feature vectors by a conditional feature update module by using the intermediate feature map obtained in step 4-1, and obtaining a series of temporally continuous clips through a combination of the basic clips; wherein a current stage is desired to be focused on an area that have great relevance to the text, which has been learned in the coarse time granularity localization stage; the conditional feature update module is to update the features of the current stage by learning weights by using information in the coarse time granularity localization stage; and step 5-3: selecting candidate clips from all possible clips by sparse sampling strategy, performing the maximum pooling operation on the basic clips in each candidate clip interval to obtain the features of the candidate clip, and representing the features of all candidate clips with the two-dimensional feature map to obtain the two-dimensional feature map of the candidate clips during fine time granularity localization.

7. The method according to claim 6, wherein said updating the basic clip feature vectors by the conditional feature update module comprises: the intermediate feature map H.sup.t−1 of a stage t−1 is transformed into h.sup.t−1 by the maximum pooling operation; for the basic clip feature vector c.sub.i.sup.t of each stage t, a weight a.sub.i.sup.t is learned in combination with the information of the stage t−1, which is expressed by the formula as follows:
a.sub.i.sup.t=sigmoid(W.sub.r.sup.t.Math.(h.sup.t−1⊙c.sub.i.sup.t)+b.sub.r.sup.t), where W.sub.r.sup.t and b.sub.r.sup.t are parameters to be learned, ⊙ is Hadamard Product, and sigmoid is a nonlinear sigmoid activation function; an updated feature vector c.sub.i.sup.t of each basic clip is obtained according to a learned weight a.sub.i.sup.t, that is:
c.sub.i.sup.t=c.sub.i.sup.t⊙a.sub.i.sup.t.

8. The method according to claim 5, wherein step 6 comprises: step 6-1: fusing the candidate clip features and the text features to obtain the fused features of the stage t; and step 6-2: fusing the intermediate feature map H.sup.t−1 in the stage t−1 obtained in step 4-1 with the feature map F.sup.t in the stage t obtained in step 6-1 by up-sampling connection, and obtaining the fused feature map G.sup.t; wherein the formula is as follows:
G.sup.t=F.sup.t custom-character sigmoid({Conv.sub.k(upsample(H.sup.t−1))}.sub.n) where the subscript n indicates n-th up-sampling and convolution operations, the subscript k indicates a size of a convolution kernel, is maximum pooling element by element, and sigmoid is a nonlinear sigmoid activation function.

9. The method according to claim 1, wherein step 7 comprises: step 7-1 feeding the fused features to a two-layer convolutional neural network to learn the correlation between the candidate clip and the text, and obtaining an intermediate feature map with fine time granularity; and step 7-2 feeding the intermediate feature map obtained in step 7-1 to an FC layer to obtain the correlation score map of the candidate clips during fine-time-granularity localization.

10. The method according to claim 1, wherein in step 9, said progressively realizing the text-based video clip localization by using the trained model comprises: step 9-1 inputting a query text and a corresponding video into the model to obtain correlation score matrices with different time granularities; and step 9-2 selecting correlation score matrices of a stage with the finest granularity, sorting candidate clips according to scores, selecting a candidate clip with a highest score, and returning the position information thereof in the original video.

Description

BRIEF DESCRIPTION OF DRAWINGS

[0049] FIG. 1 is a schematic diagram of a progressive localization network structure for text-to-video clip localization; and

[0050] FIG. 2 is a schematic diagram of the structure of a conditional feature update module in the progressive localization network.

DESCRIPTION OF EMBODIMENTS

[0051] The present disclosure will be described in detail below with reference to the drawings and specific embodiments.

[0052] To solve the problem of text-to-video clip localization, the present disclosure proposes a progressive localization method for text-to-video clip localization, which is realized based on a progressive localization network, and the structure of the progressive localization network is shown in FIG. 1. The method includes the following specific steps.

[0053] Step 1, video features and text features are extracted respectively by using different feature extraction methods.

[0054] Step 1-1, for a given video, it is divided to 256 video units (note that the length of each video unit after division is different because of its different length of origin video). For each video unit, the deep features of all frames are extracted by using a convolution neural network (CNN) model trained on an ImageNet data set, the features in each video unit are merged by average pooling, and then the dimension of each video unit is reduced by an FC layer to obtain the feature vector of the video unit. In this way, the video can be described by a series of feature vectors, v={u.sub.i}.sub.i=1.sup.l.sup.v, where u.sub.i∈ custom-character .sup.d.sup.v represents the feature vector of the i.sup.th video unit, d.sup.v=512 represents the dimension of the feature vector and l.sup.v=256 represents the number of video units.

[0055] Step 1-2, given a sentence of a length l.sup.s, each word is transformed into an embedding vector using a GloVe word2vec model, and a word embedding vector sequence{w.sub.1, w.sub.2, . . . w.sub.l.sub.s} can be obtained, where wt is the embedding vector of the t.sup.th word. Because words in a sentence are closely related to each other, it is very important to model the context of each word for the full representation of text features. LSTM is a special recurrent neural network (RNN), which can learn the dependencies between words in sentences. Therefore, the embedding vector sequence is fed to the learning of a three-layer two-way LSTM network, so as to establish a sufficient connection between words, and the output vector of the last hidden state is selected as the feature vector of the text, which is represented by f.sub.s ∈ custom-character .sup.d.sup.s, where d.sup.s=512 indicates the size of each hidden state of the LSTM network.

[0056] Step 2, after the video features and text features are obtained, candidate clips need to be generated. A coarse-to-fine and progressive idea is adopted to solve the task of localizing video clips that are relevant to the given textual query. Therefore, at first, the model is allowed to learn the correlation between the candidate clips with a coarse time granularity and the text with a larger step size, which is called a coarse time granularity branching. Firstly, a feature map of candidate clips is constructed, and the specific steps are as follows:

[0057] Step 2-1, sampling the video features obtained in step 1-1 with a first step size s.sup.1 to obtain a series of basic clips, that is, C.sup.1={c.sub.i.sup.1}.sub.i=1.sup.N.sup.1, c.sub.i.sup.1 represents the feature vector of the i.sup.th basic clip, which is obtained by average pooling of the video feature vectors in the corresponding interval. Average pooling can make the basic clip retain the video information in the corresponding interval. N.sup.1=l.sup.v/s.sup.1 indicates the number of basic clips. In order to represent different stages with different step sizes, the superscript is used to represent the current stage (for example, the superscript 1 here represents the first stage). With the increase of superscript, the step size will decrease by multiple, thus obtaining the feature representation with different time granularities.

[0058] Step 2-2, theoriotically, there are Σ.sub.k=1.sup.N.sup.1 k different candidate clips for the N.sup.1 basic clips obtained in step 2-1. In order to reduce the computational cost and remove more redundant information, the model adopts the following strategy to select candidate clips: if N.sup.1≤16, all possible candidate clips are selected; otherwise, for any clip from c.sub.a.sup.1 to c.sub.b.sup.1, if G(a, b)=1 is satisfied, it will be selected. G(a, b) can be obtained by the following formula:

G(a,b)←(a mod s=0)&((b−s′)mod s=0),

[0059] where s and s′ are defined as follows:

[00001] $s = 2^{k - 1},$ $s^{'} = {\begin{matrix} 0, & if k = 1, \\ 2^{k + 2} - 1, & otherwise \end{matrix}$ $where k = .Math. \log_{2} (\frac{b - a + 1}{8}) .Math.,$

┌.Math.┐ represents an upward rounding function.

[0060] Step 2-3, maximum pooling processing is carried out on the basic clip features contained in each selected candidate clip to obtain the feature vectors of the candidate clips. For example, for a candidate clip from c.sub.a.sup.1 to c.sub.b.sup.1, its feature m.sub.a,b.sup.1=maxpool(c.sub.a.sup.1, c.sub.a+1.sup.1, . . . , c.sub.b.sup.1) Here, the maximum pooling operation is similar to a feature selection, and it is desirable that the features with better discrimination can be kept for the next step of learning.

[0061] Step 2-4, the feature vectors of all candidate clips are stored in a two-dimensional feature map according to their positions, obtaining M.sup.1 Σ custom-character .sup.N.sup.1.sup.×N.sup.1.sup.×d.sup.v, where M.sup.1[a, b,:]=m.sub.a,b.sup.1, and all invalid positions (including those that are not selected and those whose start time is less than the end time) are filled with 0.

[0062] Step 3, after obtaining the feature map of the candidate clip, it is necessary to combine the information of the text. The specific steps are as follows:

[0063] Firstly, the text features and candidate clip features are mapped to a d.sup.u=512−dimensional space through an FC layer respectively, and then a fused feature F.sup.1 is obtained through Hadamard Product and Frobenius normalization. The above process is expressed as:

F.sup.1=∥(W.sub.v.Math.M.sup.1)⊙(W.sub.s.Math.f.sub.s.Math.1.sup.T)∥.sub.F

[0064] where W.sub.v and W.sub.s are the parameters to be learned in the FC layer of candidate clip features and text features, respectively, 1.sup.T represents the row vector of all 1, and ⊙ and ∥.Math.∥F represent Hadamard Product and Frobenius normalization, respectively.

[0065] Step 4, the fused features are fed to a convolution neural network to obtain the feature map, and then a correlation score map is obtained by a fully connected FC layer. The specific steps are as follows:

[0066] Step 4-1, the fused feature F.sup.1 is fed to a two-layer convolutional neural network to learn the correlation between the candidate clips and the text to obtain a feature map H.sup.1. In a two-layer convolutional neural network, a 5×5 convolution kernel is used. At the same time, as the shape of the map H.sup.1 is related to the position of the candidate clip, padding is used in the convolution process to remain the output size unchanged.

[0067] Step 4-2, through the leaning of the convolution neural network, the model has learned the correlation between the candidate clips and the text, and this information is stored in the feature map H.sup.1. In order to make the correlation information in the feature map clearer, the feature map H.sup.1 is fed to an FC layer to get a correlation score map P.sup.1 ∈ custom-character .sup.N.sup.1.sup.×N.sup.1. The value of each position in the feature map P.sup.1 represents the correlation between the candidate clip at the current position and the textual query. The higher the score, the more relevant the candidate clip is to the text. The correlation score map can be expressed by the formula as follows:

P.sup.1=W.sup.1.Math.H.sup.1+b.sup.1

[0068] where W.sup.1 and b.sup.1 are the parameters to be learned.

[0069] Step 5, obtaining the correlation score of the coarse time granularity branch means that the location of the current branch has been completed, but this branch only pays attention to the coarse-grained candidate clips and cannot cope with the shorter target clips. The progressive localization network also has a fine time granularity localization branch, which solves the defects of the first branch and pays attention to those short target clips. Of course, the two branches are not independent learning. A conditional feature update module and the up-sampling connection are designed to connect the two branches.

[0070] For the learning of fine time granularity branch, the feature map of candidate clips is first constructed and the specific steps are as follows:

[0071] Step 5-1, the video features obtained in step 1-1 are sampled with a second step size s.sup.2 (relative to step 2-1) to obtain a basic clip feature vector C.sup.2={c.sub.i.sup.2}.sub.i=1.sup.N.sup.2. In this embodiment, the first step size is 10 seconds and the second step size is 2 seconds.

[0072] Step 5-2, before generating candidate clips, the information learned in the previous branch is used for the first time; the feature map H.sup.1 obtained in step 4-1 implies the correlation between the candidate clips and the query text, and it is desirable to update C.sup.2 in combination with the correlation; the basic clip feature vector C.sup.2 is updated by the conditional feature update module, as shown in FIG. 2; the specific steps are as follows:

[0073] First of all, H.sup.1∈ custom-character .sup.d.sup.u.sup.×N.sup.1.sup.×N.sup.1 is converted into h.sup.1∈.sup.d.sup.u.sup.×N.sup.1 by a maximum pooling operation.

[0074] Then, for each c.sub.i.sup.2, a weight a.sub.i.sup.2 is learned by combining the information of the previous branch, which is expressed as follows:

a.sub.i.sup.2=sigmoid(W.sub.r.sup.2.Math.(h.sup.1⊙c.sub.i.sup.2)+b.sub.r.sup.2),

[0075] where W.sub.r.sup.2 and b.sub.r.sup.2 represent parameters to be learned, ⊙ represents Hadamard Product, and sigmoid represents a nonlinear sigmoid activation function.

[0076] Finally, the learned weights can be used to get the updated feature vector c.sub.i.sup.2 of each basic clip, namely:

c.sub.i.sup.2=c.sub.i.sup.2⊙a.sub.i.sup.2.

[0077] With the aid of the conditional feature update module, the correlation information learned by the coarse time granularity branch is passed to the fine time granularity branch, so that those areas with stronger correlation can get more attention.

[0078] Step 5-3, the feature vector C.sup.2 of the updated basic clip is obtained, and the two-dimensional feature map M.sup.2 of the fine time granularity branch is obtained by the methods of steps 2-2, 2-3 and 2-4.

[0079] Step 6, similarly, the feature of the candidate clips of the fine time granularity branch needs to be fused with the given text. After that, the information of the coarse time granularity branch will be used for the second time. The specific steps are as follows:

[0080] Step 6-1, the candidate clip feature map and text features are fused to obtain fused feature F.sup.2 by the method of step 3;

[0081] Step 6-2, the relevant information of the previous branch is indirectly utilized by learning a weight in Step 5-2; here, it is desirable to use them more directly, so up-sampling connection is designed; the details are as follows:

[0082] First of all, it should be clear that it is the feature map H.sup.1 learned by convolutional neural network in step 4-1 that contains relevant information in the previous step;

[0083] Next, it is noted that the shapes of H.sup.1 and F.sup.2 are different due to the different step sizes of the two branches, and the shape of F.sup.2 should be larger.

[0084] Therefore, first, H.sup.1 is upsampled to make the two shapes consistent; and then the up-sampled H.sup.1 is fed to a two-layer convolutional neural network (Conv.sub.k, where the subscript k indicates the size of the convolutional kernel, which can be 3).

[0085] After performing up-sampling and convolution operations n times, the shapes of H.sup.1 and F.sup.2 are consistent, and then the activation function sigmoid is applied.

[0086] Finally, it is fused with the fusion feature F.sup.2 through element-by-element maximum pooling ( custom-character ) to obtain the feature map G.sup.2.

[0087] The above process can be expressed as:

G.sup.2=F.sup.2 custom-character sigmoid({Conv.sub.k(upsample(H.sup.1))}.sub.n)

[0088] where the subscript n indicates n-th up-sampling and convolution operations, which also means that the shape of H.sup.1 is expanded by

[00002] $\frac{\frac{N^{2}}{N^{1}}}{n}$

times in each up-sampling.

[0089] The feature update module makes the fine granularity branch pay more attention to the video features through constraints, and the up-sampling connection keeps the features with better recognition in the candidate clips of the two stages by way of feature selection, so that the model has better localization ability.

[0090] Step 7, the same method as step 4-2 is adopted, and the correlation score map P.sup.2 of the fine time granularity branch is obtained by the two-layer convolutional neural network and the FC layer.

[0091] Step 8, after the above steps, each branch gets a correlation score map which reflect s the correlation between the candidate clip and the query text. A binary cross entropy function is used to calculate the localization loss of each branch:

[00003] $Λ_{t} = - \frac{1}{V^{t}} {.Math.}_{i = 1}^{V^{t}} y_{i}^{t} \log p_{i}^{t} + (1 - y_{i}^{t}) \log (1 - p_{i}^{t}),$

[0092] where p.sub.i.sup.t ∈sigmoid(P.sup.t) indicates the predicted label between 0 and 1 converted according to the correlation score of candidate clips, V.sup.t is the number of candidate clips at the stage t, V.sup.t is the ground-truth label of each candidate clip at the stage t,. In application, the idea of “soft label” is used, that is, the ground-truth label of each candidate clip is not all “either 1 or 0”, and they are classified according to the Intersection over Union (IoU) o.sub.i.sup.t of the candidate clip and the target clip, which can be represented as:

[00004] $y_{i}^{t} = {\begin{matrix} 0, & o_{i}^{t} \leq τ \\ \frac{o_{i}^{t} - τ,}{1 - τ}, & others \end{matrix}$

[0093] where τ is a threshold, which can be 0.5.

[0094] Finally, the total loss of T branches of the model can be expressed as:

custom-character =Σ.sub.t−1.sup.Tλ.sub.t.sub.t,

[0095] where λ.sub.t represents the weight of the stages t.

[0096] With the total loss function, the progressive localization network model can be trained in an end-to-end manner.

[0097] It is worth noting that, due to the sparse sampling strategy in step 2-2, the scores in the correlation score map are not all valid. A map with the same shape as the candidate clip feature map is used to record each valid position, and the final correlation score will be filtered by the recording map of each branch.

[0098] In addition, at last, the model can generate several correlation score matrices with different time granularities, and the fine time granularity branch often has better performance through conditional feature update module and up-sampling connection.

[0099] Step 9, by the training in step 8, the model has learned how to select the one most relevant to the query text from the candidate clips. Given a query and a corresponding video, the model can finally output the start and end time of the most relevant clip in the video with respect to the given query. The steps are as follows:

[0100] Step 9-1, given text and video are input into the model, and several correlation score matrices with different time granularities can be obtained;

[0101] Step 9-2, the score map of the branch with the finest granularity is selected; after the invalid scores are filtered by the recording map, they are sorted according to the scores, the one with the highest score is selected, and converted into the original time according to its coordinates and the step size of previous sampling; and the result is returned.

[0102] The concrete implementation steps of the progressive localization network with two branches have been introduced above. In practical application, the branches can be increased by reasonably selecting the step size, thus obtaining an optimal effect.

PROGRESSIVE LOCALIZATION METHOD FOR TEXT-TO-VIDEO CLIP LOCALIZATION

Inventors

Cpc classification

Classification Explorer

G06V20/47

PHYSICS

Classification Explorer

G06V20/41

PHYSICS

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06V20/635

PHYSICS

Classification Explorer

G06V20/70

PHYSICS

Classification Explorer

G06V10/454

PHYSICS

Classification Explorer

G06V20/46

PHYSICS

Classification Explorer

G06V20/40

PHYSICS

Classification Explorer

G06V20/49

PHYSICS

Classification Explorer

G06F18/253

PHYSICS

Classification Explorer

G06V30/19173

PHYSICS

Classification Explorer

G06V10/764

PHYSICS

Classification Explorer

G06V10/806

PHYSICS

International classification

Classification Explorer

G06V10/80

PHYSICS

Classification Explorer

G06V20/40

PHYSICS

Classification Explorer

G06V10/82

PHYSICS

Abstract

Claims

Description