METHOD FOR IDENTIFYING A VIDEO FRAME OF INTEREST IN A VIDEO SEQUENCE, METHOD FOR GENERATING HIGHLIGHTS, ASSOCIATED SYSTEMS
20210390316 · 2021-12-16
Inventors
Cpc classification
G06V20/41
PHYSICS
A63F2300/572
HUMAN NECESSITIES
International classification
Abstract
A method for automatically generating a multimedia event on a screen by analyzing a video sequence, include acquiring a plurality of time-sequenced video frames from an input video sequence; applying a learned convolutional neural network to each video frame of the acquired time-sequenced video frames for outputting feature vectors, the learned convolutional neural network being learned by a method for training a neural network that includes applying a convolutional neural network to some video frames for extracting time-sequenced feature vectors; applying a learned transformation function that produces at least one predictive feature vector from a subset of the extracted time-sequenced feature vectors, classifying each feature vector according to different classes in a feature space, the different classes defining a frame classifier; extracting the video frames that correspond to feature vectors which is classified in one predefined class of the classifier.
Claims
1. A method for automatically generating a multimedia event on a screen by analyzing a video sequence, the method comprising: acquiring a plurality of time-sequenced video frames from an input video sequence; applying a learned convolutional neural network to each video frame of the acquired time-sequenced video frames for outputting feature vectors, said learned convolutional neural network being learned by a method for training a neural network that comprises: applying a convolutional neural network to some video frames for extracting time-sequenced feature vectors; applying a recurrent neural network that produces at least one predictive feature vector from a subset of the extracted time-sequenced feature vectors; calculating a loss function, said loss function comprising a computation of a contrastive distance between: a first distance computed between a predicted feature vector and an extracted feature vector for a same-related time sequence video frame and; a second distance computed between the predicted feature vector for the same related time sequence video frame and one extracted feature vector, updating the parameters of the convolutional neural network and the parameters of the recurrent neural network in order to minimize the loss function, classifying each feature vector according to different classes in a feature space, said different classes defining a video frame classifier; extracting the video frames that correspond to feature vectors which is classified in one predefined class of the classifier.
2. The method according to claim 1, wherein applying a learned convolutional neural network to each video frame of the acquired time-sequenced video frames for outputting feature vectors is following by a step of: applying a learned transformation function to each the feature vectors, said learned convolutional neural network and learned transformation function being learned by a method for training a neural network that comprises: applying a convolutional neural network to some video frames for extracting time-sequenced feature vectors; applying a learned transformation function that produces at least one predictive feature vector from a subset of the extracted time-sequenced feature vectors; classifying each feature vector according to different classes in a feature space, said different classes defining video frame classifier or a video sequence classifier; extracting a new video sequence comprising at least one video frame that correspond to feature vectors which are classified in one predefined class of the video sequence classifier or the video frame classifier.
3. The method according to claim 1, wherein the method comprises: detecting at least one feature vector corresponding to at least one predefined class from a video frame classifier or a video sequence classifier; generating a new video sequence automatically comprising at least one video frame corresponding to the at least detected feature vector according to the predefined class, said video sequence having a predetermined duration.
4. The method according to claim 1, wherein the video sequence comprises: aggregating video sequences corresponding to a plurality of detected feature vectors according to at least one predefined class, said video sequence having a predetermined duration and/or; aggregating video frames corresponding to a plurality of detected feature vector according to at least two predefined classes, said video sequence having a predetermined duration.
5. The method according to claim 2, wherein the extracted video is associated with: a predefined audio sequence which is selected in accordance with at least one predefined class of the classifier; or a predefined visual effect which is applied in accordance with at least one predefined class of the classifier.
6. The method according to claim 1, wherein the method for training a neural network, comprises: acquiring a first set of videos; acquiring a plurality of time-sequenced video frames from a first video sequence from the above-mentioned first set of videos; applying a convolutional neural network to each video frame of the acquired time-sequenced video frames for extracting time-sequenced feature vectors; applying a learned transformation function that produces at least one predictive feature vector from a subset of the extracted time-sequenced feature vectors, said learned transformation function being repeated for a plurality of subsets; calculating a loss function, said loss function comprising a computation of a distance between each predicted feature vector and each extracted feature vector for a same-related time sequence video frame; updating the parameters of the convolutional neural network and the parameters of the learned transformation function in order to minimize the loss function.
7. The method according to claim 6, wherein each video of the first set of videos is video extracted from a computer program having a predefined images library and code instructions that, when applied by said computer program, produced a time-sequenced video scenario.
8. The method according to claim 6, wherein the time-sequenced video frames are extracted from a video at a predefined interval of time.
9. The method according to claim 6, wherein the subset of the extracted time-sequenced feature vectors is a selection of a predefined number of time-sequenced feature vectors and the at least one predictive feature vector correspond(s) to the next feature vector in the sequence of the selected times-sequences feature vectors.
10. The method according to claim 6, wherein the loss function comprises aggregating each computed distance.
11. The method according to claim 6, wherein the loss function comprises computing a contrastive distance between: a first distance computed between a predicted feature vector and an extracted feature vector for a same-related time sequence video frame and; a second distance computed between the predicted feature vector for the same related time sequence video frame and one extracted feature vector corresponding to a previous time sequence video frame, said previous time sequence video frame being selected beyond or after a predefined time window centered on the instant of the same related time sequence video frame or; one extracted feature vector corresponding to a time sequence video frame of another video sequence, and comprises aggregating each contrastive distance computed for each time sequence feature vector, said aggregation defining a first set of inputs.
12. The method according to claim 6, wherein the loss function comprises computing a contrastive distance between: a first distance computed between a predicted feature vector and an extracted feature vector for a same-related time sequence video frame and; a second distance computed between the predicted feature vector for the same related time sequence video frame and one extracted feature vector chosen in an uncorrelated time window, said uncorrelated time window being defined out of a correlation time window, said correlation time window comprising at least a predefined number of time sequenced feature vectors in a predefined time window centered on the instant of the same related time sequence video frame, and comprises aggregating each contrastive distance computed for each time sequence feature vector, said aggregation defining a first set of inputs.
13. The method according to claim 6, wherein the parameters of the convolutional neural network and/or the parameters of the learned transformation function are updated by considering the first set of inputs in order to minimize the distance function.
14. The method according to claim 6, wherein the learned transformation function is a recurrent neural network.
15. A non-transitory computer-readable medium that comprises software code portions for the execution of the method according to claim 1.
Description
BRIEF DESCRIPTION OF FIGURES
[0068]
[0069]
[0070]
[0071]
[0072]
[0073]
[0074]
DETAILED DESCRIPTION
[0075] In the following description some of the following terminology and definitions are used.
[0076] Video frames are noted with the following convention: [0077] {vf.sub.k}.sub.kε[1; N]: a plurality of acquired video frames as inputs of the method; [0078] vf.sub.k: one acquired video frame as an input of the method; [0079] . . . vf.sub.i−1, vf.sub.i, vf.sub.i+1 . . . successive acquired video frames; [0080] vf.sub.p: one extracted video frame of the method as an output of the method, said video frames being classified in a classifier. These video frames may also be considered as video frames of interest.
[0081] Feature vectors extracted from the convolutional neural network CNN are noted with the following convention: [0082] f.sub.k: one feature vector computed by a convolutional neural network CNN corresponding to the acquired video frame vf.sub.k, correspondence should be understood as meaning the same timestamp in the time sequenced video frames; [0083] . . . f.sub.0, f.sub.1, f.sub.i+1 . . . successive feature vectors corresponding to a sequence of acquired video frames vf.sub.i−1, vf.sub.i, vf.sub.i+1; [0084] f.sub.p: one extracted feature vector of the convolutional neural network as an output of the method, the extracted feature vector being classified in a classifier and corresponding to the extracted video frame vf.sub.p. [0085] pf.sub.i: one predicted feature vector by the learned transformation function that is used by the loss function or the contrastive loss function.
[0086] Feature vectors extracted from the learned transformation function LTF.sub.1 are noted with the following convention: [0087] . . . o.sub.i−1, o.sub.i, o.sub.i+1 . . . successive feature vectors outputted from a learned transformation function LTF.sub.1 corresponding to the successive feature vectors f.sub.0, f.sub.1, f.sub.i+1 which themselves correspond to a sequence of acquired video frames vf.sub.i−1, vf.sub.i, vf.sub.i+1; [0088] o.sub.p: one extracted feature vector of the learned transformation function LTF.sub.1 which is classified by the method of the invention.
[0089] The convolutional neural network used in the application method, described in
[0090] The convolutional neural network used in the learning method, described in
[0091] More generally, the properties of a convolutional neural network used in an application method or in as learning method, described in
[0092]
[0093] The first step of the method, noted ACQ, comprises the acquisition of a plurality of time-sequenced video frames {vf.sub.k}.sub.kε[1; N] from an input video sequence VS.sub.1. The time-sequenced video frames are noted vf.sub.k and are called video frames in the description. Each video frame vf.sub.k is an image that is for example quantified into pixels in an encoded, predefined digital formal such as jpeg or png (portable network graphics) or any digital format that allows encoding a digital image.
[0094] Video Frame
[0095] According to an embodiment of the invention, the full video sequence VS.sub.1 is segmented into a plurality of video frames vf.sub.k that are all treated by the method of the invention. According to another embodiment, the selected video frames {vf.sub.k}.sub.kε[1; N] in the acquisition step ACQ of the method are sampled from the video sequence VS according to a predefined sampling frequency. For example, one video frame vf.sub.k is acquired every second for being processed in the further steps of the method.
[0096] The video may be received by any interface, such as communication interface, wireless interface, user interface. The video may be recorder in a memory before being segmented.
[0097] For instance, assuming a video sequence VS.sub.1 of 10 min, the video sequence VS.sub.1 being encoded with 25 images/s, the total number of images is about 15000 images. The sampling frequency is set to 1 frame over 25 images, that is the equivalent of considering one frame every second. The acquisition step comprises the acquisition of a time-sequence of N video frames for a single video, with N=600 video frames in the previous example, with N=10×60×25/25. According to an example, a training dataset might have N frames per video where N is a number of the order of several hundred or thousands of videos. During a single training example this number may be of the order of 10 or 20 frames.
[0098] According to an example, a pre-detection algorithm is implemented in order to select some specific segments of the video sequence VS.sub.1. These segments may be sampled for acquiring video frames vf.sub.k. According to an example, the sampling frequency may be variable in time. Some labeled timestamps on the video sequence VS.sub.1 may be used for acquiring more video frames in a first segment of the video sequence VS.sub.1 than in a second one. For example, a beginning of a video sequence VS.sub.1 may be sampled with a low sampling frequency and the end of stage in a level of a video game VG.sub.1 may be sampled with a higher sampling frequency.
[0099] The video frames vf.sub.k are used to detect frames of interest vf.sub.p, also called FoI, which is detailed in
[0100] The video frames vf.sub.k may be acquired from a unique video sequence VS.sub.1 when the method is applied for generating some highlights of said video sequence VS.sub.1 or from a plurality of video sequences VS.sub.1, for example, when the method is implemented in a training process of a neural network.
[0101] Convolution Neural Network
[0102] The second step of the method of the invention, noted APPL1_CNN on
[0103] The CNN processes each acquired video frame vf.sub.k and is able to extract some feature vectors {f.sub.k}.sub.kε[1; N]. A feature vector f.sub.k may be represented in a feature space such one represented in
[0104] The CNN may be one of the convolutional neural network comprising a multilayer architecture based on the application of successive transformation operations, such as convolution, between said layers. Each input of the CNN, i.e. each video frame vf.sub.k, is processed through the successive layers by the application of transformation operations. The implementation of a CNN leads to convert an image into a vector.
[0105] The goal of a CNN is to transform video frames as inputs of the neural network into a feature space that allows a better classification of the transformed inputs by a classifier VfC, VsC. Another goal is that the transformed data is used to train the neural network in order to increase the recognition of the content of the inputs.
[0106] In an embodiment, the CNN comprises a convolutional layer, a non-linearity or a rectification layer, a normalization layer and a pooling layer. According to different embodiments, the CNN may comprise a combination of one or more previous said layers. According to an embodiment, the CNN comprises a backpropagation process for training the model by modifying the parameters of each layer of the CNN. Other derivative architectures may be implemented according to different embodiments of the invention.
[0107] According to an implementation, the incoming video frames vf.sub.k, which are processed by the learned convolutional neural network CNN.sub.L, are gathered by successive batches of N incoming video frames, for instance as it is represented in
[0108] In other examples, the CNN may be configured so that batches comprise between 2 and 25 video frames. According to an example, the batch comprises 4 frames or 6 frames.
[0109] According to an embodiment, the CNN is learned to output a plurality of successive feature vectors f.sub.i−1, f.sub.i, f.sub.i+1, each feature being timestamped according to the acquired time-sequenced video frames vf.sub.i−1, vf.sub.i, vf.sub.i+1. The weights of the CNN, and more generally the other learned parameters of the CNN and the configuration data that described the architecture of the CNN are recorded in a memory that may be in a server on the Internet, the cloud or a dedicated server. For some application, the memory is a local memory of one computer.
[0110] The learned CNN.sub.L is trained before or during the application of the method of the invention according to
[0111] The feature vectors f.sub.i that are computed by the convolutional neural network CNN.sub.1 and the learned transformation function LTF.sub.1 may be used to train the learned convolutional neural network CNN.sub.L model and possibly the RNN model when it is also implemented in the method according to
[0112] This training process leads to output a learned neural network which is implemented in one video treatment application that minimizes data treatments in automatic video editing process. Such a learned neural network allows outputting relevant highlights in video game. This method also ensures increasing the relevance of the built-in process video frame classifier VfC.
[0113] Backpropagation
[0114]
[0115] In other words, the weights of the CNN.sub.L, and the learned transformation function LTF.sub.1 when implemented, are updated simultaneously via backpropagation. The updates are derived, for example, by backpropagating a contrastive loss throughout the neural network.
[0116] The backpropagation of the method that is used to train the CNN.sub.L or the CNN.sub.L+ RNN may be realized thanks to a contrastive loss function CLF.sub.1 that predicts a target in order to compare an output with a predicted target. The backpropagation then comprises updating the parameters in the neural network, in a way such that the next time the same input goes through the network, the output will be closer to the desired target.
[0117] Classifier
[0118] A third step is a classifying step, noted CLASS. This step comprises classifying each extracted feature vector f.sub.p, or the respective extracted video frame vf.sub.p according to different classes C{p}.sub.pε[1, Z] of a video frame classifier VfC in a feature space. This classifier comprises different classes C{p} defining a video frame classifier.
[0119] A fourth step is an extraction step, noted EXTRACT(vf). This step comprises extracting the video frames vf.sub.p that correspond to feature vectors f.sub.p which is classified in at least one class C{p} of the classifier. In the scope of the invention, the extracting step may correspond to an operation of marking, identifying, or annotating these video frames vf.sub.p. The annotated video frames vf.sub.p may be used, for example, in an automatic film editing operation for gathering annotated frames of one class of the classifier VfC in order to generate a highlight sequence.
[0120] According to the example of
[0121] For instance, the classifier VfC may comprise classes of interest CoI comprising video frames vf.sub.p related to highlights of a video sequence VS.sub.1. Highlights may appear, for example, at times when many events occur at about the same time in the video sequence VS.sub.1, when a user changes of level in a game play, when different user avatars meet in a scene during high intensity action, when there are collisions of a car, ship or plane or a death of an avatar, etc. A benefit of the classifier of the invention is that classes are dynamically defined in a training process that corresponds to many scenarios which are difficult to enumerate or anticipate.
[0122] According to some embodiments, different methods can be used for generating a short video when considering a specific extracted video frame vf.sub.p. The length of the video sequence SSoI can be a few seconds. For example, the duration of the SSoI may be comprised in the range of 1 s and 10 s. The SSoI may be generate so that the video frame of interest vf.sub.p is placed at the middle of the SSoI, or placed at ⅔ of the duration of the SSoI. In an example, the SSoI may start or finish at the Vol.
[0123] According to an embodiment, some visual effects may be integrated during the SSoI such as slowdown(s) or acceleration(s), zoom on the user virtual camera, including video of the subsequence generated by another virtual camera different from the user's point of view, an inscription on the picture, etc.
[0124] According to an embodiment, the duration of the SSoI depends on the class wherein the FoI is selected. For instance, the classifier VfC or VsC may comprise different classes of Interest C{p}: a class with high intensity actions, class with new appearing events, etc. Some implementations take advantage of the variety of classes that is generated according to the method of the invention. The SSoI may be generated taking into account classes of the classifier. For example, the duration of the SSoI may depend on the classes, the visual effects applied may also depend on the classes, the order of the SSoI in a video montage may depend on the classes, etc.
[0125] According to an example, a video a subsequence of interest SSoI is generated when several video frames of interest vf.sub.p are identified in the same time period. When a time period, for example of few seconds comprises several FoI, a SSoI is automatically generated. This solution may be implemented when some FoI of different classes are detected in the same lapse of time during the video sequence VS.sub.1.
[0126] An application of the invention is the automatic generation of films that results from the automatic selection of several extracted video sequences according to the method of the invention. Such films may comprise automatic aggregations of audio sequence, visual effects, written inscriptions such as titles, etc. depending of the classes wherein said extracted video sequences are selected.
[0127] Learned Transformation Function
[0128]
[0129] In an embodiment, the learned function LT is a recurrent neural network, also noted RNN. The RNN is implemented so that to process the output “f.sub.i” of the learned convolutional neural network CNN.sub.L in order to output new feature vectors “oi”. A benefit of the implementation of recurrent neural network RNN is that it aggregates temporally the transformed data into its own feature extracting process. The connections between nodes of the network of an RNN allows for producing temporal dynamic behavior of the acquired time sequenced video frames. The performance of the classifier is increased by taking into account the temporal neighborhood of a video frame.
[0130] According to different examples, the RNN may be one of those variants: Fully recurrent type, Elman Networks and Jordan networks types, Hopfied type, Independently RNN type, recursive type, Neural history compressor type, second order RNN type, long short-term memory (LSTM) type, gated recurrent unit (GRU) type, bi-directional type or a continuous-time type, recurrent multilayer perceptron network type, multiple timescales model type, neural Turing machines type, differentiable neural computer type, neural network pushdown automata type, memristive networks type, transformer type.
[0131] According to the invention, the RNN aims to continuously output a prediction of the feature vector of the next frame. This prediction function may be applied continuously to a batch of feature vectors f.sub.i that is outputted by the CNN.sub.L. The RNN may be configured for predicting one output vector of over a batch of N−1 incoming feature vectors in order to apply in a further step a loss function LF.sub.1, such as contrastive loss function CLF.sub.1.
[0132] The implementation of an RNN, or more generally a learned transformation function LTF.sub.1, is used for training the learned neural network of the method of
[0133] The method of
[0134] It is to be noted that in the example of
[0135] A video sequences classifier VsC may be implemented so that it includes a selecting step of classified SSoI. This is an alternative to the previous embodiments wherein subsequences of interest SSoI were generated from selected FoI from a video frame classifier VfC.
[0136]
[0137] According to an embodiment, the method of
[0138] An example of an algorithm describing the training loop for an example of a specific implementation of the method of the invention is detailed here after.
[0139] In that example, it is considered a database of 10 second video clips from a single videogame, sampled at 1 frame per second. The following sequence is processed until converged or training otherwise complete.
TABLE-US-00001 .square-solid. For each batch of B video clips in random_shuffle(database): .square-solid. let X = the batch of image sequence # X.shape == (B, 10, 3, h, w) == (batch_size, seq_len, RGB, heigh, width) .square-solid. F = CNN(X) # extract feature vectors of dimension D using CNN, independently for each image in X; F.shape == (B, 10, D) .square-solid. o = RNN(F[:, −5:−1]) # predicts the last vector in F, based on the 4 vectors before that one; o.shape == (B, D) .square-solid. pred = Proj(o) # Proj is a two layer neural network that projects to a lower dimension d; pred.shape == (B,d) .square-solid. f = Proj(F) # we also project f to this lower dimension for comparsions; f.shape == (B, 10, d) ○ loss = 0.0 # initialise the loss .square-solid. for i in 1 ... B .circle-solid. pos_score[i] = dot_product(pred[i], f[i, −1]) # we want the prediction to be close to the last vector in the sequence .circle-solid. neg_score[i] = exp(dot_product(pred[i], f[i, 0])) # we want the prediction to be far from the first vector in the sequence .circle-solid. for j in 1 ... B where j != i # use all feature vectors from all other sequences in the batch as additional negative examples ○ for t in 1 ... 10 .square-solid. neg_score[i] += exp(dot_product(pred[i], f[j, t])) ○ end for .circle-solid. end for .circle-solid. loss −= pos_score / log(neg_score) # contrastive loss .square-solid. end for .square-solid. # backpropagate the loss to the parameters of the CNN, RNN and Proj networks, and do .square-solid. # an update step with stochastic gradient descent, so as to minimise the average loss: update([CNN, RNN, Proj], loss / B) .square-solid. end for done
[0140] In an embodiment, the invention aims to initiate the learning of the neural network which may be continuously implemented when methods of
[0141] The acquisition step ACQ, the application of the CNN.sub.L and the application of a learned transformation function LTF.sub.1 in
[0142] The RNN is further detailed in
[0143] The loss function LF.sub.1 is detailed in
[0144]
[0145] The h.sub.i vectors evolve through the neural network layer NNL by successively passing through processing blocs, called activation functions or transfer functions. h.sub.i vectors are applied to each new entrance in the learned transformation function LTF.sub.1 for outputting a new feature vector o.sub.1.
[0146] According to different embodiments, the RNN may comprise one or more network layers. Each node of the layer may be implemented by an activation function such as linear activation function of non-linear activation function. A non-linear activation function that is implemented may be one of those derivative or differential of monotonic function. As an example, the activation functions implemented in the layer(s) of the RNN may be: Sigmoid or logistic activation function, Tan h or hyperbolic tangent Activation Function, ReLU (Rectified Linear Unit) activation Function, Leaky ReLU activation function, GRU (Gated Recurrent Units), or any other activation functions, GRU (Gated Recurrent Units). In a configuration, LSTMs and GRUs which may be implemented with a mix of sigmoid and tan h function.
[0147] Contrastive Loss Function
[0148] According to an embodiment of the invention, a loss function LF.sub.1 is implemented in the method of
[0149] According to an embodiment, the loss function LF.sub.1 may also be implemented in a method according to
[0150] According to an embodiment, the loss function LF.sub.1 is a contrastive loss function CLF.sub.1.
[0151] In the example of
[0152] In this approach, the RNN works as a predicting function wherein the result is an input of the contrastive loss function CLF.sub.1. The prediction function comprises computing a next feature vector o.sub.i+1 from previous received feature vectors { . . . , f.sub.i−2, f.sub.i+1, f.sub.i}, where o.sub.i+1 is a prediction of f.sub.i+1. In
[0153] As a convention, the outputs of the RNN or of any equivalent learned transformation function LTF.sub.1, are called {o.sub.i}.sub.iε[1;N] when the learned transformation function LTF.sub.1 is implemented in an application method for identifying highlights, for example. The outputs of the RNN or any equivalent learned transformation function LTF.sub.1 are called {pf.sub.i}.sub.iε[1;N] when the learned transformation function LTF.sub.1 is implemented for training the learned neural network {CNN} or {CNNL+LTF).
[0154] In other embodiments, the RNN may be replaced by any learned transformation function LTF.sub.1 that aims to predict a feature vector pf.sub.i considering past feature vectors {f.sub.j}.sub.jε[W;i−1] and that aims to train a learned neural network model via backpropagation of computed errors by a loss function LF.sub.1.
[0155]
[0156] According to an embodiment, the loss function LF.sub.1 comprises the computation of a distance d.sub.1(o.sub.i+1, f.sub.i+1). The distance d.sub.1(o.sub.i+1, f.sub.i+1) is computed between each predicted feature vector pf.sub.i+1 calculated by the RNN and each extracted feature vector f.sub.i+1 calculated by the convolutional neural network CNN. In that implementation pf.sub.i+1 and f.sub.i+1 corresponds to a same-related time sequence video frame vf.sub.i+1.
[0157] According to an embodiment, when the loss function LF.sub.1 is a contrastive loss function CLF.sub.1, it comprises computing a contrastive distance Cd.sub.1 between: [0158] a first distance d.sub.1(o.sub.i+1, f.sub.i+1) computed between a predicted feature vector o.sub.i+1 and an extracted feature vector f.sub.i+1 for a same-related time sequence video frame vf.sub.i+1 and; [0159] a second distance d.sub.2(o.sub.i+1, f.sub.k) computed between the predicted feature vector o.sub.i+1 corresponding to the feature vector outputted from the CNN and one reference extracted feature vector Rf.sub.n that should be uncorrelated from the video frame vf.sub.i.
[0160] In practice, reference extracted feature vector Rf.sub.n is ensured to be uncorrelated from extracted feature vector f.sub.i by only considering frames that are separated by a sufficient period of time from the video frame vf.sub.i, or by considering frames acquired from a different video entirely. It means that “n” is chosen below a predefined number of the current frame “i”, for instance n<i−5. In the present invention, an uncorrelated time window UW is defined in which reference extracted feature vector Rf.sub.n may be chosen.
[0161] The reference feature vectors Rf.sub.n that are used to define the contrastive distance function Cd.sub.1 may correspond to frames of the same video sequence VS.sub.1 from which the video frames vf.sub.i are extracted or frames of another video sequence VS.sub.1.
[0162] In an example, the contrastive loss function CLF.sub.1 randomly sample other frames vf.sub.k, or feature vectors f.sub.k, of the video sequence VS.sub.1 in order to define a set of reference extracted feature vectors Rf.sub.i.
[0163] The combination of reference feature vectors Rf.sub.i taken from random other video clips in the dataset, along with feature vectors from the same video clip but outside the predefined “correlation time window” CW, provides the neural network with a mix of “easy” and “hard” tasks. This mix ensures the presence of a useful training signal throughout the training procedure.
[0164] The invention allows extracting reference feature vectors and comparing their distance to a predicted vector, versus that predicted vector's distance to a target vector. This process allows for desired properties of the neural network to be expressed in a mathematical, differentiable loss function, which can in turn be used to train the neural network.
[0165] The training of the neural network allows distinguishing a near-future video frame from a randomly selected reference frame in order to increase the distinction of highlight in a video sequence from other video frame sequences.
[0166]
[0167] The contrastive loss function CLF.sub.1 compares d1 and d2 in order to generate a computed error between d1 and d2 that is backpropagated to the weights of the neural network.
[0168] To train the model, a positive pair is required, as well as at least one negative pair to contrast against this positive pair. Using a sequence length of 5 like in
[0171] According to an embodiment, the loss function LF.sub.1 comprises aggregating each computed contrastive distance Cd.sub.1 for increasing the accuracy of the detection of relevant video frames vf.sub.p.
[0172] The resulting error Er from the contrastive computed distance is backpropagated to update the parameters of the neural network model. This backpropagation allows finding relevant video frame vf.sub.p when the neural network is trained efficiently.
[0173] The loss function LF.sub.1, or more particularly, the contrastive loss function CLF.sub.1 comprises a projection module PROJ for computing the projection of each feature vector fi or oi. The predicted feature vector pf.sub.i may undergo an additional, and possible nonlinear, transformation to a projection space. A second step corresponds to the computation of each predicted component of the feature vector in order to generate the predicted feature vector pf.sub.i. This predicted feature vector aims to define a pseudo target for defining an efficient training process of the neural network.
[0174] The objective of the loss function LF.sub.1 is to push the predicted future features and true future features closer together, while pushing the predicted future features further away from the features of some other random image in the dataset.
[0175]
[0176] Uncorrelated Time Window
[0177] The invention allows aggregating feature vectors in a set of reference feature vector {Rf.sub.i}.sub.l which are supposed to be uncorrelated with a feature vector f.sub.k which is currently processed by the learned transformation function LTF.sub.1 and the contrastive loss function CLF.sub.1. According to a configuration, an uncorrelated window UW corresponds to the video frames occurring outside a predefined time period centered on the timestamp t.sub.k of the frame vf.sub.k. It means that the frame vf.sub.k−7, vf.sub.k−8, vf.sub.k−9, vf.sub.k−10, etc. may be considered as uncorrelated with vf.sub.k, because they are far from the event occurring on frame vf.sub.k. In this case, the uncorrelated time window UW is defined by the closest frame from the frame vf.sub.k which is in that example the frame of −7, this parameter is called the depth of the uncorrelated time window UW.
[0178] Considering, for example, a duration of 1 second between each video frame Δ(t.sub.k, t.sub.k−1)=1 s with a sampling frequency of 1/25 with a video at 25 frames per second. In that example, it may be considered that the frame vf.sub.k−7 is uncorrelated from the video frame vf.sub.k. In this example, it is assumed that 7 second before the video frame vf.sub.k, the frame vf.sub.k−5 is different from the frame vf.sub.k in which an event may occur. In such a configuration, d.sub.2(f.sub.k−7, f.sub.k) is considered as a negative pair, in the same way that d.sub.2((Rf.sub.i, f.sub.k) is considered a negative pair. In this example, the distance d.sub.1(f.sub.k−7, f.sub.k), d.sub.1(f.sub.k−6, f.sub.k), d.sub.1(f.sub.k−4, f.sub.k), d.sub.1(f.sub.k−4, f.sub.k), d.sub.1(f.sub.k−3, f.sub.k), d.sub.2(f.sub.k−2, f.sub.k), d.sub.2(f.sub.k−1, f.sub.k) may be defined as positive pairs or not but they cannot be defined as negative pairs. In this example, only frames d.sub.1(f.sub.k−4, f.sub.k), d.sub.1(f.sub.k−3, f.sub.k), d.sub.2(f.sub.k−2, f.sub.k), d.sub.2(f.sub.k−1, f.sub.k) may be defined as positive pairs according to the definition of a correlation time window CW.
[0179] This configuration is well adapted for a video sequence VS.sub.1 of a video game VG.sub.1. But this configuration may be adapted for another video game VG.sub.2 or for another video sequence VS.sub.2 of a same video game for example corresponding to another level of said video game.
[0180] Correlation Window
[0181] The invention allows aggregating feature vectors in a set of correlated feature vector {Cf.sub.i}.sub.l which are supposed to be correlated with a feature vector f.sub.k which is currently processed by the learned transformation function LTF.sub.1 and the contrastive loss function CLF.sub.1. According to a configuration, a correlation time window CW corresponds to a predefined time period centered on the timestamp t.sub.k of the frame vf.sub.k. It means that the frame vf.sub.k−1, vf.sub.k−2, vf.sub.k−3, vf.sub.k−4 may be considered as correlated with vf.sub.k. In this case, the correlation window CW is defined by the farthest frame from the frame vf.sub.k which is here the frame vf.sub.−4 in that example, this parameter is called the depth of the correlation time window CW.
[0182] According to an embodiment, the depth of the correlation time window CW and the depth of the uncorrelated time window may be set at the same value.
[0183] The method according to the invention comprises a controller that allows configuring the depth of the correlation time window CW and the depth of the uncorrelated time window UW. For instance, in a specific configuration they may be chosen with the same depth.
[0184] This configuration may be adapted to the video game or information related to an event rate. For instance, in a car race video game, numerous events or changes may occur in a short time window. In that case, the correlation time window CW may be set at 3 s including positive pairs inside the range [t.sub.k−3; t.sub.k] and/or excluding negative pairs from this correlation time window CW. In other examples, the time window is longer, for instance 10s including positive pairs into the range [t.sub.k−10; t.sub.k] and/or excluding negative pairs from this correlation time window CW.
[0185] Embodiments of the subject matter and the operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on computer storage medium for execution by, or to control the operation of, data processing apparatus.
[0186] A computer storage medium can be, or can be included in, a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of them. Moreover, while a computer storage medium (e.g. a memory) is not a propagated signal, a computer storage medium can be a source or destination of computer program instructions encoded in an artificially-generated propagated signal. The computer storage medium also can be, or can be included in, one or more separate physical components or media (e.g., multiple CDs, disks, or other storage devices). The operations described in this specification can be implemented as operations performed by a data processing apparatus on data stored on one or more computer-readable storage devices or received from other sources.
[0187] The term “programmed processor” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, digital signal processor (DSP), a computer, a system on a chip, or multiple ones, or combinations, of the foregoing. The apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
[0188] The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform actions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
[0189] Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random-access memory or both. The essential elements of a computer are a processor for performing actions in accordance with instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Devices suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
[0190] To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., an LCD (liquid crystal display), LED (light emitting diode), or OLED (organic light emitting diode) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. In some implementations, a touch screen can be used to display information and to receive input from a user. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
[0191] The present invention has been described and illustrated in the present detailed description and in the figures of the appended drawings, in possible embodiments. The present invention is not however limited to the embodiments described. Other alternatives and embodiments may be deduced and implemented by those skilled in the art on reading the present description and the appended drawings.
[0192] In the claims, the term “includes” or “comprises” does not exclude other elements or other steps. A single processor or several other units may be used to implement the invention. The different characteristics described and/or claimed may be beneficially combined. Their presence in the description or in the different dependent claims do not exclude this possibility. The reference signs cannot be understood as limiting the scope of the invention.
[0193] It will be appreciated that the various embodiments described previously are combinable according to any technically permissible combinations.