METHOD FOR IDENTIFYING A VIDEO FRAME OF INTEREST IN A VIDEO SEQUENCE, METHOD FOR GENERATING HIGHLIGHTS, ASSOCIATED SYSTEMS

Abstract

A method for automatically generating a multimedia event on a screen by analyzing a video sequence, include acquiring a plurality of time-sequenced video frames from an input video sequence; applying a learned convolutional neural network to each video frame of the acquired time-sequenced video frames for outputting feature vectors, the learned convolutional neural network being learned by a method for training a neural network that includes applying a convolutional neural network to some video frames for extracting time-sequenced feature vectors; applying a learned transformation function that produces at least one predictive feature vector from a subset of the extracted time-sequenced feature vectors, classifying each feature vector according to different classes in a feature space, the different classes defining a frame classifier; extracting the video frames that correspond to feature vectors which is classified in one predefined class of the classifier.

Claims

1. A method for automatically generating a multimedia event on a screen by analyzing a video sequence, the method comprising: acquiring a plurality of time-sequenced video frames from an input video sequence; applying a learned convolutional neural network to each video frame of the acquired time-sequenced video frames for outputting feature vectors, said learned convolutional neural network being learned by a method for training a neural network that comprises: applying a convolutional neural network to some video frames for extracting time-sequenced feature vectors; applying a recurrent neural network that produces at least one predictive feature vector from a subset of the extracted time-sequenced feature vectors; calculating a loss function, said loss function comprising a computation of a contrastive distance between: a first distance computed between a predicted feature vector and an extracted feature vector for a same-related time sequence video frame and; a second distance computed between the predicted feature vector for the same related time sequence video frame and one extracted feature vector, updating the parameters of the convolutional neural network and the parameters of the recurrent neural network in order to minimize the loss function, classifying each feature vector according to different classes in a feature space, said different classes defining a video frame classifier; extracting the video frames that correspond to feature vectors which is classified in one predefined class of the classifier.

2. The method according to claim 1, wherein applying a learned convolutional neural network to each video frame of the acquired time-sequenced video frames for outputting feature vectors is following by a step of: applying a learned transformation function to each the feature vectors, said learned convolutional neural network and learned transformation function being learned by a method for training a neural network that comprises: applying a convolutional neural network to some video frames for extracting time-sequenced feature vectors; applying a learned transformation function that produces at least one predictive feature vector from a subset of the extracted time-sequenced feature vectors; classifying each feature vector according to different classes in a feature space, said different classes defining video frame classifier or a video sequence classifier; extracting a new video sequence comprising at least one video frame that correspond to feature vectors which are classified in one predefined class of the video sequence classifier or the video frame classifier.

3. The method according to claim 1, wherein the method comprises: detecting at least one feature vector corresponding to at least one predefined class from a video frame classifier or a video sequence classifier; generating a new video sequence automatically comprising at least one video frame corresponding to the at least detected feature vector according to the predefined class, said video sequence having a predetermined duration.

4. The method according to claim 1, wherein the video sequence comprises: aggregating video sequences corresponding to a plurality of detected feature vectors according to at least one predefined class, said video sequence having a predetermined duration and/or; aggregating video frames corresponding to a plurality of detected feature vector according to at least two predefined classes, said video sequence having a predetermined duration.

5. The method according to claim 2, wherein the extracted video is associated with: a predefined audio sequence which is selected in accordance with at least one predefined class of the classifier; or a predefined visual effect which is applied in accordance with at least one predefined class of the classifier.

6. The method according to claim 1, wherein the method for training a neural network, comprises: acquiring a first set of videos; acquiring a plurality of time-sequenced video frames from a first video sequence from the above-mentioned first set of videos; applying a convolutional neural network to each video frame of the acquired time-sequenced video frames for extracting time-sequenced feature vectors; applying a learned transformation function that produces at least one predictive feature vector from a subset of the extracted time-sequenced feature vectors, said learned transformation function being repeated for a plurality of subsets; calculating a loss function, said loss function comprising a computation of a distance between each predicted feature vector and each extracted feature vector for a same-related time sequence video frame; updating the parameters of the convolutional neural network and the parameters of the learned transformation function in order to minimize the loss function.

7. The method according to claim 6, wherein each video of the first set of videos is video extracted from a computer program having a predefined images library and code instructions that, when applied by said computer program, produced a time-sequenced video scenario.

8. The method according to claim 6, wherein the time-sequenced video frames are extracted from a video at a predefined interval of time.

9. The method according to claim 6, wherein the subset of the extracted time-sequenced feature vectors is a selection of a predefined number of time-sequenced feature vectors and the at least one predictive feature vector correspond(s) to the next feature vector in the sequence of the selected times-sequences feature vectors.

10. The method according to claim 6, wherein the loss function comprises aggregating each computed distance.

11. The method according to claim 6, wherein the loss function comprises computing a contrastive distance between: a first distance computed between a predicted feature vector and an extracted feature vector for a same-related time sequence video frame and; a second distance computed between the predicted feature vector for the same related time sequence video frame and one extracted feature vector corresponding to a previous time sequence video frame, said previous time sequence video frame being selected beyond or after a predefined time window centered on the instant of the same related time sequence video frame or; one extracted feature vector corresponding to a time sequence video frame of another video sequence, and comprises aggregating each contrastive distance computed for each time sequence feature vector, said aggregation defining a first set of inputs.

12. The method according to claim 6, wherein the loss function comprises computing a contrastive distance between: a first distance computed between a predicted feature vector and an extracted feature vector for a same-related time sequence video frame and; a second distance computed between the predicted feature vector for the same related time sequence video frame and one extracted feature vector chosen in an uncorrelated time window, said uncorrelated time window being defined out of a correlation time window, said correlation time window comprising at least a predefined number of time sequenced feature vectors in a predefined time window centered on the instant of the same related time sequence video frame, and comprises aggregating each contrastive distance computed for each time sequence feature vector, said aggregation defining a first set of inputs.

13. The method according to claim 6, wherein the parameters of the convolutional neural network and/or the parameters of the learned transformation function are updated by considering the first set of inputs in order to minimize the distance function.

14. The method according to claim 6, wherein the learned transformation function is a recurrent neural network.

15. A non-transitory computer-readable medium that comprises software code portions for the execution of the method according to claim 1.

Description

BRIEF DESCRIPTION OF FIGURES

[0068] FIG. 1 is a flowchart of the main steps of an embodiment of the method for extracting a video frame of interest in a video.

[0069] FIG. 2 is a flowchart of the main steps of an embodiment of the method for generating a video sequence of interest.

[0070] FIG. 3 is a flowchart of the main steps of an embodiment of the method for training a neural network that is used for extracting video frame of interest in a video.

[0071] FIG. 4 is a schematic representation of an architecture that may be implemented according to an example of the method of the invention.

[0072] FIG. 5 is an example of a distribution of features in a feature space or a classifier according to an example of the invention.

[0073] FIG. 6 is a schematic view focusing on an example of a recurrent neural network according to an example of the invention.

[0074] FIG. 7 is schematic representation of an example of the implementation of a contrastive loss function according to the invention.

DETAILED DESCRIPTION

[0075] In the following description some of the following terminology and definitions are used.

[0076] Video frames are noted with the following convention: [0077] {vf.sub.k}.sub.kε[1; N]: a plurality of acquired video frames as inputs of the method; [0078] vf.sub.k: one acquired video frame as an input of the method; [0079] . . . vf.sub.i−1, vf.sub.i, vf.sub.i+1 . . . successive acquired video frames; [0080] vf.sub.p: one extracted video frame of the method as an output of the method, said video frames being classified in a classifier. These video frames may also be considered as video frames of interest.

[0081] Feature vectors extracted from the convolutional neural network CNN are noted with the following convention: [0082] f.sub.k: one feature vector computed by a convolutional neural network CNN corresponding to the acquired video frame vf.sub.k, correspondence should be understood as meaning the same timestamp in the time sequenced video frames; [0083] . . . f.sub.0, f.sub.1, f.sub.i+1 . . . successive feature vectors corresponding to a sequence of acquired video frames vf.sub.i−1, vf.sub.i, vf.sub.i+1; [0084] f.sub.p: one extracted feature vector of the convolutional neural network as an output of the method, the extracted feature vector being classified in a classifier and corresponding to the extracted video frame vf.sub.p. [0085] pf.sub.i: one predicted feature vector by the learned transformation function that is used by the loss function or the contrastive loss function.

[0086] Feature vectors extracted from the learned transformation function LTF.sub.1 are noted with the following convention: [0087] . . . o.sub.i−1, o.sub.i, o.sub.i+1 . . . successive feature vectors outputted from a learned transformation function LTF.sub.1 corresponding to the successive feature vectors f.sub.0, f.sub.1, f.sub.i+1 which themselves correspond to a sequence of acquired video frames vf.sub.i−1, vf.sub.i, vf.sub.i+1; [0088] o.sub.p: one extracted feature vector of the learned transformation function LTF.sub.1 which is classified by the method of the invention.

[0089] The convolutional neural network used in the application method, described in FIGS. 1 and 2, is named a learned convolutional neural network and noted CNN.sub.L.

[0090] The convolutional neural network used in the learning method, described in FIG. 3 is named a convolutional neural network and it is noted CNN.sub.1.

[0091] More generally, the properties of a convolutional neural network used in an application method or in as learning method, described in FIGS. 1, 2 and 3 is named a convolutional neural network and it is noted CNN.

[0092] FIG. 1 represents the main steps of an example of the method of the invention that allows extracting a video frame vf.sub.k in a video sequence VS.sub.1.

[0093] The first step of the method, noted ACQ, comprises the acquisition of a plurality of time-sequenced video frames {vf.sub.k}.sub.kε[1; N] from an input video sequence VS.sub.1. The time-sequenced video frames are noted vf.sub.k and are called video frames in the description. Each video frame vf.sub.k is an image that is for example quantified into pixels in an encoded, predefined digital formal such as jpeg or png (portable network graphics) or any digital format that allows encoding a digital image.

[0094] Video Frame

[0095] According to an embodiment of the invention, the full video sequence VS.sub.1 is segmented into a plurality of video frames vf.sub.k that are all treated by the method of the invention. According to another embodiment, the selected video frames {vf.sub.k}.sub.kε[1; N] in the acquisition step ACQ of the method are sampled from the video sequence VS according to a predefined sampling frequency. For example, one video frame vf.sub.k is acquired every second for being processed in the further steps of the method.

[0096] The video may be received by any interface, such as communication interface, wireless interface, user interface. The video may be recorder in a memory before being segmented.

[0097] For instance, assuming a video sequence VS.sub.1 of 10 min, the video sequence VS.sub.1 being encoded with 25 images/s, the total number of images is about 15000 images. The sampling frequency is set to 1 frame over 25 images, that is the equivalent of considering one frame every second. The acquisition step comprises the acquisition of a time-sequence of N video frames for a single video, with N=600 video frames in the previous example, with N=10×60×25/25. According to an example, a training dataset might have N frames per video where N is a number of the order of several hundred or thousands of videos. During a single training example this number may be of the order of 10 or 20 frames.

[0098] According to an example, a pre-detection algorithm is implemented in order to select some specific segments of the video sequence VS.sub.1. These segments may be sampled for acquiring video frames vf.sub.k. According to an example, the sampling frequency may be variable in time. Some labeled timestamps on the video sequence VS.sub.1 may be used for acquiring more video frames in a first segment of the video sequence VS.sub.1 than in a second one. For example, a beginning of a video sequence VS.sub.1 may be sampled with a low sampling frequency and the end of stage in a level of a video game VG.sub.1 may be sampled with a higher sampling frequency.

[0099] The video frames vf.sub.k are used to detect frames of interest vf.sub.p, also called FoI, which is detailed in FIG. 1 or to detect subsequences of interest, also called SSoI, which is detailed in FIG. 2 or to improve a learning process of the method of the invention which is detailed in FIG. 3.

[0100] The video frames vf.sub.k may be acquired from a unique video sequence VS.sub.1 when the method is applied for generating some highlights of said video sequence VS.sub.1 or from a plurality of video sequences VS.sub.1, for example, when the method is implemented in a training process of a neural network.

[0101] Convolution Neural Network

[0102] The second step of the method of the invention, noted APPL1_CNN on FIG. 1, comprises applying a learned convolutional neural network, noted CNN.sub.L to the input video frames {vf.sub.k}.sub.kε[1; N].

[0103] The CNN processes each acquired video frame vf.sub.k and is able to extract some feature vectors {f.sub.k}.sub.kε[1; N]. A feature vector f.sub.k may be represented in a feature space such one represented in FIG. 5. The feature space of FIG. 5 comprises 2 dimensions: DIM1 and DIM2. In this example, the feature space comprises two classes C.sub.A, C.sub.B which are represented. Each class delimits a region gathering vectors that share common properties or the like.

[0104] The CNN may be one of the convolutional neural network comprising a multilayer architecture based on the application of successive transformation operations, such as convolution, between said layers. Each input of the CNN, i.e. each video frame vf.sub.k, is processed through the successive layers by the application of transformation operations. The implementation of a CNN leads to convert an image into a vector.

[0105] The goal of a CNN is to transform video frames as inputs of the neural network into a feature space that allows a better classification of the transformed inputs by a classifier VfC, VsC. Another goal is that the transformed data is used to train the neural network in order to increase the recognition of the content of the inputs.

[0106] In an embodiment, the CNN comprises a convolutional layer, a non-linearity or a rectification layer, a normalization layer and a pooling layer. According to different embodiments, the CNN may comprise a combination of one or more previous said layers. According to an embodiment, the CNN comprises a backpropagation process for training the model by modifying the parameters of each layer of the CNN. Other derivative architectures may be implemented according to different embodiments of the invention.

[0107] According to an implementation, the incoming video frames vf.sub.k, which are processed by the learned convolutional neural network CNN.sub.L, are gathered by successive batches of N incoming video frames, for instance as it is represented in FIG. 4, a batch may comprise 5 video frames vf.sub.k. The incoming video frames are processed so as to respect their time sequenced arrangement. According to an example, the CNN is configured to receive continuously batches of 5 video frames vf.sub.k for being computed through the layers of the CNN.sub.L. In such example, each incoming batch of video frames {vf.sub.i−2, vf.sub.i−1, vf.sub.i, vf.sub.i+1, vf.sub.i+2} leads to output a batch of the same number of feature vectors {f.sub.i−2, f.sub.i−1, f.sub.i, f.sub.i+1, f.sub.i+2}.

[0108] In other examples, the CNN may be configured so that batches comprise between 2 and 25 video frames. According to an example, the batch comprises 4 frames or 6 frames.

[0109] According to an embodiment, the CNN is learned to output a plurality of successive feature vectors f.sub.i−1, f.sub.i, f.sub.i+1, each feature being timestamped according to the acquired time-sequenced video frames vf.sub.i−1, vf.sub.i, vf.sub.i+1. The weights of the CNN, and more generally the other learned parameters of the CNN and the configuration data that described the architecture of the CNN are recorded in a memory that may be in a server on the Internet, the cloud or a dedicated server. For some application, the memory is a local memory of one computer.

[0110] The learned CNN.sub.L is trained before or during the application of the method of the invention according to FIGS. 1 and 2 by the application of two successive functions. The first function is a convolutional neural network CNN.sub.1 applied to some video frames vf.sub.k for computing time-sequenced feature vectors f.sub.k. The second function is a learned transformation function LTF.sub.1 that produces at least one predictive feature vector pf.sub.i+1 from a subset SUB.sub.i of the outputted time-sequenced feature vectors {f.sub.i}.sub.iε[W;i], where “W” is the set of timestamps defining a correlation time window CW. The correlation time window CW may be defined by the number of feature vectors comprised into the band [f.sub.w; f.sub.i] which are used for predicting a predicted feature vector pf.sub.i+1. In an example, the second function is a recurrent neural network, noted RNN.

[0111] The feature vectors f.sub.i that are computed by the convolutional neural network CNN.sub.1 and the learned transformation function LTF.sub.1 may be used to train the learned convolutional neural network CNN.sub.L model and possibly the RNN model when it is also implemented in the method according to FIG. 2.

[0112] This training process leads to output a learned neural network which is implemented in one video treatment application that minimizes data treatments in automatic video editing process. Such a learned neural network allows outputting relevant highlights in video game. This method also ensures increasing the relevance of the built-in process video frame classifier VfC.

[0113] Backpropagation

[0114] FIG. 1 represents a feedback loop BP that corresponds to the backpropagation of the learning step of the CNN.sub.L. The backpropagation is an operation that aims to evaluate how a change in the kernel weights of the CNN.sub.L affects a loss function LF.sub.1. The backpropagation may result as a continuous task that is applied during the extraction of video frames of interest vf.sub.p.

[0115] In other words, the weights of the CNN.sub.L, and the learned transformation function LTF.sub.1 when implemented, are updated simultaneously via backpropagation. The updates are derived, for example, by backpropagating a contrastive loss throughout the neural network.

[0116] The backpropagation of the method that is used to train the CNN.sub.L or the CNN.sub.L+ RNN may be realized thanks to a contrastive loss function CLF.sub.1 that predicts a target in order to compare an output with a predicted target. The backpropagation then comprises updating the parameters in the neural network, in a way such that the next time the same input goes through the network, the output will be closer to the desired target.

[0117] Classifier

[0118] A third step is a classifying step, noted CLASS. This step comprises classifying each extracted feature vector f.sub.p, or the respective extracted video frame vf.sub.p according to different classes C{p}.sub.pε[1, Z] of a video frame classifier VfC in a feature space. This classifier comprises different classes C{p} defining a video frame classifier.

[0119] A fourth step is an extraction step, noted EXTRACT(vf). This step comprises extracting the video frames vf.sub.p that correspond to feature vectors f.sub.p which is classified in at least one class C{p} of the classifier. In the scope of the invention, the extracting step may correspond to an operation of marking, identifying, or annotating these video frames vf.sub.p. The annotated video frames vf.sub.p may be used, for example, in an automatic film editing operation for gathering annotated frames of one class of the classifier VfC in order to generate a highlight sequence.

[0120] According to the example of FIG. 2, extracting a video subsequence of interest SSOI comprises selecting a portion of the video sequence timestamped at a video frame vf.sub.p selected in one classifier VfC. Highlights correspond to short video sequences VS.sub.1, herein called video subsequences of interest SSoI. According to different embodiments, a highlight may correspond to a short video generated at a video frame vf.sub.p that is classified in a specific class C{i}.sub.iε[1,Z] of the classifier. Such a class may be named Class of Interest CoI.

[0121] For instance, the classifier VfC may comprise classes of interest CoI comprising video frames vf.sub.p related to highlights of a video sequence VS.sub.1. Highlights may appear, for example, at times when many events occur at about the same time in the video sequence VS.sub.1, when a user changes of level in a game play, when different user avatars meet in a scene during high intensity action, when there are collisions of a car, ship or plane or a death of an avatar, etc. A benefit of the classifier of the invention is that classes are dynamically defined in a training process that corresponds to many scenarios which are difficult to enumerate or anticipate.

[0122] According to some embodiments, different methods can be used for generating a short video when considering a specific extracted video frame vf.sub.p. The length of the video sequence SSoI can be a few seconds. For example, the duration of the SSoI may be comprised in the range of 1 s and 10 s. The SSoI may be generate so that the video frame of interest vf.sub.p is placed at the middle of the SSoI, or placed at ⅔ of the duration of the SSoI. In an example, the SSoI may start or finish at the Vol.

[0123] According to an embodiment, some visual effects may be integrated during the SSoI such as slowdown(s) or acceleration(s), zoom on the user virtual camera, including video of the subsequence generated by another virtual camera different from the user's point of view, an inscription on the picture, etc.

[0124] According to an embodiment, the duration of the SSoI depends on the class wherein the FoI is selected. For instance, the classifier VfC or VsC may comprise different classes of Interest C{p}: a class with high intensity actions, class with new appearing events, etc. Some implementations take advantage of the variety of classes that is generated according to the method of the invention. The SSoI may be generated taking into account classes of the classifier. For example, the duration of the SSoI may depend on the classes, the visual effects applied may also depend on the classes, the order of the SSoI in a video montage may depend on the classes, etc.

[0125] According to an example, a video a subsequence of interest SSoI is generated when several video frames of interest vf.sub.p are identified in the same time period. When a time period, for example of few seconds comprises several FoI, a SSoI is automatically generated. This solution may be implemented when some FoI of different classes are detected in the same lapse of time during the video sequence VS.sub.1.

[0126] An application of the invention is the automatic generation of films that results from the automatic selection of several extracted video sequences according to the method of the invention. Such films may comprise automatic aggregations of audio sequence, visual effects, written inscriptions such as titles, etc. depending of the classes wherein said extracted video sequences are selected.

[0127] Learned Transformation Function

[0128] FIG. 2 represents another embodiment of the invention wherein a learned function LF is implemented after the application of the convolutional neural network, this step is noted APPL2_LT.

[0129] In an embodiment, the learned function LT is a recurrent neural network, also noted RNN. The RNN is implemented so that to process the output “f.sub.i” of the learned convolutional neural network CNN.sub.L in order to output new feature vectors “oi”. A benefit of the implementation of recurrent neural network RNN is that it aggregates temporally the transformed data into its own feature extracting process. The connections between nodes of the network of an RNN allows for producing temporal dynamic behavior of the acquired time sequenced video frames. The performance of the classifier is increased by taking into account the temporal neighborhood of a video frame.

[0130] According to different examples, the RNN may be one of those variants: Fully recurrent type, Elman Networks and Jordan networks types, Hopfied type, Independently RNN type, recursive type, Neural history compressor type, second order RNN type, long short-term memory (LSTM) type, gated recurrent unit (GRU) type, bi-directional type or a continuous-time type, recurrent multilayer perceptron network type, multiple timescales model type, neural Turing machines type, differentiable neural computer type, neural network pushdown automata type, memristive networks type, transformer type.

[0131] According to the invention, the RNN aims to continuously output a prediction of the feature vector of the next frame. This prediction function may be applied continuously to a batch of feature vectors f.sub.i that is outputted by the CNN.sub.L. The RNN may be configured for predicting one output vector of over a batch of N−1 incoming feature vectors in order to apply in a further step a loss function LF.sub.1, such as contrastive loss function CLF.sub.1.

[0132] The implementation of an RNN, or more generally a learned transformation function LTF.sub.1, is used for training the learned neural network of the method of FIG. 1 or FIG. 2, it means the CNN.sub.L or the {CNN.sub.L+LTF.sub.1). In a first embodiment, the learned neural network of the method according to FIGS. 1 and 2 may comprise only a CNN. In a second embodiment, the learned neural network of the method according to FIGS. 1 and 2 may comprise a combination of a CNN and a LTF.sub.1, such as a RNN. In this last case, a learned transformation function LTF.sub.1 is used for improving the detection of FOI or SSOI. The use of a RNN in a method of FIG. 2 allows aggregating past information for processing the current input f.sub.i. The outputted feature vector of of the RNN is used for improving the training of the method of FIG. 1 and FIG. 2 and also for improving a video subsequence classifier or a video frame classifier.

[0133] The method of FIG. 1 or FIG. 2 may be repeated for each input video sequence VS.sub.1 of a set of video sequences {VS.sub.i}.sub.iε[1, P].

[0134] It is to be noted that in the example of FIG. 2 the last step, noted EXTRACT(VS), corresponds to an extraction of video subsequences of interest SSOI from a video sequence classifier VsC. The embodiments of FIG. 2 may be combined with the embodiments of FIG. 1. For example, the extraction of video frames vf.sub.p in FIG. 1 may be implemented in the method of FIG. 2 by replacing the step of extracting video subsequences by the step of extracting video frames vf.sub.p.

[0135] A video sequences classifier VsC may be implemented so that it includes a selecting step of classified SSoI. This is an alternative to the previous embodiments wherein subsequences of interest SSoI were generated from selected FoI from a video frame classifier VfC.

[0136] FIG. 3 shows an embodiment of the learning process of the invention used for training the CNN.sub.L or the CNN.sub.L when implemented with an RNN or more generally with a learned transformation function LTF.sub.1. The method of FIG. 3 is a method for training a neural network such those described in FIG. 1 or FIG. 2. The methods of FIGS. 1 and 2 may be trained continuously while it is also used for classifying each newly video frame vf.sub.p.

[0137] According to an embodiment, the method of FIG. 3 may be trained with a set SET.sub.1 of video sequences {VS.sub.i}.sub.iε[1, P]. The set of video sequences SET.sub.1 may comprise video sequences with different lengths and coming from different video sources. For example, SET.sub.1 may comprise video sequences VS.sub.1, VS.sub.2, VS.sub.3, etc., each one corresponding to different user instances of a specific video game. A benefit is to train the neural network of the method with a set of video sequences {VS.sub.i}.sub.iε[1, P] of one specific video game. In another embodiment, different video games may be considered in the training process. In such cases, the model learns features that are more generally useful across many different videogames

[0138] An example of an algorithm describing the training loop for an example of a specific implementation of the method of the invention is detailed here after.

[0139] In that example, it is considered a database of 10 second video clips from a single videogame, sampled at 1 frame per second. The following sequence is processed until converged or training otherwise complete.

TABLE-US-00001 .square-solid. For each batch of B video clips in random_shuffle(database): .square-solid. let X = the batch of image sequence # X.shape == (B, 10, 3, h, w) == (batch_size, seq_len, RGB, heigh, width) .square-solid. F = CNN(X) # extract feature vectors of dimension D using CNN, independently for each image in X; F.shape == (B, 10, D) .square-solid. o = RNN(F[:, −5:−1]) # predicts the last vector in F, based on the 4 vectors before that one; o.shape == (B, D) .square-solid. pred = Proj(o) # Proj is a two layer neural network that projects to a lower dimension d; pred.shape == (B,d) .square-solid. f = Proj(F) # we also project f to this lower dimension for comparsions; f.shape == (B, 10, d) ○ loss = 0.0 # initialise the loss .square-solid. for i in 1 ... B .circle-solid. pos_score[i] = dot_product(pred[i], f[i, −1]) # we want the prediction to be close to the last vector in the sequence .circle-solid. neg_score[i] = exp(dot_product(pred[i], f[i, 0])) # we want the prediction to be far from the first vector in the sequence .circle-solid. for j in 1 ... B where j != i # use all feature vectors from all other sequences in the batch as additional negative examples ○ for t in 1 ... 10 .square-solid. neg_score[i] += exp(dot_product(pred[i], f[j, t])) ○ end for .circle-solid. end for .circle-solid. loss −= pos_score / log(neg_score) # contrastive loss .square-solid. end for .square-solid. # backpropagate the loss to the parameters of the CNN, RNN and Proj networks, and do .square-solid. # an update step with stochastic gradient descent, so as to minimise the average loss: update([CNN, RNN, Proj], loss / B) .square-solid. end for done

[0140] In an embodiment, the invention aims to initiate the learning of the neural network which may be continuously implemented when methods of FIG. 1 or 2 are processed.

[0141] The acquisition step ACQ, the application of the CNN.sub.L and the application of a learned transformation function LTF.sub.1 in FIG. 3 may be the same steps described in FIG. 1 and FIG. 2.

[0142] The RNN is further detailed in FIGS. 3 and 6. It may also be implemented in the methods of FIGS. 1 and 2 as a learned neural network when it is combined with a CNN.sub.L.

[0143] The loss function LF.sub.1 is detailed in FIG. 3, FIG. 4 and FIG. 7. It is to be noted that the loss function LF.sub.1 may be implemented in the methods of FIGS. 1 and 2 for processing the backpropagation BP with a computation of an error distance Er.

[0144] FIG. 6 represents an example of how a recurrent neural network RNN may be implemented. An RNN comprises a dynamic loop applied on the inputs of the network allowing information to persist. This dynamic loop is represented by successive “h.sub.i” vectors that are applied to the incoming extracted feature vectors f.sub.i that coming from the CNN in an ordered sequence as continuous process. In such an architecture, the invention allows connecting past information, such as previous processed extracted feature vectors f.sub.i coming from the CNN and allows selecting them from a correlation time window CW. This connecting and selecting tasks allows processing the present extracted feature vector f.sub.i from the CNN into the RNN. According to an example the RNN is an LSTM network.

[0145] The h.sub.i vectors evolve through the neural network layer NNL by successively passing through processing blocs, called activation functions or transfer functions. h.sub.i vectors are applied to each new entrance in the learned transformation function LTF.sub.1 for outputting a new feature vector o.sub.1.

[0146] According to different embodiments, the RNN may comprise one or more network layers. Each node of the layer may be implemented by an activation function such as linear activation function of non-linear activation function. A non-linear activation function that is implemented may be one of those derivative or differential of monotonic function. As an example, the activation functions implemented in the layer(s) of the RNN may be: Sigmoid or logistic activation function, Tan h or hyperbolic tangent Activation Function, ReLU (Rectified Linear Unit) activation Function, Leaky ReLU activation function, GRU (Gated Recurrent Units), or any other activation functions, GRU (Gated Recurrent Units). In a configuration, LSTMs and GRUs which may be implemented with a mix of sigmoid and tan h function.

[0147] Contrastive Loss Function

[0148] According to an embodiment of the invention, a loss function LF.sub.1 is implemented in the method of FIG. 3 in order to train a neural network. This training process aims to provide a learned neural network that can be used in any application for classifying video sequences, any application for detecting specific video frames vf.sub.p, or any generating video sequence application. The errors computed by the loss function LF.sub.1 aims to update the parameters of the neural network via backpropagation process. The error is preferably a distance error that is minimized thanks to the learning process.

[0149] According to an embodiment, the loss function LF.sub.1 may also be implemented in a method according to FIG. 1 or FIG. 2 in order to improve the detection of video frames of interest vf.sub.p. This detection relies on dynamically analyzing video subsequences by considering past information in the treatment of current information.

[0150] According to an embodiment, the loss function LF.sub.1 is a contrastive loss function CLF.sub.1. FIG. 4 shows an example of an implementation of contrastive loss function CLF.sub.1.

[0151] In the example of FIG. 4, the CNN and the RNN work as two different modules which deliver outputs that are considered by the contrastive loss function CLF.sub.1.

[0152] In this approach, the RNN works as a predicting function wherein the result is an input of the contrastive loss function CLF.sub.1. The prediction function comprises computing a next feature vector o.sub.i+1 from previous received feature vectors { . . . , f.sub.i−2, f.sub.i+1, f.sub.i}, where o.sub.i+1 is a prediction of f.sub.i+1. In FIG. 4, feature vector o.sub.5 is a prediction of the feature vector f.sub.5. This prediction is processed by considering the last four input feature vectors {f.sub.1, f.sub.2, f.sub.3, f.sub.4} and the last four output vectors {o.sub.1, o.sub.2, o.sub.3, o.sub.4}. When the RNN or the learned transformation function LTF.sub.1 is implemented for predicting an output feature vector o.sub.1, that feature vector is noted pf.sub.i.

[0153] As a convention, the outputs of the RNN or of any equivalent learned transformation function LTF.sub.1, are called {o.sub.i}.sub.iε[1;N] when the learned transformation function LTF.sub.1 is implemented in an application method for identifying highlights, for example. The outputs of the RNN or any equivalent learned transformation function LTF.sub.1 are called {pf.sub.i}.sub.iε[1;N] when the learned transformation function LTF.sub.1 is implemented for training the learned neural network {CNN} or {CNNL+LTF).

[0154] In other embodiments, the RNN may be replaced by any learned transformation function LTF.sub.1 that aims to predict a feature vector pf.sub.i considering past feature vectors {f.sub.j}.sub.jε[W;i−1] and that aims to train a learned neural network model via backpropagation of computed errors by a loss function LF.sub.1.

[0155] FIG. 4 shows an embodiment wherein the RNN is implemented and FIG. 7 shows an embodiment wherein a learned transformation function LTF.sub.1 replacing the RNN is represented.

[0156] According to an embodiment, the loss function LF.sub.1 comprises the computation of a distance d.sub.1(o.sub.i+1, f.sub.i+1). The distance d.sub.1(o.sub.i+1, f.sub.i+1) is computed between each predicted feature vector pf.sub.i+1 calculated by the RNN and each extracted feature vector f.sub.i+1 calculated by the convolutional neural network CNN. In that implementation pf.sub.i+1 and f.sub.i+1 corresponds to a same-related time sequence video frame vf.sub.i+1.

[0157] According to an embodiment, when the loss function LF.sub.1 is a contrastive loss function CLF.sub.1, it comprises computing a contrastive distance Cd.sub.1 between: [0158] a first distance d.sub.1(o.sub.i+1, f.sub.i+1) computed between a predicted feature vector o.sub.i+1 and an extracted feature vector f.sub.i+1 for a same-related time sequence video frame vf.sub.i+1 and; [0159] a second distance d.sub.2(o.sub.i+1, f.sub.k) computed between the predicted feature vector o.sub.i+1 corresponding to the feature vector outputted from the CNN and one reference extracted feature vector Rf.sub.n that should be uncorrelated from the video frame vf.sub.i.

[0160] In practice, reference extracted feature vector Rf.sub.n is ensured to be uncorrelated from extracted feature vector f.sub.i by only considering frames that are separated by a sufficient period of time from the video frame vf.sub.i, or by considering frames acquired from a different video entirely. It means that “n” is chosen below a predefined number of the current frame “i”, for instance n<i−5. In the present invention, an uncorrelated time window UW is defined in which reference extracted feature vector Rf.sub.n may be chosen.

[0161] The reference feature vectors Rf.sub.n that are used to define the contrastive distance function Cd.sub.1 may correspond to frames of the same video sequence VS.sub.1 from which the video frames vf.sub.i are extracted or frames of another video sequence VS.sub.1.

[0162] In an example, the contrastive loss function CLF.sub.1 randomly sample other frames vf.sub.k, or feature vectors f.sub.k, of the video sequence VS.sub.1 in order to define a set of reference extracted feature vectors Rf.sub.i.

[0163] The combination of reference feature vectors Rf.sub.i taken from random other video clips in the dataset, along with feature vectors from the same video clip but outside the predefined “correlation time window” CW, provides the neural network with a mix of “easy” and “hard” tasks. This mix ensures the presence of a useful training signal throughout the training procedure.

[0164] The invention allows extracting reference feature vectors and comparing their distance to a predicted vector, versus that predicted vector's distance to a target vector. This process allows for desired properties of the neural network to be expressed in a mathematical, differentiable loss function, which can in turn be used to train the neural network.

[0165] The training of the neural network allows distinguishing a near-future video frame from a randomly selected reference frame in order to increase the distinction of highlight in a video sequence from other video frame sequences.

[0166] FIG. 4 shows a first block named Pos(Pairs) that aims to calculate a first distance d.sub.1 between the extracted feature vector f.sub.k+1 and a predicted feature vector pf.sub.k+1. This first block evaluates distances between the set of positive pairs. A second a block named Neg(Pairs) represents the function that computes distances d.sub.2 between the set of extracted feature vectors f.sub.k+1 and reference feature vectors Rf.sub.k.

[0167] The contrastive loss function CLF.sub.1 compares d1 and d2 in order to generate a computed error between d1 and d2 that is backpropagated to the weights of the neural network.

[0168] To train the model, a positive pair is required, as well as at least one negative pair to contrast against this positive pair. Using a sequence length of 5 like in FIG. 4 leads us to consider: [0169] the positive pair by computing a distance d.sub.1 between a true future features f5 and a predicted future feature pf.sub.5; [0170] the negative pair by computing a distance d.sub.2 between a feature vector f.sub.n from any other random video frame vf.sub.n of one video sequence VS.sub.1 in a predefined dataset and a predicted future feature pf.sub.5.

[0171] According to an embodiment, the loss function LF.sub.1 comprises aggregating each computed contrastive distance Cd.sub.1 for increasing the accuracy of the detection of relevant video frames vf.sub.p.

[0172] The resulting error Er from the contrastive computed distance is backpropagated to update the parameters of the neural network model. This backpropagation allows finding relevant video frame vf.sub.p when the neural network is trained efficiently.

[0173] The loss function LF.sub.1, or more particularly, the contrastive loss function CLF.sub.1 comprises a projection module PROJ for computing the projection of each feature vector fi or oi. The predicted feature vector pf.sub.i may undergo an additional, and possible nonlinear, transformation to a projection space. A second step corresponds to the computation of each predicted component of the feature vector in order to generate the predicted feature vector pf.sub.i. This predicted feature vector aims to define a pseudo target for defining an efficient training process of the neural network.

[0174] The objective of the loss function LF.sub.1 is to push the predicted future features and true future features closer together, while pushing the predicted future features further away from the features of some other random image in the dataset.

[0175] FIG. 7 represents a schematic view of the way that a contrastive loss function CLF.sub.1 may be implemented. The computation of an error Er is backpropagated into the CNN and the RNN.

[0176] Uncorrelated Time Window

[0177] The invention allows aggregating feature vectors in a set of reference feature vector {Rf.sub.i}.sub.l which are supposed to be uncorrelated with a feature vector f.sub.k which is currently processed by the learned transformation function LTF.sub.1 and the contrastive loss function CLF.sub.1. According to a configuration, an uncorrelated window UW corresponds to the video frames occurring outside a predefined time period centered on the timestamp t.sub.k of the frame vf.sub.k. It means that the frame vf.sub.k−7, vf.sub.k−8, vf.sub.k−9, vf.sub.k−10, etc. may be considered as uncorrelated with vf.sub.k, because they are far from the event occurring on frame vf.sub.k. In this case, the uncorrelated time window UW is defined by the closest frame from the frame vf.sub.k which is in that example the frame of −7, this parameter is called the depth of the uncorrelated time window UW.

[0178] Considering, for example, a duration of 1 second between each video frame Δ(t.sub.k, t.sub.k−1)=1 s with a sampling frequency of 1/25 with a video at 25 frames per second. In that example, it may be considered that the frame vf.sub.k−7 is uncorrelated from the video frame vf.sub.k. In this example, it is assumed that 7 second before the video frame vf.sub.k, the frame vf.sub.k−5 is different from the frame vf.sub.k in which an event may occur. In such a configuration, d.sub.2(f.sub.k−7, f.sub.k) is considered as a negative pair, in the same way that d.sub.2((Rf.sub.i, f.sub.k) is considered a negative pair. In this example, the distance d.sub.1(f.sub.k−7, f.sub.k), d.sub.1(f.sub.k−6, f.sub.k), d.sub.1(f.sub.k−4, f.sub.k), d.sub.1(f.sub.k−4, f.sub.k), d.sub.1(f.sub.k−3, f.sub.k), d.sub.2(f.sub.k−2, f.sub.k), d.sub.2(f.sub.k−1, f.sub.k) may be defined as positive pairs or not but they cannot be defined as negative pairs. In this example, only frames d.sub.1(f.sub.k−4, f.sub.k), d.sub.1(f.sub.k−3, f.sub.k), d.sub.2(f.sub.k−2, f.sub.k), d.sub.2(f.sub.k−1, f.sub.k) may be defined as positive pairs according to the definition of a correlation time window CW.

[0179] This configuration is well adapted for a video sequence VS.sub.1 of a video game VG.sub.1. But this configuration may be adapted for another video game VG.sub.2 or for another video sequence VS.sub.2 of a same video game for example corresponding to another level of said video game.

[0180] Correlation Window

[0181] The invention allows aggregating feature vectors in a set of correlated feature vector {Cf.sub.i}.sub.l which are supposed to be correlated with a feature vector f.sub.k which is currently processed by the learned transformation function LTF.sub.1 and the contrastive loss function CLF.sub.1. According to a configuration, a correlation time window CW corresponds to a predefined time period centered on the timestamp t.sub.k of the frame vf.sub.k. It means that the frame vf.sub.k−1, vf.sub.k−2, vf.sub.k−3, vf.sub.k−4 may be considered as correlated with vf.sub.k. In this case, the correlation window CW is defined by the farthest frame from the frame vf.sub.k which is here the frame vf.sub.−4 in that example, this parameter is called the depth of the correlation time window CW.

[0182] According to an embodiment, the depth of the correlation time window CW and the depth of the uncorrelated time window may be set at the same value.

[0183] The method according to the invention comprises a controller that allows configuring the depth of the correlation time window CW and the depth of the uncorrelated time window UW. For instance, in a specific configuration they may be chosen with the same depth.

[0184] This configuration may be adapted to the video game or information related to an event rate. For instance, in a car race video game, numerous events or changes may occur in a short time window. In that case, the correlation time window CW may be set at 3 s including positive pairs inside the range [t.sub.k−3; t.sub.k] and/or excluding negative pairs from this correlation time window CW. In other examples, the time window is longer, for instance 10s including positive pairs into the range [t.sub.k−10; t.sub.k] and/or excluding negative pairs from this correlation time window CW.

[0185] Embodiments of the subject matter and the operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on computer storage medium for execution by, or to control the operation of, data processing apparatus.

[0186] A computer storage medium can be, or can be included in, a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of them. Moreover, while a computer storage medium (e.g. a memory) is not a propagated signal, a computer storage medium can be a source or destination of computer program instructions encoded in an artificially-generated propagated signal. The computer storage medium also can be, or can be included in, one or more separate physical components or media (e.g., multiple CDs, disks, or other storage devices). The operations described in this specification can be implemented as operations performed by a data processing apparatus on data stored on one or more computer-readable storage devices or received from other sources.

[0187] The term “programmed processor” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, digital signal processor (DSP), a computer, a system on a chip, or multiple ones, or combinations, of the foregoing. The apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

[0188] The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform actions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

[0189] Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random-access memory or both. The essential elements of a computer are a processor for performing actions in accordance with instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Devices suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

[0190] To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., an LCD (liquid crystal display), LED (light emitting diode), or OLED (organic light emitting diode) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. In some implementations, a touch screen can be used to display information and to receive input from a user. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.

[0191] The present invention has been described and illustrated in the present detailed description and in the figures of the appended drawings, in possible embodiments. The present invention is not however limited to the embodiments described. Other alternatives and embodiments may be deduced and implemented by those skilled in the art on reading the present description and the appended drawings.

[0192] In the claims, the term “includes” or “comprises” does not exclude other elements or other steps. A single processor or several other units may be used to implement the invention. The different characteristics described and/or claimed may be beneficially combined. Their presence in the description or in the different dependent claims do not exclude this possibility. The reference signs cannot be understood as limiting the scope of the invention.

[0193] It will be appreciated that the various embodiments described previously are combinable according to any technically permissible combinations.

METHOD FOR IDENTIFYING A VIDEO FRAME OF INTEREST IN A VIDEO SEQUENCE, METHOD FOR GENERATING HIGHLIGHTS, ASSOCIATED SYSTEMS

Inventors

Cpc classification

Classification Explorer

G06V20/47

PHYSICS

Classification Explorer

G06V20/41

PHYSICS

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06V20/40

PHYSICS

Classification Explorer

A63F13/86

HUMAN NECESSITIES

Classification Explorer

A63F13/85

HUMAN NECESSITIES

Classification Explorer

A63F2300/572

HUMAN NECESSITIES

Classification Explorer

G06V10/764

PHYSICS

International classification

Classification Explorer

G06K9/00

PHYSICS

Classification Explorer

A63F13/85

HUMAN NECESSITIES

Classification Explorer

G06N3/08

PHYSICS

Abstract

Claims

Description