METHOD AND SYSTEM FOR SELECTING HIGHLIGHT SEGMENTS

20230230378 · 2023-07-20

    Inventors

    Cpc classification

    International classification

    Abstract

    Described are methods and systems for selecting a highlight segment. The computer-implemented method comprises receiving a sequence of frames, and at least one user data; via a converting module, for each frame, selecting a local neighborhood around it. said neighborhood comprising at least one frame; and converting each neighborhood into a feature vector; via a high-lighting module, assigning a score to each of the feature vectors based on the user data; via a selection module, selecting at least one highlight segment based on the scoring of the feature vectors; and via an outputting module, outputting the highlight segment. The system comprises a receiving module configured to receive a sequence of frames, and at least one user data; a converting module configured to select a local neighborhood around each frame, said neighborhood comprising at least one frame, and convert each neighborhood into a feature vector, a highlighting module configured to assign a score to each of the feature vector based on the user data; a selection module configured to select at least one highlight segment based on the scoring of the feature vectors; and an output component configured to output the highlight segment.

    Claims

    1. A computer-implemented method for selecting a highlight segment, the method comprising Receiving A sequence of frames, and At least one user data; Via a converting module, For each frame, selecting a local neighborhood around it, said neighborhood comprising at least one frame and Converting each neighborhood into a feature vector; Via a highlighting module, assigning a score to each of the feature vectors based on the user data; Via a selection module, selecting at least one highlight segment based on the scoring of the feature vectors; and Via an outputting module, outputting the highlight segment.

    2. The method according to claim 1 further comprising generating and maintaining a database of video segments and selecting at least one video segment as user data based on at least one characteristic associated with the user.

    3. The method according to claim 1 further comprising, prior to the inputting step, receiving at least one reference video segment indicative of a user's preference and converting it into the user data and wherein converting the video segment comprises converting the reference video segment into a reference feature vector.

    4. The method according to claim 3 wherein the user data comprises a plurality of reference feature vectors obtained by converting a plurality of reference video segments indicative of a user's preference.

    5. The method according to claim 4 wherein the plurality of reference video segments are indicative of different user preferences and wherein the reference video segments are grouped into sets, each said set indicative of a particular user preference, and wherein each set is converted into a distinct user data subset comprising a subset of the reference feature vectors associated with the reference video segments forming part of it.

    6. The method according to claim 5 wherein the feature vectors are assigned a score based on each user data subset and wherein the method further comprises for each feature vector, assigning a score based on a comparison to each of the user data subsets.

    7. The method according to claim 5 further comprising assigning a weight to each of the data subset, said weight associated with the user's relative preference towards it.

    8. The method according to claim 1 further comprising, prior to selecting the neighborhood for each frame, via a segmentation module, generating at least one segment, each segment comprising at least one frame of the sequence of frames.

    9. The method according to claim 8 wherein each neighborhood is comprised within a single segment.

    10. The method according to claim 3 wherein assigning scores to the feature vectors comprises comparing each of the feature vectors with each of the reference feature vectors and assigning scores to the associated neighborhoods based on each input feature vectors' difference with respect to closest matching of the user feature vectors.

    11. The method according to claim 5 wherein assigning scores to the feature vectors further comprises determining which user data subset is closest to each feature vector and assigning it a value based on a comparison between the subset of reference feature vectors and said feature vector.

    12. The method according to claim 11 further comprising accounting for the relative weight of each of the user data subset when assigning scores to the feature vectors.

    13. The method according to claim 1 further comprising the selection module constructing the highlight segment and wherein the highlight segment comprises a plurality of frames selected from the input sequence of frames.

    14. The method according to claim 13 wherein the highlight segment is constructed by evaluating assigned scores of all feature vectors corresponding to the frames and their neighboring frames and identifying a plurality of neighboring frames with an average best assigned score.

    15. The method according to claim 13 further comprising the selection module constructing a plurality of highlight segments, each comprising a plurality of frames selected from the input sequence of frames, and corresponding to a plurality of distinct neighboring frames with an average highest assigned score.

    16. A system for selecting a video highlight segment, the system comprising A receiving module configured to Receive a sequence of frames, and At least one user data; A converting module configured to For each frame, select a local neighborhood around it, said neighborhood comprising at least one frame; Convert each neighborhood into a feature vector; A highlighting module configured to Assign a score to each of the feature vector based on the user data; A selection module configured to Select at least one highlight segment based on the scoring of the feature vectors; and An output component configured to output the highlight segment.

    17. The system according to claim 16 further comprising at least one database comprising a plurality of user data associated with particular users and wherein the database comprises a plurality of reference video segments and wherein the user data is generated from a plurality of video segments based on at least one user-specific characteristic.

    18. The system according to claim 16 further comprising a segmentation module configured to generate a plurality of segments, each comprising a plurality of frames of the video.

    19. The system according to claim 18 wherein each neighborhood is comprised within a single segment.

    20. The system according to claim 16 wherein the selection module is configured to construct the highlight segment and wherein the highlight segment comprises a plurality of frames selected from the input sequence of frames.

    21. The system according to claim 20 wherein the selection module is configured to construct the highlight segment by evaluating assigned scores of all feature vectors corresponding to the frames and their neighboring frames and identifying a plurality of neighboring frames with an average best assigned score.

    22. The system according to claim 16 further comprising a user terminal configured to display at least the highlight segment output by the output component.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0144] FIG. 1 depicts an embodiment of a method for selecting highlight segments according to an embodiment of the present invention;

    [0145] FIG. 2 schematically shows a system for selecting highlight segments with several optional elements;

    [0146] FIG. 3 shows an exemplary procedure for converting frame sequences according to an aspect of the present invention;

    [0147] FIG. 4 shows a schematic embodiment of the present advantageous procedure for selecting highlight segments.

    DESCRIPTION OF EMBODIMENTS

    [0148] FIG. 1 shows a method according to an embodiment of the present invention. The method can be advantageously used to select highlight segments. Particularly, the method can be used to generate and output personalized video highlights or “segments” for users based on certain predetermined user data.

    [0149] In a first step, S1, a sequence of frames is received or input together with at least one user data. The frames can comprise images such as frames in a video. Additionally or alternatively, the frames can comprise point clouds or light fields. Further, 3D video frames or compilations from a plurality of cameras can be considered as frames within the present disclosure. Put differently, the frames can correspond to encoding or projections of the real world/computer- generated content.

    [0150] In a preferred embodiment, a video comprising a sequence of frames is input. The video may be temporally subsampled, so that the frames may not be directly consecutive. In other words, some frames from the video may be skipped. That is, the sequence of frames can also be irregularly sampled from the video.

    [0151] The sequence of frames can then be processed as per S2, to select a local neighborhood around each frame. This can be performed by a converting module. The generated neighborhoods may each comprise at least one frame. The local neighborhood of a given frame may comprise the given frame, along with a few frames adjoining it. In the case of a video or a processed video serving as input, the neighborhood can correspond to a temporal interval centered around the given frame. The neighborhood may correspond to e.g. 5 to 21 frames, with the given frame and between 2 and 10 frames selected on either side of it according to the sequence of frames that is input. The neighborhood may also correspond to a single frame, which would then be the given frame.

    [0152] In the third step, the neighborhoods can be converted into feature vectors by a converting module (the converting module may be used to perform both steps S2 and S3, but they can also be performed by separate modules, submodules, algorithms, or the like). The feature vectors can comprise embeddings and/or vectors in an n-dimensional vector space.

    [0153] In the fourth step, S4, the feature vectors are assigned scores based on the user data. This step can be performed by the highlighting module, where the user data may also be input (e.g. from a database comprising user data and/or reference data based on which user data can be generated). The scores assigned to the neighborhoods can refer to determining how similar each of them is to user preference for highlight segments (e.g. video highlight segments), defined or represented by the user data, and sorting them according to this similarity. For example, the user data may comprise one or more video segments corresponding to a given user's interests or preferences. Such user-specific segments may then be used as a reference or as a benchmark for the incoming segments, so that segments of the input sequence of frames (e.g. a video) most similar to the given user's interests may be ranked as such or assigned an appropriate score based on this.

    [0154] Step S5 comprises selecting a highlight segment based on the assigned scores. This can be done by a selecting module. The selecting module can construct the highlight segment based on a set of scores assigned to each of the neighborhoods. This construction can be based on considering not only the top assigned scores, but average scores of a plurality of consecutive frames. This can allow to avoid selecting the highlight segment based on one top scored frame, which may be due to noise. In other words, the selection or construction of the highlight segment can be performed by evaluating the scores assigned to each frame (represented by a neighborhood) and selecting a plurality of consecutive frames which were all assigned a relatively high score on average.

    [0155] In step S6, the highlight segment is output. The outputting may be performed by an output component. In one specific example, the outputting may refer to providing or displaying the segment to the user.

    [0156] To summarize, the present advantageous method may be used to identify video highlights based on individual user preference. It may also then be used to automatically generate a video highlight for a given user and provide it to them. For example, the present method may be used as part of a content provider's algorithm for ensuring that each user gets shown a preview or highlight of a video that may pique their interest and maximize their engagement, resulting in them watching the entire video.

    [0157] FIG. 2 schematically depicts an embodiment of a system for selecting video segments according to an aspect of the present invention. Some elements of the system are optional, and are depicted in FIG. 2 merely on an exemplary basis. A skilled person will understand that such elements may be skipped or replaced by appropriate alternatives.

    [0158] In FIG. 2, an exemplary sequence of frames (depicted as a video) 1 is shown to be input into a converting module 10. The converting module 10 may comprise an algorithm and/or a routine and/or a subroutine that can be implemented to run on a local and/or remote and/or distributed processor so as to execute certain instructions. In other words, the converting module 10 may comprise a computer-implemented algorithm with a particular purpose and defined inputs and outputs.

    [0159] The converting module 10 can select a local neighborhood around each frame. The local neighborhood may correspond to a few frames on each side of the given frame. If the input comprises a video, the local neighborhood may correspond to a short excerpt of this video with a few frames before and after the central frame forming it.

    [0160] The converting module 10 then converts neighborhoods into feature vectors 12. Note, that neighborhood selection and conversion into feature vectors can also be done by separate modules, submodules, algorithms, or the like.

    [0161] An optional part of the system comprises the segmentation module 60. The segmentation module 60 can generate a plurality of segments 62 based on the sequence of frames 1. The segments 62 may be generated, for example, by running the sequence of frames 1 through a neural network 64, that can be configured to extract appropriate segments from it. The segments 62 may be defined based on comparison between frames of the video 1 to determine e.g. a change of scene. In other words, the segmentation module 60 can comprise a shot detector, which can identify and separate different shots present in the input sequence of frames (or a video). The neural network 64 may be a convolutional neural network specifically trained to extract segments from videos. The generated segments 62 may be used to ensure that each of the neighborhoods 12 does not intersect a segment boundary. In other words, each neighborhood 12 may be comprised by one segment only. This is useful, as scoring of neighborhoods as feature vectors can be made more accurate by ensuring that each of the neighborhoods converted into a feature vector does not include a segment boundary (e.g. a shot boundary).

    [0162] The generated feature vectors 12 may then be input into a highlighting module 20 together with a user data 42. The user data 42 may be stored in a user data database 40. Additionally or alternatively, data that can be used to immediately provide user data 42 can be optionally stored in the database 40. For example, the database 40 may comprise a selection of video segments generally considered interesting or relevant by a plurality of segments. The user data 42 may then be further personalized for a particular user by selecting only those video segments that they would find particularly interesting. The selection can be done based on user characteristics (such as e.g. demographic parameters) and/or be user-defined.

    [0163] The user data 42 may be indicative of a given user's preference for videos. For example, the user data 42 may comprise user-selected (or automatically collected) video segments indicative of their interests. Such segments may be grouped into sets indicative of different categories of user interests. For example, one set may comprise user-selected videos showing cats, and another set may comprise user-selected videos showing paragliding. The user data 42 may further comprise user-selected or user-specific videos that have been transformed into a format where they can be easily used for benchmarking or filtering the extracted segments from the input video. For example, the user data 42 may comprise user-preferred videos converted to reference feature vectors such as n-dimensional vectors. The frames from the input sequence of frames can then also be converted into such feature vectors, and compared with the user data 42 by computing the distance between them. If the user data 42 comprises multiple sets of user-preferred videos (optionally converted into a particular format), each of the feature vectors of the input video may be compared with each of the sets, and a similarity or “closeness” score may be computed for each of the cases. In this scenario, the set for which the feature vector has the highest similarity score may be considered as a reference set and the distance or difference between the feature vector (of the input video segment) and the reference feature vector corresponding to this set may be further considered for assigning a score.

    [0164] The user data 42 may also optionally comprise auxiliary data, which can comprise e.g. metadata. This can comprise data related to videos, such as comments, reactions to videos, graphic features added by users to various videos, or the like.

    [0165] The highlighting module 20 is configured to output scores assigned to the input feature vectors based on the user data 42. The assigned scores 22 may be given e.g. based on similarity (or similarity score) to the user data 42. Put differently, each of the feature vectors from the input sequence of frames may be compared to the user data 42 (indicative of a user's preference for particular videos), and the segments most similar to the user's preference as indicated by the user data 42 are ranked as top segments or assigned highest scores. However, the scores may not correspond purely to the distance between feature vectors and reference vectors. Rather, this distance may be the input to the highlighting module, which can then use machine learning techniques to select the highest ranked segment, which may not be the one corresponding to the feature vector with the lowest distance to one of the reference feature vectors. The machine learning techniques used to assign scores to the feature vectors (and therefore the frames) may comprise neural networks and may be trained with annotated data (such as e.g. segments ranked as most entertaining by a group of test users).

    [0166] A selection module 30 receives the assigned scores from the highlighting module 20 and selects or constructs a highlight segment 32 based on these scores. The highlight segment 32 may comprise a plurality of sequential frames that were assigned higher average scores than average scores of all other subsets of sequential frames. In other words, the scores are evaluated by the selection module, and a certain subset of the input sequence of frames selected. This selected subset then comprises sequential frames with highest average assigned scores. This can be done to avoid selecting a highlight segment based on one top assigned score, as this score may be a fluke or due to various sources of noise.

    [0167] The highlight segment 32 is then output by the output module 70. The output module 70 may provide the highlight segment to the user associated with the user data 42. For example, the output module 70 may send the highlight segment 32 to a user terminal 50, which can then play the segment to the user. As the highlight segment 32 is determined based on the user's individual preference, the user may immediately know whether they would consider the input sequence of frames (or video) 1 in its entirety interesting, and whether they should watch it or not. In this way, the user may advantageously save time, by only watching videos they would likely be interested in. The user terminal 50 is also optional. The output component 70 may instead output the highlight segment 32 to a general user interface and/or save it in a database for future use.

    [0168] The present system can allow to automatically select highlight segments from videos based on users' preferences. Furthermore, it can take into account different categories of user preferences, such as for example an interest in cats and an interest in paragliding. The different interests can be separately considered as different subsets of user data and therefore different sets of reference feature vectors. Each of the frames of the input video (and the associated neighborhoods) can be advantageously compared with the most similar subset of user data.

    [0169] FIG. 3 schematically depicts a part of the present method for selecting video segments corresponding to an exemplary implementation of part of present method. The depicted part presents an example of converting an input sequence of frames into a format where they can be quantitatively compared with a user data. More specifically, the frames input into a converting module, and feature vectors are output. The converting module may comprise a neural network, such as a convolutional neural network. The resulting feature vector may be an embedding or a vector in an n-dimensional space, which can then be compared with similar user-specific reference feature vectors indicative of a user's preference for videos.

    [0170] FIG. 4 depicts a schematic embodiment of personalised highlighting system and method according to an aspect of the present invention.

    [0171] The neighborhoods or temporal aggregation windows of an input video (or sequence of frames) can be generated based on each frame of said video (or sequence of frames), with an interval comprising a certain number of frames (preferably corresponding to a certain temporal interval) on each side of the frame in question, thereby defining an interval corresponding to a neighborhood. In the figure, the frame in question is denoted by k, and the frames on either side of it (based on the sequence of frames in the video) as k−1 and k+1.

    [0172] The neighborhoods extracted from an input video may be converted into feature vectors (denoted in the figure as f.sub.k, and then input into a highlighting module. The highlighting module may then perform a three-step process.

    [0173] In a first step, the feature vectors can be compared to reference feature vectors of user data (with potentially different subsets of reference feature vectors used). The comparison may comprise computing the distance between the vectors and, for each feature vector, identify the reference feature vector with the smallest distance and its corresponding subset. This is shown by depicting the different subsets of user data as filters, where one filter may comprise videos showing jumping cats, and another hockey fouls. The feature vectors input into the personalised highlighting module may be filtered according to user data indicative of a user's preference. The preference may comprise different categories or sets. In FIG. 4, the user preference comprises two sets: videos related to cats jumping and videos related to hockey fouls. These can be treated separately, so that each preference category is individually evaluated.

    [0174] In a second step, the feature vectors or preferably the smallest distance between feature vectors and reference feature vectors corresponding to user interests can be assigned scores.

    [0175] In a third step, which has been previously referred to as performed by the selection module, the scores assigned to each feature vector (corresponding to neighborhoods based around individual frames) are evaluated. This can be done, for example, by analyzing a curve comprising all of scores assigned to the individual frames, and locating intervals of this curve with highest average scores. An exemplary highlight score curve is shown in the graph in FIG. 4. Selection and/or construction of the highlight segment may be performed based on an analysis of this curve.

    [0176] The personalized highlighting module can be advantageously trained to take a user's history into account, in a way that no training is required to add new users with potentially very different preferences.

    [0177] Whenever a relative term, such as “about”, “substantially” or “approximately” is used in this specification, such a term should also be construed to also include the exact term. That is, e.g., “substantially straight” should be construed to also include “(exactly) straight”.

    [0178] Whenever steps were recited in the above or also in the appended claims, it should be noted that the order in which the steps are recited in this text may be the preferred order, but it may not be mandatory to carry out the steps in the recited order. That is, unless otherwise specified or unless clear to the skilled person, the order in which steps are recited may not be mandatory. That is, when the present document states, e.g., that a method comprises steps (A) and (B), this does not necessarily mean that step (A) precedes step (B), but it is also possible that step (A) is performed (at least partly) simultaneously with step (B) or that step (B) precedes step (A). Furthermore, when a step (X) is said to precede another step (Z), this does not imply that there is no step between steps (X) and (Z). That is, step (X) preceding step (Z) encompasses the situation that step (X) is performed directly before step (Z), but also the situation that (X) is performed before one or more steps (Y1), . . . , followed by step (Z). Corresponding considerations apply when terms like “after” or “before” are used.