Video Advertising Signage Replacement
20220398823 · 2022-12-15
Inventors
Cpc classification
G06V10/25
PHYSICS
G06T3/40
PHYSICS
International classification
G06V10/25
PHYSICS
G06T3/40
PHYSICS
Abstract
A system and methods are provided for determining an embedding region in a video stream, including: generating a mask of an initial estimate of an embedding region in a video frame of the video stream, wherein an initial boundary is a boundary of the initial estimate of the embedding region; determining a refined boundary as a region demarked by four best line segments; transforming a replacement image to fit the dimensions of the refined boundary; and inserting the transformed replacement image into the video frame, within the refined boundary.
Claims
1. A method for determining an embedding region in a video stream, comprising: a) generating a mask of an initial estimate of an embedding region in a video frame of the video stream, wherein an initial boundary is a boundary of the initial estimate of the embedding region; b) determining a refined boundary as a region demarked by four best line segments or four corners with sub-pixel resolution; c) transforming a replacement image to fit the dimensions of the refined boundary; and d) inserting the transformed replacement image into the video frame, within said refined boundary.
2. The method of claim 1, further comprising: a) before inserting the transformed replacement image into the video frame, analyzing the properties of the background image of said video frame, in the vicinity of the replacement image; and b) making adaptations in said transformed replacement image to comply with said background properties.
3. The method of claim 2, wherein the properties of the background includes one or more of the following: frequency components; focus/sharpness; blur/noise level; geometric transformations; illumination.
4. The method of claim 1, wherein determining the refined boundary further comprises: a) identifying multiple line segments in the video frame; b) calculating line segment scores according to distances between pixels of each of the multiple line segments and the initial embedding boundary and according to gradient values at the pixels of each of the multiple line segments; and c) determining from the line segment scores four best line segments as line segments with best line segment scores with respect to four sides of the initial boundary.
5. The method of claim 4, wherein determining the line segment scores includes calculating for each line segment the value of Σ.sub.∀p∈l∇I(p)l(p)f(p)/d(p)+1), where d(p) is an average distance from the boundary of the initial estimate of the, l(p) is the pixel of the line segment, and ∇I(p) is a gradient of the line segment at a pixel p.
6. The method of claim 4, further comprising calculating the line segment scores by generating a distance map of distances between each pixel of the video frame and the initial boundary, and mapping each line segment to the distance map.
7. The method of claim 4, further comprising calculating the line segment scores as average distances of multiple pixels of the line segments from the initial boundary.
8. The method of claim 1, further comprising refining a position and orientation of the four best line segments by calculating normal distances between pixels of the best line segments and pixels of the initial boundary.
9. The method of claim 1, wherein determining the refined boundary further comprises applying a machine learning model trained to identify best line segments in an image with respect to an initial boundary.
10. The method of claim 9, further comprising: a) mapping the vertices of predicted boundaries to a predicted polygon P and its corresponding mask representation; b) defining a loss function representing the difference between said predicted polygon and an actual corresponding frame F; and c) further training the machine learning model by updating the parameters of said machine learning model to reduce said difference.
11. A system for determining an embedding region in a video stream, comprising a processor and memory, wherein the memory includes instructions that when executed by the processor implement the steps of: a) generating a mask of an initial estimate of an embedding region in a video frame of the video stream, wherein an initial boundary is a boundary of the initial estimate of the embedding region; b) determining a refined boundary as a region demarked by four best line segments or four points with sub-pixel resolution; and c) transforming a replacement image to fit the dimensions of the refined boundary; and inserting the transformed replacement image into the video frame, within the refined boundary.
12. A system according to claim 11, in which the processor is further adapted to: a) analyze the properties of the background image of the video frame, in the vicinity of the replacement image, before inserting the transformed replacement image into said video frame; b) make adaptations in said transformed replacement image to comply with said background properties.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0019] For a better understanding of various embodiments of the invention and to show how the same may be carried into effect, reference will now be made, by way of example, to the accompanying drawings. Structural details of the invention are shown to provide a fundamental understanding of the invention, the description, taken with the drawings, making apparent to those skilled in the art how the several forms of the invention may be embodied in practice.
[0020]
[0021]
[0022]
[0023]
DETAILED DESCRIPTION OF THE INVENTION
[0024] A workflow of the methodology applied here is presented in
[0025] Embedding a replacement image into a video requires detecting an embedding region in an image space of a video frame and tracking that region in multiple video frames of the video. Hereinbelow, a detailed method and system is described for detecting and refining an embedding region to designate a region of pixels of the image that may then be replaced with a new, personalized advertisement.
[0026] A machine learning algorithm may be applied to detect an initial, candidate embedding region in one or more video frames. The advantage of using a machine learning algorithm is that they can autonomously define and detect advertisement-related features. Training the machine learning algorithm may include collecting and labeling a large repository of advertisement images (e.g., signage) from various topics with different contexts. The advertisement in each image of a training set of images may be marked by an enclosing polygon and labeled accordingly.
[0027] In some embodiments, a convolutional neural network (CNN) model followed by a recurrent neural network may be trained to detect advertisements that will serve as embedding regions. For example, a Mask R-CNN architecture may be modified using the annotated advertisement database generated with the labeling process described above. Training the model on the generated database enables it to detect and to segment an advertisement in an image. Such training generates a machine learning model that is able to create a mask associated with pixels of an advertisement.
[0028] Because of the time required to train a large number of video frames manually, a shortcut may be used by detecting advertisements in initial video frames and tracking them in the following video frames, and labeling additional video frames. Tracking over additional video frames also provides verification of the accuracy of the labeling of the advertisements in each video frame. Temporal coherence among consecutive video frames is utilized and the steps specified below are applied to obtain pixel level accuracy of an embedding region. At the end of the training, a large database of video segments is available, which includes accurately labeled advertisements (which serves as a dataset for training a (Machine Learning) ML model.
[0029] At step 22, a generated, labeled database trains a CNN based machine learning model to detect an initial estimate embedding region (e.g., advertisement or signage) in an image or video shot and to generate a mask of the initial embedding region. Processing by the Machine learning model may be performed on a video stream that is live, or “off-line” on a stored video. For real-time (live) video segments, a look-ahead buffer including multiple video frames are used (as there usually is a buffered delay at the receiving end). Before applying the initial embedding region detection, a video segment is subdivided into video shots, where a video shot is defined as a sequence of video frames between two video cuts. A video cut is an abrupt video transition, which separates consecutive video shots. Each video shot is typically processed independently.
[0030] Initial embedding regions within a video shot are detected and then ranked according to several parameters, such as the size of the detected regions, the visibility duration, and the shape changes of the embedding region over the video frames of the shot. Typically, a higher priority is given to larger regions, regions which are visible over many video frames, and regions whose shape does not change significantly across the video frames of the shot; i.e., its orientation with respect to the camera does not change significantly. Embedding regions with high scores are then selected for advertisement embedding and tracking in the subsequent steps of process 20 described below.
[0031] Each initial, estimated embedding region, is bounded by an “estimated” initial boundary. The determination of pixels that define the boundary is the result of a probabilistic approach used by CNNs in general. To get a higher, pixel-level accuracy, the estimated embedding boundary may then be refined by the methods described hereinbelow with respect to a boundary refinement step 24. Detecting the initial boundary and boundary refinement may be carried out by using a deep learning model or a CNN-based deep learning model.
[0032]
[0033] Returning to the process 20 of
[0034] Returning to the process 20 of
[0035]
[0036] Every line segment is mapped onto the distance map, such that the distance map values for each pixel of the line segment can be calculated. At step 34, scores for each segment may be calculated. The calculation may be performed by the following equation:
Σ.sub.∀p∈l∇I(p)l(p)f(p)/(d(p)+1) (1)
where d(p) is the average distance from pixels of the segment to the detected boundary, ∇I(p) is the gradient at pixel p, l(p) is the pixel of the segment, and f(p) is an importance function.
[0037] For the three line segments, LS 200, LS 202, and LS 204, the following analyses are performed by the system by applying the above equation (1): [0038] The line segment LS 200 is a relatively long segment and on a strong edge, thus, it has large gradient values, which mean that the sum, Σ.sub.∀p∈l∇I(p)l(p)f(p), has a relatively high value. In other words, LS 200 is close to the detected boundary (d(p) is small), therefore the sum Σ.sub.∀p∈l∇I(p)l(p)f(p)/(d(p)+1) has a relatively high value, which means that L1 is a strong contender for one of the edges of the object. [0039] The line segment LS 204 is a short line segment and is far away from the boundary, which results in a low value of the sum, Σ.sub.∀p∈l∇I(p)l(p)f(p), and a high value of d (p). Consequently, there is a low value for the total equation, Σ.sub.∀p∈l∇I(p)l(p)f(p)/d(p)+1). This means that LS 204 may not be considered as one of the advertisement edges. [0040] The line segment LS 202 is similar to LS 200 with respect to its length and the strength of the edge it lies on; therefore, the two lines will have a similar value for the sum, Σ.sub.∀p∈l∇I(p)l(p). However, LS 202 is farther away from the detected boundary, thus, its average distance, d(p), yields a larger value compared to LS 200. As a result, LS 200 may be preferred to LS 202 as the bottom edge of the embedding region, as described below.
[0041] An alternative method of boundary refinement may proceed by projecting pixels of line segments onto the initial boundary, as indicated by a step 40. The line segment scores may be calculated based on some or all pixels on the line segments. Scores may also be calculated as follows: For every pixel on the initial boundary (which is based, for example, on the Mask R-CNN's prediction or on other machine learning methods), and for every line segment within a threshold distance, assign the line segment a score based on the line segment's length, a normal (line orientation), and on the boundary's normal at the pixel. Then, project the pixel of the boundary onto the line segment with the highest score.
[0042] Subsequently, at step 42, the boundary lines may be rotated and translated by small increments to determine whether such transformations better conform to the edge of the embedding region mask. The process is indicated graphically in
[0043] For example, as shown in
[0044] <normal(p), normal(l)>*distance(p,l)+α *θ+β*t where normal(p) is the normal of the boundary at pixel p; normal(l) is the normal of the line l; θ and t are the rotation and translation of the transformed line with respect to the boundary edge. The line with the lowest sum value for each boundary edge is selected as the representative edge.
[0045] A third method of boundary refinement, indicated as step 50, includes identifying the best line segments for a refined boundary by a trained CNN. This method is described in further detail hereinbelow with respect to
[0046] Returning to process 20 of
[0047] After determining the boundary lines based on the four best line segments, corners 702 of the boundary are also determined. That is, after four edges are calculated, thereby determining a refined boundary of a refined (or “optimized”) embedding region (a quadrilateral), the corners defining the region are also calculated.
[0048] At step 62, an image that is to replace the embedding region may be transformed to the dimensions of the refined embedding region boundary and inserted in place of the embedding region in the original video frame. The replacement is indicated as replacement 1100 in
[0049]
[0050]
[0051] As in process 1200, the feature map 1258 indicates “regions of interest,” meaning regions that have traits of embedding regions. The shape of the output feature map is W×H×D, where W and H are the width and height of the input image and D is the depth of the feature map. Feature extraction is done by a Feature Extractor Model 1301 (shown in
[0052] In another embodiment, the training accuracy may be further increased by generating an improved machine learning model (called Points To Polygons Net—PTPNet), which applies advanced geometrical loss functions to optimize the prediction of the vertices of a polygon. The PTPNet outputs a polygon and its mask representation.
[0053] The PTPNet architecture is shown in
[0054] In this example, the Regressor model outputs a vector of 2n scalars that represent the n vertices of the predicted polygon representation. However, the method provided by the present invention can be implemented to polygons of any degree.
[0055] The Renderer model generates a binary mask that corresponds to the Regressor's predicted polygon. It may be trained separately from the regression model using the polygons' contours.
[0056] The PTPNet uses a rendering component, which generates a binary mask that resemble the quadrilateral corresponding to the predicted vertices. The PTPNet loss function (which represents the difference between a predicted polygon P which is, in this example, a quadrangle and an ground truth polygon F, on the vertices and shape levels) is more accurate since it considers the difference between the predicted polygon and the ground truth polygon. This difference is considered as an error (represented by the loss function), which is used for updating the model, to reduce the error.
[0057] This way, the loss function is improved to consider not only the predicted vertices (four, in this example), but also a mapping of the predicted frame that is also compared with the ground truth (actual) polygon.
[0058] It should be noted that an image (or a frame) may contain multiple advertisements, which will be detected. In addition, the method provided by the present invention may is not limited to advertisements—it may implemented similarly to replace a place holder.
[0059] Processing elements of the system described herein may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations thereof. Such elements can be implemented as a computer program product, tangibly embodied in an information carrier, such as a non-transient, machine-readable storage device, for execution by, or to control the operation of, data processing apparatus, such as a programmable processor, computer, or deployed to be executed on multiple computers at one site or one or more across multiple sites. Memory storage for software and data may include multiple one or more memory units, including one or more types of storage media. Examples of storage media include, but are not limited to, magnetic media, optical media, and integrated circuits such as read-only memory devices (ROM) and random access memory (RAM). Network interface modules may control the sending and receiving of data packets over networks. Method steps associated with the system and process can be rearranged and/or one or more such steps can be omitted to achieve the same, or similar results to those described herein.
[0060] It is to be understood that the embodiments described hereinabove are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. For example, the process described above may be calculated for each video segment and stored with the video file or transmitted over a data network. When playing the video file, it will be possible to select personalized advertisements to be embodied within the video, based on the locality of the player.