Video processing method and apparatus for use with a sequence of stereoscopic images

Abstract

To generate a warning that a stereoscopic image sequence has been synthesised from a 2D image sequence, a video processor correlates left-eye image data and right-eye image data to identify any sustained temporal offset between the left-eye and right-eye image data. A measure of sustained correlation between a measured spatial distribution of horizontal disparity and a spatial model can also be used to generate the warning.

Claims

1. A method of processing a stereoscopic image sequence to generate a warning that the stereoscopic image sequence has been derived from a 2D image sequence, the method comprising the steps in a video processor of: receiving left-eye image data and right-eye image data from a sequence of stereoscopic images; performing a correlation process upon said left-eye image data and right-eye image data to identify any temporal offset where left-eye image data is substantially equal to delayed right-eye image data or right-eye image data is substantially equal to delayed left-eye image data; and performing an analysis of said temporal offset to generate a warning signal that the said stereoscopic image sequence has been derived from a 2D image sequence where said temporal offset is constant at one value for a first set of said images and constant at a different value for a second set of said images.

2. A method according to claim 1, wherein left-eye image data is compared with right-eye image data delayed by one image and with right-eye image data delayed by two images and right-eye image data is compared with left-eye image data delayed by one image and with left-eye image data delayed by two images.

3. A method according to claim 1, wherein said first set and second set of images each comprise at least ten images.

4. A method according to claim 1, wherein said left-eye image data and said right-eye image data are vertically filtered prior to performing of said correlation process.

5. A method according to claim 4, wherein columns in said left-eye image data and said right-eye image data are vertically averaged.

6. A method according to claim 1, wherein said analysis includes a determination that there is sufficient variation in pixel values over the image to make correlation meaningful.

7. A method according to claim 1, wherein said warning signal is temporally filtered.

8. A method according to claim 7, wherein said warning signal is temporally filtered in a non-linear temporal filter.

9. Video processing apparatus comprising: an input for receiving left-eye image data and right-eye image data from a sequence of stereoscopic images; a correlator for performing a correlation process upon said left-eye image data and right-eye image data to identify any temporal offset where left-eye image data is substantially equal to delayed right-eye image data or right-eye image data is substantially equal to delayed left-eye image data; and a logic block for performing an analysis to generate a warning signal that the said stereoscopic image sequence has been derived from a 2D image sequence where said temporal offset is constant at one value for a first set of said images and constant at a different value for a second set of said images.

10. Apparatus according to claim 9, wherein said correlator is adapted to compare left-eye image data with right-eye image data delayed by one image and with right-eye image data delayed by two images and to compare right-eye image data with left-eye image data delayed by one image and with left-eye image data delayed by two images.

11. Apparatus according to claim 9, wherein said first set and second set of images each comprise at least ten images.

12. Apparatus according to claim 9, comprising a vertical filter operating on said left-eye image data and said right-eye image data prior to performing of said correlation process.

13. Apparatus according to claim 12, wherein said vertical filter comprises a vertical averager operating on columns in said left-eye image data and said right-eye image data.

14. Apparatus according to claim 9, comprising an activity detector enabling said logic block to make a determination that there is sufficient variation in pixel values over the image to make correlation meaningful.

15. Apparatus according to claim 9, comprising a temporal filter operating on said warning signal.

16. Apparatus according to claim 15, wherein said temporal filter is non-linear.

17. A non-transitory computer program product containing instructions adapted to cause programmable apparatus to implement a method of processing a stereoscopic image sequence to generate a warning that the stereoscopic image sequence has been derived from a 2D image sequence, the method comprising the steps in a video processor of: receiving left-eye image data and right-eye image data from a sequence of stereoscopic images; performing a correlation process upon said left-eye image data and right-eye image data to identify any temporal offset where left-eye image data is substantially equal to delayed right-eye image data or right-eye image data is substantially equal to delayed left-eye image data; and performing an analysis of said temporal offset to generate a warning signal that the said stereoscopic image sequence has been derived from a 2D image sequence where said temporal offset is constant at one value for a first set of said images and constant at a different value for a second set of said images.

18. A computer program product according to claim 17, wherein left-eye image data is compared with right-eye image data delayed by one image and with right-eye image data delayed by two images and right-eye image data is compared with left-eye image data delayed by one image and with left-eye image data delayed by two images.

19. A computer program product according to claim 17, wherein said left-eye image data and said right-eye image data are vertically filtered prior to performing of said correlation process.

20. A computer program product according to claim 17, wherein said analysis includes a determination that there is sufficient variation in pixel values over the image to make correlation meaningful.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) An example of the invention will now be described with reference to the drawings in which:

(2) FIG. 1 shows a block diagram of a first exemplary embodiment of the invention.

(3) FIG. 2 shows a block diagram of a second exemplary embodiment of the invention.

DETAILED DESCRIPTION OF THE INVENTION

(4) An important feature of the invention is the detection of a spatio-temporal image offset between the left-eye and right-eye images of a stereo pair of images that is piecewise constant, that is to say the offset remains constant for a section of an image sequence and then changes rapidly to another constant value during another section of the image sequence. When such offsets are found, it is likely that the pair of images were not created by two cameras with horizontally-separated viewpoints (i.e. true stereoscopic image acquisition), but rather a single image has been modified and duplicated to create a synthetic stereo pair of images.

(5) FIG. 1 show an example of a system to achieve this detection. Two, co-timed video data streams (101), (102) that represent the left-eye and right-eye images respectively of a stereoscopic video sequence, are input to respective vertical averaging processes (103), (104). These each output a set of pixel values for each video frame, where each value is the average for a column of vertically-aligned pixels extending over substantially the full image height. Typically luminance values of pixels are averaged, though any other representative pixel measure can be used. The sets of values for the frames, are input to two pairs of cascaded frame delays: the left-eye data sets (105) are input to a first frame delay (107) whose output feeds a second frame delay (109); and, the right-eye data sets (106) are input to a first frame delay (108), whose output feeds a second frame delay (110).

(6) Four identical correlation processors (111), (112), (113) and (114) compare the average data sets (5) and (6) with the respective opposite-eye, frame delayed data sets. The correlation processor (111) compares left-eye data (105) with one-frame-delayed right-eye data from the frame delay (108). The correlation processor (113) compares left-eye data (105) with two-frame-delayed right-eye data from the frame delay (110). The correlation processor (112) compares right-eye data (106) with one-frame-delayed left-eye data from the frame delay (107). The correlation processor (114) compares right-eye data (106) with two-frame-delayed left-eye data from the frame delay (109).

(7) Each correlation processor outputs a measure of the best match between the respective undelayed data and the respective delayed data. The best match is the closest match obtained by horizontally shifting the pixel values over a search window, typically 10% of the image width. The correlation process may use the well-known Pearson correlation method, or a simple average of inter-pixel difference values can be evaluated for a number of horizontal shift positions and the smallest average value used as the measure of match.

(8) The outputs from the four correlation processors are passed to an evaluation logic block (115), which also receives the output of an activity detector (116). The evaluation logic block (115) determines when there is significant correlation between one- or two-frame-delayed left-eye and right-eye data, and that there is sufficient variation in pixel values over the image, as determined by the activity detector (116), to make the correlation meaningful. The evaluation logic block (115) could simply take the lowest match error, test to see if it is below a threshold value, and output it if the output of the activity detector is asserted. More complex evaluation methods are possible, for example a test to see whether one correlation is significantly better than all the others, could be included.

(9) The activity detector (116) evaluates a measure of high spatial-frequency energy over each input image. This could be a simple average of value differences between neighbouring pixels. The two input images of the stereo pair could both be evaluated and the results combined, or only one image of the pair could be evaluated. To save processing resources it may be convenient to evaluate the activity measure for one or both of the vertically-averaged data sets (105) and (106).

(10) The output from the evaluation logic block (115) is a measure of the likelihood that the input images are not a true stereo pair derived from different viewpoints. Because the validity of this output changes over time, and is only available when picture activity is detected, it is helpful to filter it temporally in the temporal low-pass filter (117). This can be a simple running average recursive filter, or may include non-linearity, so that the characteristics of the filter are modified in dependence upon its most recent output. The effect of the filter is to reject short-duration outputs from the evaluation logic-block (115); only outputs that are sustained over several tens of frames should give rise to an indication of synthetic 3D. A sustained output does not necessarily mean that the temporal offset is constant; a sequence of non-zero offsets with a magnitude of one or two frames of either polarity that lasts for several tens of frames is a valid warning. The output from the temporal low-pass filter (117) is thus a more reliable indication of the presence of synthetic 3D than the instantaneous output from the evaluation logic block (115).

(11) As mentioned in the introduction, temporal offset is often combined with fixed, position-dependant spatial offsets in order to create synthetic 3D. A second example of the invention will now be described that detects this technique.

(12) Referring to FIG. 2, co-timed left-eye and right-eye video data streams are input to respective fingerprint detectors (203) and (204). The object of video fingerprinting is to derive a parameter that describes a video frame sufficiently well for a copy of that frame to be identified by comparison of the respective fingerprint parameters. There are many known video fingerprinting techniques; some are described in International Patent Application WO 2009/104022 (the content of which is hereby incorporated by reference). A very simple fingerprint is the average luminance of a frame, or a set of average luminance values for a defined set of spatial regions within a frame.

(13) The left-eye and right-eye fingerprints are input to a correlator that evaluates the temporal correlation between the two streams of fingerprints to find the temporal offset between the input video streams (201) and (202). Typically the process compares the correlation between the fingerprint stream after the application of a number of trial offset values, and the offset value that gives the best match is output.

(14) The output from the correlator (205) is temporally low pass filtered (206). This filter be nonlinear, for example it may be optimised to detect piecewise constant inputs by controlling its bandwidth according the frequency of changes in its input value. The filter output must be rounded to an integral number of video frames, and this number is used to control a time alignment block (207). This removes any temporal offset between the input data streams (201) and (202) by delaying one or other of the input streams by the number of frames indicated by the filter output.

(15) The two, temporally-aligned data streams are input to disparity mapping block (208). This uses any known method of evaluating the horizontal disparity between spatially co-located regions in the temporally-aligned left-eye and right-eye images. For example, the method of determining the disparity value for a region described in UK patent application 1104159.7 and U.S. patent application Ser. No. 13/415,962 (the content of both of which is hereby incorporated by reference) can be used. The number of image regions for which disparity values are obtained will depend on the available processing resources; it is clearly advantageous to have a large number of regions, and to ensure that the majority of the image area is evaluated. However, image edge regions can be ignored.

(16) The output of the disparity mapping block (208) is thus a stream of sets of disparity values, one set for each frame of the time-aligned video streams from the time alignment block (207); each set describes the spatial disparity pattern for the respective frame. These sets of disparity values are input to a temporal high-pass filter (209) that outputs sets of temporally-filtered disparity values at frame rate. The filter forms each member of each set of output values from a weighted sum of co-located disparity values from a number of adjacent frames. The simplest example, which may be suitable in many cases, is for each output value to be the difference between the current disparity for a region and the disparity for the same region in the previous frame.

(17) The sets of temporally high-pass filtered disparity values are input to a mean square calculator (209). This forms a measure of total temporal energy of horizontal disparity for each frame. Preferably each input disparity value is squared and the mean of the sum of the squares over each video frame is output. If processing resources are scarce it may be acceptable to output the mean value of the total of the magnitudes of the disparity values for each frame.

(18) The output of the disparity mapping block (208) is also input to a spatial regression block (211). This evaluates how easy it is to fit a simple spatial model to the pattern of disparity values. The simplest implementation is to average the disparity values vertically and perform linear regression on the set of average disparity versus horizontal position data; and, to average the disparity values horizontally and perform linear regression the set of average disparity versus horizontal position data. As is well-known, classical linear regression finds the linear model that best fits the data, and evaluates the errors from that model in a single operation. The two regression coefficients, quantifying the quality of fit of the disparity distribution of the current frame to a linear relationship with respect to horizontal position, and a linear relationship with respect to vertical position, are input to a decision logic block (212).

(19) If the disparity distribution fits a linear model well, and there is little temporal disparity variation energy, then it is very likely that synthetic 3D is present. True stereoscopic images are likely to have temporal variations in disparity due to moving objects; and, the spatial variation of disparity is likely to be complex. The logic block (212) thus detects the condition when there is a low output from the mean square evaluation (209) and one or two near-unity outputs from the spatial regression analysis (210). When this condition is detected, a synthetic 3D warning (213) is output. The decision logic (212) can also make use of the output from the temporal low-pass filter (206) so that the combination of temporal offset with a linear model of spatial offset is recognised as strongly characterising synthetic 3D. As with the system of FIG. 1 it is important to reject short duration correlation events, and a temporal low-pass filter should be included to ensure that only sustained correlation with the disparity model gives rise to a warning.

(20) And, also as with the system of FIG. 1, it is important to reject false warnings derived from ambiguous input data. The detection of temporal offset requires temporal activity, and the detection of disparity requires spatial activity. A control system is thus necessary to confirm this activity. Unchanging fingerprint parameters from the fingerprint blocks (203) and (204) indicate that the temporal offset cannot be determined. And, lack of high spatial frequencies prevents disparity from being determined. If it is no longer possible to measure an image characteristic that gave rise to a warning, the warning should be maintained until a valid measurement that cancels the warning is obtained.

(21) It will be understood that that features from the two described embodiments may be combined. For example, the correlation process described in relation to FIG. 1 may be used to drive the time align unit 207 of the FIG. 2 embodiment, in place of the described correlation process using fingerprints. Similarly, a correlation process using fingerprints may be used as part of a system in which the measure of sustained temporal offset between the left-eye and right-eye image data is used to generate the warning, without consideration of spatial distribution of horizontal disparity.

Video processing method and apparatus for use with a sequence of stereoscopic images

Assignee

Inventors

Cpc classification

Classification Explorer

H04N13/106

ELECTRICITY

Classification Explorer

H04N13/161

ELECTRICITY

Classification Explorer

H04N13/111

ELECTRICITY

Classification Explorer

H04N2013/0081

ELECTRICITY

Classification Explorer

H04N13/139

ELECTRICITY

Classification Explorer

H04N13/194

ELECTRICITY

Classification Explorer

H04N13/211

ELECTRICITY

Classification Explorer

H04N2013/0074

ELECTRICITY

International classification

Classification Explorer

H04N13/00

ELECTRICITY

Classification Explorer

H04N13/139

ELECTRICITY

Classification Explorer

H04N13/106

ELECTRICITY

Classification Explorer

H04N13/111

ELECTRICITY

Classification Explorer

H04N13/194

ELECTRICITY

Classification Explorer

H04N13/211

ELECTRICITY

Classification Explorer

H04N13/161

ELECTRICITY

Abstract

Claims

Description