FREE VIEWPOINT VIDEO GENERATION AND INTERACTION METHOD BASED ON DEEP CONVOLUTIONAL NEURAL NETWORK

20220394226 · 2022-12-08

    Inventors

    Cpc classification

    International classification

    Abstract

    A Free Viewpoint Video (FVV) generation and interaction method based on a deep Convolutional Neural Network (CNN) includes the steps of: acquiring multi-viewpoint data of a target scene by a synchronous shooting system with a multi-camera array arranged accordingly to obtain groups of synchronous video frame sequences from a plurality of viewpoints, and rectifying baselines of the sequences at pixel level in batches; extracting, by encoding and decoding network structures, features of each group of viewpoint images input into a designed and trained deep CNN model, to obtain deep feature information of the scene, and combining the information with the input images to generate a virtual viewpoint image between each group of adjacent physical viewpoints at every moment; and synthesizing all viewpoints into frames of the FVV based on time and spatial position of viewpoints by stitching matrices. The method eliminates the need for camera rectification and depth image calculation.

    Claims

    1. A Free Viewpoint Video (FVV) generation and interaction method based on a deep Convolutional Neural Network (CNN), comprising following steps of: a step (1) of rectifying the pose and color of cameras in an acquisition system, where in the acquisition system comprises N cameras in uniform and circular arc-shaped arrangement at the same height; the pose and position of the cameras are rectified based on a reference object at the center of the circular arc, and the position of each camera remains unchanged after rectification; and color parameters of the N cameras are rectified by a white balance algorithm based on Gray World; a step (2) of shooting a target scene object in synchronous video sequences by the camera array of the acquisition system, and selecting video frames at a certain moment to rectify baselines of N−1 groups of adjacent viewpoints in sequence, to obtain N−1 image affine transformation matrices M.sub.i, i=1, 2, . . . , n; a step (3) of rectifying baselines of all frame data of adjacent sub-viewpoints in sequence by the obtained affine transformation matrices M.sub.i; a step (4) of pre-processing binocular datasets through baseline rectification, color rectification based on the Gray World algorithm, displacement threshold screening based on optical flow calculation, and then training the virtual viewpoint generation ability of the deep CNN; a step (5) of inputting the baseline data rectified in the step (3) into the deep CNN pre-trained in the step (4), and outputting generated virtual viewpoint 2D images based on the number of reconstructed virtual viewpoints; a step (6) of stitching the matrices of the physical viewpoints and the virtual viewpoints generated in sequence of physical space positions, and labeling Block_Indexes of all viewpoints in the image matrix in sequence; and a step (7) of synthesizing the stitched frames at every moment obtained in the step (6) into an FVV at a shooting frame rate of multiple cameras.

    2. The FVV generation and interaction method based on a deep CNN according to claim 1, wherein, in the step (1), a horizontal and vertical “cross-shaped” reference plate is placed at the center of the circular arc, centers of all cameras are directed at the center of the reference plate, while the vertical direction in the middle of a camera image coincides with a vertical reference line of the reference plate, and the position of each camera remains unchanged after rectification.

    3. The FVV generation and interaction method based on a deep CNN according to claim 1, wherein, in the step (1), an angle between optical axes of adjacent cameras is controlled at about 30 degrees.

    4. The FVV generation and interaction method based on a deep CNN according to claim 1, wherein after being synthesized in the step (7), the FVV is compressed and stored in a local server according to a certain compression ratio.

    5. The FVV generation and interaction method based on a deep CNN according to claim 4, wherein after the FVV synthesized in the step (7) is loaded by software, users can smoothly switch the FVV to video blocks from different viewpoints in real time based on the viewpoint Block_Indexes in the step (6), thus realizing human-computer video interaction.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0020] FIG. 1 is a flow chart of a method according to the present invention,

    [0021] FIG. 2 is a topology of a hardware acquisition system according to an embodiment of the present invention;

    [0022] FIG. 3 is a schematic diagram of a baseline rectification method according to the embodiment of the present invention;

    [0023] FIG. 4 is a flow chart of a deep CNN according to the embodiment of the present invention; and

    [0024] FIG. 5 is an interface of an FVV Player according to the embodiment of the present invention.

    DETAILED DESCRIPTION OF THE INVENTION

    [0025] The present invention will be described in detail below with reference to the accompanying drawings by embodiments.

    [0026] In this embodiment, a multi-camera array as shown in a topology of FIG. 1 is arranged in a program stage scene to synchronously acquire video sequence information of the scene, and a series of data are processed and synthesized into an interactive Free Viewpoint Video (FVV), which can be viewed interactively by users through an interactive free-view interactive display system developed accordingly, making it possible to transmit relayed information bidirectionally.

    [0027] A processing flow of this embodiment is shown in FIG. 1, including the following steps.

    [0028] (1) Circular Arc-Shaped Arrangement of a Camera Array, and Pose and Color Rectification of Multiple Cameras

    [0029] The topology of a hardware acquisition system is shown in FIG. 1. There are N cameras in uniform and circular arc-shaped arrangement at the same height, and an angle between optical axes of adjacent cameras is controlled at about 30 degrees. A horizontal and vertical “cross-shaped” reference object is placed at the center of the scene to rectify the pose and position of the cameras. As shown in FIG. 3, a vertical and horizontal plate is placed at the center of the scene, so centers of all cameras are directed at a center O of the reference plate, while the vertical direction in the middle of a camera image coincides with a vertical reference line of the reference plate, and the position of each camera remains unchanged after rectification. At the same time, the color parameters of the N cameras are rectified by a white balance algorithm based on Gray World.

    [0030] (2) Synchronous Rectification of Multiple Cameras

    [0031] All cameras are synchronized by an external trigger signal generator through video data trigger lines, and the frequency of trigger signals is adjusted to trigger all cameras to synchronously acquire information of a shot scene.

    [0032] (3) Simultaneous Acquisition of Video Sequences and Baseline Rectification to Obtain Affine Transformation Matrices

    [0033] The camera array arranged in the step (1) is configured to shoot a target scene object in synchronous video sequences, video frames at a certain moment are selected to rectify baselines of N−1 groups of adjacent viewpoints in sequence, and a translation factor (x,y), a rotation factor θ and a scaling factor k in affine transformation are manually set based on feature points of the object in the scene, so that the feature points at the center of the scene are located at a position where the reference object is, such as a central rectification point of the scene at a point O in a schematic diagram of a baseline rectify system used in this embodiment (shown in FIG. 3), where Cam_L and Cam_R represent an object at the center of the scene shot simultaneously by the left and right cameras with the same parameters, and three obtained feature points of the object in left and right images Img_L and Img_R coincide with L1, L2, L3, R1, R2 and R3 at the same time, thus ensuring that the baselines of the left and right cameras are on the same level. Therefore, this method can rectify baselines. N−1 image affine transformation matrices are obtained by the above baseline rectification method M.sub.i (i=1, 2, . . . , n). The specific form of the affine transformation matrix is

    [00001] M = ( α β ( 1 - α ) .Math. x - β .Math. y - β α β .Math. x + ( 1 - α ) .Math. y ) 2 × 3 ,

    [0034] where α=k.Math.cos(θ), β=k.Math.sin(θ).

    [0035] (4) Batch Baseline Rectification

    [0036] The baselines of all frame data of adjacent sub-viewpoints are rectified in sequence by the obtained affine transformation matrices M.sub.i through the warpAffine( ) function in OpenCV, and the baselines of N−1 groups of cameras are rectified in pairs by the affine matrices M.sub.i (i=1, 2, . . . , n) obtained in the step (3) sequentially according to the spatial positions of N cameras in circular arc-shaped arrangement, so that the rectified image baselines of the N cameras are all kept at the same level.

    [0037] (5) Virtual Viewpoint Generation Network Training

    [0038] This step starts with pre-processing datasets through baseline rectification, color rectification, and displacement threshold screening based on optical flow calculation. Each dataset consists of image triplets of multiple ‘left center right’ viewpoints in many scenes, and baselines of the image triplets of each image are first rectified in batches by the method the same as that in the step (3), so that several groups of feature points in every three images are kept at the same level. Color rectification is performed by the white balance algorithm based on Gray World so that three images of the same scene have the same white balance parameters. Finally, optical flow diagrams of the triplets are calculated in pairs to obtain average pixel displacement of the same object in the same scene, and a threshold is set to screen the triplets exceeding the threshold to form a new training dataset.

    [0039] The structure of the deep CNN used in this embodiment is based on an open-source network SepConv, as shown in FIG. 4 (refer to the Video Frame Interpolation via Adaptive Separate Convolution for details), which specifically includes an encoding network and a decoding network (as shown by two sub-network blocks Encoder and Decoder in the two dashed boxes in FIG. 4). The Image1 and Image2 from the left and right viewpoints pass through the coding network and decoding network in sequence, which pass through convolutional layers (Conv) and average Pooling layers (Pooling) of various sizes as shown in FIG. 4 in sequence in the encoding network, and pass through convolutional layers (Upconv) and linear upsampling layers (upsampling) of various sizes as shown in FIG. 4 in sequence in the decoding network, to obtain deep feature mapping parameters S1 and S2 of the scene, respectively, and the parameters are cascaded and added with the input images Image1 and Image2, respectively, to predict a 2D image Output of a virtual viewpoint between left and right physical viewpoints. In the process of training this network, the training results are quantified using the difference between the Output and an intermediate image of the triplet in the dataset as Ground Truth, and the following two forms of loss functions are adopted respectively:


    L.sub.1=∥R−R.sub.GT∥.sub.2.sup.2, L.sub.2=∥S(R)−S(R.sub.GT)∥.sub.2.sup.2

    [0040] L.sub.total=L.sub.1+α.Math.L.sub.2, where L.sub.1 is a 2-norm error between the network predicted image and the Ground Truth based on the pixel RGB difference, L.sub.2 is the difference in a feature structure extracted by the network, and the function S( ) is a feature extraction loss function for training a network model to perceive a deep special structure in the scene. The total loss function L.sub.total for training is a linear weighted sum of L.sub.1 and L.sub.2. An optimal parameter model of a Virtual View Generation Network (VVGN) is obtained by iterative training for a certain period.

    [0041] (6) Virtual Viewpoint Generation

    [0042] The baseline data rectified in the step (4) are input into a pre-trained deep Virtual View Generation Network (VVGN), and generated virtual viewpoint 2D images are output based on the number of reconstructed virtual viewpoints. Unlike traditional virtual viewpoint generation methods, the method in the present invention predicts and generates a virtual viewpoint between two physical viewpoints based on the deep CNN, inputs data for baseline rectification at pixel level in advance, and inputs feature structures of the two viewpoints directly through CNN learning to output the results without rectifying multiple cameras in advance. Through this step, the effect of generated virtual viewpoints is determined. The binocular datasets are pre-processed through baseline rectification, color rectification, displacement threshold screening based on optical flow calculation in the step (5), and input into the CNN as shown in FIG. 4 for training. The data input for training are two binocular 2D images, and the training loss functions are:


    L.sub.1=∥R−R.sub.GT∥.sub.2.sup.2, L.sub.2=∥S(R)−S(R.sub.GT)∥.sub.2.sup.2,

    [0043] L.sub.total=L.sub.1+α.Math.L.sub.2, where L.sub.1 is a 2-norm error between the network predicted image and the Ground Truth based on the pixel RGB difference, L.sub.2 is the difference in a feature structure extracted by the network, and the function S( ) is a feature extraction loss function for training a network model to perceive a deep special structure in the scene. The total loss function L.sub.total for training is a linear weighted sum of L.sub.1 and L.sub.2. In case of binocular wide baselines, better virtual viewpoint quality may be obtained compared with existing deep learning-based video interpolation networks; and the computational effort is much lower compared with traditional methods.

    [0044] (7) Stitching Matrices of all Viewpoint Image Frames

    [0045] The matrices of the physical viewpoints and the virtual viewpoints generated in the step (6) are stitched in sequence of physical space positions (the number of rows and columns of each matrix depends on the number of virtual viewpoints generated), and Block_Index of all viewpoints in the image matrix are labeled in sequence with a default of row.

    [0046] (8) FVV Synthesis

    [0047] The stitched frames at every moment obtained in the previous step are synthesized into an FVV using a cv2.VideoWriter( ) function in FFmpeg or OpenCV at a shooting frame rate of multiple cameras, and the FVV is compressed and stored in a local server at a certain compression ratio.

    [0048] (9) Interactive Viewing of FVV by Users

    [0049] The interface of an FVV Player is shown in FIG. 5. After the free-view video (FVV) synthesized in the step (8) is loaded, users can smoothly switch the FVV to video blocks from different viewpoints corresponding to a specific viewpoint Block_Index using a Slider or Dial interactive button module in real time, thus realizing the free and interactive viewing experience for users.