MULTI-TASK DEEP LEARNING-BASED REAL-TIME MATTING METHOD FOR NON-GREEN-SCREEN PORTRAITS
20230005160 · 2023-01-05
Inventors
Cpc classification
G06V40/103
PHYSICS
G06T2207/20016
PHYSICS
Y02T10/40
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
G06V10/25
PHYSICS
G06T3/40
PHYSICS
G06V10/34
PHYSICS
International classification
G06T3/40
PHYSICS
G06V10/25
PHYSICS
Abstract
A multi-task deep learning-based real-time matting method for non-green-screen portraits is provided. The method includes: performing binary classification adjustment on an original dataset, inputting an image or video containing portrait information, and performing preprocessing; constructing a deep learning network for person detection, extracting image features by using a deep residual neural network, and obtaining a region of interest (ROI) of portrait foreground and a portrait trimap in the ROI through logistic regression; and constructing a portrait alpha mask matting deep learning network. An encoder sharing mechanism effectively accelerates a computing process of the network. An alpha mask prediction result of the portrait foreground is output in an end-to-end manner to implement portrait matting. In this method, green screens are not required during portrait matting. In addition, during the matting, only original images or videos need to be provided, without a need to provide manually annotated portrait trimaps.
Claims
1. A multi-task deep learning-based real-time matting method for non-green-screen portraits, comprising: step 1: performing binary classification adjustment on an original multi-class multi-object detection dataset, inputting an image or video containing portrait information, and performing data preprocessing on the image or video to obtain preprocessed data of an original input file; step 2: using an encoder-logistic regression structure to construct a deep learning network for person detection, inputting the preprocessed data obtained in step 1, constructing a loss function, training and optimizing the deep learning network for person detection to obtain a person detection model; step 3: extracting feature maps from an encoder of the person detection model in step 2, and performing feature stitching and fusing multi-scale image features to form an encoder of a portrait alpha mask matting network, to implement an encoder shared by the person detection and portrait alpha mask matting networks; step 4: constructing a decoder of the portrait alpha mask matting network, forming an end-to-end encoder-decoder portrait alpha mask matting network structure together with the shared encoder in step 3, inputting an image containing person information and a trimap, constructing a loss function, and training and optimizing the portrait alpha mask matting network; step 5: inputting the preprocessed data obtained in step 1 to a trained network in step 4, and outputting a region of interest (ROI) of portrait foreground and a portrait trimap in the ROI through logistic regression of the person detection model in step 2; and step 6: inputting the ROI of the portrait foreground and the portrait trimap in step 5 into the portrait alpha mask matting network constructed in step 4 to obtain a portrait alpha mask prediction result.
2. The multi-task deep learning-based real-time matting method for non-green-screen portraits according to claim 1, wherein the data preprocessing in step 1 comprises video frame processing and input image resizing.
3. The multi-task deep learning-based real-time matting method for non-green-screen portraits according to claim 1, wherein the deep learning network for person detection in step 2 is implemented through model prediction of a deep residual neural network.
4. The multi-task deep learning-based real-time matting method for non-green-screen portraits according to claim 1, wherein a main structure of the decoder in step 4 comprises upsampling, convolution, an exponential linear unit (ELU) activation function, and a fully connected layer for outputting.
5. The multi-task deep learning-based real-time matting method for non-green-screen portraits according to claim 4, wherein the upsampling is used to restore an image feature size after downsampling in the encoder, a scaled ELU (SELU) activation function is used, hyperparameters λ and α are fixed constants, and the activation function is expressed by formula (2):
6. The multi-task deep learning-based real-time matting method for non-green-screen portraits according to claim 1, wherein the constructing a loss function, and training and optimizing the portrait alpha mask matting network in step 4 specifically comprise: (4.1) computing an alpha mask prediction error, as expressed by formula (3):
Loss.sub.αlp=√{square root over ((α.sub.pre−α.sub.gro).sup.2)}+ε,α.sub.pre,α.sub.gro∈[0,1] (3) wherein Loss.sub.αlp represents the alpha mask prediction error, α.sub.pre and α.sub.gro respectively represent predicted and ground-truth alpha mask values, and ε represents a very small constant; (4.2) computing an image compositing error, as expressed by formula (4):
Loss.sub.com=√{square root over ((c.sub.pre−c.sub.gro).sup.2)}+ε (4) wherein Loss.sub.com represents the image compositing error, c.sub.pre and c.sub.gro respectively represent predicted and ground-truth alpha composite images, and ε represents a very small constant; and (4.3) constructing an overall loss function based on the alpha mask prediction error and the image compositing error, as expressed by formula (5):
Loss.sub.overall=ω.sub.1Loss.sub.αlp+ω.sub.2Loss.sub.com,ω.sub.1+ω.sub.2=1 (5) wherein Loss.sub.overall represents the overall loss function, ω.sub.1 and ω.sub.2 respectively represent weights of the alpha mask prediction error Loss.sub.αlp and the image compositing error Loss.sub.αlp.
7. The multi-task deep learning-based real-time matting method for non-green-screen portraits according to claim 1, wherein the outputting a ROI of portrait foreground and a portrait trimap in the ROI in step 5 specifically comprise: (5.1) using a relative intersection over union (RIOU) obtained by improving an original criteria for determining the ROI of the portrait foreground, wherein the RIOU is expressed by formula (7):
8. A multi-task deep learning-based real-time matting system for non-green-screen portraits, comprising an input unit, a processor and a memory storing program codes, wherein the processor performs the stored program codes for: step 1: performing binary classification adjustment on an original multi-class multi-object detection dataset, and performing data preprocessing on an image or video containing portrait information and inputted from the input unit, to obtain preprocessed data of an original input file; step 2: using an encoder-logistic regression structure to construct a deep learning network for person detection, inputting the preprocessed data obtained in step 1, constructing a loss function, training and optimizing the deep learning network for person detection to obtain a person detection model; step 3: extracting feature maps from an encoder of the person detection model in step 2, and performing feature stitching and fusing multi-scale image features to form an encoder of a portrait alpha mask matting network, to implement an encoder shared by the person detection and portrait alpha mask matting networks; step 4: constructing a decoder of the portrait alpha mask matting network, forming an end-to-end encoder-decoder portrait alpha mask matting network structure together with the shared encoder in step 3, inputting an image containing person information and a trimap, constructing a loss function, and training and optimizing the portrait alpha mask matting network; step 5: inputting the preprocessed data obtained in step 1 to a trained network in step 4, and outputting a region of interest (ROI) of portrait foreground and a portrait trimap in the ROI through logistic regression of the person detection model in step 2; and step 6: inputting the ROI of the portrait foreground and the portrait trimap in step 5 into the portrait alpha mask matting network constructed in step 4 to obtain a portrait alpha mask prediction result.
9. A computer program product comprising a non-volatile computer readable medium having computer executable codes stored thereon, the codes comprising instructions for: step 1: performing binary classification adjustment on an original multi-class multi-object detection dataset, and performing data preprocessing on an image or video containing portrait information and inputted from the input unit, to obtain preprocessed data of an original input file; step 2: using an encoder-logistic regression structure to construct a deep learning network for person detection, inputting the preprocessed data obtained in step 1, constructing a loss function, training and optimizing the deep learning network for person detection to obtain a person detection model; step 3: extracting feature maps from an encoder of the person detection model in step 2, and performing feature stitching and fusing multi-scale image features to form an encoder of a portrait alpha mask matting network, to implement an encoder shared by the person detection and portrait alpha mask matting networks; step 4: constructing a decoder of the portrait alpha mask matting network, forming an end-to-end encoder-decoder portrait alpha mask matting network structure together with the shared encoder in step 3, inputting an image containing person information and a trimap, constructing a loss function, and training and optimizing the portrait alpha mask matting network; step 5: inputting the preprocessed data obtained in step 1 to a trained network in step 4, and outputting a region of interest (ROI) of portrait foreground and a portrait trimap in the ROI through logistic regression of the person detection model in step 2; and step 6: inputting the ROI of the portrait foreground and the portrait trimap in step 5 into the portrait alpha mask matting network constructed in step 4 to obtain a portrait alpha mask prediction result.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0056]
[0057]
[0058]
[0059]
[0060]
[0061]
DETAILED DESCRIPTION OF THE EMBODIMENTS
[0062] The following further describes a multi-task deep learning-based real-time matting method for non-green-screen portraits with reference to the accompanying drawings.
[0063] As shown in
[0064] Step S601: Improve an original dataset, input an image or video in an improved dataset, and perform corresponding data preprocessing on the image or video to obtain preprocessed data of an original input file.
[0065] That the original dataset is improved and the data preprocessing is performed in step 1 may specifically include:
[0066] (1.1) Perform binary classification adjustment and supplement on a multi-class multi-object detection dataset. The binary classification adjustment is performed to modify an original 80-class COCO dataset to two classes: person and others, and supplement the dataset according to this criterion.
[0067] (1.2) Perform video frame processing by using FFmpeg to convert the video into video frames such that a processed video file can be processed by using a same method as that used to process an image file in subsequent work.
[0068] (1.3) Resize the input images by unifying sizes of different input images through cropping and padding and keep sizes of network feature maps the same as those of the original images.
[0069] Step S602: Use an encoder-logistic regression structure to construct a deep learning network 103 for person detection. Input the preprocessed data obtained in step 1, construct a loss function, and train and optimize the deep learning network for person detection to obtain a person detection model.
[0070] The deep learning network 103 for person detection may specifically include:
[0071] (2.1) The encoder 104 is a fully convolutional residual neural network. In the network, skip connections are used to construct residual blocks res_block of different depths, and a feature sequence is obtained by extracting features of the image containing the portrait information.
[0072] (2.2) The loss function is constructed by adding a cross-entropy error of the person-others binary classification as an additional load to a general object detection task.
[0073] (2.3) The logistic regression 105 is an output structure for multi-scale detection of a central position (x.sub.i, y.sub.i) of a ROI, a length and width (w.sub.i, h.sub.i) of the ROI, a confidence level C.sub.i of the ROI, and a class p.sub.i(c), c E classes of an object in the ROI. classes represents all classes in the training sample, namely, [class0:person, class1:others], and pixel.sub.i represents an i.sup.th pixel in the ROI.
[0074] Step S603: Fuse multi-scale image features to form an encoder of a portrait alpha mask matting network, to implement an encoder shared by the person detection and portrait alpha mask matting networks.
[0075] The multi-scale encoder shared by the person detection and portrait alpha mask matting networks may specifically include:
[0076] (3.1) Perform forward access to the fully convolutional deep residual neural network 104 to obtain outputs of the residual blocks res_block with downsampling multiples of 8 times, 16 times, and 32 times. Convolution kernels with a stride of 2 are used to implement the downsampling. core.sub.8, core.sub.16, and core.sub.32 are set as the convolution kernels during the downsampling, and a size of the convolution kernel is x,y. A size of an input input is m,n, and a size of an output output is m/2,n/2. Convolution corresponding to the output is expressed by formula (1). fun(⋅) represents an activation function, and β represents a bias.
output.sub.m/2,n/2=fun(ΣΣinput.sub.mn*core.sub.xy+β) (1)
[0077] (3.2) Fuse and stitch the output to form large, medium, and small fused image feature structures as the encoder of the portrait alpha mask matting network, to implement the encoder shared by the person detection and portrait alpha mask matting networks.
[0078] Step S604: Construct a decoder 106 of the portrait alpha mask matting network, and form an end-to-end encoder-decoder portrait alpha mask matting network structure together with the shared encoder in step 3. Input an image containing person information and a trimap 107, construct a loss function, and train and optimize the portrait alpha mask matting network.
[0079] A main structure of the decoder 106 of the portrait alpha mask matting network may include upsampling, convolution, an ELU activation function, and a fully connected layer for outputting.
[0080] (4.1) The upsampling is implemented by an upsampling operation to restore an image feature size after downsampling in the encoder.
[0081] (4.2) A SELU activation function is used to set outputs of some neurons in the deep learning network to 0 to form a sparse network structure. Hyperparameters λ and α in the SELU activation function are fixed constants, and the activation function is expressed by formula (2):
[0082] That the loss function of the portrait alpha mask matting network is constructed may specifically include:
[0083] (4.3) An alpha mask prediction error is expressed by formula (3):
Loss.sub.αlp=√{square root over ((α.sub.pre−α.sub.gro).sup.2)}+ε,α.sub.pre,α.sub.gro∈[0,1] (3)
[0084] where α.sub.pre and α.sub.gro respectively represent predicted and ground-truth alpha mask values, and E represents a very small constant.
[0085] (4.4) An image compositing error is expressed by formula (4):
Loss.sub.com=√{square root over ((c.sub.pre−c.sub.gro).sup.2)}+ε (4)
[0086] where c.sub.pre and c.sub.gro respectively represent predicted and ground-truth alpha composite images.
[0087] (4.5) An overall loss function is constructed based on the alpha mask prediction error and the image compositing error, as expressed by formula (5):
Loss.sub.overall=ω.sub.1Loss.sub.αlpω.sub.2Loss.sub.com,ω.sub.1+ω.sub.2=1 (5)
[0088] Step S605: Input the preprocessed image data obtained in step 1 to a trained network, and output a ROI 108 of portrait foreground and a portrait trimap 107 in the ROI 108 through the logistic regression of the person detection network in step 2.
[0089] That the ROI 108 of the portrait foreground and the portrait trimap 107 in the ROI 108 are output may specifically include:
[0090] (5.1) Use a RIOU obtained by improving an original criterion for determining the ROI of the portrait foreground. This enables the ROI to have a stronger enclosing capability and prevents a fine edge of a person being placed outside the ROI during object detection. The RIOU is expressed by formula (7):
[0091] where ROI.sub.edge represents a minimum bounding rectangle ROI that can enclose ROI.sub.p and ROI.sub.g, and [⋅] represents an area of the ROI.
[0092] (5.2) For person foreground and background binary classification results, use an erosion algorithm to remove noise, and then use a dilation algorithm to generate a clear edge contour. The finally obtained portrait trimap 107 is expressed by formula (8):
[0093] where f(pixel.sub.i) indicates that an i.sup.th pixel pixel.sub.i belongs to the foreground, b(pixel.sub.i) indicates that the i.sup.th pixel pixel.sub.i belongs to the background, and trimap.sub.i represents an alpha mask channel value of the i.sup.th pixel pixel.sub.i.
[0094] Step S606: Input the ROI 108 of the portrait foreground and the portrait trimap 107 in step 5 into the portrait alpha mask matting network constructed in step 4 to obtain a portrait alpha mask prediction result.
[0095] More specifically, the multi-task deep learning-based real-time matting method for non-green-screen portraits divides portrait matting into two parts of algorithm tasks: the person detection task 101 in step 1 and the portrait foreground alpha mask matting task 102 in step 2, specifically including the following steps:
[0096] In step 1, the data preprocessing includes video frame processing and input image resizing.
[0097] The video frame processing may include:
[0098] Convert the video into frames by using FFmpeg, use an original video number as a folder name in a project directory, and store the frames as image files in the folder. In this way, a processed video file can be processed by using a same method as that used to process an image file in subsequent work.
[0099] The input image resizing may include:
[0100] Unify sizes of different input images. Calculate a zoom factor with a longest side of an original image as a reference side, compress the longest side in equal proportions to an input criterion specified by the subsequent network, and fill vacant content on a short side with gray background through padding. Keep a size of a network feature map the same as that of the original image. This prevents abnormal network output values caused by an invalid size of the input image.
[0101] As shown in
[0102] As shown in
[0103] Step S301: The encoder 104 is a fully convolutional residual neural network. In the network 104, skip connections are used to construct residual blocks res_block of different depths, and a feature sequence x is obtained by extracting features of the image containing the portrait information. For a processed frame {V.sub.t}.sub.t=1.sup.T, a feature sequence {x.sub.t}.sub.t=1.sup.T of a length T is extracted. V.sub.t represents a t.sup.th frame, and x.sub.t represents a feature sequence of the t.sup.th frame.
[0104] The feature extraction may include:
[0105] Use a deep learning technology to perform a cognitive process of the original image or the frame obtained after the video is preprocessed, and convert the image into a feature sequence that can be recognized by a computer.
[0106] Step S302: The logistic regression 105 is an output structure for multi-scale detection of a central position (x.sub.i, y.sub.i) of a ROI, a length and width (w.sub.i, h.sub.i) of the ROI, a confidence level C.sub.i of the ROI, a class p.sub.i(c), c∈classes of an object in the ROI, and the person foreground f(pixel.sub.i) and background b(pixel.sub.i) binary classification results. classes represents all classes in the training sample, namely, [class0:person, class1:others], and pixel.sub.i represents the i.sup.th pixel in the ROI.
[0107] As shown in
[0108] Step S401: Perform forward access to the deep residual neural network to obtain the outputs of the residual blocks res_block with downsampling multiples of 8 times, 16 times, and 32 times. To reduce a negative effect of a gradient caused by pooling, the downsampling adopts convolution kernels with a stride of 2. core.sub.8, core.sub.16, and core.sub.32 are set as the convolution kernels during the downsampling. Quantities of channels channel_n are equal to corresponding inputs input.sub.8, input.sub.16, and input.sub.32, and a size of the convolution kernel is x,y. A size of an input input is m,n, and a size of an output output is m/2,n/2. Convolution corresponding to the output is expressed by formula (1). fun(⋅) represents an activation function, and β represents a bias.
output.sub.m/2,n/2=fun(ΣΣinput.sub.mn*core.sub.xy+β) (1)
[0109] Further, corresponding outputs pass through a 3×3 convolution kernel conv3 to expand a receptive field of the feature map and increase local context information of the image feature. Then, the outputs pass through a 1×1 convolution kernel conv1 to reduce a feature channel dimension. The outputs are fused and stitched to form large, medium, and small fused image feature structures as the encoder of the portrait alpha mask matting network, to implement the encoder shared by the person detection and portrait alpha mask matting networks.
[0110] Step S402: A main structure of the decoder includes upsampling, convolution, an ELU activation function, and a fully connected layer for outputting. Input the image containing the person information and the trimap, construct a network loss function with an alpha mask prediction error and an image compositing error as a core, and train and optimize the portrait alpha mask matting network.
[0111] The upsampling is implemented by an upsampling operation. A specific value in the input image feature is mapped and filled to a corresponding area of the output upsampled image feature, and a blank area after upsampling is filled with the same value to restore the size of the image feature after downsampling in the encoder.
[0112] A SELU activation function is used to set outputs of some neurons in the deep learning network to 0 to form a sparse network structure. This effectively reduces overfitting of the matting network, and avoids gradient disappearance of a traditional sigmoid activation function during back propagation. Hyperparameters λ and α in the SELU activation function are fixed constants, and the activation function is expressed by formula (2):
[0113] The alpha mask prediction error is expressed by formula (3):
Loss.sub.αlp=√{square root over ((α.sub.pre−α.sub.gro).sup.2)}+ε,α.sub.pre,α.sub.gro∈[0,1] (3)
[0114] where α.sub.pre and α.sub.gro respectively represent predicted and ground-truth alpha mask values, and ε represents a very small constant.
[0115] The image compositing error is expressed by formula (4):
Loss.sub.com=√{square root over ((c.sub.pre−c.sub.gro).sup.2)}ε (4)
[0116] where c.sub.pre and c.sub.gro respectively represent predicted and ground-truth alpha composite images, and ε represents a very small constant.
[0117] An overall loss function is constructed based on the alpha mask prediction error and the image compositing error, as expressed by formula (5):
Loss.sub.overall=ω.sub.1Loss.sub.αlpω.sub.2Loss.sub.com,ω.sub.1+ω.sub.2=1 (5)
[0118] As shown in
[0119] Step S501: The improvement and the data preprocessing is performed on dataset to be processed.
[0120] Step S502: Input preprocessed image data to the trained person detection network model, and predict a ROI of portrait foreground and a portrait trimap 107 in the ROI through the logistic regression.
[0121] Generally, a ROI is determined based on an IOU during object detection, as expressed by formula (6). ROI.sub.p and ROI.sub.9 respectively represent predicted and ground-truth ROIs.
[0122] The present disclosure proposes the RIOU for determining the ROI of the portrait foreground. This enables the ROI to have a stronger enclosing capability and prevents a fine edge of a person being placed outside the ROI during object detection. The RIOU is expressed by formula (7):
[0123] where ROI.sub.edge represents a minimum bounding rectangle ROI that can enclose ROI.sub.p and ROI.sub.g, and [⋅] represents an area of the ROI.
[0124] Further, for person foreground and background binary classification results, use an erosion algorithm to remove noise, and then use a dilation algorithm to generate a clear edge contour. The finally obtained portrait trimap 107 is expressed by formula (8):
[0125] where f(pixel.sub.i) indicates that an i.sup.th pixel pixel.sub.i belongs to the foreground, b(pixel.sub.i) indicates that the i.sup.th pixel pixel.sub.i belongs to the background, and trimap.sub.i represents an alpha mask channel value of the i.sup.th pixel pixel.sub.i.
[0126] Step S503: Perform feature mapping on the ROI 108 of the original portrait foreground in step 2, and input the portrait trimap 107 in the ROI 108 into the portrait alpha mask matting network model to reduce a convolution computing scale and accelerate network computing. After an original resolution of the image is restored through the upsampling of the decoder, a portrait alpha mask prediction result a is obtained from the output of the fully connected layer.
[0127] Step S504; In combination with the original input image, the portrait matting task is completed through foreground extraction, as expressed by formula (9). I represents the input image, F represents the portrait foreground, and B represents the background image.
I=αF+(1−α)B (9)
[0128] The foregoing is merely a description of the embodiments of the present disclosure, and is not a limitation to the present disclosure. Those of ordinary skill in the art should realize that any changes and modifications made to the present disclosure fall within the protection scope of the present disclosure.