SYSTEM AND METHOD FOR FACIAL UN-DISTORTION IN DIGITAL IMAGES USING MULTIPLE IMAGING SENSORS
20230245330 · 2023-08-03
Inventors
Cpc classification
International classification
Abstract
A method includes aligning landmark points between multiple distorted images to generate multiple aligned images, where the multiple distorted images exhibit perspective distortion in at least one face appearing in the multiple distorted images. The method also includes predicting a depth map using a disparity estimation neural network that receives the multiple aligned images as input. The method further includes generating a warp field using a selected one of the multiple aligned images. The method also includes performing a two-dimensional (2D) image projection on the selected aligned image using the depth map and the warp field to generate an undistorted image. In addition, the method includes filling in one or more missing pixels in the undistorted image using an inpainting neural network to generate a final undistorted image.
Claims
1. A method comprising: aligning landmark points between multiple distorted images to generate multiple aligned images, wherein the multiple distorted images exhibit perspective distortion in at least one face appearing in the multiple distorted images; predicting a depth map using a disparity estimation neural network that receives the multiple aligned images as input; generating a warp field using a selected one of the multiple aligned images; performing a two-dimensional (2D) image projection on the selected aligned image using the depth map and the warp field to generate an undistorted image; and filling in one or more missing pixels in the undistorted image using an inpainting neural network to generate a final undistorted image.
2. The method of claim 1, wherein the disparity estimation neural network and the inpainting neural network are trained by adjusting weights based on a loss value determined according to differences between an undistorted ground truth image and a predicted image generated using the disparity estimation neural network and the inpainting neural network.
3. The method of claim 1, wherein the multiple distorted images are captured by dual imaging sensors of an electronic device.
4. The method of claim 1, further comprising: before generating the warp field, virtually making one or more subjects in the selected aligned image more distant by adding a constant distance vector to each pixel in the selected aligned image.
5. The method of claim 1, wherein the landmark points are aligned to correct baseline disparities between the multiple distorted images caused by differences in at least one of: sensor sensitivities, calibrations, focal lengths, and baseline distance between imaging sensors.
6. The method of claim 1, wherein the landmark points are aligned using an affine transformation.
7. The method of claim 1, further comprising: adjusting for brightness differences between the multiple distorted images using histogram equalization such that the multiple distorted images have similar brightness.
8. An electronic device comprising: at least one memory configured to store instructions; and at least one processing device configured when executing the instructions to: align landmark points between multiple distorted images to generate multiple aligned images, wherein the multiple distorted images exhibit perspective distortion in at least one face appearing in the multiple distorted images; predict a depth map using a disparity estimation neural network that receives the multiple aligned images as input; generate a warp field using a selected one of the multiple aligned images; perform a two-dimensional (2D) image projection on the selected aligned image using the depth map and the warp field to generate an undistorted image; and fill in one or more missing pixels in the undistorted image using an inpainting neural network to generate a final undistorted image.
9. The electronic device of claim 8, wherein the disparity estimation neural network and the inpainting neural network are trained by adjusting weights based on a loss value determined according to differences between an undistorted ground truth image and a predicted image generated using the disparity estimation neural network and the inpainting neural network.
10. The electronic device of claim 8, wherein the multiple distorted images comprise images captured by dual imaging sensors of the electronic device or another electronic device.
11. The electronic device of claim 8, wherein the at least one processing device is further configured when executing the instructions to: before generating the warp field, virtually make one or more subjects in the selected aligned image more distant by adding a constant distance vector to each pixel in the selected aligned image.
12. The electronic device of claim 8, wherein the at least one processing device is configured to align the landmark points to correct baseline disparities between the multiple distorted images caused by differences in at least one of: sensor sensitivities, calibrations, focal lengths, and baseline distance between imaging sensors.
13. The electronic device of claim 8, wherein the at least one processing device is configured to align the landmark points using an affine transformation.
14. The electronic device of claim 8, wherein the at least one processing device is further configured when executing the instructions to adjust for brightness differences between the multiple distorted images using histogram equalization such that the multiple distorted images have similar brightness.
15. A non-transitory machine-readable medium containing instructions that when executed cause at least one processor of an electronic device to: align landmark points between multiple distorted images to generate multiple aligned images, wherein the multiple distorted images exhibit perspective distortion in at least one face appearing in the multiple distorted images; predict a depth map using a disparity estimation neural network that receives the multiple aligned images as input; generate a warp field using a selected one of the multiple aligned images; perform a two-dimensional (2D) image projection on the selected aligned image using the depth map and the warp field to generate an undistorted image; and fill in one or more missing pixels in the undistorted image using an inpainting neural network to generate a final undistorted image.
16. The non-transitory machine-readable medium of claim 15, wherein the disparity estimation neural network and the inpainting neural network are trained by adjusting weights based on a loss value determined according to differences between an undistorted ground truth image and a predicted image generated using the disparity estimation neural network and the inpainting neural network.
17. The non-transitory machine-readable medium of claim 15, wherein the multiple distorted images comprise images captured by dual imaging sensors of the electronic device or another electronic device.
18. The non-transitory machine-readable medium of claim 15, further containing instructions that when executed cause the at least one processor to: before generating the warp field, virtually make one or more subjects in the selected aligned image more distant by adding a constant distance vector to each pixel in the selected aligned image.
19. The non-transitory machine-readable medium of claim 15, wherein the instructions when executed cause at least one processor to align the landmark points to correct baseline disparities between the multiple distorted images caused by differences in at least one of: sensor sensitivities, calibrations, focal lengths, and baseline distance between imaging sensors.
20. The non-transitory machine-readable medium of claim 15, wherein the instructions when executed cause at least one processor to align the landmark points using an affine transformation.
21. A method comprising: identifying landmark points on a face portion of a person appearing in an undistorted ground truth image; generating a three-dimensional (3D) face model that fits the landmark points of the face portion, the 3D face model including depth information of the face portion; performing a perspective projection using the undistorted ground truth image and the depth information of the face portion to generate left and right distorted image pixel locations; generating left and right warp fields based on the left and right distorted image pixel locations; and performing a two-dimensional (2D) image projection on the undistorted ground truth image using the 3D face model and the left and right warp fields to generate a stereo image pair.
22. The method of claim 21, wherein the stereo image pair comprises left and right images that exhibit perspective distortion in the face portion of the person.
23. The method of claim 21, further comprising: training a disparity estimation neural network and an inpainting neural network using the stereo image pair, wherein the disparity estimation neural network and the inpainting neural network are trained by adjusting weights based on a loss value determined according to differences between the undistorted ground truth image and a predicted image generated using the disparity estimation neural network, the inpainting neural network, and the stereo image pair.
24. The method of claim 21, wherein generating the 3D face model comprises: obtaining a set of face model parameters by projecting corresponding landmark points of the 3D face model onto the face portion of the undistorted ground truth image and minimizing a distance between the landmark points of the undistorted ground truth image and the corresponding landmark points of the 3D face model.
25. The method of claim 24, wherein the corresponding landmark points of the 3D face model are projected onto the face portion of the undistorted ground truth image by dividing coordinates of the corresponding landmark points of the 3D face model by a constant value representing a distance from an imaging sensor to the person.
26. The method of claim 21, wherein the landmark points comprise points associated with one or more of: eyes, eyebrows, a nose, nostrils, lips, and a contour of a jaw line of the face portion.
27. The method of claim 21, further comprising: performing a perspective transformation that simulates moving a virtual imaging sensor closer to the person appearing in the undistorted ground truth image and reprojecting the landmark points on the 3D face model, thereby generating stronger distortion.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0019] For a more complete understanding of this disclosure and its advantages, reference is now made to the following description taken in conjunction with the accompanying drawings, in which like reference numerals represent like parts:
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
DETAILED DESCRIPTION
[0026]
[0027] As discussed above, with the recent developments in mobile device camera technology, it has become common for mobile device users to take “selfies.” Since, by definition, a selfie is an image of the user taken by the user, selfie photos are commonly taken at an arm-length distance. This small distance can result in perspective distortion of a face that appear in the selfie. For example, in many selfies, the face appears narrower and the nose is enlarged compared to the actual face. Here, perspective distortion is distinguished from other types of image distortion, such as caused by sensor noise or malfunction. Perspective distortion of faces can result in unappealing facial appearances in selfies. This can also result in unsatisfactory user experience, both in capturing selfies and during handheld video calls. Some techniques have been developed to fix perspective distortion in selfies. However, these techniques are implemented post-capture and require a depth map of the subject. These techniques therefore require additional complicated algorithms and/or depth sensors, both of which add complexity to a smartphone or other mobile device.
[0028] This disclosure provides systems and methods for facial un-distortion in digital images using multiple imaging sensors. As described in more detail below, the disclosed systems and methods receive a distorted pair of images captured with multiple imaging sensors and correct perspective distortion of human faces in the images without the use of a pre-generated depth map. In some embodiments, the disclosed systems and methods use an end-to-end differentiable deep learning pipeline to correct the perspective distortion. In addition, the disclosed systems and methods allow for variable distance between the multiple imaging sensors, as well as variable focal lengths and sensor gains. Compared to prior techniques, the disclosed embodiments achieve significant improvement in facial distortion correction without requiring the use of a depth sensor. Note that while some of the embodiments discussed below are described in the context of use in consumer electronic devices, such as smartphones or tablet computers, this are merely examples. It will be understood that the principles of this disclosure may be implemented in any number of other suitable contexts.
[0029]
[0030] According to embodiments of this disclosure, an electronic device 101 is included in the network configuration 100. The electronic device 101 can include at least one of a bus 110, a processor 120, a memory 130, an input/output (I/O) interface 150, a display 160, a communication interface 170, or a sensor 180. In some embodiments, the electronic device 101 may exclude at least one of these components or may add at least one other component. The bus 110 includes a circuit for connecting the components 120-180 with one another and for transferring communications (such as control messages and/or data) between the components.
[0031] The processor 120 includes one or more of a central processing unit (CPU), an application processor (AP), or a communication processor (CP). The processor 120 is able to perform control on at least one of the other components of the electronic device 101 and/or perform an operation or data processing relating to communication. In some embodiments, the processor 120 can be a graphics processor unit (GPU). As described in more detail below, the processor 120 may perform one or more operations for facial un-distortion in digital images using multiple cameras or other imaging sensors.
[0032] The memory 130 can include a volatile and/or non-volatile memory. For example, the memory 130 can store commands or data related to at least one other component of the electronic device 101. According to embodiments of this disclosure, the memory 130 can store software and/or a program 140. The program 140 includes, for example, a kernel 141, middleware 143, an application programming interface (API) 145, and/or an application program (or “application”) 147. At least a portion of the kernel 141, middleware 143, or API 145 may be denoted an operating system (OS).
[0033] The kernel 141 can control or manage system resources (such as the bus 110, processor 120, or memory 130) used to perform operations or functions implemented in other programs (such as the middleware 143, API 145, or application 147). The kernel 141 provides an interface that allows the middleware 143, the API 145, or the application 147 to access the individual components of the electronic device 101 to control or manage the system resources. The application 147 may support one or more functions for facial un-distortion in digital images using multiple cameras or other imaging sensors as discussed below. These functions can be performed by a single application or by multiple applications that each carry out one or more of these functions. The middleware 143 can function as a relay to allow the API 145 or the application 147 to communicate data with the kernel 141, for instance. A plurality of applications 147 can be provided. The middleware 143 is able to control work requests received from the applications 147, such as by allocating the priority of using the system resources of the electronic device 101 (like the bus 110, the processor 120, or the memory 130) to at least one of the plurality of applications 147. The API 145 is an interface allowing the application 147 to control functions provided from the kernel 141 or the middleware 143. For example, the API 145 includes at least one interface or function (such as a command) for filing control, window control, image processing, or text control.
[0034] The I/O interface 150 serves as an interface that can, for example, transfer commands or data input from a user or other external devices to other component(s) of the electronic device 101. The I/O interface 150 can also output commands or data received from other component(s) of the electronic device 101 to the user or the other external device.
[0035] The display 160 includes, for example, a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a quantum-dot light emitting diode (QLED) display, a microelectromechanical systems (MEMS) display, or an electronic paper display. The display 160 can also be a depth-aware display, such as a multi-focal display. The display 160 is able to display, for example, various contents (such as text, images, videos, icons, or symbols) to the user. The display 160 can include a touchscreen and may receive, for example, a touch, gesture, proximity, or hovering input using an electronic pen or a body portion of the user.
[0036] The communication interface 170, for example, is able to set up communication between the electronic device 101 and an external electronic device (such as a first electronic device 102, a second electronic device 104, or a server 106). For example, the communication interface 170 can be connected with a network 162 or 164 through wireless or wired communication to communicate with the external electronic device. The communication interface 170 can be a wired or wireless transceiver or any other component for transmitting and receiving signals.
[0037] The wireless communication is able to use at least one of, for example, long term evolution (LTE), long term evolution-advanced (LTE-A), 5th generation wireless system (5G), millimeter-wave or 60 GHz wireless communication, Wireless USB, code division multiple access (CDMA), wideband code division multiple access (WCDMA), universal mobile telecommunication system (UMTS), wireless broadband (WiBro), or global system for mobile communication (GSM), as a cellular communication protocol. The wired connection can include, for example, at least one of a universal serial bus (USB), high definition multimedia interface (HDMI), recommended standard 232 (RS-232), or plain old telephone service (POTS). The network 162 or 164 includes at least one communication network, such as a computer network (like a local area network (LAN) or wide area network (WAN)), Internet, or a telephone network.
[0038] The electronic device 101 further includes one or more sensors 180 that can meter a physical quantity or detect an activation state of the electronic device 101 and convert metered or detected information into an electrical signal. For example, one or more sensors 180 include one or more cameras or other imaging sensors for capturing images of scenes. The sensor(s) 180 can also include one or more buttons for touch input, a gesture sensor, a gyroscope or gyro sensor, an air pressure sensor, a magnetic sensor or magnetometer, an acceleration sensor or accelerometer, a grip sensor, a proximity sensor, a color sensor (such as a red green blue (RGB) sensor), a bio-physical sensor, a temperature sensor, a humidity sensor, an illumination sensor, an ultraviolet (UV) sensor, an electromyography (EMG) sensor, an electroencephalogram (EEG) sensor, an electrocardiogram (ECG) sensor, an infrared (IR) sensor, an ultrasound sensor, an iris sensor, or a fingerprint sensor. The sensor(s) 180 can further include an inertial measurement unit, which can include one or more accelerometers, gyroscopes, and other components. In addition, the sensor(s) 180 can include a control circuit for controlling at least one of the sensors included here. Any of these sensor(s) 180 can be located within the electronic device 101.
[0039] The first external electronic device 102 or the second external electronic device 104 can be a wearable device or an electronic device-mountable wearable device (such as an HMD). When the electronic device 101 is mounted in the electronic device 102 (such as the HMD), the electronic device 101 can communicate with the electronic device 102 through the communication interface 170. The electronic device 101 can be directly connected with the electronic device 102 to communicate with the electronic device 102 without involving with a separate network. The electronic device 101 can also be an augmented reality wearable device, such as eyeglasses, that include one or more imaging sensors.
[0040] The first and second external electronic devices 102 and 104 and the server 106 each can be a device of the same or a different type from the electronic device 101. According to certain embodiments of this disclosure, the server 106 includes a group of one or more servers. Also, according to certain embodiments of this disclosure, all or some of the operations executed on the electronic device 101 can be executed on another or multiple other electronic devices (such as the electronic devices 102 and 104 or server 106). Further, according to certain embodiments of this disclosure, when the electronic device 101 should perform some function or service automatically or at a request, the electronic device 101, instead of executing the function or service on its own or additionally, can request another device (such as electronic devices 102 and 104 or server 106) to perform at least some functions associated therewith. The other electronic device (such as electronic devices 102 and 104 or server 106) is able to execute the requested functions or additional functions and transfer a result of the execution to the electronic device 101. The electronic device 101 can provide a requested function or service by processing the received result as it is or additionally. To that end, a cloud computing, distributed computing, or client-server computing technique may be used, for example. While
[0041] The server 106 can include the same or similar components 110-180 as the electronic device 101 (or a suitable subset thereof). The server 106 can support to drive the electronic device 101 by performing at least one of operations (or functions) implemented on the electronic device 101. For example, the server 106 can include a processing module or processor that may support the processor 120 implemented in the electronic device 101. As described in more detail below, the server 106 may perform one or more operations to support techniques for facial un-distortion in digital images using multiple cameras or other imaging sensors.
[0042] Although
[0043]
[0044] As shown in
[0045] Starting with the training data generation process 220, the electronic device 101 obtains at least one ground truth image 202. The ground truth image 202 represents a “clean” image of a person's face that includes no distortion. Each ground truth image 202 can be obtained in any suitable manner. For example, the electronic device 101 can obtain the ground truth image 202 from an image database. Using the ground truth image 202 as an input, the electronic device 101 performs a two-dimensional (2D) landmark detection process 204 to identify specific landmarks (such as eyes, nose, nostrils, corners of lips, a contour of a jaw line, and the like) on the face shown in the ground truth image 202.
[0046]
Φ:I.fwdarw.{(x.sub.1,y.sub.1),(x.sub.2,y.sub.2), . . . (x.sub.L,y.sub.L)} (1)
where Φ represents the 2D landmark detection process 204, I represents the ground truth image 202, and (x.sub.L, y.sub.L) represents each of the landmark points 250.
[0047] Using the set of landmark points 250 as an input, the electronic device 101 performs a weak perspective three-dimensional (3D) face model fitting process 206 to generate a 3D model of the face in the ground truth image 202, where the 3D model can explain the landmark points 250. Here, a weak perspective model represents use of a virtual camera or other virtual imaging sensor that is relatively far away from the subject, thereby having weaker perspective distortion (as compared to a relatively close virtual imaging sensor, which would generate a stronger perspective distortion).
[0048] The 3D model 252 can be represented by a set of face model parameters, which are obtained by projecting corresponding landmark points of the 3D model 252 onto the face portion of the ground truth image 202 and minimizing the distance between the landmark points 250 and the projected landmark points of the 3D model 252. In projecting the landmark points from the 3D model to the 2D image plane, the weak perspective 3D face model fitting process 206 assumes a large distance between the subject and the imaging sensor. That is, a weak perspective model is assumed because the ground truth image 202 is distortion free, so it is assumed that the ground truth image 202 is obtained at a large distance. Thus, the relative depths of facial features are negligible compared with the imaging sensor-to-subject distance. The projection of each landmark point 250 can be represented by the following:
where X and Y represent coordinates of landmark points on the 3D model 252, and Z.sub.avg represent the average subject-to-imaging sensor distance. It is noted that Z.sub.avg is a constant and is much larger than the values of X and Y. For example, in some embodiments, Z.sub.avg is at least one hundred times larger than the values of X and Y, although other multiples are possible. It is also noted that for the purposes of mathematical simplicity and without loss of generality, the focal length is assumed to be equal to a constant value of 1 (in any units which are being used).
[0049] The weak perspective 3D face model fitting process 206 represents any suitable technique for generating a 3D model of a face from a given input image under the assumption that the camera to subject distance is large. In some embodiments, the weak perspective 3D face model fitting process 206 uses a FLAME model technique to generate the 3D model, although other techniques are possible and within the scope of this disclosure.
[0050] After the weak perspective 3D face model fitting process 206, the electronic device 101 performs a weak-to-strong perspective transformation process 208, which simulates moving the virtual imaging sensor closer to the subject. The electronic device 101 also reprojects all the pixels of the input ground truth image using their depth information from the fitted 3D model 252, thereby generating a stronger perspective distortion model (since perspective distortion is stronger at closer distances). Using the weak-to-strong perspective transformation process 208, the electronic device 101 obtains new pixel locations for a distorted image that has the appearance of being captured from a short distance. In some embodiments, the electronic device 101 uses a strong perspective model that can be represented by the following (assuming a focal length of 1):
where X, Y, Z represent the 3D locations of the landmark points on the 3D model 252 with respect to the camera optical center (origin) C, and (x′, y′) represent new 2D locations of the pixels in the distorted image. In Equation (3), the denominator Z is variable and can change for each landmark point. The values of Z may be much less than the value of Z.sub.avg in Equation (2) because the virtual imaging sensor has been moved much closer to the subject.
[0051] In some cases, the calculations using Equation (3) can be performed twice in the weak-to-strong perspective transformation process 208, once for each distorted image of a training stereo pair 218a-218b. As discussed in greater detail below, each image of the training stereo pair 218a-218b represents an image taken from a different imaging sensor of an electronic device, such as a dual imaging sensor smartphone or other electronic device having a left imaging sensor and a right imaging sensor. Each imaging sensor inherently has a different origin C as shown in
[0052] After the weak-to-strong perspective transformation process 208, the electronic device 101 performs a virtual imaging sensor transformation process 210 and a depth calculation process 212. The virtual imaging sensor transformation process 210 is performed to obtain new distance values for the distorted images using the depth values from the 3D model 252. The electronic device 101 can perform any suitable virtual imaging sensor transformation process 210. The depth calculation process 212 is performed to convert distance units of the depth values (such as meters, centimeters, inches, or the like) into pixels.
[0053] After the depth calculation process 212, the electronic device 101 performs a warp field generation process 214 to generate warp fields corresponding to the left and right imaging sensors.
The difference vector d can be used to obtain the new location of a pixel in the distorted image plane. In the warp field generation process 214, the electronic device 101 computes a left warp field 254 corresponding to the left imaging sensor and a right warp field 254 corresponding to the right imaging sensor. Each warp field 254 can be based on a different origin C as shown in
[0054] After the electronic device 101 generates the warp fields 254, the electronic device 101 performs a 2D image projection process 216 on the ground truth image 202 to generate a training stereo pair 218a-218b, which are the distorted images that can be used for training.
[0055] Returning to
[0056] In the training process 240, the electronic device 101 obtains one or more pairs of distorted images 222a-222b, which represent two distorted images (such as left and right distorted images) that exhibit perspective distortion in a person's face shown in the distorted images 222a-222b. Each pair of distorted images 222a-222b can represent a training stereo pair 218a-218b generated during the training data generation process 220. Using the distorted images 222a-222b as input, the electronic device 101 performs a 2D baseline normalization process 224 to address variable parametric differences that can occur between different imaging sensors. For example, stereo imaging sensors used during training might be different than the imaging sensor used to generate a ground truth image 202. Different imaging sensors and devices can exhibit parametric differences, such as sensor sensitivities, calibrations, focal lengths, and the baseline distance between the imaging sensors in a given device. The electronic device 101 performs the 2D baseline normalization process 224 to remove any baseline differences or “disparities” that exist between the distorted images 222a-222b, thereby “normalizing” the distorted images 222a-222b.
[0057] In some cases, the 2D baseline normalization process 224 removes the baseline disparity by aligning a subset of the landmark points 250 identified during the training data generation process 220 (such as only the nostrils of the face). Using the subset of landmark points 250, the faces can be aligned between the distorted images 222a-222b such that the landmark points 250 appear at nearby locations in the 2D grid. The 2D baseline normalization process 224 can use one or more transformations, such as an affine transformation, to align the images. An affine transformation can rotate an image (such as to account for sensor alignment), scale an image (such as to account for focal length differences), and translate an image (such as to account for baseline distance differences). This can be expressed mathematically as follows:
y=Ax+b (5)
where y represents the coordinates of a transformed point, x represents the coordinates of an input point, A represents a matrix that models rotation and scaling, and b represents a vector that models translation.
[0058]
[0059] The 2D baseline normalization process 224 can also account for brightness differences between sensors. For example, the 2D baseline normalization process 224 can use histogram equalization to make the images of the distorted images 222a-222b have similar brightness. Of course, histogram equalization is only an example technique for equalizing brightness between images, and any other suitable technique can be used.
[0060] After the 2D baseline normalization process 224, the electronic device 101 trains a disparity estimation network 226 using the distorted images 222a-222b. The disparity estimation network 226 is a deep learning network (DNN), such as a convolutional neural network (CNN). Deep learning networks may require training using a large number (such as dozens, hundreds, or thousands) of training data to perform at high levels of accuracy. Thus, the electronic device 101 performs the training process 240, in part, to train the disparity estimation network 226. The disparity estimation network 226 accepts the distorted images 222a-222b as input and, for each pair, predicts a depth map 244 (such as in units of pixels). Each depth map 244 is a mapping of how far an object is away from an imaging sensor as represented by each pixel. The disparity estimation network 226 represents any suitable deep learning network or other machine learning model that is trained to predict depth maps using distorted images. Each depth map 244 represents the depth of each pixel in the associated distorted images 222a-222b. In some embodiments, the disparity estimation network 226 includes multiple layers, which can include one or more encoder layers, decoder layers, and the like.
[0061] After obtaining the depth map 244 using the disparity estimation network 226, the electronic device 101 performs a depth calculation process 228 to convert the disparity from pixel units to physical distance units (such as meters, centimeters, inches, or the like) using one or more parameters of the imaging sensor model. Essentially, the depth calculation process 228 rescales the depth map 244 into different units to make the depth map 244 better suited for downstream processes.
[0062] After the electronic device 101 uses the disparity estimation network 226 to obtain a depth map 244, the electronic device 101 performs a virtual imaging sensor transformation process 230 to “virtually” move a virtual imaging sensor further away from a subject. Moving the virtual imaging sensor to be further away corresponds to a reduction in perspective distortion as discussed above. In some embodiments, the virtual imaging sensor transformation process 230 includes adding a constant distance vector to every pixel in a selected one of the distorted images 222a-222b (such as the distorted image 222a). After the virtual imaging sensor transformation process 230, the electronic device 101 performs a warp field generation process 232 using the selected distorted image 222a to generate a warp field. The warp field generation process 232 is similar to the warp field generation process 214 of the training data generation process 220 discussed above, expect the warp field generated in the warp field generation process 232 is used to eliminate the distortion in the distorted image 222a (in contrast to the warp fields 254, which are used to introduce distortion to the ground truth image 202). Using the depth map 244 and the warp field generated in the warp field generation process 232, the electronic device 101 performs a 2D image projection process 234 on the distorted image 222a to generate an undistorted image 260, an example of which is shown in
[0063] As shown in
[0064] The prediction image 238 represents a prediction of what is actually shown in the ground truth image 202. However, the prediction may not be entirely accurate, especially early in training. Thus, the training process 240 is performed iteratively, and a loss 242 can be calculated for each iteration. The loss 242 is calculated to represent the difference between the ground truth image 202 and the prediction image 238. The electronic device 101 may calculate the loss 242 using any suitable metric for image quality, such as L1, structural similarity index (SSIM), multi-scale SSIM (MS-SSIM), and the like. An example of a loss function is given below:
Loss=Σ.sub.i=1.sup.P(X.sub.i−{circumflex over (X)}.sub.l).sup.2 (6)
where X.sub.i represents the value of the i.sup.th pixel in the ground truth image 202, and {circumflex over (X)}.sub.l represents the value of the i.sup.th pixel in the prediction image 238. Of course, this is merely one example, and other loss function calculations can be used. Once the loss 242 is calculated, the electronic device 101 uses the loss 242 to tune one or more network weights. For example, in the training process 240, both the disparity estimation network 226 and the inpainting network 236 include weights that are updated based on the calculated loss 242, such as via a backpropagation algorithm. Once the weights are updated, the electronic device 101 can perform another iteration of the training process 240, and the iterations can continue until the loss 242 is acceptably small or until one or more other criteria are met (such as a specified amount of time elapsing or a specified number of training iterations completing).
[0065] Although
[0066]
[0067] As shown in
[0068] Using the images 302a-302b as input, the electronic device 101 performs the 2D baseline normalization process 224 to address parametric differences that can occur between the dual imaging sensors of the electronic device 101. For example, the electronic device 101 may perform the 2D baseline normalization process 224 to remove any baseline disparities that exist between the images 302a-302b, thereby aligning the images 302a-302b. After the 2D baseline normalization process 224, the electronic device 101 provides the aligned images 302a-302b as input to the disparity estimation network 226. The disparity estimation network uses the aligned images 302a-302b to predict the depth map 244, such as in units of pixels. The electronic device 101 also performs a depth calculation process 228 to convert the disparity from pixel units to physical distance units (such as meters, centimeters, inches, or the like).
[0069] The electronic device 101 performs the virtual imaging sensor transformation process 230 to “virtually” make the subject in the images 302a-302b appear more distant. After the virtual imaging sensor transformation process 230, the electronic device 101 performs a warp field generation process 232 to generate a warp field, which is used to eliminate the distortion in a selected one of the images 302a-302b (such as the image 302a). Using the warp field generated in the warp field generation process 232, the electronic device 101 performs a 2D image projection process 234 on the image 302a to generate an undistorted image 260. The electronic device 101 implements the inpainting network 236 to fill in any holes 262 (such as missing pixels) in the undistorted image 260 and generate a final undistorted image 304. The final undistorted image 304 can be output, saved, displayed to a user of the electronic device 101, provided as input to another image processing technique, or used in any other suitable manner.
[0070] Although
[0071] Note that the operations and functions shown in
[0072]
[0073] Although
[0074]
[0075] As shown in
[0076] A warp field is generated using a selected one of the multiple aligned images at step 506. This could include, for example, the electronic device 101 performing the warp field generation process 232 using a selected one of the multiple aligned images 222a-222b to generate a warp field 254. A 2D image projection is performed on the selected aligned image using the depth map and the warp field to generate an undistorted image at step 508. This could include, for example, the electronic device 101 performing the 2D image projection process 234 to generate an undistorted image 260. One or more missing pixels are filled in within the undistorted image using an inpainting neural network to generate a final undistorted image at step 510. This could include, for example, the electronic device 101 implementing the inpainting network 236 to fill in one or more missing pixels in the undistorted image 260 to generate the final undistorted image 304.
[0077] Although
[0078]
[0079] As shown in
[0080] A strong perspective projection is performed using the undistorted ground truth image and the depth information of the face portion to generate left and right distorted image pixel locations at step 606. This could include, for example, the electronic device 101 performing the weak-to-strong perspective transformation process 208 using the ground truth image 202 and the depth information of the face portion to generate left and right distorted image pixel locations (x′, y′). Left and right warp fields are generated based on the left and right distorted image pixel locations at step 608. This could include, for example, the electronic device 101 generating the left and right warp fields 254 based on the left and right distorted image pixel locations. A 2D image projection is performed on the undistorted ground truth image using the 3D face model and the left and right warp fields to generate a stereo image pair at step 610. This could include, for example, the electronic device 101 performing the 2D image projection process 216 on the ground truth image 202 using the 3D model 252 and the left and right warp fields 254 to generate the training stereo pair 218a-218b.
[0081] Although
[0082] Although this disclosure has been described with reference to various example embodiments, various changes and modifications may be suggested to one skilled in the art. It is intended that this disclosure encompass such changes and modifications as fall within the scope of the appended claims.