METHOD FOR GENERATING DEPTH IN IMAGES, ELECTRONIC DEVICE, AND NON-TRANSITORY STORAGE MEDIUM
20230386063 · 2023-11-30
Inventors
Cpc classification
International classification
G06V10/80
PHYSICS
Abstract
A method and system for generating depth in monocular images acquires multiple sets of binocular images to build a dataset containing instance segmentation labels as to content; training an work using the dataset with instance segmentation labels to obtain a trained autoencoder network; acquiring monocular image, the monocular image is input into the trained autoencoder network to obtain a first disparity map and the first disparity map is converted to obtain depth image corresponding to the monocular image. The method combines binocular images with instance segmentation images as training data for training an autoencoder network, monocular images can simply be input into the autoencoder network to output the disparity map. Depth estimation for monocular images is achieved by converting the disparity map to a depth image corresponding to the monocular image. An electronic device and a non-transitory storage are also disclosed.
Claims
1. A depth image generation method, comprising: acquiring multiple sets of binocular images to build a dataset containing instance segmentation labels based on the multiple sets of binocular images; training an autoencoder network based on the dataset containing instance segmentation labels to obtain a trained autoencoder network; acquiring a monocular image and inputting the monocular image into the trained autoencoder network to obtain a first disparity map; and converting the first disparity map to obtain a depth image corresponding to the monocular image.
2. The depth image generation method of claim 1, The depth image generation method of claim 1, wherein each set of the multiple sets of binocular images comprises a first image and a second image, training the autoencoder network based on the dataset containing instance segmentation labels to obtain the trained autoencoder network comprises: inputting the first image into the autoencoder network to obtain a second disparity map; processing the second disparity map based on the instance segmentations labels to obtain a third disparity map; adding the first map with the third disparity map to obtain a predicated image of the second image; using a preset mean square error formula to calculate the error between a second image and the predicated image; and determining the error as a training loss of the autoencoder network until the training loss converges to obtain the trained autoencoder network.
3. The depth image generation method of claim 2, wherein processing the second disparity map based on the instance segmentation labels to obtain the third disparity map comprises: generating a processed image based on the instance segmentation labels; and fusing and correcting the second disparity map based on the processed image to obtain the third disparity map.
4. The depth image generation method of claim 2, wherein the preset mean square error formula comprises:
5. The depth image generation method of claim 2, wherein the first image is a left image of the binocular images and the second image is a right image of the binocular images.
6. The depth image generation method of claim 2, wherein the first image is a right image of the binocular images and the second image is a left image of the binocular image.
7. An electronic device, comprising: at least one processor; and a data storage storing one or more programs which when executed by the at least one processor, cause the at least one processor to: acquire multiple sets of binocular images to build a dataset containing instance segmentation labels based on the multiple sets of binocular images; train an autoencoder network based on the dataset containing instance segmentation labels to obtain a trained autoencoder network; acquire a monocular image and input the monocular image into the trained autoencoder network to obtain a first disparity map; and convert the first disparity map to obtain a depth image corresponding to the monocular image.
8. The electronic device of claim 7, The depth image generation method of claim 1, wherein each set of the multiple sets of binocular images comprises a first image and a second image, training the autoencoder network based on the dataset containing instance segmentation labels to obtain the trained autoencoder network comprises: inputting the first image into the autoencoder network to obtain a second disparity map; processing the second disparity map based on the instance segmentations labels to obtain a third disparity map; adding the first map with the third disparity map to obtain a predicated image of the second image; using a preset mean square error formula to calculate the error between a second image and the predicated image; and determining the error as a training loss of the autoencoder network until the training loss converges to obtain the trained autoencoder network.
9. The electronic device of claim 8, wherein processing the second disparity map based on the instance segmentation labels to obtain the third disparity map comprises: generating a processed image based on the instance segmentation labels; and fusing and correcting the second disparity map based on the processed image to obtain the third disparity map.
10. The electronic device of claim 8, wherein the preset mean square error formula comprises:
11. The electronic device of claim 8, wherein the first image is a left image of the binocular images and the second image is a right image of the binocular images.
12. The electronic device of claim 8, wherein the first image is a right image of the binocular images and the second image is a left image of the binocular image.
13. A non-transitory storage medium having stored thereon instructions that, when executed by a processor of an electronic device, causes the electronic device to perform a depth image generation method, the depth image generation method comprising: acquiring multiple sets of binocular images to build a dataset containing instance segmentation labels based on the multiple sets of binocular images; training an autoencoder network based on the dataset containing instance segmentation labels to obtain a trained autoencoder network; acquiring a monocular image and inputting the monocular image into the trained autoencoder network to obtain a first disparity map; and converting the first disparity map to obtain a depth image corresponding to the monocular image.
14. The non-transitory storage medium of claim 13, wherein each set of multiple sets of binocular images comprises a first image and a second image, training the autoencoder network based on the dataset containing instance segmentation labels to obtain the trained autoencoder network comprises: inputting the first image into the autoencoder network to obtain a second disparity map; processing the second disparity map based on the instance segmentations label to obtain a third disparity map; adding the first map with the third disparity map to obtain a predicated image of the second image; using a preset mean square error formula to calculate the error between a second image and the predicated image; and determining the error as a training loss of the autoencoder network until the training loss converges to obtain the trained autoencoder network.
15. The non-transitory storage medium of claim 14, wherein processing the second disparity map based on the instance segmentation labels to obtain the third disparity map comprises: generating a processed image based on the instance segmentation labels; and fusing and correcting the second disparity map based on the processed image to obtain the third disparity map.
16. The non-transitory storage medium of claim 14, wherein the preset mean square error formula comprises:
17. The non-transitory storage medium of claim 14, wherein the first image is a left image of the binocular images and the second image is a right image of the binocular image.
18. The non-transitory storage medium of claim 14, wherein the first image is a right image of the binocular images and the second image is a left image of the binocular images.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0004] Implementations of the present disclosure will now be described, by way of embodiments, with reference to the attached figures.
[0005]
[0006]
[0007]
[0008]
[0009]
DETAILED DESCRIPTION
[0010] It will be appreciated that for simplicity and clarity of illustration, where appropriate, reference numerals have been repeated among the different figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth in order to provide a thorough understanding of the embodiments described herein. However, it will be understood by those of ordinary skill in the art that the embodiments described herein can be practiced without these specific details. In other instances, methods, procedures, and components have not been described in detail so as not to obscure the related relevant feature being described. Also, the description is not to be considered as limiting the scope of the embodiments described herein. The drawings are not necessarily to scale and the proportions of certain parts may be exaggerated to better illustrate details and features of the present disclosure. It should be noted that references to “an” or “one” embodiment in this disclosure are not necessarily to the same embodiment, and such references mean “at least one”.
[0011] Several definitions that apply throughout this disclosure will now be presented.
[0012] The connection can be such that the objects are permanently connected or releasably connected. The term “comprising,” when utilized, means “including, but not necessarily limited to”; it specifically indicates open-ended inclusion or membership in the so-described combination, group, series, and the like.
[0013] The depth image generation method is applied in one or more electronic devices which can compute. The hardware may be but is not limited to Microprogrammed Control Unit, Application Specific Integrated Circuit, Field-Programmable Gate Array, Digital Signal Processor, and embedded devices, etc.
[0014]
[0015] In block S100, multiple sets of binocular images are acquired to build a dataset containing instance segmentation labels based on the multiple sets of binocular images.
[0016] In one embodiment, binocular images can be original binocular images. The original binocular images can be images that are directly captured by an image acquisition device. For example, a camera is an image acquisition device.
[0017] In one embodiment, the multiple sets of binocular images can also be obtained from a storage device. For example, a storage device is a U disk, etc. . . . The multiple sets of binocular images can also be obtained from a website server through a network.
[0018] In one embodiment, instance segmentation may comprise multiple instances of the same category in images and may need to be distinguished. For example, an image may comprise a number of people. In the instance segmentation, it is necessary to distinguish each person, and each person can be added under a corresponding instance segmentation label. The instance segmentation label of the binocular images can be obtained by segmenting the binocular images. A data set containing instance segmentation labels can be obtained by the instance segmentation labels of the binocular images.
[0019] In block S200, an autoencoder network is trained based on the dataset containing instance segmentation labels to obtain a trained autoencoder network.
[0020] In one embodiment, the autoencoder network is trained by a dataset. The dataset can be established based on the multiple sets of binocular images containing instance segmentation labels. This produces better training results, and training methods of the autoencoder network are shown in
[0021] In block S300, a monocular image without a depth characteristic (M-image) is acquired, and the monocular image is input into the trained autoencoder network to obtain a first disparity map.
[0022] In one embodiment, in current use, an M-image can be obtained by a monocular camera, and the monocular image can be images of any scene.
[0023] The sampling date of the autoencoder network comes from binocular images, so it can improve the accuracy of depth prediction when the disparity of binocular images is used to guide the depth prediction of M-images.
[0024] In block S400, the first disparity map is converted to obtain a depth image corresponding to the M-image.
[0025] In one embodiment, after the M-image is obtained, the M-image is input into the trained autoencoder network, and a first disparity map corresponding to the M-image is output by the autoencoder network, a depth image cannot be output by the autoencoder network. Therefore, it is also necessary to convert the first parallax image according to the first parallax image output by the autoencoder network, the baseline distance of the lens of the monocular camera shooting the M-image, and the focal length of the monocular camera shooting the M-image being obtained. Thus the depth image corresponding to the determined M-image is obtained.
[0026] Referring to
[0027] In block S210, the first image is input into the autoencoder network to obtain a second disparity map.
[0028] In block S220, the second disparity map is processed based on the instance segmentation label to obtain a third disparity map.
[0029] In one embodiment, the processed image can be generated based on the instance segmentation label. Based on the processed image as the guidance of the processed mechanism, the second disparity map is fused and corrected to obtain a finer third disparity map.
[0030] In block S230, the first map is added with the third disparity map to obtain a predicated image of the second image.
[0031] In block S240, the error between a second image and the predicated image is calculated by using the preset mean square error formula.
[0032] In one embodiment, the preset mean square error formula is:
wherein ‘MSE’ is the mean square error between the predicated image and the second image, y.sub.i is an i-th pixel point of the second image, ŷ.sub.i is an average pixel value of the second image, and n is the number of pixels of the second image.
[0033] In block S250, the error is confirmed as a training loss of the autoencoder network until these is convergence in the training losses, to obtain the trained autoencoder network.
[0034] In one embodiment, the first image can be a left image of the binocular images, and the second image can be a right image of the binocular images. In other embodiments, the first image can be a right image of the binocular images, and the second image can be a left image of the binocular images. In other words, the disparity map can be the disparity map corresponding to the left view based on the left view, or the disparity map corresponding to the right view based on the right view. This is not limited in any embodiment.
[0035] The training sample data of the autoencoder network in this application comes from the binocular images containing the instance segmentation label, that is to say, this application uses binocular parallax to guide the prediction of M-image depth. Therefore, the depth image generation method does not require a large amount of data and labeling and a better training effect is achieved.
[0036] Referring to
[0037] Specifically, in one embodiment, the depth image generation system 20 can be applied to electronic devices. The depth image generation system 20 can comprise an image acquisition module 21. The image acquisition module 21 acquires multiple sets of binocular images to build a dataset containing instance segmentation labels based on multiple sets of binocular images; and a model training module 22, the model training module 22 trains an autoencoder network by using the dataset with instance segmentation labels to obtain a trained autoencoder network; an image inference module 23 acquires an M-image, wherein the M-image is input into the trained autoencoder network to obtain a first disparity map; an image conversion module 24 converts the first disparity map to obtain the depth image corresponding to the M-image.
[0038] In one embodiment, each set of the multiple sets of binocular images comprises the first image and the second image.
[0039] In one embodiment, as shown in
[0040] In one embodiment, the data storage 101 can be in the electronic device 100, or can be a separate external memory card, such as an SM card (Smart Media Card), an SD card (Secure Digital Card), or the like. The data storage 101 can include various types of non-transitory computer-readable storage mediums. For example, the data storage 101 can be an internal storage system, such as a flash memory, a random access memory (RAM) for temporary storage of information, and/or a read-only memory (ROM) for permanent storage of information. The data storage 101 can also be an external storage system, such as a hard disk, a storage card, or a data storage medium. The processor 102 can be a central processing unit (CPU), a microprocessor, or other data processor chip that performs functions of the electronic device 100.
[0041] In one embodiment, a non-transitory storage medium having stored thereon instructions is also disclosed. When the computer instruction is executed by a processor 102 of an electronic device 100, the electronic device 100 can perform the depth image generation method.
[0042] The embodiments shown and described above are only examples. Many details known in the field are neither shown nor described. Even though numerous characteristics and advantages of the present technology have been set forth in the foregoing description, together with details of the structure and function of the present disclosure, the disclosure is illustrative only, and changes may be made in the detail, including in matters of shape, size, and arrangement of the parts within the principles of the present disclosure, up to and including the full extent established by the broad general meaning of the terms used in the claims. It will therefore be appreciated that the embodiments described above may be modified within the scope of the claims.