METHOD FOR IMAGE SUPER RESOLUTION IMITATING OPTICAL ZOOM IMPLEMENTED ON A RESOURCE-CONSTRAINED MOBILE DEVICE, AND A MOBILE DEVICE IMPLEMENTING THE SAME
20200211159 ยท 2020-07-02
Inventors
- Artem Sergeevich Migukin (Moscow, RU)
- Anna Andreevna Varfolomeeva (Moscow, RU)
- Alexey Stanislavovich Chernyavskiy (Moscow, RU)
- Vitaly Vladimirovich Chernov (Moscow, RU)
Cpc classification
G06T3/04
PHYSICS
G06T3/4053
PHYSICS
International classification
G06T3/40
PHYSICS
Abstract
A method for super resolution of an image imitating optical zoom implemented on a resource-constrained mobile device is includes inputting a first-resolution image. The method also includes assigning a super resolution factor from a predefined set of super resolution factors that indicate increasing the first resolution to obtain a second-resolution image. The super resolution factor indicates a trained intelligent system imitating the operation of the corresponding optical system among a plurality of intelligent systems trained for each super resolution factor of the predefined set of super resolution factors. The method further includes transforming the first-resolution image into the second-resolution image using said trained intelligent system.
Claims
1. A method for image super resolution imitating optical zoom implemented on a resource-constrained mobile device, the method comprising: obtaining a first-resolution image; assigning a super resolution factor from a predefined set of super resolution factors that indicate increasing a first resolution to obtain a second-resolution image, wherein the super resolution factor indicates a trained intelligent system imitating an operation of a corresponding optical system among a plurality of intelligent systems trained for each super resolution factor of the predefined set of super resolution factors; and transforming the first-resolution image into the second-resolution image using the trained intelligent system.
2. The method of claim 1, wherein a convolutional neural network (CNN) is used as an intelligent system and the CNN is trained on pairs of image patches of a first-resolution and a second-resolution for training the intelligent system, and wherein the first-resolution is lower than the second-resolution.
3. The method of claim 1, wherein obtaining the first-resolution image, further comprises one of: inputting the first-resolution image captured by a camera of the mobile device, loading the first-resolution image from a memory of the mobile device or an external data source, or selecting an arbitrary image area as the first-resolution image, wherein the first-resolution image is a still image or a video data frame.
4. The method of claim 2, wherein the CNN training comprises: capturing a plurality of image pairs consisting of a first-resolution image and a second-resolution image, images of each of the image pairs depicting a same scene, wherein the first-resolution image is captured by a camera with 1 zoom factor, and the second-resolution image is captured by a camera having optical zoom with the zoom factor being higher than 1, the zoom factor for the second-resolution image is selected from a predefined set of zoom factors corresponding to the predefined set of super resolution factors; matching the image patches for each image pair using at least one keypoint found by at least one detector as a point in an image at which there is a local maximum of information measure built based on a human visual system (HVS) model and computed by combining gradient values of brightness, colors or other quantitative characteristics of the image; cropping, for the at least one keypoint, a first-resolution image patch and a second-resolution image patch to form a pair of patches comprising the at least one keypoint and a neighborhood of the at least one keypoint; matching, for each pair of patches, a brightness histogram of the first-resolution image patch and a brightness histogram of the second-resolution image patch; and training the CNN by using the pair of patches to obtain numerical parameters of the trained CNN, including sets of weights, wherein each pair of patches comprises of a first-resolution patch and a corresponding second-resolution patch with the matched brightness histograms
5. The method of claim 4, wherein the camera with 1 zoom factor and the camera having optical zoom with the zoom factor being higher than 1 are timed and centered along an optical axis via an optical divider.
6. The method of claim 2, wherein transforming the first-resolution image into the second-resolution image further comprises: applying two-dimensional filtering in each convolutional CNN layer to each individual feature map extracted from the first-resolution image, and applying one or more weighted summation operations with weight sets obtained as a result of CNN training to a set of feature maps, wherein a number of feature maps processed in each CNN convolutional layer, except for a first one, is equal to a number of filters applied in a previous CNN convolutional layer.
7. The method of claim 4, wherein the at least one detector used in matching the image patches is selected from a group comprising of at least ORB, BRISK, SURF, FAST, Harris, MinEigen, HOG, LBPF or the like.
8. The method of claim 1, wherein transforming the first-resolution image into the second-resolution image further comprises: performing weighted summation of an image obtained by interpolating the first-resolution image and a residual image generated by the trained intelligent system, wherein the interpolation is one of bilinear interpolation, bicubic interpolation or Lanczos filter interpolation.
9. The method of claim 4, wherein a number of layers and a number of CNN parameters are determined by using a greedy algorithm to minimize used layers and CNN parameters while maximizing a super resolution by numerical characteristics including standard deviation, structural similarity and the like.
10. The method of claim 1, wherein the trained intelligent system is one of: an auto-encoder comprising a coupled encoder and decoder, the auto-encoder configured to generalize and correlate input data so that the decoder transforms the first-resolution image to the second-resolution image, or non-linear regression configured to approximate a function of transforming the first-resolution image into the second-resolution image as a non-linear combination of model parameters.
11. A mobile device, comprising: a processor; a camera configured to capture a first-resolution image; a memory configured to store numerical parameters of a plurality of trained intelligent systems and instructions that, when executed by the processor, cause the processor to: obtain a first-resolution image; assign a super resolution factor from a predefined set of super resolution factors that indicate increasing a first resolution to obtain a second-resolution image, wherein the super resolution factor indicates a trained intelligent system imitating an operation of a corresponding optical system among a plurality of intelligent systems trained for each super resolution factor of the predefined set of super resolution factors; and transform the first-resolution image into the second-resolution image using the trained intelligent system.
12. The mobile device of claim 11, wherein a convolutional neural network (CNN) is used as an intelligent system and the CNN is trained on pairs of image patches of a first-resolution and a second-resolution for training the intelligent system, and wherein the first-resolution is lower than the second-resolution.
13. The mobile device of claim 11, wherein the instructions, when executed, cause the processor to obtain the first-resolution image by: inputting the first-resolution image captured by a camera of the mobile device, loading the first-resolution image from a memory of the mobile device or an external data source, or selecting an arbitrary image area as the first-resolution image, wherein the first-resolution image is a still image or a video data frame.
14. The mobile device of claim 12, wherein to train the CNN the instructions further cause the processor to: capture a plurality age pairs consisting of a first-resolution image and a second-resolution image, images of each of the image pairs depicting a same scene, wherein the first-resolution image is captured by a camera with 1 zoom factor, and the second-resolution image is captured by a camera having optical zoom with the zoom factor being higher than 1, the zoom factor for the second-resolution image is selected from a predefined set of zoom factors corresponding to the predefined set of super resolution factors; match the image patches for each image pair using at least one keypoint found by at least one detector as a point in an image at which there is a local maximum of information measure built based on a human visual system (HVS) model and computed by combining gradient values of brightness, colors or other quantitative characteristics of the image; crop, for the at least one keypoint, a first-resolution image patch and a second-resolution image patch to form a pair of patches comprising the at least one keypoint and a neighborhood of the at least one keypoint; match, for each pair of patches, a brightness histogram of the first-resolution image patch and a brightness histogram of the second-resolution image patch; and train the CNN by using the pair of patches to obtain numerical parameters of the trained CNN, including sets of weights, wherein each pair of patches comprises of a first-resolution patch and a corresponding second-resolution patch with the matched brightness histograms.
15. The mobile device of claim 14, wherein the camera with 1 zoom factor and the camera having optical zoom with the zoom factor being higher than 1 are timed and centered along an optical axis via an optical divider.
16. The mobile device of claim 12, wherein to transform the first-resolution image into the second-resolution image, the instructions further cause the processor to: apply two-dimensional filtering in each convolutional CNN layer to each individual feature map extracted from the first-resolution image, and apply one or more weighted summation operations with weight sets obtained as a result of CNN training to a set of feature maps, wherein a number of feature maps processed in each CNN convolutional layer, except for a first one, is equal to a number of filters applied in a previous CNN convolutional layer.
17. The mobile device of claim 14, wherein the at least one detector used to match the image patches is selected from a group comprising of at least ORB, BRISK, SURF, FAST, Harris, MSER, MinEigen, HOG, LBPF or the like.
18. The mobile device of claim 11, wherein to transform the first-resolution image into the second-resolution image, the instructions further cause the processor to: perform weighted summation of an image obtained by interpolating the first-resolution image and a residual image generated by the trained intelligent system, wherein the interpolation is one of bilinear interpolation, bicubic interpolation or Lanczos filter interpolation.
19. The mobile device of claim 14, wherein the instructions cause the processor to determine a number of layers and a number of CNN parameters by using a greedy algorithm to minimize used layers and CNN parameters bile maximizing a super resolution by numerical characteristics including standard deviation, structural similarity and the like.
20. The mobile device of claim 11, wherein the trained intelligent system is one of: an auto-encoder comprising a coupled encoder and decoder, the auto-encoder configured to generalize and correlate input data so that the decoder transforms the first-resolution image to the second-resolution image, or non-linear regression configured to approximate a function of transforming the first-resolution image into the second-resolution image as a non-linear combination of model parameters.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0019] Other advantages of the present invention will become apparent to those ordinary skill in the art upon review of the following detailed description of various embodiments thereof, as well as the drawings in which:
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
DETAILED DESCRIPTION
[0026]
[0027] Preferred embodiments of the present invention will now be described in more detail with reference to the drawings, in which identical elements in different figures, where possible, are identified by the same reference numerals. These embodiments serve to explain the present invention, rather than to limit it. After reading this detailed description and drawings, various modifications and variations will be apparent to those skilled in the art.
[0028] Image super resolution usually means image HR upscaling (in pixels) with higher resolution, zooming and higher visual perception quality. This is a fundamental difference from simple scaling of an image, in which image upsampling inevitably leads to deterioration in image resolution and its zooming. Image zooming refers to the number of pixels per unit area of an image (ppi, pixels per inch) or the number of dots per unit area of an image (dpi, dots per inch).
[0029]
[0030] In step 110, a first-resolution image can be input in various ways, for example, by capturing the first-resolution image with a camera of the mobile device, loading the first-resolution image from a memory of the mobile device or an external data source, selecting an arbitrary image area as a first-resolution image, or any other way. The first-resolution image may be, but is not limited to, for example, a still image, a video data frame, or a sequence of video data frames. In addition, the first resolution may be lower than the second resolution.
[0031] Then, according to the method, in step 120, the super resolution factor can be assigned from a predefined set of super resolution factors. The super resolution factor indicates higher first resolution to obtain a second-resolution image. In addition, the super resolution factor indicates a trained intelligent system imitating the operation of the corresponding optical system among a plurality of intelligent systems trained for each super resolution factor of the predefined set of super resolution factors, which should be used when transforming the first-resolution image to the second-resolution image. According to an embodiment of the present disclosure, a trained convolutional neural network (CNN) is used as a trained intelligent system. However, instead of CNN, the present disclosure can use any other intelligent system, such as, for example, an auto-encoder consisting of a coupled encoder and decoder, configured to generalize and correlate input data so that the decoder transforms the first-resolution image to the second-resolution image, or non-linear regression capable of approximating a function of transforming the first-resolution image into the second-resolution image as a non-linear combination of the model parameters.
[0032] Next, in step 130, a first-resolution image (illustrated in
[0033] Step 130 may further comprise performing weighted summation or another combination of the image obtained by interpolating the first-resolution image and the residual image. The interpolation can be, for example, bilinear interpolation, bicubic interpolation, Lanczos filter interpolation or any other standard interpolation algorithm. Weighted summation or combination makes it possible to eliminate blurring and gradation of the edges of objects in the interpolated image and other defects in the image. The residual image can be generated by the CNN from the first-resolution image, then the generated residual image can be upscaled using an additional back-convolution layer to the dimensions of the interpolated image to enable correct weighted summation or another combination of these images. This approach reduces computational load, since the residual image is generated from the first-resolution image before it is enlarged.
[0034]
[0035] For each image pair, matching, in step 220, of the image patches is performed by using at least one keypoint found by at least one detector as a point in the image at which there is a local maximum of the information measure built basing on a human visual system (HVS) model and computed by combining the gradient values of brightness, colors and/or other quantitative characteristics of the images. The at least one detector used at this step may be selected from the group consisting of at least ORB, BRISK, SURF, FAST, Harris, MSER, MinEigen, HOG, LBPF or the like. The present disclosure is not limited to the listed detectors, and includes any number of detectors as commonly used by one of ordinary skill in the art. The human visual system model (HVS) is understood to include other approaches as commonly understood by one of ordinary skill in the art including those described in documents by V. Yanulevskaya, et al., Cognit Comput. 3, 94-104 (2011) and H. Z. Momtaz and M. R. Dalir, Cogn Neurodyn. 10, 31-47 (2016). Quantitative Characteristics of images can be understood as gradients of brightness, colors, saturation, numerical values obtained when said data of a specific image is subjected to multiplication, addition, exponentiation, convolution with a given set of digital filters, as well as statistics computed for each image pixel, such as mathematical expectation, variance, kurtosis coefficient, algebraic and geometric moments and so on.
[0036] Next, in step 230, for each at least one keypoint, a first-resolution image patch and a second-resolution image patch are cropped, forming a pair of patches having said keypoint and the neighborhood of said keypoint. A patch is a neighborhood that can have any shape, for example, round, square, etc., of the point at which the local maximum of information measure was found. The patch includes a point, i.e. the keypoint itself and the surroundings around it.
[0037] Then, in step 240, for each said pair of patches, matching is performed for brightness histogram of the first-resolution image patch and brightness histogram of the second-resolution image patch to match the brightness histograms with each other.
[0038] In step 250, the CNN is trained using the obtained plurality of patch pairs to obtain the numerical parameters of the trained CNN, including sets of weights, wherein each patch pair consists of a first-resolution patch and a corresponding second-resolution patch with the matched histograms. The above learning steps (210, 220, 230, 240, 250) are performed for each zoom factor of said set of zoom factors to obtain said plurality of the trained intelligent systems, each having its own corresponding zoom factor. The set of zoom factors one-to-one corresponds to the set of super resolution factors, i.e. one particular super resolution factor corresponds to a particular zoom factor. Training of auto-encoder and non-linear regression can be carried out in a similar way on pairs of image patches of different resolution.
[0039] The camera with 1 zoom factor and the camera having optical zoom with the zoom factor being higher than 1 used according to embodiments of the present disclosure to obtain training images, are timed and centered along the optical axis via an optical divider. The configuration of such a training image capturing system can be any suitable configuration in which timing and centering along the optical axis of the captured images are achieved.
[0040] The CNN training may additionally contain optimization of the number of layers and the number of CNN parameters, where the number of layers and the number of CNN parameters are determined using the greedy algorithm known in the art to minimize the used layers and CNN parameters while maximizing the super resolution by numerical characteristics including, but not limited to, for example, standard deviation and structural similarity. This optimization is performed iteratively, first selecting the maximum permissible limits of the future CNN, then training the CNN to achieve the maximum desired level of super resolution, evaluating this resolution either by subjective estimates or by objective quantitative quality criteria such as, but not limited to, for example, peak signal-to-noise ratio, structure/features similarity, Wasserstein distance. Then, at the next iteration, the number of layers and the number of parameters are reduced simultaneously with CNN retraining, and the super resolution quality of the retrained CNN is re-evaluated. The number of such iterations is not limited and they are performed until the problem of minimizing the layers and CNN parameters used in maximizing the super resolution in terms of numerical characteristics is resolved.
[0041]
[0042] In addition, the method disclosed may be implemented by a processor, a special purpose integrated circuit (ASIC), a user programmable gate array (FPGA), or as a system on a chip (SoC). In addition, the method disclosed can be implemented by a computer-readable medium that stores the numerical parameters of a plurality of trained intelligent systems and computer-executable instructions that, when executed by a computer processor, cause the computer to carry out the disclosed method for image super resolution. The trained intelligent system instructions for implementing the method can be downloaded into a mobile device via a network or from a medium.
[0043] Reference numerals used in this disclosure should not be interpreted as unambiguously determining the sequence of steps, since after reading the above disclosure, other modified sequences of the above steps will become apparent to those skilled in the art. The reference numerals were used in this description and are used in the following claims only as a through indicator the corresponding element of the application, which facilitates its perception and ensures observing of the unity of terminology.
[0044] Although the present disclosure has been described with various embodiments, various changes and modifications may be suggested to one skilled in the art. It is intended that the present disclosure encompass such changes and modifications as fall within the scope of the appended claims.