METHOD AND SYSTEM FOR PROCESSING IMAGE BASED ON WEIGHTED MULTIPLE KERNELS
20230073175 · 2023-03-09
Inventors
Cpc classification
G06V10/7715
PHYSICS
G06T3/4053
PHYSICS
International classification
G06T3/40
PHYSICS
Abstract
Systems and methods for processing a plurality of images include obtaining input data including the plurality of images; providing the input data to a first machine learning model; providing an output of the first machine learning model to a second machine learning model and a third machine learning model; generating a first feature map corresponding to a plurality of kernels based on an output of the second machine learning model; generating a second feature map corresponding to a plurality of weights based on an output of the third machine learning model; generating a predicted kernel based on a weighted sum of the plurality of kernels; and generating output data based on the input data and the predicted kernel.
Claims
1. A method of processing a plurality of images, the method comprising: obtaining input data including the plurality of images; providing the input data to a first machine learning model; providing an output of the first machine learning model to a second machine learning model and a third machine learning model; generating a first feature map corresponding to a plurality of kernels based on an output of the second machine learning model; generating a second feature map corresponding to a plurality of weights based on an output of the third machine learning model; generating a predicted kernel based on a weighted sum of the plurality of kernels; and generating output data based on the input data and the predicted kernel.
2. The method of claim 1, wherein the generating of the predicted kernel comprises: extracting the plurality of kernels from the first feature map, wherein the plurality of kernels has a plurality of different kernel sizes; extracting the plurality of weights respectively corresponding to the plurality of kernels from the second feature map; and calculating the weighted sum of the plurality of kernels based on the plurality of weights.
3. The method of claim 2, wherein the calculating of the weighted sum comprises: adding zero to a product of a weight and a second kernel that is different from a first kernel among the plurality of kernels, wherein the first kernel has a largest size among the plurality of different kernel sizes.
4. The method of claim 2, wherein the generating of the first feature map comprises: reshaping the output of the second machine learning model based on a number of the plurality of images, a number of channels in each of the plurality of images, and the plurality of different kernel size.
5. The method of claim 2, wherein the generating of the second feature map comprises: reshaping the output of the first machine learning model based on a number of the plurality of images and the number of kernels.
6. The method of claim 5, wherein the extracting of the plurality of weights comprises: applying a softmax function to the extracted weights.
7. The method of claim 1, wherein the obtaining of the input data comprises: generating the input data by arranging the plurality of images such that pixels of the plurality of images overlap.
8. The method of claim 1, wherein the first machine learning model comprises a U-Net convolution network.
9. The method of claim 1, wherein a number of layers included in the third machine learning model is less than the number of layers included in the second machine learning model
10. The method of claim 1, wherein the output data includes a plurality of output images, and the generating of the output data comprises generating each of the plurality of output images by performing a convolution operation on each of the plurality of images with the predicted kernel.
11. The method of claim 1, further comprising: providing the output data to a fourth machine learning model including a plurality of residual blocks; and generating a super-resolution image corresponding to the input data based on an output of the fourth machine learning model.
12. A system comprising: at least one processor; and a non-transitory storage medium storing instructions allowing the at least one processor to perform image processing when the instructions are executed by the at least one processor, wherein the image processing comprises: obtaining input data including a plurality of images; providing the input data to a first machine learning model; providing an output of the first machine learning model to a second machine learning model and a third machine learning model; generating a first feature map corresponding to a plurality of kernels based on an output of the second machine learning model; generating a second feature map corresponding to a plurality of weight based on an output of the third machine learning model; generating a predicted kernel based on a weighted sum of the plurality of and generating output data based on the input data and the predicted kernel.
13. The system of claim 12, wherein the generating of a predicted kernel comprises: extracting the plurality of kernels from the first feature map, wherein the plurality of kernels has a plurality of different kernel sizes; extracting the plurality of weights respectively corresponding to the plurality of kernels from the second feature map; and calculating a weighted sum of the plurality of kernels based on the plurality of weights.
14. The system of claim 13, wherein the generating of the first feature reap comprises: reshaping the output of the second machine learning model based on a number of the plurality of images, a number of channels in each of the plurality of images, and the plurality of different kernel sizes.
15. The system of claim 13, wherein the generating of the second feature map comprises: reshaping the output of the first machine learning model based on a number of the plurality of images and the number of kernels.
16. The system of claim 13, wherein the extracting of the plurality of weights comprises: applying a softmax function to the extracted weights.
17. The system of claim 12, wherein the image processing further comprises: providing the output data to a fourth machine learning model including a plurality of residual blocks; and generating a super-resolution image corresponding to the input data based on an output of the fourth machine learning model.
18. A non-transitory computer-readable storage medium comprising instructions, wherein the instructions allow at least one processor to perform image processing when the instructions are executed by the at least one processor, and wherein the image processing comprises: obtaining input data including a plurality of images; providing the input data to a first machine learning model; providing an output of the first machine learning model to a second machine learning model and a third machine learning model; generating a first feature map corresponding to a plurality of kernels based on an output of the second machine learning model; generating a second feature map corresponding to a plurality of weights based on an output of the third machine learning model; generating a predicted kernel based on a weighted sum of the plurality of kernels; and generating output data based on the input data and the predicted kernel.
19. The non-transitory computer-readable storage medium of claim 18, wherein the generating of the predicted kernel comprises: extracting the plurality of kernels from the first feature map, wherein the plurality of kernels has a plurality of different kernel sizes; extracting the plurality of weights respectively corresponding to the plurality of kernels from the second feature map; and calculating a weighted sum of the plurality of kernels based on the plurality of weights.
20. The non-transitory computer-readable storage medium of claim 18, wherein the generating of the first feature map comprises: reshaping the output of the second machine learning model based on a number of the plurality of images, the number of channels in each of the plurality of images, and the plurality of different kernel sizes, and the generating of the second feature map includes reshaping the output of the first machine learning model based on a number of the plurality of images and the number of kernels.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] Embodiments of the inventive concept will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings in which:
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
DETAILED DESCRIPTION
[0020] The present disclosure relates to image processing. Embodiments of the disclosure include systems and methods for super-resolution imaging. Super-resolution imaging may refer to generating a high-resolution image from a low-resolution image. According to embodiments of the disclosure, super-resolution imaging based on deep learning may include predicting a multi-kernel and generating a high-resolution image from a low-resolution image based on the multi-kernel.
[0021] According to an embodiment of the inventive concept, two branches are formed in a network for kernel prediction. One branch is used for multi-kernel prediction, and the other branch is used for multi-kernel weight prediction. Input images are provided to a first model, and the output of the first model is provided to both a second model (i.e., the first branch) and a third model (i.e., the second branch). In some examples, the output of the first model may be reshaped for extraction of multi-kernel, and the output of the second model may be reshaped for extraction of weights. A weighted sum of the multi-kernel may be calculated based on the weights, and the kernel may be predicted. The predicted kernel may be applied to the input images, which may result in the generation of high-quality images. In some embodiments, the output images may be provided to a fourth model for super-resolution imaging. Due to the high-quality images provided to the fourth model, the super-resolution image may also have high quality.
[0022]
[0023] In some embodiments, the image processing 10 of
[0024] Referring to
[0025] Herein, a machine learning model may have any structure that is trainable. For example, the machine learning model may include an artificial neural network, a decision tree, a support vector machine, a Bayesian network, and/or a genetic algorithm, and the like.
[0026] An artificial neural network (ANN) is a hardware or a software component that includes a number of connected nodes (i.e., artificial neurons), which loosely correspond to the neurons in a human brain. Each connection, or edge, transmits a signal from one node to another (like the physical synapses in a brain). When a node receives a signal, it processes the signal and then transmits the processed signal to other connected nodes. In some cases, the signals between nodes comprise real numbers, and the output of each node is computed by a function of the sum of its inputs. In some examples, nodes may determine their output using other mathematical algorithms (e.g., selecting the max from the inputs as the output) or any other suitable algorithm for activating the node. Each node and edge is associated with one or more node weights that determine how the signal is processed and transmitted.
[0027] During the training process, these weights are adjusted to improve the accuracy of the result (i.e., by minimizing a loss function which corresponds in some way to the difference between the current result and the target result). The weight of an edge increases or decreases the strength of the signal transmitted between nodes. In some cases, nodes have a threshold below which a signal is not transmitted at all. In some examples, the nodes are aggregated into layers. Different layers perform different transformations on their inputs. The initial layer is known as the input layer and the last layer is known as the output layer. In some cases, signals traverse certain layers multiple times.
[0028] Hereinafter, the machine learning model is described with reference to an artificial neural network, but example embodiments are not limited thereto. The artificial neural network may include, as a non-limiting example, a convolution neural network (CNN), a region with convolution neural network (R-CNN), a region proposal network (RPN), a recurrent neural network (RNN), a stacking-based deep neural network (S-DNN), a state-space dynamic neural network (S-SDNN), a deconvolution network, a deep belief network (DBN), a restricted Boltzmann machine (RBM), a fully convolutional network, a long short-term memory (LSTM) network, a classification network, and the like. Herein, the machine learning model may be simply referred to as a model.
[0029] For example, the first model may include a CNN. A CNN is a class of neural network that is commonly used in computer vision or image classification systems. In some cases, a CNN may enable processing of digital images with minimal pre-processing. A CNN may be characterized by the use of convolutional (or cross-correlational) hidden layers. These layers apply a convolution operation to the input before signaling the result to the next layer. Each convolutional node may process data for a limited field of input (i.e., the receptive field). During a forward pass of the CNN, filters at each layer may be convolved across the input volume, computing the dot product between the filter and the input. During the training process, the filters may be modified so that they activate when they detect a particular feature within the input. In some cases, the filter may be referred to as a kernel.
[0030] The second model 12 receives the first output OUT from the first model 11 and may generate a second output OUT2. The second model 12 may be trained to predict a kernel K that will be applied to the input data DIN in the reconstruction 17 to be described later. For example, the second model 12 may be trained to output a plurality of kernels of different sizes, as described below with reference to
[0031] The third model 13 receives the first output OUT1 from the first model 11 and may generate a third output OUT3. The third model 13 may be trained to predict a kernel K that will be applied to the input data DIN in the reconstruction 17 to be described later. For example, the third model 13 may be trained to output a plurality of weights respectively corresponding to the plurality of kernels with different sizes, as described below with reference to
[0032] When the kernel is predicted based only on the second output OUT2 output by the second model 12, multiple kernels may be equally applied regardless of the position of the image. For example, an image may include a region having a high frequency and a region having a low frequency, and equally applying multiple kernels to the regions may limit the quality of a final image. On the other hand, the image processing 10 of
[0033] The first post-processing 14 generates a first feature map FM1 using the second output OUT2. For example, the first post-processing 14 may generate the first feature map FM1 by reshaping the second output OUT2 so that a plurality of kernels may be extracted in the kernel generation 16 to be described later. An example of the first feature map FM1. generated by the first post-processing 14 is described below with reference to
[0034] The second post-processing 15 generates a second feature map FM2 using the third output OUT3. For example, the second post-processing 14 may generate the second feature map FM2 by reshaping the third output OUT3 so that a plurality of weights may be extracted in the kernel generation 16 to be described later. An example of the second feature map FM2 generated by the second post-processing 15 is described below with reference to
[0035] The kernel generation 16 generates the kernel K from the first feature map FM1 and the second feature map FM2. As described above, the first feature map FM1 may correspond to the plurality of kernels, and the second feature map FM2 may correspond to the plurality of weights. The kernel generation 16 may identify respective importance levels of the plurality of kernels based on the plurality of weights, and may generate the kernel K from the plurality of kernels based on the identified importance levels. An example of the kernel generation 16 is described below with reference to
[0036] The reconstruction 17 generates the output data DOUT from the input data DIN and the kernel K. As described above, the input data. DIN may include a plurality of images, and each of the plurality of images may be operated with the kernel K. The output data DOUT may include a plurality of images generated by applying the kernel K to the plurality of images, respectively. An image included in the output data DOUT may have a higher quality than an image included in the input data DIN. For example, the image included in the output data DOUT may correspond to an image in which noise is removed from the image included in the input data DIN, or may correspond to an image in which images included in the input data DIN are aligned.
[0037]
[0038] Referring to
[0039] In some embodiments, the plurality of images included in the input data DIN may correspond to image frames generated by repeatedly photographing the same object. For example, in order to produce high-quality images despite limited perform lance such as the camera module of a smartphone, the same object may be repeatedly photographed, and a high-quality image frame may be generated based on the generated image frames. In some embodiments, the images included in the input data DIN may be images divided from a large-sized source image. For example, in order to reduce the complexity of image processing, a large-sized source image may be divided into a plurality of images, and the divided images may be provided to the first model 21.
[0040] In some embodiments, the first model 21 may be based on U-Net. The U-Net may be a fully convolution network of the end-to-end scheme. As shown in
[0041] The second model 22 may generate a second output OUT2 from the first output OUT1. As shown in
[0042]
[0043] Referring to
[0044] Referring to
[0045]
[0046] As described above with reference to
[0047] In the kernel generation 16, the first feature map FM1 may be divided into T feature maps FM1.sub.1 to FM1.sub.T in the channel direction. As shown in
K.sub.i.sup.S(x, y)=B.sub.k(I.sub.i(x, y)) [Equation 1]
[0048] In Equation 1, S is the kernel group, i is the image index, B.sub.k is a model for predicting a plurality of kernels, for example, a model including the first model 11 and the second model 12 in
[0049] As described above with reference to
w.sub.i.sup.S(x, y)=B.sub.w(I.sub.i(x, y)) [Equation 2]
[0050] In Equation 2, S is the kernel group, i is the image index, B.sub.w is a model for predicting a plurality of weights, for example, a model including the first model 11 and the third model 13 of
[0051] In the kernel generation 16, a weighted sum of the plurality of kernels may be calculated based on the plurality of weights. For example, as shown in
[0052] In some embodiments, a softmax function may be applied to the weights extracted from the second feature map FM2. For example, the softmax function may be applied to the first to fourth weights .sub.1 to .sub.4 extracted from the feature map FM2.sub.1, and the weights to which the softmax function is applied may be multiplied by the kernels. With softmax, important weights may be further emphasized. The weight {tilde over (W)}.sub.i to which the softmax is applied may be expressed as in Equation 3 below.
[0053] In Equation 3, j is an index of a weight (or one kernel among a plurality of kernels).
[0054] The products of the kernel and the weight are summed, so that the final kernel may correspond to the weighted sum of the kernels. To sum products of different sizes, zero padding may be applied to products corresponding to a small size. For example, as shown by a dashed line in
[0055]
[0056] Referring to
[0057]
[0058] As high-resolution displays such as ultra-high definition (UHD) displays become popular, super resolution (SR) imaging that converts a low-resolution (LR) image such as a full-high definition (FHD) image into a high-resolution (HR) image may be used. A method based on a machine learning model such as deep learning may be used for the SR imaging, and output data DOUT of
[0059] When using deep learning models for super resolution, high complexity due to a deep network may require many resources, and the depth of the network and the performance of the network may not necessarily be proportional. In order to solve such a problem, residual learning may be used. The residual learning may refer to adding an LR image to an HR image and learning a difference value between two images. In some examples, the network may be divided into a plurality of residual blocks, and filter parameters may be more easily optimized by connecting each of the plurality of residual blocks through a skip connection. In some embodiments, a fourth model 60 of
[0060] Referring to
[0061]
[0062] Referring to
[0063] In operation S20, the input data DIN is provided to the first model 11. For example, the first model 11 may be based on U-Net and generate a first output OUT1 including features of a plurality of images included in the input data DIN by processing the input data DIN.
[0064] In operation S30, the output of the first model 11 is provided to the second model 12 and the third model 13. For example, the second model 12 may be trained to extract a plurality of kernels, and the third model 13 may be trained to extract a plurality of weights respectively corresponding to a plurality of kernels. The second model 12 and the third model 13 may receive the output of the first model 11, for example, the first output OUT1 in common, and may generate a second output OUT2 and a third output OUT3, respectively.
[0065] In operation S40, a first feature map FM1 is generated. For example, in operation S30, the second output OUT2 generated by the second model 12 may be reshaped to enable extraction of a plurality of kernels, and thus the first feature map FM1 may be generated. In some embodiments, the first feature map FM1 may be reshaped based on the number of image frames included in the input data DIN, the number of channels of the image, and sizes of the plurality of kernels. For example, as described above with reference to
[0066] In operation S50, the second feature map FM2 is generated. For example, in operation S30, the third output OUT3 generated by the third model 13 may be reshaped so that a plurality of weights may be extracted, and thus the second feature map FM2 may be generated. In some embodiments, the second feature map FM2 may be reshaped based on the number of image frames and the number of kernels included in the input data DIN. For example, the second feature map FM2 may include 3TM slices each having a resolution of W*H like an image included in the input data DIN.
[0067] In operation S60, a predicted kernel is generated. For example, a kernel may be generated based on the first feature map FM1 generated in operation S40 and the second feature map FM2 generated in operation S50. As described above with reference to the drawings, the predicted kernel may be generated based on a weighted sum of a plurality of kernels based on importance, and thus may be more suitable for the input data DIN. An example of operation S60 is described below with reference to
[0068] In operation S70, output data DOUT is generated. The predicted kernel generated in operation S60 may be applied to the input data DIN, and thus the output data DIN may be generated. For example, as described above with reference to
[0069]
[0070] Referring to
[0071] In operation S64, a plurality of weights are extracted from the second feature map FM2. For example, a plurality of weights respectively corresponding to the plurality of kernels extracted in operation 562 may be extracted from the second feature map FM2. As described above with reference to
[0072] In operation S66, a weighted sum of the plurality of kernels is calculated. A weighted sum of the plurality of kernels extracted in operation S62 may be calculated based on the weights extracted in operation S64. For example, a product of the weight and the kernel may be generated, and a plurality of products may be summed. Due to the different sizes of the kernels, zero padding may be applied to the product corresponding to the kernel with a smaller size, as described above with reference to
[0073]
[0074] Referring to
[0075] In operation S90, an HR image IMG is generated. For example, the input data DIN may be processed by the convolution layer and the upsampling layer, and the processing result may be summed with the fourth output OUT4 generated in operation S80. The summation result may be processed again by the convolution layer, and thus an HR image MG may be generated. Due to the improved quality of the output data DOUT, the quality of the HR image IMG may also have improved quality.
[0076] In some embodiments, loss functions used to train the fourth model 60 of
[0077] In Equation 5, H.sub.SR may correspond to all the models described above.
[0078] Also, a structural similarity index measure (SSIM) loss may be used, and the SSIM loss may be defined as in Equation 6 below based on the dependence of the mean and standard deviation for a pixel p in a region P.
[0079] Models may be trained considering both the loss of Equation 5 and the loss of Equation 6, and accordingly, the loss function of Equation 7 below may be defined.
L.sub.Total=λ.sub.SR.Math.L.sub.SR+λ.sub.SSIM.Math.L.sub.SSIM [Equation 7]
[0080] In Equation 7, λ.sub.SR and λ.sub.SSIM are coefficients determined based on a balanced learning method between consistency and visual quality. In some examples, parameters of one or more machine learning models are updated iteratively to minimize the loss function of equation 7.
[0081]
[0082] The computer system 100 may refer to any system, including general-purpose or special-purpose computing systems. For example, the computer system 100 may include personal computers, server computers, laptop computers, consumer electronics, and the like. As shown in
[0083] The at least one processor 101 executes program modules including a computer system executable instructions. A program module may include routines, programs, objects, components, logic, data structures, etc. that perform particular operation or implement particular abstract data types. The memory 102 may include computer-readable recording medium in the form of volatile memory, such as random-access memory (RAM). At least one processor 101 may access the memory 102 and execute instructions loaded in the memory 102. The storage system 103 may store information in a non-volatile manner, and may include at least one program product including a program module configured to perform image processing and/or training of models described above with reference to the drawings in some embodiments. The program may include an operating system, at least one application, other program modules, and program data, as a non-limiting example.
[0084] The network adapter 104 provides connectivity to a local area network (LAN), a wide area network (WAN), and/or a public network (e.g., the Internet), and the like. The input/output interface 105 may provide a communication channel with a peripheral device, such as a keyboard, a pointing device, an audio system, or the like. The display 106 may output a variety of information that the user may verify.
[0085] In some embodiments, the image processing and/or training of models described above with reference to the drawings may be implemented as a computer program product. The computer program product may include a non-transitory computer-readable medium (or storage medium) including computer-readable program instructions for allowing the at least one processor 101 to perform image processing and/or training of models. Computer-readable instructions may be, as a non-limiting example, assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or source code or object code written in at least one programming language.
[0086] The computer-readable medium may be any tangible medium that may non-transitory maintain and store instructions executed by at least one processor 101 or any instruction-executable device. The computer-readable medium may be an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any combination thereof, but is not limited thereto. For example, the computer-readable medium may be a portable computer diskette, a hard disk, RAM, read-only memory (ROM), or electrically erasable read only memory (EEPROM), flash memory, static random-access memory (SRAM), a CD, a DVD, a memory stick, a floppy disk, a mechanically encoded device, such as a punch card, or any combination thereof.
[0087]
[0088] Referring to
[0089] The at least one processor 111 may execute instructions. For example, the at least one processor 111 may execute an operating system by executing instructions stored in the memory 113, or may execute applications executed by the operating system. In some embodiments, the at least one processor 111 may instruct an operation to the AI accelerator 115 and/or the hardware accelerator 117 by executing instructions, and may obtain a result of performing the operation from the AI accelerator 115 and/or the hardware accelerator 117. In some embodiments, the at least one processor 111 may be an application specific instruction set processor (ASIP) customized for a specific purpose, and may support a dedicated instruction set.
[0090] The memory 113 may have any structure for storing data. For example, the memory 113 may include a volatile memory device such as DRAM, SRAM, or the like, and include a non-volatile memory device such as flash memory, resistive random-access memory (RRAM), or the like. The at least one processor 111, the AI accelerator 115, and the hardware accelerator 117 may store data (e.g., IN, IMG_I, IMG_O, OUT in
[0091] The AI accelerator 115 may refer to hardware designed for AI applications. In some embodiments, the AI accelerator 115 may include a neural processing unit (NPU) for implementing a neuromorphic structure, may generate output data by processing input data provided from the at least one processor 111 and/or the hardware accelerator 117, and may provide output data to the at least one processor 111 and/or the hardware accelerator 117. In some embodiments, the AI accelerator 115 may be programmable and may be programmed by the at least one processor 111 and/or the hardware accelerator 117.
[0092] The hardware accelerator 117 may refer to hardware designed to perform a specific operation at a high speed. For example, the hardware accelerator 117 may be designed to perform data transformation such as demodulation, modulation, encoding, and decoding at a high speed. The hardware accelerator 117 may be programmable and may be programmed by at least one processor 111 and/or the hardware accelerator 117.
[0093] In some embodiments, the AI accelerator 115 executes the models described above with reference to the drawings. The AI accelerator 115 may generate an output that includes useful information by processing an image, a feature map, and the like. In addition, in some embodiments, at least some of the models executed by the AI accelerator 115 may be executed by the at least one processor 111 and/or the hardware accelerator 117.
[0094] While the inventive concept has been particularly shown and described with reference to embodiments thereof, it will be understood that various changes in form and details may be made therein without departing from the spirit and scope of the following claims.