System and Method to Utilize a Reduced Image Resolution for Computer Vision Applications

20240037702 · 2024-02-01

Assignee

Inventors

Cpc classification

International classification

Abstract

A system, device and method are provided for generating image processing models for selected hardware. The method, illustratively, includes obtaining a reference model, a desired image resolution based on target hardware, and a training set of images comprising images with the desired image resolution and images with a higher resolution. The method includes generating an updated model by: iteratively training the reference model with a combined set of features, the combined set of features comprising features determined from the images with the higher resolution with at least one stem and features determined from the images with the desired resolution. The method includes outputting the trained updated model to the target hardware to process images with the desired image resolution.

Claims

1. A computer-implemented method for generating image processing models, the method comprising: obtaining a reference model, a desired image resolution based on target hardware, and a training set of images comprising images with the desired image resolution and images with a higher resolution; generating an updated model by: iteratively training the reference model with a combined set of features, the combined set of features comprising features determined from the images with the higher resolution with at least one stem and features determined from the images with the desired resolution; and outputting the trained updated model to the target hardware to process images with the desired image resolution.

2. The method of claim 1, wherein the at least one stem comprises one or more of a convolution structure, a pooling structure, and a space to depth structure.

3. The method of claim 1, wherein the at least one stem comprises two different convolution structures.

4. The method of claim 1, wherein the at least one stem comprises two different stems, or two identical stems.

5. The method of claim 4, wherein each of the two stems comprise of one or more of a convolution structure, a pooling structure, and a space to depth structure.

6. The method of claim 5, wherein each of the stems comprises different convolution structures.

7. The method of claim 1, wherein the at least one stem comprises an instance of the space to depth structure outputting into an instance of the convolution structure.

8. The method of claim 1, wherein the at least one stem comprises an instance of the convolution structure outputting into an instance of the pooling structure.

9. The method of claim 1, the at least one stem comprises an instance of the pooling structure outputting into an instance of and the convolution structure.

10. The method of claim 4, wherein features learned from each of the stems are combined for use in training the updated reference model.

11. The method of claim 1, further comprising: evaluating the reference model performance for different image resolutions during a training operation; and determining the desired image resolution based on the reference model performance during the evaluation, the desired image resolution defining characteristics of the target hardware.

12. A device comprising a processor and memory, the memory comprising computer executable instructions for generating image processing models, the instructions causing the processor to: obtain a reference model, a desired image resolution based on target hardware, and a training set of images comprising images with the desired image resolution and images with a higher resolution; generate an updated model by: iteratively training the reference model with a combined set of features, the combined set of features comprising features determined from the images with the higher resolution with at least one stem and features determined from the images with the desired resolution; and output the trained updated model to the target hardware to process images with the desired image resolution.

13. The device of claim 12, wherein the at least one stem comprises one or more of a convolution structure, a pooling structure, and a space to depth structure.

14. The device of claim 12, wherein the at least one stem comprises two different convolution structures.

15. The device of claim 12, wherein the at least one stem comprises two different stems, or two identical stems.

16. The device of claim 15, wherein features learned from each of the two or more stems are combined for use in training the updated reference model.

17. The device of claim 15, wherein the at least one stem comprises two stems, each of the two stems comprising one or more of a convolution structure, a pooling structure, and a space to depth structure.

18. The device of claim 17, wherein at least one of the stems comprises an instance of the space to depth structure outputting into an instance of the convolution structure.

19. The device of claim 11, the instructions causing the processor to: evaluate the reference model performance for different image resolutions during a training operation; and determine the desired image resolution based on the reference model performance during the evaluation, the desired image resolution defining characteristics of the target hardware.

20. A computer readable medium comprising computer executable instructions for generating image processing models, the instructions for: obtain a reference model, a desired image resolution based on target hardware, and a training set of images comprising images with the desired image resolution and images with a higher resolution; generate an updated model by: iteratively training the reference model with a combined set of features, the combined set of features comprising features determined from the images with the higher resolution with at least one stem and features determined from the images with the desired resolution; and output the trained updated model to the target hardware to process images with the desired image resolution.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0027] Embodiments will now be described with reference to the appended drawings wherein:

[0028] FIG. 1 is a block diagram for finding an optimal resolution.

[0029] FIG. 2 is a flow chart illustrating a two stem architecture using S2D.

[0030] FIG. 3 is a flow chart illustrating a two stem architecture using Conv2D.

[0031] FIG. 4 is a flow chart illustrating two identical stems with sharp down sampling.

[0032] FIG. 5 is a flow chart illustrating a single stem with sharp down sampling following by a 11 Conv.

[0033] FIG. 6 is a flow chart illustrating a process for generating an optimized model for a target hardware system.

DETAILED DESCRIPTION

[0034] Deep learning models used in computer vision perform well at higher input resolutions and when the model capacity is high. Accuracy metrics start reducing when the system either reduces the input resolution and/or the model capacity. While the model capacity is needed when the model needs complex understanding of the problem (e.g., a high number of categories, semantically complex categories, etc.), most of the industrial and practical applications do not need a model to detect more than a few classes. For example, a surveillance application for apartments may need to detect only a few object categories like person, pet animal and car. Similarly automotive applications need to detect different types of vehicles, people, and animals. Most of the applications use a resolution that is either chosen based on empirical studies from academic literature or chosen using a limited set of experiments. To solve this issue, an algorithm has been developed to find an optimal resolution for a given task automatically, as follows.

[0035] First, one can divide the image resolutions into smaller bins divisible by 32. These bins fall in the range [0.4*R.sub.org:R.sub.org].

[0036] Second, the system proposes a way to find the optimal resolution R.sub.opt with model performance drop within a range (delta-d). Delta is the maximum accuracy drop that the application can afford, which can be as low as zero.

[0037] Third, the system uses R.sub.opt to add an auxiliary stem in the object detection pipeline which accepts the image with original resolution (R.sub.org) and uses one of the two stems to be the same as the original model accepting the resized input with scaling factor of (R.sub.opt/R.sub.org).

[0038] In the present solution, the system can add another stem that accepts the input resolution R.sub.org and it goes through a few layers before the output gets concatenated to stem1.

[0039] The above architecture results in increase in mean average precision (mAP) by few points and speedup of around (R.sub.org/R.sub.opt){circumflex over ()}2.

[0040] As a next step, the system is configured to chose one or two bins lower based on the accuracy gain from the above operations and retrains the model with 2 stems to get an accuracy that is almost the same as the original model. This step would provide overall speedup of (R.sub.opt32)/R.sub.org with zero accuracy drop.

Context

[0041] The experiments are carried out using YOLOv5/v4 backbone [1]. Space2Depth was introduced by Mehdi et al. [2] which can be used to downscale input resolution. Zhang [5] proposed anti-aliasing by low-pass filtering before down sampling which improves detection performance. TResNet [3] is a variant on a ResNet that aims to boost accuracy while maintaining GPU training and inference efficiency. It includes multiple design choices including Space2Depth and Antialiasing. Through extensive ablation studies Sandler et al. [4] show that resolution in the first few layers does not matter that much as in the later layers.

Finding An Optimal Resolution

[0042] The first step of finding the optimal resolution during model training is to create a resolution bin at an interval of 32 within the range of [0.4*R.sub.org:R.sub.org] as shown in FIG. 1. During model training, at each eval stage, the model is evaluated on all the resolution bins and the best accuracy within the allowed accuracy drop (delta) is recorded. After the end of the model training, the optimal resolution is selected from the eval results (R.sub.opt).

Optimal Resolution Guided Model Architecture Change

[0043] To support the optimal resolution automatically, the system introduces some changes in the architecture of the model, which can be achieved by using different methods described in this section.

Two Stem Architecture Using S2D

[0044] The two stem architecture shown in FIG. 2 uses an image of resolution HW. Stem2 uses an average pool to convert the image to half the original resolution (H/2W/2) followed by a convolution with a stride of 2 to get an output of NH/4W/4 (where N=48 is a width hyperparameter of the model). Stem1 uses a block called space to depth (S2D) which stacks an image with resolution (HW) to increase the depth for an example, a single channel of image with HW dimension will be converted to H/4W/4 with number of channels as 16. Therefore, the total number of channels of the input image with 3 channels would become 3*16 (48). This output goes through a convolution layer with stride=1 to produce 48H/4W/4 output. The output from stem1 and stem2 are added in elementwise fashion and the rest of the network structure is kept the same. The rationale behind this approach is that instead of using a low-resolution input for a model to increase the speed (with a loss in accuracy), if the system adds an additional stem that uses a higher resolution image but goes through a sharp down sampling (using S2D), the system can get more information that would have been lost otherwise.

Two Stem Architecture Using Conv2D

[0045] FIG. 3 shows how to apply a technique using a two stem architecture using Conv2D. This approach is a slight modification of S2D stem and instead of using S2D module followed by a Conv2D with stride 1 in stem1, first a Conv2D with stride 2 is applied which results in a tensor of size 32H/2W/2. Then, an average pooling layer with kernel size 2 is applied so that the final tensor from stem 1 is 32H/4W/4. The Conv2D in stem2 will have 32 output channels as well. In this design the Conv2D in stem1 is applied to a larger resolution image, which gives the model the opportunity to extract features from that scale.

Two Identical Stems with Sharp Down-Sampling

[0046] The architecture shown in FIG. 4 was used to verify whether the accuracy gain is coming from two stems or due to sharp down-sampling. Both stems have a convolution with a stride of 4 to convert the input resolution from HW to H/4W/4. The output of both stems is added bitwise to send the output to the next layer.

Single Stem with Sharp Down-Sampling and 11 Conv

[0047] The architecture shown in FIG. 5 verifies that for many datasets and models, the accuracy gain can be achieved by a single stem with sharp down sampling followed by a 11 layer.

Results

[0048] All the benchmarking results shown in Tables 1 and 2 below are from yolo5s model and trained and evaluated on a subset of coco dataset with 8 classes (person, dog, cat, car, bus, truck, motorcycle, and bicycle). All of the models are trained from scratch (no pretrained model is used) to ensure that all the performance results are comparable without any bias.

TABLE-US-00001 TABLE 1 Benchmarking of Yolo5s Model Using Different Architectures on Input Resolution 320 Input CPU GPU Exp resolution Stem1 Stem2 MAP@0.5 time (ms) time (ms) Yolov5s 320 NA NA 53.94 352.45 61.11 Yolov5s 160 37.0 105.54 22.82 Yolov5s 2stem- 320* conv(s = 4) avgpool(2 2) -> 40.59 113.53 24.14 avgpool_160 conv(s = 2 ) Yolov5s-2stem- 320* conv(s = 4) maxblurpool(2 2) -> 40.98 113.27 24.88 maxblurpool_320 conv(s = 2) Yolov5s-2stem- 320* conv(s = 4) conv(s = 4) 40.4 112.24 23.809 conv_320 Yolov5s-2stem- 320* conv(s = 4) -> conv(s = 4) -> 41.61 129.82 35.93 conv_sa_320 conv(1 1) conv(1 1) Yolov5s-conv-2x- 320* conv(s = 4) *2C -> conv(s = 4) *2C -> 41.57 115.27 23.77 channels_320 conv(1 1) conv(1 1)

TABLE-US-00002 TABLE 2 Benchmarking of Yolo5s Model Using Different Architectures on Input Resolution 448 Input CPU time GPU time Exp resolution Stem1 Stem2 MAP@0.5 (ms) (ms) Yolov5s 448 NA NA 58.8 683 112 Yolov5s 480 NA NA 60.9 772 129 Yolov5s 640 Upsample --> 896 Interpolate --> 59.8 781 139 2stem- Conv (S = 2, K = 3) 448 interpolate Maxpool (K = 2) Conv (S = 2, K = 3) Yolov5s 640 Upsample --> 896 Interpolate --> 60.2 931 157 2stem- Conv (S = 4, K = 7) 448 interpolate Conv (S = 2, K = 6) Final single 896 Conv (S = 2, K = 7, c2 = 64) N/A 60.6 768 126 stem model (actual is Maxpool (K = 2) gonna be Conv1 1(c2 = 32) half this)

[0049] It may be noted that for these tables, the resolution noted is used as the input of the model but the effective resolution would be half of the actual resolution.

CONCLUSION

[0050] The proposed solution addresses two different aspects of object detection performance, namely i) finding the optimal resolution for better latency, and ii) proposing changes in the models for better accuracy. The resolution for inference directly impacts the latency of the models, but in industrial use cases, this resolution is decided without any structured experiments. The above proposes a framework to find the optimal resolution for inference of the models which will give the lowest inference time given delta accuracy difference from the original resolution model.

[0051] For finding the changes in the architecture, multiple experiments were conducted including single stem and two stem architectures. The fundamental idea is to extract more information from the same image using different operations. For two stem architectures, two different approaches were attempted, one with S2D auxiliary stem, and one with Conv2D auxiliary stem. The Conv2D auxiliary stem produced better accuracy than S2D in most of the experiments.

[0052] It was found that the accuracy gain achieved by a single stem with sharp down sampling followed by a 11 layer is equivalent or sometimes better than the 2 stem approaches.

[0053] A combination of both accuracy and latency aspects to a given object detection model can improve performance and help save costs at large scale applications.

[0054] Referring now to FIG. 6, the proposed solution in an application is summarized. With the reference model and data as inputs, the system evaluates the model on different resolutions (e.g., the resolution bin intervals discussed above) during training. This produces an optimized resolution which can be used along with the original resolution in the next stage. The original resolution is used to insert additional layers to learn from the high resolution features, while the optimized resolution is used to learn features from the low(er) resolution. The features learned in these operations are then concatenated and used to train an optimized model as discussed above. The optimized model can be used by a target hardware, such as a CPU, NPU, embedded GPU, etc. to make inferences on the optimized resolution. The process shown in FIG. 6 can be adapted for different applications, different computing environments, and/or different hardware types to utilize the optimal resolution in various systems and devices.

[0055] For simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth in order to provide a thorough understanding of the examples described herein. However, it will be understood by those of ordinary skill in the art that the examples described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the examples described herein. Also, the description is not to be considered as limiting the scope of the examples described herein.

[0056] It will be appreciated that the examples and corresponding diagrams used herein are for illustrative purposes only. Different configurations and terminology can be used without departing from the principles expressed herein. For instance, components and modules can be added, deleted, modified, or arranged with differing connections without departing from these principles.

[0057] It will also be appreciated that any module or component exemplified herein that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media, or data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Computer storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transitory computer readable medium which can be used to store the desired information and which can be accessed by an application, module, or both. Any such computer storage media may be part of the system, any component of or related thereto, etc., or accessible or connectable thereto. Any application or module herein described may be implemented using computer readable/executable instructions that may be stored or otherwise held by such computer readable media.

[0058] The steps or operations in the flow charts and diagrams described herein are provided by way of example. There may be many variations to these steps or operations without departing from the principles discussed above. For instance, the steps may be performed in a differing order, or steps may be added, deleted, or modified.

[0059] Although the above principles have been described with reference to certain specific examples, various modifications thereof will be apparent to those skilled in the art as having regard to the appended claims in view of the specification as a whole.

REFERENCES

[0060] [1] Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020). [0061] [2] Sajjadi, Mehdi S M, Raviteja Vemulapalli, and Matthew Brown. Frame-recurrent video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6626-6634. 2018. [0062] [3] Ridnik, Tal, Hussam Lawen, Asaf Noy, Emanuel Ben Baruch, Gilad Sharir, and Itamar Friedman. Tresnet: High performance gpu-dedicated architecture. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1400-1409. 2021. [0063] [4] Sandler, Mark, Jonathan Baccash, Andrey Zhmoginov, and Andrew Howard. Non-discriminative data or weak model? on the relative importance of data and model resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0-0. 2019. [0064] [5] Zhang, Richard. Making convolutional networks shift-invariant again. In International conference on machine learning, pp. 7324-7334. PMLR, 2019.