Method and System for Object Tracking and Recognition Using Low Power Compressive Sensing Camera in Real-Time Applications
20200160110 ยท 2020-05-21
Assignee
Inventors
Cpc classification
G06V20/52
PHYSICS
G06F18/2148
PHYSICS
G06V10/758
PHYSICS
International classification
Abstract
The present invention integrates a low power and high compression pixel-wise coded exposure (PCE) camera with advanced object detection, tracking, and classification algorithms into a real-time system. A PCE camera can control exposure time of every pixel in the camera and at the same time can compress multiple frames into a compressed frame. Consequently, it can significantly improve the dynamic range as well as reduce data storage and transmission bandwidth usage. Conventional approaches utilize PCE camera for object detection, tracking and classification, require the compressed frames to be reconstructed. These approaches are extremely time consuming and hence makes the PCE cameras unsuitable for real-time applications. The present invention presents an integrated solution that incorporates advanced algorithms into the PCE camera, saving reconstruction time and making it feasible to work in real-time applications.
Claims
1. A system for object tracking and recognition in real-time applications comprising: a compressive sensing camera generates video frames of motion coded images; the motion coded images are directly connected to an object tracking and recognition unit without frames reconstruction; wherein the object tracking and recognition unit having a video frames trainer including: a histogram matching unit with an output connected to a You Only Look Once (YOLO) tracker and another output connected a data augmentation unit; a vehicle label and location unit with an output connected to the YOLO tracker; a Residual Network (ResNet) classifier is connected to an output of the data augmentation unit; an output of the YOLO tracker and an output of the ResNet classifier are connected to a performance metrics unit; and an output of the performance metrics unit is being fed-back to both the YOLO tracker and the ResNet classifier, respectively.
2. A system for object tracking and recognition in real-time applications in accordance to claim 1, wherein: the compressive sensing camera is either a pixel-wise coded exposure (PCE) or a Coded Aperture (CA) camera.
3. A system for object tracking and recognition in real-time applications in accordance to claim 2, further comprising: a cropped object histogram matching unit connected between the output of the YOLO tracker and an input of the ResNet classifier; and a majority voting unit is connected to an output of the ResNet classifier.
4. A system for object tracking and recognition in real-time applications in accordance to claim 3, wherein: the compressive sensing camera generates measurements at a variable compression ratio to save video data storage space and transmission bandwidth.
5. A system for object tracking and recognition in accordance to claim 4, wherein: individual exposure control is applied to each pixel to compress the video data to improve dynamic range of the motion coded image.
6. A system for object tracking and recognition in real-time applications in accordance to claim 5, wherein: the YOLO tracker is a deep learning based tracker which tracks multiple objects simultaneously without initial bounding boxes on the objects.
7. A system for object tracking and recognition in real-time applications in accordance to claim 6, wherein: two deep learning algorithms are integrated into the compressive sensing camera.
8. A system for object tracking and recognition in real-time applications in accordance to claim 7, wherein: the algorithms can be implemented in low cost Digital Signal Processor (DSP) and Field Programmable Gate Array (FPGA) for real-time processing.
9. A method for object tracking and recognition in real-time applications, comprising the steps of: using a compressive sensing camera to produce motion coded images containing raw sensing measurements; generating training video frames from the raw sensing measurements directly without frames reconstruction; histogram matching the training video frames to a common frame reference; extracting and saving object label and location information from the motion coded images; training a You Only Look Once (YOLO) tracker using outputs from the histogram matched video frames and the extracted label and location information; training a Residual Network (ResNet) classifier by augmenting data generated by the histogram matching; selecting classification metrics from training results of the YOLO tracker and ResNet classifier, respectively; and feeding back the selected training results to the YOLO tracker and ResNet classifier.
10. A method for object tracking and recognition in real-time applications in accordance to claim 9, wherein: the compressive sensing camera is either a pixel-wise coded exposure (PCE) or a Coded Aperture (CA) camera.
11. A method for object tracking and recognition in real-time applications in accordance to claim 10, further comprising the steps of: cropping the histogram matched video frame objects from outputs of the YOLO tracker; histogram matching the cropped video frame objects to a common frame reference before sending the output to the Residual Network (ResNet) classifier; applying a decision level fusion based on majority voting to the ResNet classifier to improve classification performance; and displaying target type and location information on output videos.
12. A method for object tracking and recognition in real-time applications in accordance to claim 11, wherein the cropping step further comprising the steps of: scaling up and down the cropped objects; rotating the cropped objects by different angles; or lightening up and down the cropped objects.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
DETAILED DESCRIPTION OF THE INVENTION
[0024] Video Imager with PCE
[0025] The present invention employs a sensing scheme based on pixel-wise coded exposure (PCE) or also known as Coded Aperture (CA) camera as described in [1].
[0026] Conventional approaches to use PCE cameras involve frame reconstruction from the compressed frames using sparsity based methods [6] [7], which are extremely time consuming and unsuitable for real-time applications.
[0027] Instead of doing sparse reconstruction on PCE images or frames, the scheme of the present invention as shown in
[0031] The PCE Full Model (PCE Full or CA Full) is quite similar to a conventional video sensor: every pixel in the spatial scene is exposed for exactly the same duration of one second. However, motion is expected to be blurred significantly. This simple model still produces a compression ratio of 30:1. However, there is not much saving in sensing power since all pixels are exposed at all time.
[0032] Next, the sensing model labeled as PCE 50% or CA 50% uses the following set of parameters. For each frame, there are roughly 1.85% pixels being activated. The exposure time is Te=133.3 ms. Therefore, each exposed pixel stays continuously active for 4-frame duration. In short, the present invention outputs ONE coded aperture image for every group of 30 frames, resulting in a temporal sensing ratio of 1 frame per second (fps) or equivalently 30:1 compression ratio in terms of frame rate. In every frame, a new set of pixels that have not been activated yet would be selected for activation. Once activated, each pixel would have exactly the same exposure duration. This scheme results in 50% of the pixels locations being captured in various time-stamp within one sensing period (1 second), resulting in a single coded aperture image or PCE frame with 50% activated pixels for every 30 conventional video frames. The PCE 50% Model yields a data saving ratio of 1/301/2=1/60 and a power saving ratio of 1/604=1/15.
[0033] For the PCE 25% or CA 25% Model, the percentage of pixels activated per frame is further decreased so that the final output PCE frame contains only 25% of randomly activated pixel locations. The exposure duration is still set at the same conventional 4-frame duration. A simple way to generate PCE 25% data is to randomly ignore half of the measurements collected from the PCE 50% Model. The PCE 25% Model yields a data saving ratio of 1/301/4=1/120 and a power saving ratio of 1/1204=1/30. Note that the present invention can easily reduce the sensing power by limiting a much shorter exposure duration. This might be advantageous for tracking fast-moving objects at the expense of noisier measurements at low-light conditions.
[0034] Table 1 below summarizes the comparison between the three sensing models.
TABLE-US-00001 TABLE 1 Comparison in Data Compression Ratio and Power Saving Ratio between Three Sensing Models. PCE PCE PCE Full/CA 50%/CA 25%/CA Full 50% 25% Data Saving Ratio 30:1 60:1 120:1 Power Saving Ratio 1:1 15:1 30:1
[0035] A small portion of the sensing mask in 3-dimensional spatio-temporal space for the PCE 50% Model is shown in
Integrated Framework for Object Detection, Tracking, and Classification Directly in Compressive Measurement Domain
[0036] In the present invention, two deep learning algorithms are integrated into the PCE camera.
[0037] After the preparation is done, the YOLO tracker will be trained via some standard procedures. One useful technique for the training part is about the burn-in period. It is found necessary to perform 1000 epochs of burn-in using a small learning rate. This will prevent unstable training. Another technique is that if there are multiple objects that need to be tracked, it will be better to create a single class model that lumps all the objects into one single class. Otherwise, the training will never converge. In the training of Res-Net, data augmentation plays a critical role, especially when there are not that many video frames for training.
[0038] In training both the YOLO and Res-Net, there is an iterative process that improves over time by using feedback information from the ground truth and the training results.
[0039]
[0040] The following metrics are used in the present invention for evaluating the YOLO tracker performance: [0041] Center Location Error (CLE): It is the error between the center of the bounding box and the ground-truth bounding box. [0042] Distance Precision (DP): It is the percentage of frames where the centroids of detected bounding boxes are within 20 pixels of the centroid of ground-truth bounding boxes. [0043] EinGT: It is the percentage of the frames where the centroids of the detected bounding boxes are inside the ground-truth bounding boxes. [0044] Number of frames with detection: This is the total number of frames that have detection.
[0045] Classification metrics can be confusion matrices. The correct and false detection percentages will be tabulated in confusion matrices.
Histogram Matching
[0046] Since histogram matching is an important step in both the training and testing process, the idea is briefly summarized below. The idea of histogram matching is to histogram match all frames to a common reference with a fixed mean and standard deviation. Two histogram matching steps are implemented: one for the whole frame and another one for the image patches inside the bounding boxes.
[0047] For whole frame, the formula is given by:
F.sub.i-new=(F.sub.imean(F.sub.i))/std(F.sub.i)*std(R)+mean(R)
where F.sub.i is the i.sup.th frame before histogram matching and F.sub.i-new is i.sup.th frame after matching, R is a reference frame selected by the user.
[0048] For patch inside the bounding box, the formula is given by:
P.sub.i-new=(P.sub.1mean(P.sub.i))/std(P.sub.i)*std(R)+mean(R)
where P.sub.i is the patch containing objects and the patches are detected by the YOLO tracker.
YOLO Tracker
[0049] YOLO tracker [2] is fast and has similar performance as Faster R-CNN [3]. In the present invention, YOLO is picked because it is easy to install and is also compatible with the hardware, which seems to have a hard time to install and run Faster R-CNN. The training of YOLO is quite simple. Images with ground truth target locations are needed.
[0050] YOLO has 24 convolutional layers followed by 2 fully connected layers. Details can be found in [3]. The input images are resized to 448448. It has some built-in capability to deal with different target sizes and illuminations. However, it is found that histogram matching is essential in order to make the tracker more robust to illumination changes.
[0051] YOLO also comes with a classification module. However, based on evaluations, the classification accuracy using YOLO is not as good as can be seen in Section 3. This is perhaps due to a lack of training data. In the training of YOLO, one needs to apply several important techniques. First, burn-in is critical by training the YOLO using very small learning rate for about one thousand epochs. Second, the performance is better if all the targets/objects are lumped into a single model.
ResNet Classifier
[0052] The ResNet-18 model [4] is an 18-layer convolutional neural network (CNN) that has the advantage of avoiding performance saturation and/or degradation when training deeper layers. It is a common problem among other CNN architectures. The ResNet-18 model avoids the performance saturation by implementing an identity shortcut connection. It skips one or more layers and learns the residual mapping of the layer rather than the original mapping.
Training of ResNet requires target patches. The targets are cropped from training videos. Mirror images are then created. Then, data augmentation is performed using scaling (larger and smaller), rotation (every 45 degrees), and illumination (brighter and dimmer) to create more training data. For each cropped target, the present invention is capable of creating a data set with 64 more images.
Conventional Tracker Results
[0053] Some tracking results using a conventional tracker known as STAPLE [5] are first presented. STAPLE requires the target location to be known in the first frame. After that, STAPLE learns the target model online and tracks the target. Two cases are shown here: PCE full and PCE 50%.
[0054] As shown in
YOLO Results
[0055] Some results where training was done using only one video called Video 4 and testing using another video called Video 5 are shown in
TABLE-US-00002 TABLE 2 Tracking metrics for PCE full. Train using Video 4 and test using Video 5. CLE DP E in GT Number of frames with detection Ram 5.07 1 0.93 74/89 Frontier 4.59 1 0.74 82/89 Silverado 5.69 1 0.98 63/89
TABLE-US-00003 TABLE 3 Tracking metrics for PCE 50. Train using Video 4 and test using Video 5. CLE DP E in GT Number of frames with detection Ram 7.08 1 0.97 74/89 Frontier 6.1 1 0.75 83/89 Silverado 6.56 1 1 65/89
TABLE-US-00004 TABLE 4 Tracking metrics for PCE 25. Train using Video 4 and test using Video 5. CLE DP E in GT Number of frames with detection Ram 9.07 1 0.97 39/89 Frontier 6.85 1 0.74 67/89 Silverado 10.9 1 0.88 33/89
Classification Using ResNet
[0056] Here, two classifiers are applied: YOLO and ResNet. It should be noted that classification is performed only when there are good detection results from the YOLO tracker. For some frames in the PCE 50 and PCE 25, there may not be positive detection results and for those frames, no classification result is generated.
[0057] Similar to the tracking case, the training was done by using Video 4 and the testing was done by using Video 5. Table 5-Table 7 show the classification results using YOLO and ResNet. The first observation is that the ResNet performance is better than that of YOLO. The second observation is that the classification performance deteriorates with high missing rates. The third observation is that Ram and Silverado have lower classification rates. This is because Ram and Silverado have similar appearances. A fourth observation is that the results in Table 7 appear to be better than other cases. This may be misleading, as the classification is done only for frames with good detection.
TABLE-US-00005 TABLE 5 Classification results for PCE full case. Video 4 for training and Video 5 for testing. Classification Ram Frontier Silverado Accuracy (a) YOLO classifier outputs. Ram 13 10 50 0.1781 Frontier 9 69 3 0.8519 Silverado 55 0 7 0.1129 (b) ResNet classifier outputs. Ram 48 17 9 0.6486 Frontier 15 67 0 0.8171 Silverado 16 19 28 0.4444
TABLE-US-00006 TABLE 6 Classification results for PCE 50 case. Video 4 for training and Video 5 for testing. Classification Ram Frontier Silverado Accuracy (a) YOLO classifier outputs. Ram 15 37 19 0.2113 Frontier 8 75 0 0.9036 Silverado 60 0 5 0.0769 (b) ResNet classifier outputs Ram 26 5 43 0.3514 Frontier 9 53 21 0.6386 Silverado 11 1 53 0.8154
TABLE-US-00007 TABLE 7 Classification results for PCE 25 case. Video 4 for training and Video 5 for testing. Classification Ram Frontier Silverado Accuracy (a) YOLO classifier outputs. Ram 18 14 7 0.4615 Frontier 15 50 2 0.7463 Silverado 28 0 5 0.1515 (b) ResNet classifier outputs Ram 29 9 1 0.7436 Frontier 0 69 0 1.0000 Silverado 17 3 13 0.3939
Post-Classification Enhancement Step
[0058] To further increase the classification performance, a decision level fusion is proposed, which is based on voting. At a particular instant, the classification decision is based on all decisions made in the past N frames. That is, the decision is based on majority voting. The class label with the most votes will be selected as the decision at the current instance.
[0059] It will be apparent to those skilled in the art that various modifications and variations can be made to the system and method of the present disclosure without departing from the scope or spirit of the disclosure. It should be perceived that the illustrated embodiments are only preferred examples of describing the invention and should not be taken as limiting the scope of the invention.
REFERENCES
[0060] [1] J. Zhang, T. Xiong, T. Tran, S. Chin, and R. Etienne-Cummings. Compact all-CMOS spatio-temporal compressive sensing video camera with pixel-wise coded exposure, Optics Express, vol. 24, no. 8, pp. 9013-9024, April 2016. [0061] [2] J. Redmon, A. Farhadi, YOLOv3: An Incremental Improvement, arxiv, April 2018. [0062] [3] S. Ren et al., Faster R-CNN: Towards real-time object detection with region proposal networks, Advances in neural information processing systems. 2015. [0063] [4] K. He X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, Conference on Computer Vision and Pattern Recognition, 2016. [0064] [5] Bertinetto, L. et al.: Staple: Complementary Learners for Real-Time Tracking. Conference on Computer Vision and Pattern Recognition. (2016) [0065] [6] M. Dao, C. Kwan, K. Koperski, and G. Marchisio, A Joint Sparsity Approach to Tunnel Activity Monitoring Using High Resolution Satellite Images, IEEE Ubiquitous Computing, Electronics & Mobile Communication Conference, pp. 322-328, New York City, October 2017. [0066] [7] J. Zhou, B. Ayhan, C. Kwan, and T. Tran, ATR Performance Improvement Using Images with Corrupted or Missing Pixels, Proc. SPIE 10649, Pattern Recognition and Tracking XXIX, 106490E, 30 Apr. 2018.