ALGORITHMIC PIPELINE AND GENERAL PRINCIPLES OF COST-EFFICIENT AUTOMATIC PLATE RECOGNITION SYSTEM FOR RESOURCE-CONSTRAINED EMBEDDED DEVICES

Abstract

The present disclosure proposes a method for recognizing a license plate number on an image, comprising the following steps: detecting a license plate on a vehicle in an image; recognizing characters on the license plate and coordinates of the characters by a neural network, wherein the loss function of the neural network consists of classification loss, confidence loss and Complete Intersection over Union (CIOU) loss; and organizing the recognized characters based on the coordinates to form the recognized license plate number. By this method, when recognition of a license plate number on an image is performed, the network bandwidth and the hardware cost may be greatly reduced.

Claims

1. A method (100) for recognizing a license plate number on an image, comprising the following steps: detecting (101) a license plate on a vehicle in an image; recognizing (102) characters on the license plate and coordinates of the characters by a neural network, wherein the loss function of the neural network consists of classification loss, confidence loss and Complete Intersection over Union (CIOU) loss; and organizing (103) the recognized characters based on the coordinates to form the recognized license plate number.

2. The method (100) of claim 1, wherein the loss function is: $LOSS = 1 - IoU + \frac{?}{?} + ? v - {.Math.}_{i = 0}^{S^{2}} {.Math.}_{j = 0}^{B} I_{?}^{obj} [{\hat{C}}_{i} \log (C_{i}) + (1 - {\hat{C}}_{i}) \log (1 - C_{i})] - ?_{?} {.Math.}_{i = 0}^{S^{2}} {.Math.}_{j = 0}^{B} I_{ij}^{?} [{\hat{C}}_{i} \log (C_{i}) + (1 - {\hat{C}}_{i}) \log (1 - C_{i})] - {.Math.}_{i = 0}^{S^{2}} I_{ij}^{obj} \underset{?}{.Math.} [{\hat{p}}_{i} (c) \log (p_{i} (c)) + (1 - {\hat{p}}_{i} (c)) \log (1 - p_{i} (c))]$ $? indicates text missing or illegible when filed$

3. The method (100) of claim 1, wherein for a positive match prediction of a character, the confidence loss is penalized according to the confidence score of the class of the character; and for a negative match prediction of a character, the loss confidence is penalized according to the softmax loss over confidences of multiple classes in the equation below: $L_{conf} (x, c) = - {.Math.}_{i ? Pos}^{N} x_{ij}^{p} \log ({\hat{c}}_{i}^{p}) - \underset{i ? Neg}{.Math.} \log ({\hat{c}}_{i}^{o}) where {\hat{c}}_{i}^{p} = \frac{\exp (c_{i}^{p})}{{.Math.}_{p} \exp (c_{i}^{p})}$ where N is the number of matched default boxes.

4. The method (100) of claim 1, wherein the detecting and recognizing steps are performed on each of multiple images from adjacent frames in a video about the vehicle, and the method further comprising, for each character position: comparing characters from that position on the multiple images with each other; choosing the character which occurs most often as the recognized character on that position.

5. The method (100) of claim 1, wherein the detecting, recognizing and organizing steps are performed on each of multiple images from adjacent frames in a video about the vehicle, to form multiple license plate numbers, and the method further comprising: assigning all license plate numbers that have low edit-distance from the multiple license plate numbers to a cluster; and choosing the license plate number which occurs most often in the cluster as the recognized license plate number.

6. The method (100) of claim 1, wherein the recognized license plate number is a two-line license plate number.

7. The method (100) of claim 1, wherein the neural network predicts 35 classes, including digit 0?9 and letter A?Z, for a character, wherein the letter O and the digit 0 is seen as one class.

8. The method (100) of claim 1, further comprising, before the detecting step: detecting the vehicle on a photo or a video frame about the vehicle; and cropping a region including the detected vehicle from the photo or the video frame as the image.

9. A device (600) for recognizing a license plate number on an image, comprising: a processor (601); and a memory (602), having stored instructions that when executed by the processor cause the device to perform the method of claim 1.

10. A machine-readable medium, having stored thereon instructions, that when executed on a device cause the device to perform the method of claim 1.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] The above and other aspects, features, and benefits of the present disclosure will become more fully apparent from the following detailed description with reference to the accompanying drawings, in which like reference numerals or letters are used to designate like or equivalent elements. The drawings are illustrated for facilitating better understanding of the embodiments of the disclosure and not necessarily drawn to scale, in which:

[0016] FIG. 1 illustrates a flowchart of the method for recognizing a license plate number on an image according to the present disclosure;

[0017] FIG. 2 illustrates an example architecture of a neural network for detecting a vehicle;

[0018] FIG. 3 illustrates an example architecture of a neural network for detecting a license plate;

[0019] FIG. 4 illustrates an example architecture of a neural network for recognizing characters on a license plate;

[0020] FIG. 5 is a schematic block diagram of a device according to the present disclosure.

[0021] FIG. 6 is another schematic block diagram of a device according to the present disclosure.

DETAILED DESCRIPTION

[0022] Embodiments herein will be described more fully hereinafter with reference to the accompanying drawings. The embodiments herein may, however, be embodied in many different forms and should not be construed as limiting the scope of the appended claims.

[0023] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the singular forms a, an and the are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms comprises comprising, includes and/or including when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

[0024] Also, use of ordinal terms such as first, second, third, etc., herein to modify an element does not by itself connote any priority, precedence, or order of one element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the elements.

[0025] Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood. It will be further understood that terms used herein should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

[0026] A flowchart of a method 100 for recognizing a license plate number on an image according to the present disclosure is shown in FIG. 1. The method 100 comprises the following steps: a step 101 of detecting a license plate on a vehicle in an image; a step 102 of recognizing characters on the license plate and coordinates of the characters by a neural network, wherein the loss function of the neural network consists of classification loss, confidence loss and Complete Intersection over Union (CIOU) loss; and a step 103 of organizing the recognized characters based on the coordinates to form the recognized license plate number.

[0027] The method according to the present disclosure may employ state-of-the-art object detectors, based on Convolutional Neural Networks (CNNs). The networks may be trained using images from several opensource datasets, as well as images that were manually collected and labeled by UrbanChain Group Limited. The computational efficiency of the method allows it to be run on wide range of resource-constrained embedded devices (also known as IoT devices), including systems-on-chip (SoC), based on ARM or Intel mobile processors, with on-board camera connected to the SoC via MIPI-CSI interface.

[0028] The amount of shared CPU-GPU RAM consumed by the proposed method is estimated to be as low as 1.5Gb. Using innovative approaches, the method does not rely on over-the-network video transfer and does not employ any image processing on cloud servers, thus minimizing data usage, bandwidth requirements and overall cost of ownership.

[0029] Embodiments of the method according to the present disclosure will be described below, which may relate to some preprocessing and/or precondition, such as taking photos which each contains an image of a moving vehicle, and selecting one or more relatively clear photos from the taken photos. However, it can be understood that, the embodiments can be also applied without those preprocessing and precondition, for example, the image of a vehicle directly comes from a database (instead of a photo which was just taken), the vehicle is static, and/or the image itself is clear enough.

[0030] As described above, the step 101 is performed on an image of a vehicle. The image may come from e.g., a photo. For example, a plurality of photos of a moving vehicle may be taken with at least one camera, in e.g., a parking lot, wherein each photo of the plurality of photos taken with the same camera is taken in a sequential manner, in a different point of time. In an example, the plurality of photos comprises 5-10 photos. In another example, the plurality of photos may also be adjacent frames of a video regarding the vehicle. In another example, the photos are taken with two cameras connected to the same acquisition device, with each camera having a different angle of view and altitude with respect to the ground level, accounting for possible glare resulting from different angles of the vehicle's headlights, impacting the quality of the taken photos.

[0031] If some of the taken photos are blurry, it may be necessary to reduce a motion blur to be able to accurately recognize symbols on a license plate. The one solution to this problem could be a camera with high shutter speed, but such hardware could be quite expensive for deployment at scale. It is proposed to solve the problem algorithmically, e.g., by selecting one or more relatively clear photos from the plurality of photos for further processing, wherein the selection may be based on comparing at least one calculated first parameter, associated with blurriness of the photos, with a threshold. Preferably the first parameter is a variance of the photo. Preferably edge detection is performed on the plurality of photos before the first parameter is calculated. Preferably a Laplacian kernel (D.sub.xy.sup.2) is used for edge detection.

[0032] For example, the plurality of (for example, 5 to 10) consecutive taken photos may be placed in a buffer and calculating their measures of blurriness, sliding over each photo with Laplacian kernel and then calculating the variance of the resulting matrix. The Laplacian is a 2D isotropic measure of the 2nd spatial derivative of a photo. The Laplacian of a photo highlights regions of rapid intensity change and is therefore often used for edge detection. The Laplacian L (x, y) of a photo with pixel intensity values I (x, y) is given by:

[00001] $\begin{matrix} L (x, y) = \frac{?^{2} I}{x^{2}} + \frac{?^{2} I}{y^{2}} & (1) \end{matrix}$

[0033] Since the input photo is represented as a set of discrete pixels, we may find a discrete convolution kernel that can approximate the second derivatives in (1):

[00002] $\begin{matrix} D_{xy}^{2} = [\begin{matrix} 0 & 1 & 0 \\ 1 & - 4 & 1 \\ 0 & 1 & 0 \end{matrix}] & (2) \end{matrix}$

[0034] Using kernel (2), the Laplacian can be calculated using standard convolution methods. The variation of the Laplacian could serve as a reliable photo blurriness characteristic and can be used for selection of the relatively clear photo(s) for further processing via proper thresholding.

[0035] Then, in an example, on each of the selected photo(s), the vehicle may be detected and a corresponding image region containing the vehicle may be cropped as the image on which the method of the present disclosure is applied, where the vehicle is detected, by means of e.g., a trained neural network with custom architecture optimized for efficient inference on low-resource devices.

[0036] An example architecture of a neural network for detecting a vehicle is shown in FIG. 2, which describes parameters and dimensions of individual layers. The input size of the photo being fed to the network was adapted to a non-square frame size (e.g., 448?288 pixels), that can be acquired from a typical camera. The network was trained on a private dataset of approximately 5000 photos collected and labeled by UrbanChain Group Limited. The photos were taken in unconstrained scenarios of road traffic and within parking lots. A range of data augmentation strategies, such as flipping, rescaling, blurring, applying cut-mix approach was used to train the network, thus preventing overfitting by creating synthetic photos with different characteristics from a single labelled photo.

[0037] In an example, after the image region containing the vehicle is cropped from the photo, the cropped image region may act as the image processed by the method according to the present disclosure. The license plate on the cropped image region of each of the selected photos may be detected, by means of e.g., a trained neural network with carefully designed architecture and dedicated for efficient inference on mobile processors. For example, a neural network for detecting a license plate, which may employ ideas from YOLOv3 paper [J. Redmon and A. Farhadi, YOLOv3: An incremental improvement, CoRR, vol. abs/1804.02767, 2018], was trained on thousands of manually collected and labeled real-world photos and adapted to be executed in quasi-real-time on low-resource embedded devices, based on e.g., ARM chips. An example architecture of the neural network for detecting a license plate is shown in FIG. 3. The resolution of resized input images was set to e.g., 608?608 during model training, and to e.g., 416?416 during inference stage, to speed up computations on embedded platforms.

[0038] Once the license plate is detected from the image, characters on the license plate and coordinates of the characters may recognized by a neural network with e.g., the custom architecture, designed for efficient computations on low-resource devices. For fast and resource-efficient character detection in real-time, the set and order of layers, as well as the corresponding number of parameters (including convolutional filters and their sizes) in each layer of the neural network were carefully selected with the purpose of minimization of the total number of multiply-add operations involved. Said network architecture may be considered as a critical part of the proposed solution. For example, the network architecture has shown a license plate recognition accuracy of 99.7% during the field test conducted on ?2000 license plates numbers recognized from photos taken in unconstrained real-world scenario by the acquisition device both during the day and in the night.

[0039] An example layer-by-layer description of the proposed neural network architecture for recognizing the characters on the license plate is shown in FIG. 4. The associated computational costs in terms of number of multiply-add operations at each stage (i.e., layer of the neural network) are also listed on FIG. 4.

[0040] The neural network was trained to predict e.g., 35 classes (0?9, A?Z, where the letter O is detected/recognized jointly with the digit 0) and outputs a class and pixel coordinates of each recognized character. In an embodiment, the total loss function used as an optimization criterion during the process of training of the neural network consists of classification loss, confidence loss and Complete Intersection over Union (CIOU) loss and has the following form:

[00003] $LOSS = 1 - IoU + \frac{?}{?} + ? v - {.Math.}_{?}^{S^{2}} {.Math.}_{?}^{B} I_{?}^{obj} [{\hat{C}}_{?} \log (C_{i}) + (1 - {\hat{C}}_{?}) \log (1 - C_{i})] - ?_{?} {.Math.}_{?}^{S^{2}} {.Math.}_{?}^{B} I_{?}^{?} [{\hat{C}}_{?} \log (C_{i}) + (1 - {\hat{C}}_{?}) \log (1 - C_{i})] - {.Math.}_{i = 0}^{?} I_{ij}^{obj} \underset{?}{.Math.} [{\hat{p}}_{i} (c) \log (p_{i} (c)) + (1 - {\hat{p}}_{i} (c)) \log (1 - p_{i} (c))]$ $? indicates text missing or illegible when filed$

[0041] The above-mentioned confidence score is the output of the neural network. The M-dimensional vector of confidence scores (where M is the total number of available classes-letters and digits, in this particular example) is the slice of an initial vector of predicted parameters for each detection box, so as detection box coordinates. This initial vector is obtained following the sequence of algebraic operations, commonly employed while constructing neural networks (e.g., matrix addition, multiplication, concatenation, pooling, etc.), using the set of weights, resulting from the process of non-convex optimization of the target loss function by a kind of Stochastic Gradient Descent (SGD) algorithm, e.g., SGD with Momentum. In other words, the confidence loss is the loss in making a class prediction. In an embodiment, for a positive match prediction of a character, the confidence loss may be penalized, according to e.g., the confidence score of the class of the character; for a negative match prediction of a character, the confidence loss may be penalized, according to e.g., the confidence score of a special class: the special class classifies no object is detected, and it may be calculated as the softmax loss over confidences c of multiple classes (class score) as indicated below in (3).

[00004] $\begin{matrix} L_{conf} (x, c) = - {.Math.}_{i ? Pos}^{N} x_{ij}^{p} \log ({\hat{c}}_{i}^{p}) - \underset{i ? Neg}{.Math.} \log ({\hat{c}}_{i}^{o}) where {\hat{c}}_{i}^{p} = \frac{\exp (c_{i}^{p})}{{.Math.}_{p} \exp (c_{i}^{p})} & (3) \end{matrix}$

where N is the number of matched default boxes (in the paradigm of Single-Shot Object Detector, which is used for building the object detector).

[0042] In addition, to minimize errors emerging at license plate characters recognition stage, various techniques may be employed, e.g., both on character- and string-level (given string representation of license plate number).

[0043] For example, non-maximum suppression may be an important process for the character recognition. There is possible to generate a plurality of detection boxes for the same single character, wherein the detection box is an area of an image which comprises all pixels of a character. Initially, all detection boxes are sorted on the basis of their confidence scores (a second parameter) calculated for each box. A detection box with the maximum confidence score is selected out and all other remaining detection boxes having a significant overlap with that detection box, determined by a third parameter, are suppressed (i.e., discarded). The third parameter is equal to area of intersection of a pair of detection boxes divided by total area covered by the pair of detection boxes. For each selected detection box, the third parameter of each pair of detection boxes involving the selected detection box and another remaining detection box is compared with a predefined threshold value. If the third parameter is higher than the predefined threshold value, said another remaining detection box is discarded. This process is recursively applied on all of the remaining detection boxes. Finally, all the detection boxes which are selected out may be used as the final detection boxes for the characters on the license plate.

[0044] In an embodiment, the detecting and recognizing steps in the method according to the present disclosure are performed on each of multiple images from adjacent frames in a video about the vehicle, and the method further comprises, for each character position: comparing characters from that position on the multiple images with each other; choosing the character which occurs most often as the recognized character on that position.

[0045] Further, to deal with incorrectly recognized number of the license plate, well-known string processing algorithms may be employed, such as Levenshtein edit distance, to filter out errors resulting from e.g., target vehicle motion blur, various occlusions and lighting issues. The Levenshtein algorithm calculates edit distance, i.e., the least number of edit operations that are necessary to modify one string to obtain another string.

[0046] Hence, in another embodiment, the detecting, recognizing and organizing steps in the method according to the present disclosure are performed on each of multiple images from adjacent frames in a video about the vehicle, to form multiple license plate numbers, and the method further comprises: assigning all license plate numbers that have low edit-distance (such as 2) from the multiple license plate numbers to a cluster; and choosing the license plate number which occurs most often in the cluster as the recognized license plate number. The resulting license plate number may be then sent to a backend server, with some additional metadata payload. This step may reduce a risk of obtaining wrong character form the image and it allows obtaining missing data from adjacent images.

[0047] The recognized characters may be placed form e.g., one-line or two-line license plate number. In an embodiment, the recognized license plate number is a two-line license plate number. In this embodiment, initially, each character may be assigned a position on a second line by default. In order to perform a reliable distribution between two lines, each character of the license plate number is compared with two preceding and two succeeding characters, calculating a difference between values of Y-coordinate of the center of the character's detection box for each pair of the corresponding characters. If the said difference exceeds 80% of the mean height (in pixels) of the character's box (calculated by taking an average of heights of all the detected character boxes), the character with the least value of the Y-coordinate is placed to the first line.

[0048] In a further embodiment, a system is proposed which comprises at least one acquisition device comprising at least one camera, a processor, a memory and a power source, and at least one server. The acquisition device is configured to perform method according to present disclosure and transfer at least the license plate number to the server. Due to high computational efficiency, it is possible to run the method on a device with a relatively low power and it is thus possible to perform most of the computations on the acquisition device. By doing this it will enable to transmit smaller data packages (the license plate number, metadata, etc., and not a series of raw photos). It will also make server significantly cheaper due to lower requirements with respect to a computation power and a memory available.

[0049] FIG. 5 illustrates a schematic block diagram of a device 500 according to the present disclosure. The device 500 is used for recognizing a license plate number on an image, and may include: a detecting unit 501, for detecting a license plate on a vehicle in an image; a recognizing unit 502, for recognizing characters on the license plate and coordinates of the characters by a neural network, wherein the loss function of the neural network consists of classification loss, confidence loss and Complete Intersection over Union (CloU) loss; and an organizing unit 503, for organizing the recognized characters based on the coordinates to form the recognized license plate number.

[0050] It can be appreciated that, the device 500 described herein may be implemented by various units, so that the device 500 implementing one or more functions described with the embodiments may comprise not only the units shown in the corresponding figure, but also other units for implementing one or more functions thereof. In addition, the device 500 may comprise a single unit configured to perform two or more functions, or separate units for each separate function. Moreover, the units may be implemented in hardware, firmware, software, or any combination thereof.

[0051] It is understood that blocks of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, may be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, and/or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable data processing apparatus, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks.

[0052] It is also to be understood that the functions/acts noted in the blocks of the flowchart may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Although some of the diagrams include arrows on communication paths to show a primary direction of communication, it is to be understood that communication may occur in the opposite direction to the depicted arrows.

[0053] Furthermore, the solution of the present disclosure may take the form of a computer program on a memory having computer-usable or computer-readable program code embodied in the medium for use by or in connection with an instruction execution system. In the context of this document, a memory may be any medium that may contain, store, or is adapted to communicate the program for use by or in connection with the instruction execution system, apparatus, or device.

[0054] Therefore, the present disclosure also provides a device 600 including a processor 601 and a memory 602, as shown in FIG. 6. In the device 600, the memory 602 stores instructions that when executed by the processor 601 cause the device 600 to perform the method described above with the embodiments.

[0055] The present disclosure also provides a machine-readable medium (not illustrated) having stored thereon instructions that when executed on a device cause the device to perform the method described with the above embodiments.

[0056] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any implementation or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular implementations. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub-combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a sub-combination.

[0057] It will be obvious to a person skilled in the art that, as the technology advances, the inventive concept can be implemented in various ways. The above described embodiments are given for describing rather than limiting the disclosure, and it is to be understood that modifications and variations may be resorted to without departing from the spirit and scope of the disclosure as those skilled in the art readily understand. Such modifications and variations are considered to be within the scope of the disclosure and the appended claims. The protection scope of the disclosure is defined by the accompanying claims.

ALGORITHMIC PIPELINE AND GENERAL PRINCIPLES OF COST-EFFICIENT AUTOMATIC PLATE RECOGNITION SYSTEM FOR RESOURCE-CONSTRAINED EMBEDDED DEVICES

Inventors

Cpc classification

Classification Explorer

G06V20/41

PHYSICS

Classification Explorer

G06V30/15

PHYSICS

Classification Explorer

G06V30/19007

PHYSICS

Classification Explorer

G06V20/625

PHYSICS

Classification Explorer

G06V30/19173

PHYSICS

International classification

Classification Explorer

G06V20/62

PHYSICS

Classification Explorer

G06V30/148

PHYSICS

Classification Explorer

G06V20/40

PHYSICS

Classification Explorer

G06V30/19

PHYSICS

Abstract

Claims

Description