SYSTEMS AND METHODS FOR RAPID DEVELOPMENT OF OBJECT DETECTOR MODELS
20230023164 · 2023-01-26
Assignee
Inventors
- Vasudev Parameswaran (Fremont, CA)
- Atul KANAUJIA (San Jose, CA, US)
- Simon CHEN (Pleasanton, CA, US)
- Jerome BERCLAZ (San Jose, CA, US)
- Ivan KOVTUN (San Jose, CA, US)
- Alison HIGUERA (San Josae, CA, US)
- Vidyadayini TALAPADY (Sunnyvale, CA, US)
- Derek YOUNG (Carbondale, CO, US)
- Balan AYYAR (Oakton, VA, US)
- Rajendra SHAH (Cupertino, CA, US)
- Timo PYLVANAINEN (Menlo Park, CA, US)
Cpc classification
G06V10/778
PHYSICS
G06V10/72
PHYSICS
G06V10/7753
PHYSICS
International classification
G06V10/72
PHYSICS
G06V10/774
PHYSICS
Abstract
A computer vision system configured for detection and recognition of objects in video and still imagery in a live or historical setting uses a teacher-student object detector training approach to yield a merged student model capable of detecting all of the classes of objects any of the teacher models is trained to detect. Further, training is simplified by providing an iterative training process wherein a relatively small number of images is labeled manually as initial training data, after which an iterated model cooperates with a machine-assisted labeling process and an active learning process where detector model accuracy improves with each iteration, yielding improved computational efficiency. Further, synthetic data is generated by which an object of interest can be placed in a variety of setting sufficient to permit training of models. A user interface guides the operator in the construction of a custom model capable of detecting a new object.
Claims
1. A method for developing in one or more processors a merged deep learning model for classification and detection of one or more previously specified objects and at least one newly specified object comprising providing in one or more processors and associated storage a system production model comprising a first deep learning model capable of detecting and classifying one or more previously specified objects identified through the use of at least some anchor bounding boxes, providing in one or more processors and associated storage an iterated model comprising a second deep learning model capable, following training, of detecting and classifying at least one newly specified object, providing a system production training dataset representative of the previously specified objects to the system production model and the iterated model, providing a second training dataset representative of a newly specified object identified through the use of at least some bounding boxes to the system production model and the iterated model, processing, in both the system production model and the iterated model, at least the system production dataset and the second training dataset and generating a system training output and an iterated training output, respectively, optimizing the training output from the processing step by applying classification and regression algorithms to the system training output and the iterated training output to generate an optimized training output, supplying the optimized training output as the merged deep learning model configured to classify and detect at least some of the one or more previously specified objects and at least one of the newly specified objects.
2. The method of claim 1 wherein new unlabeled data is supplied to the system production model, the iterated model, and the optimizing step and the processing step includes processing the new unlabeled data.
3. The method of claim 1 wherein at least one of the system production model and the iterated model is a single shot multibox detector.
4. The method of claim 1 wherein classification comprises determining the probability distribution of the presence of any of the objects of interest, or the background at an anchor box.
5. The method of claim 1 wherein classification is modeled as a softmax function to output confidence of a foreground class or a background class.
6. The method of claim 1 wherein regression is modeled as a non-linear multivariate regression function.
7. The method of claim 6 wherein the multivariate regression function outputs a four-dimensional vector representing center coordinates, width and height of the bounding box enclosing the object in the image.
8. The method of claim 1 wherein the second training dataset is only partly labeled.
9. The method of claim 1 wherein the system production model is interoperable with the iterated model.
10. A method for developing in one or more processors a student deep learning model for classification and detection of one or more previously specified objects and at least one newly specified object, comprising the steps of providing in one or more processors and associated storage a first teacher model comprising a first deep learning model capable of detecting and classifying at least one previously specified object, providing in one or more processors and associated storage a second teacher model comprising a second deep learning model configured for being trained to detect and classify at least one newly specified object, providing to the first teacher model and the second teacher model a first training dataset representative of the previously specified objects, providing to the first teacher model and the second teacher model a second training dataset representative of a newly specified object identified through the use of at least some bounding boxes, processing, in both the first teacher model and the second teacher model, at least the first training dataset and the second training dataset and generating a first training output and a second training output, respectively, optimizing the first and second training outputs to generate an optimized training output from the processing step by applying classification algorithms for determining the probability distribution, at an anchor box, of the presence of either of any of the objects of interest or the background and applying regression algorithms for determining the bounding box of an object that is detected at the anchor box. supplying the optimized training output as the student deep learning model configured to classify and detect at least one of the one or more previously specified objects and at least one of the newly specified objects.
11. The method of claim 10 wherein at least the second training dataset comprises in part video snippets.
12. The method of claim 10 wherein at least the second training dataset comprises in part synthetic data.
13. The method of claim 10 wherein the first teacher model and the second teacher model are interoperable.
14. The method of claim 13 wherein at least one of the first teacher model and the second teacher model is a single shot multibox detector.
15. The method of claim 10 wherein the second training output is provided to an operator for correction and the corrected output is processed in a second iteration of the processing step.
16. The method of claim 10 comprising the further step of providing a validation dataset to the first teacher model and the second teacher model.
17. The method of claim 10 wherein at least some images are provided to an operator as the result of an uncertainty calculation for distinguishing an object from background.
18. The method of claim 17 wherein the uncertainty calculation is based in part on a variable threshold.
19. The method of claim 10 wherein a grid of anchor boxes is distributed uniformly throughout an image.
20. A method for developing in one or more processors a merged deep learning model for classification and detection of one more previously specified objects and at least one newly specified object, comprising the steps of providing in one or more processors and associated storage a plurality of teacher models each comprising a deep learning model capable of detecting and classifying at least one previously specified object and at least one newly specified object, providing to each teacher model a plurality of first training datasets, some of which are representative of at least some of the one or more previously specified objects, providing to each teacher model at least one new training dataset representative of the at least one newly specified object identified through the use of at least some bounding boxes, providing to each teacher model new unlabeled data, processing, in each of the teacher models, each of the plurality of training datasets and the new unlabeled data and generating a training output from each of the plurality of teacher models, optimizing the training output from each of the plurality of teacher models to generate an optimized training output by applying classification algorithms and regression algorithms to each of the training outputs, supplying the optimized training output as the merged deep learning model configured to detect and classify one or more of the previously specified objects and at least one newly specified object.
21. The method of claim 20 wherein each of the plurality of teacher models is interoperable with the remainder of the plurality of teacher models.
22. The method of claim 20 wherein at least some of the teacher models are selected from a group comprising a single shot multibox detector and a low shot learning detector.
23. The method of claim 20 wherein the classification algorithms determine the probability distribution, at an anchor box, of the presence of either of any of the objects of interest or the background, and the regression algorithms determine the bounding box of an object that is detected at the anchor box.
24. The method of claim 20 wherein generating the training output is fine tuned by propagating a loss function gradient.
25. The method of claim 20 wherein at least one new training dataset comprises in part synthetic data.
26. A method for developing in one or more processors a student deep learning model for classification and detection of one more previously specified objects and at least one newly specified object, comprising the steps of providing in one or more processors and associated storage at least one of a first teacher model comprising one of either a single shot multibox detector or a low shot learning detector configured to classify and detect at least one previously specified object, providing in one or more processors and associated storage at least one of a second teacher model comprising one of either a single shot multibox detector or a low shot learning detector configured for being trained to classify and detect at least one newly specified object, providing to each first teacher model and each second teacher model a first training dataset representative of the previously specified objects, providing to each first teacher model and each second teacher model at least one new training dataset representative of a newly specified object, processing, in each first teacher model and each second teacher model, the first training dataset and the at least one new training dataset and generating a first training output and at least one new training output, respectively, optimizing the first training output and new training outputs to generate an optimized training output from the processing step by applying classification algorithms and regression algorithms, supplying the optimized training output as the student deep learning model configured to classify and detect at least some of the one or more previously specified objects and at least one of the newly specified objects.
27. The method of claim 26 wherein at least one of the first teacher model is interoperable with at least one of the second teacher model.
28. The method of claim 26 further comprising the step of testing the merged model by processing production data to generate a production output and feeding the production output to an operator.
29. The method of claim 26 wherein only part of the at least one new training dataset is labeled.
30. The method of claim 26 further comprising the step of iteratively improving the student deep learning model by testing for uncertainty as to whether an anchor box includes an object and, for a plurality of anchor boxes, sorting according to uncertainty values.
Description
THE FIGURES
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
DETAILED DESCRIPTION OF THE INVENTION
[0058] The present invention enables a user to create an object detection model for custom objects, and to then use that custom model to find those objects is video and still frame imagery where that imagery can be either live or pre-recorded. In an embodiment of an aspect of the invention, the training of the custom object detection model is achieved with a volume of training data substantially less than in many prior art systems. In an embodiment of a further aspect of the invention, the custom model, together with a backbone object detection neural network that is pretrained on a variety of objects, forms the teacher portion of a teacher-student ensemble network which permits development of an optimized student object detection model with significantly improved computational efficiency. In an embodiment, each of the networks is a “Single Shot Multibox Detector” or “SSD” neural network for the detection task, with classification and regression performed at and relative to anchor boxes, where, in at least some implementations, the predefined, fixed grid of anchor boxes is spread uniformly through the image. While the following description assumes a supervised learning model, those skilled in the art will recognize, once they have digested the teachings herein, that unsupervised learning can also be used in at least some embodiments. In particular, if a model is “pre-trained” on a large amount of video data, all using unsupervised data—basically “self supervision”, the amount of fine tuning that would be needed to build a specific model would be significantly reduced. While such a general purpose model that will work for any scene requires substantial compute power and data storage, data with considerable redundancy will greatly reduce compute and data storage needs considerably. Where the data source is a specific camera or group of cameras, which is a common configuration wherein a specific camera will see a highly regular scene with a lot of redundancy, an unsupervised learning system can reduce the tuning time.
[0059] To develop a custom object detector model, a set of representative images of the object of interest is gathered. The images can come from an existing or newly captured dataset or, in some embodiments, can be generated synthetically, as discussed in greater detail below in connection with
[0060] Each of the images is then labeled by identifying all of the occurrences of the object of interest and drawing a tight bounding box enclosing the entire object without extraneous elements. The minimum number of images for generation of a model can vary depending upon the size of the dataset and the nature of the objects being sought, but is typically between 10 and 1,000, with 50 images an exemplary number.
[0061] Once a sufficient quantity of images has been labeled, training is performed by the associated SSD, which may be operating with any of a variety of backbones, for example Resnet50, Resnet34, InceptionV3, or numerous other SSD variations, but, for at least some embodiments, with the weights unfrozen so that the detectors can be fine tuned for a specific task by propagating the gradient of the loss function from the top to the bottom. The output of the SSD's comprises a first model. That model, together with an extensively trained system production model, comprise the “teacher” side of a teacher-student network, where the teacher networks are merged in an optimizing step using a novel form of distillation and the output of that step is a student model capable of detecting objects in all of the classes for which the system production model is trained plus all classes that can be detected by the iterated model. In some embodiments, no system model will have been previously developed. In such a case, the the event that no system production model has been developed,
[0062] The model is then tested against a set of images for validation, which provides an indication of how well the model performs. As discussed below, and depending upon the embodiment, various feedback and iterative techniques can be implemented to improve the model. In at least some embodiments, it is desirable to provide interoperability between the system production model and the custom, or iterated, model. Thus, in an embodiment, the two teacher models use a common vocabulary of object classes, where an operator seeking to designate a new class can see the previously trained classes and thus avoid duplication. Further, in an embodiment, the models use the same deep neural network framework, although such commonality is not required in all embodiments. In other embodiments, interoperability can be achieved where the neural network models are understandable in both frameworks, for example using the ONNX format although the ONNX format does not always yield successful results without operator intervention. It will be appreciated by those skilled in the art that, if the networks are interoperable, the custom model can be merged with the system production model. Further, should the system production model yield poor results, for example as the result of poor labeling, the images from the system production model can be supplied to the image set of the present invention such that any labeling errors can be corrected, resulting in a more accurate production model.
[0063]
[0064] At step 115, the classes of objects are defined, for example, “red ball”, or “sunflower”, or any other appropriate term. The descriptors for the class are assigned by the operator in many embodiments, although it will be appreciated that, if synthetic data is used, the object is already defined and, as with step 110, no human involvement is required. Next, at step 120, at least some of the images from the collected image set are labeled by applying bounding boxes tightly around each occurrence of the object in the images. While human intervention is required to applying bounding boxes for many types of images, for at least synthetic images the labeling can be performed automatically, since the process of generating a synthetic image includes knowing where the object is within the image. Next, at step 125, the model is trained by processing the labeled images in an appropriate neural network, where the result is an iterated model 130. The training process is typically an SSD as described above although in some instances a Low Shot Learning approach can work to get to an iterated model faster with less labor in acquiring training data. Other types of deep learning networks suitable for detecting objects in imagery are also acceptable. In an embodiment, the backbone or deep residual network of the SSD can be the Resnet50 architecture, although architectures such as InceptionV3, Resnet34 with an additional layer for smaller objects, or any other functionally equivalent architecture may also be acceptable.
[0065] The output of the iterated model 130 is a set of images and labeling data, where the top layer classifier for the iterated model will have two outputs, specifically new-class versus background. That output is supplied to an optimization process 135, described in more detail in connection with
[0066] Active learning 145, discussed in greater detail in connection with
[0067] Through repetition of the cycle of labeling, training, creating the iterated model, testing for uncertainty, then sending the least certain images back to the operator for reassessment, the model is iteratively improved. Because the uncertainty threshold or selection process can be adjusted according to any convenient criteria, the size of the group of images sent back for review by the operator can be comparatively small compared to the full dataset, with the result that a relatively small volume of images can, through iterative assessment, refine the iterated model 130 until it achieves acceptable accuracy. This reduces the labor involved and can also reduce computational expense.
[0068] As noted above, the output of the iterated model 130 is also supplied to an optimization process 135, which also receives as an input the images and a system production model 150. The system production model 150 and the iterated model 130 form the teacher pair of networks, where each is trained for different objects and, through optimization process 135, their trainings are combined into a single student model, specifically merged model 155, trained to detect any object or objects that could have been detected by either (or both) the system production model or the iterated model. The merged model will have (N+1)+1=(N+2) outputs where the last “+1” is for the background class. Omitted from
[0069] The output of the merged model is then deployed, step 160, where it is applied to the production data 165. The results of that deployment are then fed back to step 120, as were the images labeled by the machine-assisted labeling process 140 and the active learning process 145, to allow the operator to correct the labeling of any images that the operator determines were mislabeled. It will be appreciated that, depending upon the embodiment, the feedback from one or more of the feedback sources 140, 145 and 165 is optional.
[0070] Further, in implementations where the system production model still needs to be developed, the foregoing steps can be used to create the system production model simply by executing the above-described process steps but without inclusion of the system production model and its associated dataset as inputs. As just one example, in an embodiment, the first execution of the process of the invention, including the aforementioned feedback as desired, classifies and detects a first object. That model, while capable of classifying and detecting only a first object can be used as a nascent system production model, where each successive execution of the process adds an additional object to the objects that can be detected by that developing system production model. The collection of training data developed through successive addition of objects to the developing system production model becomes the system production training dataset. For purposes of the present invention, the foregoing description of the development of the system production model is not intended to be limiting, and the system production model can be developed in any suitable manner, The following description of the invention assumes a pre-existing system production model unless specifically stated to the contrary, although it will be apparent to those skilled in the art, upon digesting the details presented hereinafter, how to modify those processes and systems to develop the system production model if one does not yet exist.
[0071] Referring next to
[0072] To begin, in an embodiment a user assigns a name to an object of interest and then labels a batch of unlabeled images 200A. In some embodiments, the batch may range in size from about ten images to 1000 or more images, at least partly based on the size of the production data set. The images in the batch are then labeled, step 210, where step 120 of
[0073] The result of the training step 125 is the first iteration of iterated model 130, which also functions as a teacher as discussed further below and shown in simplified form in
[0074] Still referring to
[0075] Still referring to
[0076] In an additional aspect of some embodiments of the invention, tracking of video snippets, indicated at 270 in
[0077] Referring again to
[0078] While distillation is known where the task is to classify an image into different categories, the present invention extends this concept to a detection task where the model is required to report not only whether an object of a particular class exists in an image, but also the location of that object in the image, with the location typically represented as within a tight bounding box around the object. In an embodiment, the present invention enables combining two or more teacher models trained for different objects into a single student model containing all the objects, and also enables using only partially labeled datasets to train a model. That is, at least some embodiments of the invention enable using different sets where only a single one or only some objects of interest are labeled, thus saving substantial effort in that it becomes unnecessary to review all the data and relabel all the objects in all the images.
[0079] Thus, in
[0080] In an embodiment of the invention, a “Single Shot Multibox Detector” (SSD) neural network is used for the detection task. Classification and regression are performed at and relative to predefined, fixed grid of boxes called “anchor boxes”. For a large set of “anchor boxes” spread uniformly through the image, the SSD algorithm trains a network to perform two tasks, classification and regression, where classification is determining the probability distribution of the presence of any of the objects of interest, or the background at an anchor box and regression is determine the bounding box of the object that is detected at the anchor box.
[0081] Classification is modeled as a softmax function to output confidence of a foreground class or the background class:
for foreground classes C.sub.k and background class B, for the anchor box X. Note here that background is treated just as one of the class amongst all the classes modeled by the softmax function. The background class is trained by extracting negative examples around the positive examples in the labeled images. The loss function for training the classifier is a cross-entropy loss defined for every association of anchor box to a label denoted by X.sub.Label,Anchor [Eq. 1, below]:
[0082] Regression is modeled as a non-linear multivariate regression function that outputs a four-dimensional vector representing center coordinates, width and height of the bounding box enclosing the object in the image. The loss function for training regressor is a smoothL1 loss function L.sub.loc(C,X). Only foreground objects are used for training the regressor as background class has no boundaries [Eq. 2, below]:
Here x.sub.Label,Anchor is 1 for an association between a positive label and a predefined anchor box. Δx.sub.Box is the offset of the ground truth label relative to the associated anchor box, Δx.sub.Pred is the predicted bounding box from the network.
[0083] For training a standard SSD[3] model, parameters are learned that minimize L.sub.conf(C,X)+L.sub.loc(C,X) defined over a selective set of positive and negative anchor boxes X, chosen using the using the labels from manually annotated images. These labels are called hard labels with one-hot encoding for positive samples.
[0084] As a part of the workflow for the present invention, an operator will train multiple detectors by labeling multiple sets of data where only a particular object of interest is labeled in each dataset. Distillation enables an operator to train a single student model from multiple teacher models without losing accuracy, and without requiring the operator to label all the objects on all the datasets. The advantage of doing this is the performance gain resulting from running a single detector instead of multiple detectors.
[0085] The teacher in this case constitutes multiple networks of similar complexity, where each network is able to detect a new class of object as trained by the user. The student is a new network of similar complexity as the teacher models, where the goal is to distill the knowledge from multiple teacher models into a single student model.
[0086] While the distillation process can be performed on any number of teacher networks, as an example, the algorithm can be illustrated by using two teacher networks M.sub.1 and M.sub.2 to train a student network M. The teacher networks are trained to detect class C.sub.1 and C.sub.2 with the respective “background” classes B.sub.1 and B.sub.2. “Background”, in this context, means regions that do not contain the object of interest. (Labeled-Data).sub.1 and (Labeled-Data).sub.2 are employed for training M.sub.1 and M.sub.2 that have only their respective classes labeled.
[0087] In an embodiment, the student model is a single deepnet model M with two classes and a single background class B that is an intersection of classes B.sub.1 and B.sub.2. The probability mapping for the combined model can be performed as follows. For the input X, the model for (Labeled-Data)1 and (Labeled-Data)2 has class probability as P(C.sub.1|M.sub.1,X) and P(C.sub.2|M.sub.2,X) respectively. Corresponding background probabilities are P(B.sub.1|M.sub.1,X) and P(B.sub.2|M.sub.2,X) respectively. The probabilities for the teacher models are computed as follows:
P.sub.Teacher(B|X)=P(B.sub.1|M.sub.1,X)×P(B.sub.2|M.sub.2,X)
P.sub.Teacher(C.sub.1|X)=P(C.sub.1|M.sub.1,X)×P(B.sub.2|M.sub.2,X)
P.sub.Teacher(C.sub.2|X)=P(C.sub.2|M.sub.2,X)×P(B.sub.1|M.sub.1,X)
[0088] In this example, the loss terms for training the SSD comprise a loss term for the classifier and a term of the regressor, shown in Eq. 1 and Eq. 2, above. In the present invention, the loss function for training a student model is a linear combination of two loss functions: [0089] Loss1: Positive labels are hard labels that are extracted from (Labeled-Data).sub.1 and (Labeled-Data).sub.2 where only positive labels are sampled and no negative samples are extracted because it isn't known whether a negative sample for class C.sub.1 has a class C.sub.2 object (and vice-versa). [0090] a. For training the classifier, only positive examples are used in the cross-entropy loss in Eq. 1, above.
Here, X represents the anchor box associated to positive soft labels and Δx represents the difference between the soft label and the associated anchor box X. So a highly confident classification score will have more influence in optimizing the corresponding regression loss (smooth.sub.L1 loss). A bounding box that does not have a high confidence C.sub.1 or C.sub.2 box will be most likely a background and will not have any significant influence on the regression function.
[0096] The combined loss is α*Loss1+(1−α)*Loss2, where α is used to control the weights of the combined loss and, in an embodiment, is set to 0.25. Note that any amount of representative unlabeled data can also be used to train a student model from the teacher models M.sub.1 and M.sub.2. There, only the Loss2 term is employed, as there are only soft labels from the models, and no hard labels as used in the Loss1 term.
[0097] Referring next to
[0098] Referring next to
The system organizes the unlabeled data according to each datum's uncertainty score, after which the operator is invited to label a batch of the unlabeled data having the highest uncertainty scores. The model is then retrained using all of the labeled data, yielding an improved result. This cyclic process of labeling, training and querying is continued until the model converges or the validation accuracy is deemed satisfactory by the user. By using active learning, the customers are able to train a model with high accuracy by only labeling a small subset of the raw data, for example as few as ten images for some models and as many as 1000 images or more for other models, based at least in part on the size of the dataset.
[0099]
[0100] In some instances, the object is available physically but there are insufficient images of the object in context, i.e., with an appropriate background, to create a dataset adequate to train a model to yield sufficiently accurate results. In other cases, no physical example exists, but a 3D computer model is available. In such circumstances, the generation of synthetic images can offer a number of advantages. An embodiment of such an approach can be appreciated from
[0101] The details of the object are then provided from 525 to a blending process, 535, which also receives data representative of at least color, tone, texture and scale of the scene depicted in a background image, 540, as well as characterizing information specifying position and angle of view of a virtual camera, 545, together with characteristics of the virtual camera such as distortion, foreshortening, compression, etc. The virtual camera can be defined by any suitable digital representation of a model of camera. The process 535 modifies the object in accordance with the context of the background image, including color and texture matching as well as scaling the object to be consistent with its location in the background image, and adjusts the image of the object by warping, horizontally or vertically tilting the object, and other similar photo post-processing techniques to give the synthetic representation of the object proper scale, perspective, distortion representative of the camera lens, noise, and related camera characteristics. The blended and scaled object image from step 555 is then provided to a renderer 560 which places the blended and scaled object into the background image. To achieve that result, the renderer 560 also receives the background image 540 and the camera information, 545 and 550. The result is a synthetic image 565 of the object in the background image, usable in dataset 200 of
[0102] Referring next to
[0103] Next with reference to
[0104] The multisensor processor 615 can be a server computer such as maintained on premises or in a cloud network, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a cellular telephone, a smartphone, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions 635 (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term “machine” is to be understood to include any collection of machines that individually or jointly execute instructions 635 to perform any one or more of the methods or processes discussed herein.
[0105] In at least some embodiments, the multisensor processor 615 comprises one or more processors 650. Each processor of the one or more processors 650 can comprise a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), a controller, one or more application specific integrated circuits (ASICs), one or more radio-frequency integrated circuits (RFICs), or any combination of these. In an embodiment, the machine 615 further comprises static memory 655 together with main memory 645, which are configured to communicate with each other via bus 660. The machine 615 can further include one or more visual displays as well as associated interfaces, all indicated at 665, for displaying messages or data. The visual displays may be of any suitable type, such as monitors, head-up displays, windows, projectors, touch enabled devices, and so on. At least some embodiments further comprise an alphanumeric input device 670 such as a keyboard, touchpad or touchscreen or similar, together with a pointing or other cursor control device 675 such as a mouse, a trackball, a joystick, a motion sensor, a touchpad, a tablet, and so on), a storage unit or machine-readable medium 640 wherein the machine-readable instructions 635 are stored, a signal generation device 680 such as a speaker, and a network interface device 685. A user device interface 690 communicates bidirectionally with user devices 620 (
[0106] Although shown in
[0107] While machine-readable medium or storage device 640 is shown in an example embodiment to be a single medium, the term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) able to store instructions (e.g., instructions 635). The term “machine-readable medium” includes any medium that is capable of storing instructions (e.g., instructions 635) for execution by the machine and that cause the machine to perform any one or more of the methodologies disclosed herein. The term “machine-readable medium” includes, but is not limited to, data repositories in the form of solid-state memories, optical media, and magnetic media. The storage device 640 can be the same device as data store 630 (
[0108]
[0109] Where the multisensor data from inputs 700A-700n includes full motion video from terrestrial or other sensors, the processor 615 can, in an embodiment, comprise a face detector 720 chained with a recognition module 725 which comprises an embedding extractor, and an object detector 730. In an embodiment, the face detector 720 and object detector 730 can employ a single shot multibox detector (SSD) network, which is a form of convolutional neural network. SSD's characteristically perform the tasks of object localization and classification in a single forward pass of the network, using a technique for bounding box regression such that the network both detects objects and also classifies those detected objects. Using, for example, the FaceNet neural network architecture, the face recognition module 725 represents each face with an “embedding”, which is a 128-dimensional vector designed to capture the identity of the face, and to be invariant to nuisance factors such as viewing conditions, the person's age, glasses, hairstyle, etc. Alternatively, various other architectures, of which SphereFace is one example, can also be used. In embodiments having other types of sensors, other appropriate detectors and recognizers may be used. Machine learning algorithms may be applied to combine results from the various sensor types to improve detection and classification of the objects, e.g., faces or inanimate objects. In an embodiment, the embeddings of the faces and objects comprise at least part of the data saved by the data saver 710 and encoders 705 to the data store 630. The embedding and entities detections, as well as the raw data, can then be made available for querying, which can be performed in near real time or at some later time.
[0110] Queries to the data are initiated by analysts or other users through a user interface 735 which connects bidirectionally to a reasoning engine 740, typically through network 620 (
[0111] Queries are processed in the processor 615 by a query process 755. The user interface 735 allows querying of the multisensor data for faces and objects (collectively, entities) and activities. One exemplary query can be “Find all images in the data from multiple sensors where the person in a given photograph appears”. Another example might be, “Did John Doe drive into the parking lot in a red car, meet Jane Doe, who handed him a bag”. Alternatively, in an embodiment, a visual GUI can be helpful for constructing queries. The reasoning engine 740, which typically executes in processor 615, takes queries from the user interface via web services and quickly reasons through, or examines, the entity data in data store 630 to determine if there are entities or activities that match the analysis query. In an embodiment, the system geo-correlates the multisensor data to provide a comprehensive visualization of all relevant data in a single model. Once that visualization of the relevant data is complete, a report generator module 760 in the processor 615 saves the results of various queries and generates a report through the report generation step 765. In an embodiment, the report can also include any related analysis or other data that the user has input into the system.
[0112] The data saver 715 receives output from the processing system and saves the data on the data store 630, although in some embodiments the functions may be integrated. In an embodiment, the data from processing is stored in a columnar data storage format, such as Parquet as just one example, that can be loaded by the search backend and searched for specific embeddings or object types quickly. The search data can be stored in the cloud (e.g. AWS S3), on premise using HDFS (Hadoop Distributed File System), NFS, or some other scalable storage. In some embodiments, web services 745 together with user interface (UI) 735 provide users such as analysts with access to the platform of the invention through a web-based interface. The web based interface provides a REST API to the UI. The web based interface, in turn, communicates with the various components with remote procedure calls implemented using Apache Thrift. This allows various components to be written in different languages.
[0113] In an embodiment, the UI is implemented using React and node.js, and is a fully featured client side application. The UI retrieves content from the various back-end components via REST calls to web service. The User Interface supports upload and processing of recorded or live data. The User Interface supports generation of query data by examining the recorded or live data. For example, in the case of video, it supports generation of face snippets from uploaded photograph or from live video, to be used for querying. Upon receiving results from the Reasoning Engine via the Web Service, the UI displays results on a webpage.
[0114] A user interface comprises another aspect of the present invention, and various screens of an embodiment of a user interface are shown in
[0115] If the operator decides that the existing models would not yield the desired results, the operator can click on “New”, shown at 925, in which case in an embodiment a screen such as shown in
[0116] The operator is invited to define a new object by clicking on “New Object”, 1115, which causes, in an embodiment, the screen 1120 of
[0117] When the screen 1250 of
[0118] In
[0119] Depending upon the embodiment, the process of
[0120] Once the model has been trained sufficiently, such that the merged model 155 (
[0121]
[0122] To increase or decrease the number of detections, the confidence threshold can be adjusted to any desired level, for example by slider 1580, shown in
[0123] The display of confidence percentages can also vary depending upon the selections of the data to be displayed to the operator. For example, in an embodiment of the Analysis Results display, confidence percentages are hidden by default in the video player, and by default also hidden for objects displayed in the larger view shown at 1555. At the same time, by default all detections exceeding a default low confidence threshold, for example one percent, may be returned as search results, optionally arranged by confidence percentage. In contrast, the defaults for Live Monitoring Alerts may be, for example, to return all detections above a default threshold of 20% confidence, with confidence percentages always visible. As noted above, the default values can be adjusted via the settings accessible at icon 1560.
[0124] In an embodiment, “inspect” mode reveals to the operator all detections of any searched object or objects above a default confidence level, for example 20%, with the identities of the searched objects visible at 1590. Optionally, the user can be permitted to select which of the objects shown at 1590 are revealed in inspect mode, surrounded by their respective bounding boxes. Again, the confidence threshold can be adjusted in at least some embodiments. Alternatively, inspect mode can also be configured to reveal all objects detected by the system, whether or not a given object is part of the analysis results, or can be configured to allow the operator to incrementally add types or classes of objects that the system will reveal in inspect mode. Inspect mode can thus be used by an operator to reveal associations between or among detected objects, where the types of detections to be revealed varies with each iteration of a search. Inspect mode can also be use for verification step, to ensure that they system is successfully detecting all objects in a frame or a video sequence regardless whether included in a given search. In any of the modes a given scene can be captured by clicking on “capture scene”, shown at 1595.
[0125] Having fully described a preferred embodiment of the invention and various alternatives, those skilled in the art will recognize, given the teachings herein, that numerous alternatives and equivalents exist which do not depart from the invention. It is therefore intended that the invention not be limited by the foregoing description, but only by the appended claims.