AUTOMATED IMAGING SYSTEM FOR OBJECT FOOTPRINT DETECTION AND A METHOD THEREOF
20230066119 · 2023-03-02
Assignee
Inventors
- Pradip GUPTA (West Bengal, IN)
- Naveen Kumar PANDEY (Uttar Pradesh, IN)
- Balakrishna PAILLA (Goa, IN)
- Shailesh KUMAR (Telangana, IN)
- Shubham BHARDWAJ (Telangana, IN)
- Anmol KARNWAL (Uttar Pradesh, IN)
Cpc classification
G06V10/469
PHYSICS
G06V10/774
PHYSICS
G06V20/70
PHYSICS
G06V10/44
PHYSICS
International classification
G06V10/774
PHYSICS
Abstract
The present disclosure provides for a system for facilitating a completely automated process that may directly fetch an imagery of a given location and area from any mapping module and extract a plurality of objects in the given imagery. Further, a deep learning-based object segmentation such as but not limited to a cascaded reverse mask RCNN framework method may generate a set of predefined vectors associated with the image. The system may be configured to automate the generation of the predefined vectors based on the image received from the image sensing assembly.
Claims
1. An automated imaging sensing system (108), said system (110) comprising; an image module (106) comprising one or more processors (202), wherein the one or more processors (202) are coupled with a memory (204), wherein said memory (204) stores instructions which when executed by the one or more processors (202) causes said system (108) to: receive a set of images of an object from an image sensing assembly (112), wherein the set of images are obtained at a plurality of viewpoints; extract, by using a Deep Learning (DL) engine (216), a first set of attributes from each image in the set of images recorded at respective viewpoints based on a location template, wherein the first set of attributes pertain to center coordinates and radius of the region of interest of each image at the respective viewpoints, wherein the DL engine (216) is operatively coupled to the one or more processors (202); extract, by using the DL engine (216), a second set of attributes from the first set of attributes extracted, the second set of attributes pertaining to a set of predefined boundaries associated with the object; generate, by using the DL engine (216), a mask for each set of predefined boundaries of the object; and, merge, by the DL engine (216), the mask of each set of predefined boundaries of the object with each other to obtain a set of predefined vectors to be stored in a database associated with the system (108).
2. The system as claimed in claim 1, wherein the plurality of viewpoints refers to coordinates and radius of an object or region of interest, latitude, longitude of a region.
3. The system as claimed in claim 1, wherein the set of predefined boundaries comprises background, object interior, object edges and object separators.
4. The system as claimed in claim 1, wherein an object detection module is operatively coupled to the one or more processors, wherein the object detection module is configured to: process the extracted second set of attributes; obtain a set of features from the processed second set of attributes; map down one or more precise unique features of the object from the set of features obtained.
5. The system as claimed in claim 1, wherein the DL engine is further configured to: obtain a set of values of each image of the set of images; process the set of values of each said image to yield a set of predefined vectors; and, generate a trained model configured from the set of predefined vectors;
6. The system as claimed in claim 5, wherein the DL engine is further configured to: automate fetching of an image from the image sensing assembly to generate the predefined set of vectors specific to the image.
7. The system as claimed in claim 5, wherein the set of values of each image are any or a combination of red green blue (RGB) values, greyscale values, and luma, blue projection and red projection (YUV) values.
8. The system as claimed in claim 5, wherein the trained model is trained to take an image automatically as an input and return a minimum rotated bounding box for object along with one or more pixel labels associated with the object.
9. The system as claimed in claim 1, a segmentation module is operatively coupled with the DL engine (216), wherein the segmentation module is further configured to: cascade a multi-class segmentation task to generate a plurality of pixel-level semantic features in a hierarchal manner.
10. The system as claimed in claim 1, wherein the image sensing assembly (108) comprises one or more analog electronic input source configured for recording a plurality of physical parameters simultaneously with the set of images and a network (104) connecting one or more camera sensors (204) and the one or more analog input sources to the computing device (102).
11. A method for facilitating automated image sensing, said method comprising; receiving, by an image module (106) a set of images of an object from an image sensing assembly (112), wherein the set of images are obtained at a plurality of viewpoints, wherein the image module (106) comprises one or more processors (202), wherein the one or more processors (202) are coupled with a memory (204), wherein said memory (204) stores instructions which are executed by the one or more processors (202); extracting, by using a Deep Learning (DL) engine (216), a first set of attributes from each image in the set of images recorded at respective viewpoints based on a location template, wherein the first set of attributes pertain to center coordinates and radius of the region of interest of each image at the respective viewpoints, wherein the DL engine (216) is operatively coupled to the one or more processors (202); extracting, by using the DL engine (216), a second set of attributes from the first set of attributes extracted, the second set of attributes pertaining to a set of predefined boundaries associated with the object; generating, by using the DL engine (216), a mask for each set of predefined boundaries of the object; merging, by the DL engine (216), the mask of each set of predefined boundaries of the object with each other to obtain a set of predefined vectors to be stored in a database associated with the system (108).
12. The method as claimed in claim 11, wherein the plurality of viewpoints refers to coordinates and radius of an object or region of interest, latitude, longitude of a region.
13. The method as claimed in claim 11, wherein the set of predefined boundaries comprises background, object interior, object edges and object separators.
14. The method as claimed in claim 11, wherein an object detection module is operatively coupled to the one or more processors, wherein the method further comprises the steps of: processing, by the object detection module, the extracted second set of attributes; obtaining, by the object detection module, a set of features from the processed second set of attributes; and, mapping down, by the object detection module, one or more precise unique features of the object from the set of features obtained.
15. The method as claimed in claim 11, wherein the method further comprises the steps of: obtaining, by the DL engine (216), a set of values of each image of the set of images; processing, by the DL engine (216), the set of values of each said image to yield a set of predefined vectors; and, generating, by the DL engine (216), a trained model configured from the set of predefined vectors;
16. The method as claimed in claim 15, wherein the method further comprises the steps of: automate fetching of an image, by the DL engine (216), from the image sensing assembly to generate the predefined set of vectors specific to the image.
17. The method as claimed in claim 15, wherein the set of values of each image are any or a combination of red green blue (RGB) values, greyscale values, and luma, blue projection and red projection (YUV) values.
18. The method as claimed in claim 15, wherein the trained model is configured to take an image automatically as an input and return a minimum rotated bounding box for object along with one or more pixel labels associated with the object.
19. The method as claimed in claim 11, a segmentation module is operatively coupled with the one or more processors, wherein the method further comprises the steps of: cascading, by the segmentation module, a multi-class segmentation task to generate a plurality of pixel-level semantic features in a hierarchal manner.
20. The method as claimed in claim 11, wherein the imaging sensing assembly (108) comprises one or more analog electronic input source configured for recording a plurality of physical parameters simultaneously with the set of images and a network (104) connecting one or more camera sensors (204) and the one or more analog input sources to the computing device (102).
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0030] In the FIG.s, similar components and/or features may have the same reference label. Further, various components of the same type may be distinguished by following the reference label with a second label that distinguishes among the similar components. If only the first reference label is used in the specification, the description is applicable to any one of the similar components having the same first reference label irrespective of the second reference label.
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
DETAILED DESCRIPTION
[0040] In the following description, numerous specific details are set forth in order to provide a thorough understanding of embodiments of the present invention. It will be apparent to one skilled in the art that embodiments of the present invention may be practiced without some of these specific details.
[0041] The present disclosure provides herein is an imaging system and a method for mapping of objects, and relates to the imaging of 3-dimensional (3-D) objects, and more specifically to the fetching and processing of imagery of a given location or an object.
[0042] In an aspect, the present disclosure further provides for a system for facilitating a completely automated process that may directly fetch imagery of a given location or an object and area from any mapping module and extract a plurality of objects in the given imagery. Further, a deep learning-based object segmentation methods such as but not limited to a cascaded reverse mask RCNN framework that reaches state of the art even in cluttered rural and urban environments.
[0043] Several embodiments of the present disclosure are described hereafter in detail with reference to the drawings. The specifications herein can be considered as the illustration of the invention, and is not intended to limit the scopes of the invention specific to the embodiments described by the drawings and the description provided below for an imaging system.
[0044] Embodiments of the present invention may be provided as a computer program product, which may include a machine-readable storage medium tangibly embodying thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process. The machine-readable medium may include, but is not limited to, fixed (hard) drives, magnetic tape, floppy diskettes, optical disks, compact disc read-only memories (CD-ROMs), and magneto-optical disks, semiconductor memories, such as ROMs, PROMs, random access memories (RAMs), programmable read-only memories (PROMs), erasable PROMs (EPROMs), electrically erasable PROMs (EEPROMs), flash memory, magnetic or optical cards, or other type of media/machine-readable medium suitable for storing electronic instructions (e.g., computer programming code, such as software or firmware).
[0045] Various methods described herein may be practiced by combining one or more machine-readable storage media containing the code according to the present invention with appropriate standard computer hardware to execute the code contained therein. An apparatus for practicing various embodiments of the present invention may involve one or more computers (or one or more processors within a single computer) and storage systems containing or having network access to computer program(s) coded in accordance with various methods described herein, and the method steps of the invention could be accomplished by modules, routines, subroutines, or subparts of a computer program product.
[0046] Exemplary embodiments will now be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments are shown. These exemplary embodiments are provided only for illustrative purposes and so that this disclosure will be thorough and complete and will fully convey the scope of the invention to those of ordinary skill in the art. The invention disclosed may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Various modifications will be readily apparent to persons skilled in the art. Thus, the present invention is to be accorded the widest scope encompassing numerous alternatives, modifications and equivalents consistent with the principles and features disclosed.
[0047] The embodiments will be clear with the illustrative drawings explained henceforth.
[0048] Referring to
[0049] In an exemplary embodiment, the plurality of objects may be buildings, like roads, bridges, flyovers but not limited to the like. The imaging module (106) may be operatively coupled to at least one computing device (102-1, 102-2, . . . . 102-N) (hereinafter interchangeably referred as a computing device (102); and collectively referred to as 102). The computing device (102) and the system (108) may communicate with each other over a network (104). The system (108) may further be associated with a centralized server (110). The data can be stored to computer hard-disk, external drives, cloud systems or centralized server (110).
[0050] In an embodiment, the network (104) that can include any or a combination of a wireless network module, a wired network module, a dedicated network module and a shared network module. Furthermore, the network can be implemented as one of the different types of networks, such as Intranet, Local Area Network (LAN), Wide Area Network (WAN), Internet, and the like. The shared network can represent an association of the different types of networks that can use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), and the like.
[0051] In an embodiment, the deep learning (DL) engine (216) may cause the system (108) to receive a set of images of an object from an image sensing assembly (112), at a plurality of viewpoints. The plurality of viewpoints may refer to coordinates and radius of a region of interest, latitude, longitude and the like. For example, the image sensing assembly (112) can take images of a location having North west (NW)=15.394745, 73.832864 and South west (SW)=15.393420, 73.834237.
[0052] The DL engine (216) may further cause the system (108) to extract a first set of attributes from each image in the set of images recorded at respective viewpoints based on a location template. The first set of attributes may refer to center coordinates and radius of the region of interest of each image at the respective viewpoints. The DL engine (216) may further extract a second set of attributes from the first set of attributes extracted. The second set of attributes may pertain to a set of predefined boundaries associated with the object. For example, the set of predefined boundaries may include background, building interior, building edges and building separators (i.e. the gap between two close buildings but not limited to the like.
[0053] The DL engine (216) may further cause the system to generate a mask for each set of predefined boundaries of the object and then merge the masks of each set of predefined boundaries of the object with each other to obtain a set of predefined vectors to be stored in a database associated with the system (108).
[0054] In an embodiment, an object detection module (not shown in
[0055] In an exemplary embodiment, the system (108) may generate, through the DL engine, a trained model configured to process the image to yield a set of predefined vectors such as a set of geometrical structures as target output. For example, the boundary and its vicinity information may be learnt by the DL engine (216) utilising the prior data coming from the previous module and predict each pixel assigned to one of the four classes to implicitly capture the geometric properties which may be otherwise difficult to learn, for example the pixels between two close buildings. Thus, in an exemplary embodiment, the DL engine (216) may facilitate to automate the fetching of an image from the image sensing assembly (112) to generate the predefined set of vectors specific to the image.
[0056] In an exemplary embodiment, the image from the image sensing assembly (112) may be an RGB image, a greyscale image, YUV image and the like.
[0057] In an exemplary embodiment, in a way of example and not as a limitation, an RGB image may be fetched from the image sensing assembly (112) and may act as an input and return a minimum rotated bounding box for each building instance along with the respective pixel labels. In an embodiment, a segmentation module associated with the system may cascade multi-class segmentation task to generate pixel-level semantic features in a hierarchal manner and further apply an optimized object detection on the extracted feature maps to obtain precise object corner points.
[0058] In an embodiment, the image sensing assembly (112) may further include one or more analog electronic input sources configured for recording several physical parameters simultaneously with the images and a wired or a wireless network (104) connecting one or more camera sensors (204) and the one or more analog input sources to the computing device (102). The wired network (104) may include one or more cables to connect the one or more camera sensors and the one or more analog input
[0059] In an embodiment, the image profile mapping of the object using their respective computing devices via set of instructions residing on any operating system, including but not limited to, Android™, iOS™, and the like. In an embodiment, the computing device (102) may include, but not limited to, any electrical, electronic, electro-mechanical or an equipment or a combination of one or more of the above devices such as mobile phone, smartphone, virtual reality (VR) devices, augmented reality (AR) devices, pager, laptop, a general-purpose computer, personal computer (PC), workstation, industrial computer, a super-computer, desktop, personal digital assistant, tablet computer, mainframe computer, or any other computing device, wherein the computing device may include one or more in-built or externally coupled accessories including, but not limited to, a visual aid device such as camera, audio aid, a microphone, a keyboard, input devices for receiving input from a user such as touch pad, touch enabled screen, electronic pen and the like. It may be appreciated that the computing device (102) may not be restricted to the mentioned devices and various other devices may be used. A smart computing device may be one of the appropriate systems for storing data and other private/sensitive information.
[0060] In an embodiment, the system (108) for imaging may include one or more processors coupled with a memory, wherein the memory may store instructions which when executed by the one or more processors may cause the system to perform the selection, evaluation and score generation steps as described hereinabove.
[0061] In an embodiment, the system (108) may include an interface(s) 206. The interface(s) 206 may comprise a variety of interfaces, for example, interfaces for data input and output devices, referred to as I/O devices, storage devices, and the like. The interface(s) 206 may facilitate communication of the system (108). The interface(s) 206 may also provide a communication pathway for one or more components of the system (108). Examples of such components include, but are not limited to, processing engine(s) (208) (engine(s) are referred to as module(s)) and a database (210).
[0062] The processing engine(s) (208) may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processing engine(s) (208). In examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processing engine(s) (208) may be processor executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the processing engine(s) (208) may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the machine-readable storage medium may store instructions that, when executed by the processing resource, implement the processing engine(s) (208). In such examples, the system (108) may comprise the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separate but accessible to the system (108) and the processing resource. In other examples, the processing engine(s) (208) may be implemented by electronic circuitry.
[0063] The processing engine (208) may include one or more engines selected from any of an image acquisition module (212), an image processing module (214), a deep learning (DL) engine (216), and other modules (218). The other modules (218) may help in other functionalities such as image acquisition, calibration, processing, post-processing of the images and data obtained with other analog inputs and for storage of images and post-processed data.
[0064]
[0065] Further, the method (300) may include the step at 308 of generating, by the DL engine (216), a mask for each set of predefined boundaries of the object and the step at 310 of merging, by the DL engine (216), the masks of each set of predefined boundaries of the object with each other to obtain a set of predefined vectors to be stored in a database associated with the processor.
[0066]
[0067]
[0068] As illustrated in
[0069] In an embodiment, the DL segmentation module (506) may work on map tiles downloaded by the mapping module (504). The DL segmentation module (506) may take the map tile images as input and may generate at least four different pixels masks for background, building interior, building exterior and building separator pixel masks but not limited to the like for each tile. The DL segmentation module (506) may make use of Reverse Mask R-CNN model but not limited to it to obtain pixel masks in a cascaded manner. The deep-learning segmentation model may learn at least four attributes from input image: [0070] Background [0071] Building interior, [0072] Building edges and [0073] Building separators (i.e. the gap between two close buildings).
[0074] In an embodiment, the post processing module (508) may merge all the four different type of pixel masks to create a single building polygon mask for each building. The post processing module (508) may also stitch overlapping building regions from multiple tiles to create a unified polygon mask for the building.
[0075] In another embodiment, the vectorization module (510) may take the polygon masks and convert the polygon mask into geospatial shape files (512) ready for geospatial database ingestion.
[0076]
[0077] In an embodiment, the overall pipeline is a combination of cascaded-segmentation and oriented bounding box detection. In essence, the building extraction task is conceptualised as a multi-stage hierarchical learning problem. At the end of each stage, the output feature map is fused with the input RGB image and the concatenated representation becomes the input to the next stage. This strategy may allow higher-level concepts to learn from raw images in a bottom-up manner with a gradual increase in the learning complexity at each stage by exploiting a prior in the form of previous stage output to learn the next stage features. This ensures that anytime the model is not overwhelmed by the feature complexity in the initial stages. Unlike conventional object detection, the adopted approach estimates the pose, shape and size simultaneously. It also overcomes the convergence issues found in its five variable oriented object detection counterparts due to the use of a consistent scale and unit in all eight variables.
[0078]
[0079] In an exemplary embodiment, in a way of example and not as a limitation, all the three stages share the same encoder-decoder architecture, because of its ability to extract rotation-invariant representations. A Resnet-34 as the encoder module may be utilized, with dilated convolutions of kernel size of but not limited to 3 and dilation size of but not limited to 2, 5, and 7 in the decoder.
[0080] In an exemplary embodiment, the confidence zone segmentation module (702) may be trained with the RGB image as input and the target output is ground truth binary masks representing building/no-building regions. At this stage, the network attempts to narrow down the area of interest and learns coarse structures with fuzzy boundaries used for subsequent learning in the upcoming stages.
[0081] In an exemplary embodiment, the part segmentation module (704) may gradually increase the learning complexity by guiding the part segmentation module (704) to learn geometric properties of buildings as our target output. A set of morphological operations like area opening, thinning and area closing may be applied to decompose the original ground truth mask into four classes namely: building boundary, building interior, inter-building gaps (the strip of land separating two close buildings) and background.
[0082] Further, the part segmentation module (704) may be trained with a four-channel input consisting of three RGB channels and the output from confidence zone segmentation to yield the four classes of decomposed geometrical structures as target output. Essentially, the part segmentation module may be forced to learn the bottleneck i.e the boundary and its vicinity information, utilising the prior coming from the previous network. Each pixel competes to be assigned to one of the four classes and implicitly captures the geometric properties which are otherwise difficult to learn, like the pixels between two close buildings.
[0083] In an exemplary embodiment, the Diffusion module (706) may be trained with a seven-channel input consisting of an RGB input image as well as the output masks from part segmentation. The target output is the final ground truth binary masks representing building/no-building regions. Essentially, the Diffusion module (706) performs anisotropic diffusion over implicit deep learning features with wider context.
[0084] In another exemplary embodiment, an Oriented Object Detection (OBD) module (708) (also shown in
[0085] In an exemplary embodiment, the confidence zone segmentation module (702) may have at least one target class (C=1) with but not limited to sigmoid as its final destination function. In another exemplary embodiment, the part segmentation module (704) may have at least four target classes (C=4) with but not limited to softmax as the final destination function. In another exemplary embodiment, the diffusion module (706) may have at least one target classes (C=1) with but not limited to sigmoid as the final destination function. In yet another exemplary embodiment, the oriented object detection module (708) may have but not limited to Resnet-34 as encoder and regression as the final destination function.
[0086]
[0087]
[0088] Bus (920) communicatively couples processor(s) (970) with the other memory, storage and communication blocks. Bus (920) can be, e.g. a Peripheral Component Interconnect (PCI)/PCI Extended (PCI-X) bus, Small Computer System Interface (SCSI), USB or the like, for connecting expansion cards, drives and other subsystems as well as other buses, such a front side bus (FSB), which connects processor (970) to software system.
[0089] Optionally, operator and administrative interfaces, e.g. a display, keyboard, joystick and a cursor control device, may also be coupled to bus (920) to support direct operator interaction with a computer system. Other operator and administrative interfaces can be provided through network connections connected through communication port (960). The external storage device (99) can be any kind of external hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc—Read Only Memory (CD-ROM), Compact Disc-Re-Writable (CD-RW), Digital Video Disk-Read Only Memory (DVD-ROM). Components described above are meant only to exemplify various possibilities. In no way should the aforementioned exemplary computer system limit the scope of the present disclosure.
[0090] The present disclosure provides a method and system for automated Building Footprint Extraction using Satellite Imagery system with cascaded multitask segmentation framework reverse mask R-CNN. The unique solution provides extremely low cost when compared to specialized GIS talent needed to manually label and curate building boundaries, easy logistics to operationalize as any other software or application, infinite scalability as the process is completely automated and eliminates manual specialized labor(GIS Experts/Field Agents) and achieves results comparable to extraction from GIS specialists even in the most cluttered settlements in lesser time. Also, the method accounts for unplanned, remote, dense and cluttered urban as well as rural regions of developing nations like India previously unaccounted by the building detection community.
[0091] While the foregoing describes various embodiments of the invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. The scope of the invention is determined by the claims that follow. The invention is not limited to the described embodiments, versions or examples, which are included to enable a person having ordinary skill in the art to make and use the invention when combined with information and knowledge available to the person having ordinary skill in the art.
ADVANTAGES OF THE PRESENT DISCLOSURE
[0092] The present disclosure provides for a simple, compact, portable and cost-effective imaging system for non-contact and non-destructive measurement of full-field 3-dimensional profile of the surfaces of buildings and structures of a given location.
[0093] The present disclosure provides for a method for real-time, remote, in-situ and simultaneous extremely low-cost method when compared to specialized GIS talent needed to manually label and curate building boundaries.
[0094] The present disclosure provides for a method for an easy logistics to operationalize as any other software or application.
[0095] The present disclosure provides for a system and method that is infinitely scalable as the process is completely automated and eliminates manual labor and scaling requires only adding more servers.
[0096] The present disclosure provides for periodically identifying new settlement clusters (newly constructed buildings)
[0097] The present disclosure provides for a system and method for extending the same problem to infrastructure components like roads, bridges, flyovers etc.