System and method for recognizing environment and/or location using object identification techniques
11605054 · 2023-03-14
Assignee
Inventors
Cpc classification
G06F16/00
PHYSICS
G06F7/00
PHYSICS
G06F16/9537
PHYSICS
G06V20/46
PHYSICS
G06Q10/025
PHYSICS
International classification
G06F16/9537
PHYSICS
Abstract
The system and method of the present invention identify an environment or a location out of a sequence of information received in a video format. The invention provides a learning system and therefore the more videos that are received, relating to a specific environment/location, the more accurate the identification will be when a different image is later analyzed, including the ability to identify the environment/location seen from different viewpoint and angles.
Claims
1. A method for planning a trip by identifying a location of an unidentified environment depicted in a video stream, comprising: a) receiving at least one video stream, said video stream including at least a first plurality of frames each including an image of an object of interest and an environment of surroundings of said object of interest that are identifiable with said object of interest; b) breaking said at least one video stream into disunited frames; c) sending each of said disunited frames to a location identification engine which is configured to identify a location that is visible in the sent frame, wherein said location identification engine is trained through deep learning techniques to compare images within each disunited frame with images stored in an image recognition database; d) for each of the first plurality of frames, identifying a location of the object of interest on a basis of the comparison performed with the location identification engine, and further identifying a location of the environment of the surroundings of the object of interest; e) if one or more of said disunited frames is unidentified by said location identification engine, invoking a completion algorithm to identify an environment that is visible in a given one of said unidentified frames, whereby said completion algorithm is operable to identify one or more frames within the first plurality of frames with an identified location and being associated with an identified environment, to determine that an environment shown in said unidentified frame is the same as the environment with the identified location, and to set a location of the given unidentified frame with the location of the environment with the identified location; f) presenting videos with the identified location that are relevant to a requested destination; g) allowing a user to choose attractions the user wants to visit by viewing the videos and selecting at least a portion thereof; and h) outputting a trip plan recommendation according to the videos provided with the identified location and according to user selections during the allowing step, wherein said trip plan recommendation includes a list of sites to visit.
2. The method according to claim 1, wherein the frames broken from the at least one video stream are a plurality of sequential frames.
3. The method according to claim 2, wherein the plurality of sequential frames broken from the at least one video stream include all the frames in a sequence.
4. The method according to claim 1, further comprising the step of identifying a location that is visible in the sent frame, if not identified by the location identification engine, according to keywords derived from an audio transcript of the sent frame.
5. The method according to claim 4, further comprising the steps of: i) using a public API database to identify a location that is visible in the sent frame, if not identified according to an audio of the sent frame; ii) manually approving said identification; iii) adding the identified location using the public API database to the image recognition database after such approval; and iv) invoking the completion algorithm if a location that is visible in the sent frame is not identified using the public API database.
6. The method according to claim 5, further comprising the step of adding metadata of each of the disunited frames associated with the identified location to the image recognition database.
7. The method according to claim 1, wherein the video stream is acquired using a crawler.
8. The method according to claim 7, wherein additional data acquired by the crawler is related to one or more of the groups consisting of text, photos, audio, maps reviews, descriptions and ratings.
9. The method according to claim 1, wherein the received at least one video stream is a 360-degree video whereby images of an object are taken at different angles and are connected into a panoramic rounding frame, and the rounding frames of the 360-degree video are sent to the image recognition database.
10. The method according to claim 1, further comprising, following setting of the location of each of the unidentified disunited frames, adding each of the unidentified disunited frames whose location has been set and accompanying metadata to the image recognition database.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
DETAILED DESCRIPTION OF THE EMBODIMENTS OF THE INVENTION
(4) The system and method of the present invention identify an environment or a location out of a sequence of information received in a video format. The invention provides a learning system and therefore the more videos that are received, relating to a specific environment/location, the more accurate the identification will be when a different image is later analyzed, including the ability to identify the environment/location seen from different viewpoint and angles. The system of the present invention employs a Deep Learning algorithm. Deep learning algorithms are known to the skilled person and, therefore, are not discussed in detail herein, for the sake of brevity. The invention permits to identify the location appearing in an image, even if the image does not exist in the system database and even if the frame was taken at a unique angle which doesn't exist in the system database. The system and method of the invention identify objects which are related to a specific environment/location and recognize the environment in which the object is located, even if the environment is not shown with the object related to it. Furthermore, the system of the invention also identifies the environment/location shown in a frame even if the object related to it is not shown. For example: the Eiffel Tower can be identified even if the analyzed image only shows the garden near the Eiffel Tower without showing the Tower itself.
(5) The method of the invention comprises two stages: A. A training stage, which is the stage of the construction of the image recognition database of the system, with the information of the frame location and the objects that are associated and/or tagged to said locations. The image recognition database can always be updated, even after the system is running. B. An identification stage, which recognizes the environment/location of an image presented to the system, on the basis of the information found in the image recognition database. As the system of the invention is a learning system, the identification stage can always be updated with new locations and new frames that are added to the system as part of the training stage.
(6) Prior art systems focus on analyzing an image to recognize objects found therein. In contrast, the invention assigns context to the objects in an image by using a plurality of images taken from a video stream. Thus, for instance, when a video camera sweeps a location and shows objects in the frames it creates from different points of view, the invention uses a plurality of images from the video to create context for the objects. Taking the simple example of the Eiffel Tower, when a frame in a video is seen showing a garden, after the previous or later frames showed the Eiffel Tower, context is generated for the garden that associates it with the Eiffel Tower, even though the Tower itself is never seen in the image of the garden. Accordingly, in an embodiment of the invention, the system comprises a location identification engine, which receives videos as an input, processes the videos and identifies and tags the locations in a plurality of frames of the videos. The engine uses sophisticated Vision processing technics along with public reverse image search API's and private image search DB. The private image search DB can be for instant a deep learning neural network.
(7)
(8) As stated above, the present invention location identification engine uses Deep Learning technics. In an embodiment of the invention the system uses the Convolutional Neural Networks algorithm. This algorithm is a variant of deep learning algorithms, which mimics the way animals and humans process visual elements and categorize objects. The method of the present invention uses Convolutional Neural Networks to categorize the different images to their location/site id's. As part of the CNN (convolutional neural network) sub-categories are used, in order to be able to identify a single site from different angles. As already stated above, since such algorithms are known to the man of the art, they are not explained herein in detail, for the sake of brevity.
(9) In an embodiment of the invention, the present invention location identification engine uses a Revers Image Search DB technology. Reverse image search DB permits to index photos with identifiers (every photo has a key/ID identifying it) in the database and later on, sends a request with one photo and the DB will match the given photo to the photos already indexed in the database, using various image pattern matching algorithms. Eventually the result set is a list of “Hits”, every hit including the hit photo identifier and a score implying how much the indexed photo matches the one given in the request.
(10) In one embodiment of the invention, when indexing a photo, the photo itself is not saved in the DB. The photo is decoded in a binary way which allows later to match other photos to it.
(11)
(12) The step of relevant video collection is performed by the web crawler 201. The crawler 201, connects to video sites such as YouTube (can be also Vimeo, Netflix, Yahoo screen or any other video web site) and conducts search queries on the videos in the site with pre-configured key words. The selected videos can be regular videos or 360-degree videos. For example, the crawler can be configured as follows: crawl on YouTube web site; look for videos whose title contains the word “Paris” and “attractions”; filter the results set for videos with length larger than 10 minutes; filter out results which their title contains the words: slideshow and landing.
(13) On the video list which is received from the above query, the crawler traverses each video in the list and also filters videos according to their transcription (most video sites include transcription for the video). Videos whose transcript contains desired key words (such as Paris or attractions) are collected. Videos whose transcript includes undesired key words (such as slideshow) are removed from the final results list.
(14) After each query, the crawler 201 analyzes the results set. For videos which are selected as relevant videos a request is sent for similar videos and the system 200 process them as well.
(15) The crawler runs according to a predefined schedule and sends different queries from time to time in order to get different results.
(16) The resulting video list is pushed to a queue of “Videos require processing” and waits for the location identification engine 202 to pull and process them. The location identification engine 202 automatically processes, tags and identifies the location in each frame of the video list as described above. After a video is auto tagged, a data collector 203 pulls additional information from the web for new locations/sites (location that are first inserted to the information DB) that were created during the processing and tagging of this video, such as photos, reviews, description, rating etc.
(17) The data collector 203 pulls the data from public API's such as Google places, Trip advisor, Lonely Planet and so on for each new site created by the location identification engine.
(18) Once the auto-location identification engine 202 adds the tagged videos to the DB, the videos are still not usable by the product platforms that uses the system of the invention (a trip planner web site, for example). In order for it to be usable for such specific purpose it must be verified and approved by an administrator, by reviewing the auto tags assigned by the auto-location identification engine. This process is conducted using a verification application 204. The application 204 allows opening a previously tagged video and reviewing the tags assigned to it. Using the application, the administrator can edit the tags, create new sites, and update titles and descriptions and so on. Of course, this process can also be fully automated so as to avoid human intervention, although then in many cases the desired precision of identification may not be at the level desired by a user.
(19) Once the administrator finishes the review he can use the application to mark the video as approved. From this point and on the video is available in every product platform and it is also added to the train process so the system can “learn” from the approved results.
(20)
(21) Upon approval of a video, its tagged frames are sent to the image search DB for training. From this point and on, when the location identification engine 202 searches for similar photos it can retrieve their exact site identification from the Image DB.
(22) An example of the use of the system and method of the present invention is a trip planning system based on video. It is being understood that this is merely an example of a useful application and is not intended to limit the invention in many ways. As will be apparent to the skilled person, the location recognizing method and system of the invention can be used for a variety of different applications that require recognizing a location from an image.
(23) The trip planning system allows creating a full trip from end to end in a one stop, including selecting attractions, hotels, flights, rental cars and booking it all. The trip planning system platform includes the following components:
(24) Web application;
(25) Mobile application;
(26) Web Service;
(27) Information DB; and
(28) Search DB.
(29) The web application according to one embodiment of the invention enables end users to build their trip plan end to end, using a web browser. The uniqueness of the trip planning system based on video analyzed according to the invention is in the way in which a user selects the attraction he wants to visit. In trip planners known in the art, the user searches for sites using text, and in order to decide whether he wants to visit the site or not he is required to read data. However in the trip planning system based on video according to the invention, the selection is made using videos. When the user enters the site, he watches videos of attractions, accommodation options, etc., and then he can decide according to what he sees, if he would like to visit the location/site or not. In one embodiment, after the user selects the sites he desires to visit, the trip planning engine calculates the best hotels, flights and rental cars for the user and provides the user a full trip plan for his convenience.
(30) After a user creates a trip plan for himself using the web application, he can use a mobile app during his trip. The mobile application provides real time information on the locations/sites the user selected for his trip, trip directions, attraction information (opening hours, process, etc.), restaurants information and so on. The information is sent directly to the application once the user finishes planning the trip.
(31) The planning trip system operational web service provides the needed data and infrastructure to the various apps. The system exposes a restful API (a Restful api is a type of api which works on top of http protocol) with all needed functionality.
(32) The verification application provides the ability to manage the information DB which holds the trip planning service data (update sites, remove, edit text, etc.).
(33) The illustrative system discussed with reference to
(34) All the data required for the trip planning system is held in an information DB based on rational technology. Currently, MS SQL DB on top of azure is used; however, any other rational DB can be used, such as mySql, LiteSql, Oracle etc. The information DB is populated using the auto-location identification engine and using data that arrives from the web site, such as users' data, ratings and so forth.
(35) The trip planning system according to one embodiment of the invention uses in addition to the information DB which is a rational DB, a search DB that is based on document indexing technologies. The search DB allows fetching data according to key words quickly, and enables features such as search suggestions in the web site. The trip planning system uses, in one embodiment, a search DB called Solr, but can also use any other search DB such as Elastic search, Lucene or cloud search services. The search DB is accessed by the web site through the operational web service. The search DB is populated using data replication from the information DB according to a pre-defined schedule. According to an embodiment of the invention the entire trip planning system stack is deployed in cloud services, which allows the platform to scale up as much as needed.
(36) The information database is hosted, in one embodiment, in Azure SQL DB, while the rest of the components are hosted in virtual machines in the cloud.
(37) All the above description of preferred embodiments has been provided for the purpose of illustration only and is not intended to limit the invention in any way, except as per the appended claims.