Process and wearable device equipped with stereoscopic vision for helping the user
20170249863 · 2017-08-31
Inventors
Cpc classification
G06F18/214
PHYSICS
G02C5/001
PHYSICS
H04N2213/008
ELECTRICITY
G09B21/008
PHYSICS
H04N13/239
ELECTRICITY
G06F3/0346
PHYSICS
International classification
Abstract
Wearable device for helping a user and process for using such device, which device includes at least one image and/or video acquisition unit of the stereoscopic or multicamera type, a data processing unit connected to the image and/or video acquisition unit, and a unit indicating information processed by the processing unit to the user. The device also includes additional sensors connected to the processing unit and is intended to perform a plurality of functions, which are activated and/or deactivated for defining a plurality of device operating states alternative to each other, there being provided a unit analyzing the three-dimensional structure of a scene and the signals generated by the sensors for assigning the operating state.
Claims
1. A wearable device for helping a user, comprising: at least one of an image or a video acquisition unit (1); a data processing unit (2) connected to the image and/or or video acquisition unit (1); an indication unit (3) indicating information processed by the processing unit (2) to the user, wherein the image or video acquisition unit (1) is stereoscopic or multicamera, wherein the processing unit (2) is equipped with at least one programmable processor, which analyzes acquired images or videos, by evaluating a three-dimensional structure of a scene observed through the image or video acquisition unit (1) to detect objects of interest, and wherein the processing unit (2) is equipped with a mass memory (10); sensors connected to said processing unit (2), wherein said wearable devices is designed to perform a plurality of functionalities, which are activated or deactivated for defining a plurality of device operating states (24, 25, 26, 27) alternative to each other; and a unit analyzing the three-dimensional structure of the scene and signals generated by said sensors to automatically assign one of the operating states (24, 25, 26, 27).
2. The wearable device according to claim 1, wherein said sensors comprise at least one microphone (15).
3. The wearable device according to claim 1, wherein said sensors comprise at least one inertial sensor (23).
4. The wearable device according to claim 1, further comprising a database (11) of known objects and a learning system for said database (11).
5. The wearable device according to claim 1, wherein the indication unit (3) comprises one or more loudspeakers.
6. The wearable device according to claim 5, wherein the loudspeakers comprise at least one earphone.
7. The wearable device according to claim 6, wherein the loudspeakers comprise a bone conduction sound system (30).
8. The wearable device according to claim 1, further comprising a control interface (14) connected to the processing unit (2), said control interface having one or more keys (16) and/or one or more microphones (15) in combination with a voice message recognition unit.
9. The wearable device according to claim 1, wherein the image or video acquisition unit (1) comprises at least two video cameras (9) arranged along a vertical axis with the wearable device in worn condition.
10. The wearable device according to claim 1, further comprising a supporting frame (22) provided with two ends (220) resting on the user's ears and with a connection body (221) to be placed behind the user's nape, the acquisition unit (1) being placed in proximity to one of the two ends (220).
11. The wearable device according to claim 1, wherein the wearable device is adapted to be connected to a pair of eyeglasses (4) with a coupling system (6).
12. A process of operation of a wearable device for helping a user, comprising the following steps: (a) acquiring one or more images or videos in stereoscopic or multicamera mode, which images, or videos representing a scene; (b) processing the images or the videos to obtain a three-dimensional structure of the scene; (c) identifying objects in the scene by isolating the objects from a background; (d) indicating defined objects to the user, wherein the wearable device is designed to perform a plurality of functions, which are activated or deactivated to define a plurality of device operating states alternative to each other; (e) detecting signals with additional sensors; (f) analyzing in real-time detected signals and the three-dimensional structure of the scene; and (g) assigning or keeping the device operating state based on a performed analysis.
13. The process according to claim 12, further comprising a recording of audio signals and a processing of the user's speech to trigger one or more of steps (a)-(g).
14. The process according to claim 12, wherein identifying the defined objects comprises one or more of detecting characteristic features of an object in the image or searching for the characteristic features of the object that correspond with each other in a plurality of video frames, and grouping the characteristic features together in homogeneous groups.
15. The process according to claim 12, further comprising the step of providing a system managing a database of known objects, and a system for learning said objects.
16. The process according to claim 12, wherein the identified objects are classified by the wearable device with one or more classifiers with reference to a database of known objects, further comprising the step of learning unknown identified objects by storing characteristic features of said unknown identified objects in said database.
17. The process according to claim 16, wherein, for each unknown identified object, the user is asked to pronounce a name of the unknown identified object, and the pronounced name is stored in association with the characteristic features.
18. The process according to claim 12, wherein the objects isolated from the background are analyzed to evaluate whether said objects have alphanumeric characters, said alphanumeric characters being automatically identified and read to the user by means of OCR (Optical Character Recognition).
19. The process according to claim 18, wherein, if a plurality of groups of alphanumeric characters is present, a priority value is assigned to each group based on criteria that include a distance of the group from a fixation point, said fixation point being defined as a center of the acquired image, a distance of the group from the user, or a dimension of a font, and wherein the groups of characters are read in priority order.
20. An eyeglass frame (4) for helping a user, said eyeglass frame having side arms and a wearable device integrated in said side arms, wherein: (a) the wearable device according to claim 1; and (b) the eyeglass frame is adapted to perform a process based on stereoscopic or multicamera vision, which comprises: acquiring one or more images or videos in stereoscopic or multicamera mode, which images, or videos representing a scene; processing the images or the videos to obtain a three-dimensional structure of the scene; identifying objects in the scene by isolating the objects from a background; indicating defined objects to the user, wherein the wearable device is designed to perform a plurality of functions, which are activated or deactivated to define a plurality of device operating states alternative to each other; detecting signals with additional sensors; analyzing in real-time detected signals and the three-dimensional structure of the scene; and assigning or keeping the device operating state based on a performed analysis.
Description
[0051] These and other characteristics and advantages of the present invention will be more clear from the following description of some embodiments shown in the annexed drawings wherein:
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058] The device, wearable by the user, provides a data processing unit 2 connectable with a stereoscopic or multicamera image and/or video acquisition unit 1. The processing unit 2 can receive images or video sequences from the stereoscopic or multicamera image and/or video acquisition unit 1, which images or video sequences are processed in order to provide information required by the user. By using stereoscopy the device can distinguish bodies in foreground from those in background, and therefore it can perform different actions depending on the command given by the user and on the context.
[0059] The device is also equipped with mass memory system, used for all the purposes requiring data storage.
[0060] In the example of
[0061] In
[0062] The device is provided with two bone conduction sound systems 30, placed such to lean on the user in the region of the temples with the device in the worn condition.
[0063] In the embodiment of
[0064] As an alternative the device can be composed of one or more housings coupled to the eyeglasses 4, as shown in
[0065] In order to couple the visual sensor to the eyeglasses coupling means 6 are used, visible in
[0066] As an alternative it is possible to provide the whole device, suitably miniaturized, integrated into the frame of the eyeglasses 4.
[0067] The stereoscopic or multicamera image and/or video acquisition unit 1 mentioned above can be composed of several video cameras, particularly a pair of video cameras 9, or a video camera and/or other sensors that can give the device a knowledge of the three-dimensional structure of the scene at each moment, namely in real-time, regardless of the condition of the device (namely for example it is not necessary for the device to be in motion for perceiving three-dimensionality). Said video cameras 9 can be monochromatic, RGB or of the NIR type (using near infrared).
[0068] In the rear part, opposite to the video cameras 9, the housing of the visual system 1 has connectors 5, visible in
[0069] The device is equipped with microphones connected to the processing unit, such to record and process the recorded sounds for interpreting the user's commands. This speech recognition operation can take place both on the main processing unit and on dedicated cards. The presence of several microphones can be used for carrying out noise reduction techniques, such to further highlight the user's voice with respect to other present noises.
[0070] The device can give an auditive feedback to the user, namely it can communicate with him/her, through earphones 3, as shown in
[0071] The processing unit 2 can be contained within an housing mountable on the eyeglasses 4, as shown in
[0072] The unit is provided with a connection port for input and output data, in the example in the figure a USB port 8, and one connector 7 for recharging the batteries.
[0073]
[0074] The processing unit 2 processes the images and it communicates to the user processed information by means of the sound indication unit 3.
[0075] There is provided a memory unit 10 where data are saved, where a database 11 of known data is contained for object recognition. Such memory unit can be put in communication with a remote server to remotely save data or to receive updates or training of the database 11. However the connection to the server is optional and the device can work in complete autonomy.
[0076] The processing unit 2 can be set by a control interface 14 comprising a microphone 15 and a push-button panel 16, visible for example in
[0077] By pressing the buttons it is equipped of, or by speech commands, it is possible to access the functionalities provided by the device.
[0078] The device is provided with means 18 for reading printed texts and communicate them to the user: it acquires images, extracting the position of possible printed texts by means of algorithms similar or equal to those present in prior art, then it performs automatic character recognizing techniques (OCR) to obtain a representation of the text and to communicate it to the user by methods described above. In case of ambiguity or in presence of several texts within the image, the device asks for user intervention for deciding which text has to be communicated, or it chooses on the basis of pre-programmed logics such as for example closeness to the fixation point. The fixation point is identified as the centre of the acquired images.
[0079] The device is provided with crosswalk recognition means 17: the device analyses acquired images, and by using dedicated algorithms, it audibly indicates to the user whether crosswalks are present or not in his/her field of vision and, if any, it can communicate their position and orientation to the user.
[0080] The device is provided with people face recognition means 21: by analyzing the acquired images, the device can determine whether a face is present in the scene or not, and then it can identify it, if it is known. This recognition procedure can use stereoscopy for distinguishing the background in the acquired images. In order to identify such face the device can use one or more classifiers specifically trained for the aim. The device can further be able to learn new faces, such to perfectly adapt itself to the life of each user. Therefore the device can keep within the database 11 the features of the known faces. Such features can for example comprise gradients, intensity, colour, texture and contrast.
[0081] The device is provided with means 19 for identifying and recognizing a plurality of objects: by analysing the acquired images the device can extract the features (some of them have been described in the previous paragraph) of the scene. For making this it can use the advantages offered by stereoscopic vision. The database 11 in this case contains features of recognizable objects. The features identified by the processing unit 2 in the acquired images are then compared with the features of the objects present in the database 11. By means of trained classifiers or differences between the features of the several objects it is possible for the device to communicate the object, if identified, to the user, and to indicate its position, allowing the user to reach it.
[0082] The device can further learn new objects and/or improve the representation of those already learnt. The start of learning procedure is decided by the user by a speech command or by pressing the buttons present on the device. Once the object to be learnt is within the field of vision of the video cameras, in order to be identified and separated from the rest of the scene the device can use techniques related to stereoscopic vision, contour analysis or colour segmentation. Thus the computation of the main features is much more accurate. Once enough features are acquired necessary to identify the object from different points of view, the device can ask the user to assign a name thereto by generating a sound. Such sound can be recorded by the device and it can be used by the device when it has to make reference to the object when talking to the user, and vice versa by the user when he/she has to make reference to the object when talking to the device. Such sound can be both recorded as it is and stored in the memory, and processed by a speech recognition motor for producing a text representation thereof, with the same purposes just mentioned.
[0083] When analysing the scene, the device can use images and video sequences acquired by several video cameras 9, as well as a time filtering thereof. Thus the accuracy of detection and classification algorithms is improved.
[0084] The device is provided with means 20 recognizing obstacles present on the path of the user. This is performed by a deep use of stereoscopy to interfere with the three-dimensional structure of the observed scene. The obstacles in the field of vision of the user (for example people, steps, protrusions, real obstacles) can be recognized and communicated to the user thus having the possibility of acting accordingly.
[0085] The device can be characterized by several operation modes: a first one, intended to reply to questions of the user about the scene such as for example the presence of crosswalks, the identification of an object held in the hands, and a second one characterized by a higher autonomy of the device, that will warn the user if something important is detected (examples: asking the device to warn the user as soon as it detects crosswalks, or asking as soon as it detects a particular object he/she is looking for in the house, etc. . . . ).
[0086] With reference to classification processes mentioned above, a general learning process will be performed by presenting the stimulus to be learnt many times, such that the device can extract the essential features (such as those mentioned above). These latter will be associated to the stimulus and will be used for identifying it. As regards some classes of objects, it is possible for the device to have already in the memory a reduced database with main recognition functionalities (example: banknotes, road signs, . . . ) such to guarantee a minimum set of functionalities even when it is started for the first time.
[0087] With reference to image or video processing required by the several functionalities of the device, they can comprise, but not exclusively, disparity analysis, optical flow calculation, colour segmentation, connected components analysis, differences between frames, neural networks applications, time filtering.
[0088] The device can be connected to a personal computer by a cable, for example USB cable, or by wireless connection. Such cable, in addition to provide power supply for recharging the battery (which can take place also by means of other types of standard dedicated chargers) will allow a possible internet connection to be used for obtaining updates. In this step, the device can also communicate with a central server, such to further refine the classifiers intended to recognize people and objects (learning/improving process of the “offline” type).
[0089]
[0090] The device is intended to perform a plurality of functionalities, which functionalities are activated and/or deactivated for defining a plurality of device operating states alternative to each other.
[0091] Such states can be for example “Idle” 24, where the device is in standby; “Recognition” 25 where the three-dimensional scene is analysed, the objects are isolated from the background and such objects are recognized and communicated to the user; “Learning” 26, where the user provides the identity of unknown objects, that therefore are learnt by the device; “Reading” 27 where a text recognized by OCR is read to the user; “Navigation” 28 where the three-dimensional scene is monitored for detecting obstacles, crosswalks, indication of the path and the like.
[0092] A real-time analysis is performed for the signals detected by the sensors 15 and 23 and the three-dimensional structure of the scene acquired by the unit 1, and therefore the operating state is assigned or kept on the basis of the performed analysis.
[0093] From Idle state 24 it is possible to pass to Recognition state 25 when the acquisition unit 1 detects close objects. Thus the object is analysed and recognized by the device.
[0094] If the object analysed by vision is unknown the device passes to Learning state 26 where it is possible to define the object by the user's voice.
[0095] The Recognition state 25, and possibly also the Learning state 26 can be assigned by speech activation, by recognizing speech instructions of the user.
[0096] Once recognition or learning have ended, the device automatically goes back to Idle 24.
[0097] On the contrary if vision detects a text by OCR, the state passes from Recognition 25 to Reading 27, and the text is read to the user. Once the text has ended, the device automatically goes back to Idle 24.
[0098] If the inertial sensor 23 detects a rapid movement of the head, the system passes to Idle 24 and reading can then start again later. The movement is identified as a change of attention of the user from the text to a further object, for example as a reply to a call.
[0099] If the inertial sensor 23 detects a walk, acting for example as a pedometer, the Navigation state 28 is assigned. When the inertial sensor 23 detects the walk has stopped, the device automatically passes to Idle 24, and Navigation state 28 can be again activated if the user starts walking again.
[0100] The invention described in this manner is not limited as regards shape and process by the examples provided in the previous paragraph. Many variants can characterize it, starting from the shape, the anatomical position where housings are worn, the type of data processing unit and video camera used. Moreover as regards the process, the same result can be often achieved by following different approaches, therefore those provided in the description have not to be considered as limitative.