Portable system that allows blind or visually impaired persons to interpret the surrounding environment by sound and touch
11185445 · 2021-11-30
Inventors
Cpc classification
H04N7/181
ELECTRICITY
A61F9/08
HUMAN NECESSITIES
G08B6/00
PHYSICS
G08B3/00
PHYSICS
International classification
A61F9/08
HUMAN NECESSITIES
G08B6/00
PHYSICS
G08B3/00
PHYSICS
H04N7/18
ELECTRICITY
Abstract
A portable system that allows blind or visually impaired persons to interpret the surrounding environment by sound or touch, said system comprising: two cameras separate from one another and configured to capture an image of the environment simultaneously, and means for generating sound and/or touch output signals. Advantageously, the system also comprises processing means connected to the cameras and to the means for generating sound and/or touch signals. The processing means are configured to combine the images captured in real time and to process the information associated with at least one vertical band with information relating to the depth of the elements in the combined image.
Claims
1. A portable system that allows blind or visually impaired persons to interpret in real time the surrounding environment by sound and touch, comprising: two cameras, separate from one another and configured to continuously capture images of the environment simultaneously, generating means for generating sound and touch output signals; processing means connected to the cameras and to the generating means for generating sound and touch signals, characterized in that said processing means are configured to combine the images captured in real time and to establish a vertical band (from the cameras' point of view) covering a total height of the combined image, and to generate sound and touch signals from pixels associated with depth information located exclusively within the vertical band, and said processing means also being configured to: divide the vertical band into at least two regions; define a sound and touch signal in each region according to the depth information of the region and the height of the region within the vertical band; and define a sound and touch output signal based on the sound and touch signals in each region of the vertical band.
2. The system according to claim 1, wherein the vertical band is a central band of the combined image.
3. The system according to claim 2, wherein the Processing means are configured to process a plurality of side vertical bands in the combined image, on each side-of the central vertical band, and characterized in that a left side signal and a right side signal are defined from the regions of each left side band and of each right side band, respectively.
4. The system according to claim 3, wherein the Processing means are suitable for providing a simultaneous analysis of the plurality of side vertical bands, such that a segmentation region is processed horizontally on the complete image acquired by the cameras.
5. The system according to claim 4, wherein the operating mode of the Processing means can be configured by the user, such that the mode of simultaneous analysis of the plurality of side vertical bands and mode of analysis of a single vertical band can be activated and deactivated by said user.
6. The system according to claim 1, wherein the generating means operate in stereophonic, combining a left side sound and touch signal and a right side sound and touch signal, and/or wherein the sound generated is monaural, where both modalities can be selected by the user.
7. The system according to claim 1, wherein the processing means define a strength of the sound and touch signal according to the depth of the region.
8. The system according to claim 7, wherein the processing means define a frequency of the sound and touch signal according to the height of the region in the vertical band.
9. The system according to claim 8, wherein the processing means are configured to determine the depth of a region, according to grayscale color coding or by means of a color gradient, on a depth map of the image of the environment.
10. The system according to claim 1, comprising a support structure to be carried by the user, and configured to situate the reproduction means and the two cameras.
11. The system according to claim 1, wherein the touch signal is a signal generated by vibration.
12. The system according to claim 1, wherein the frequency of the sound signal is chosen from within the range between 100 Hz and 18000 Hz.
13. The system according to claim 1, wherein the generating means comprise bone conduction headphones.
14. The system according to claim 1, wherein the support structure is chosen from at least: glasses, a headband, neck support, pectoral support, shoulder support, hand support.
15. The system according to claim 1, comprising wired, and/or wireless data transmission means connected to the processing unit, wherein said transmission means are connected to an external device with a wired, and/or wireless connection, and/or to a wearable type of device.
16. The System according to claim 1, wherein the vertical band is defined in width by at least one depth information associated pixel and in height by the total height of the combined image in depth information associated pixels.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
DETAILED DISCLOSURE OF THE INVENTION
(6) For the sake of greater clarity, an embodiment of the invention is described in a non-limiting manner in relation to the drawings and focusing on sound or touch signals.
(7)
(8) The actual circuitry of the cameras (3i, 3d) performs pre-processing on the captured image to provide a steady flow of images, preventing geometric or chromatic artifacts and aberrations. The circuitry of the sensors offers a pair of images synchronized in time.
(9) As a result, this video stream is transmitted to a processing unit (2). The processing unit (2) is preferably a specific hardware design implementing the algorithm for converting images to audio/vibration. A cable (6) has been envisaged to communicate the cameras (3i, 3d) with the processing unit (2). Nevertheless, wireless transmission is contemplated in other more complex embodiments.
(10) The processing unit (2) converts stereoscopic images into a grayscale depth map. A disparity map (without information about the scale) is previously generated.
(11) Depth map is understood to be a grayscale image, in which the color called process black means maximum remoteness (depending on the scale used) and pure white means maximum closeness (depending on the scale used). The rest of the grays specify intermediate distances. Nevertheless, in other embodiments of the invention it is possible to reverse the contrast and make the darker colors correspond to the closest distances, or the use of a pre-established color scale similar to a thermographic representation.
(12) Disparity map is understood to be the resulting image that is obtained from the superimposition of a pair of stereo images, which are subjected to mathematical processing. The binocular disparity map expresses, in one image, the differences in pixel level between two stereo images. By means of applying the mathematical disparity algorithm, by knowing the distance between cameras and the camera calibration files, the difference between pixels can be adapted to real distances. The distance of the camera from each portion (pixel size) of the image taken is known as a result of this process. A grayscale is used to express that distance.
(13) This is then converted to a depth map. After a mathematical process in which grayscale distances/level are applied, a depth map is obtained.
(14) Based on the generated depth map, a conversion algorithm developed for that purpose is applied, being a high optimization algorithm, and therefore, one requiring fewer computational resources, which allows special data relating to depth to be converted into audio in a more efficient manner than in the systems known.
(15) The result is that with an initial pair of stereo images, a non-verbal stereo sound signal is achieved which is transmitted to the user through cochlear headphones or through bone conduction (4i, 4d). Audiovisual language which reliably translates visual information into auditory information in an intuitive manner for the user is thereby defined.
(16)
(17)
(18) Using the information from the depth map, a matrix or table with information relating to the environment at that time is built. This information must be converted into audio according to the following considerations: Disparity mapping is performed with each pair of stereo frames: Given the difference between pixels of the images and using the data from the cameras (FOV, interocular distance, specific calibration), triangulations can be established, therefore pixels can be associated with distances in the real world. With this information, the image is processed to provide a depth map. It is an outline and grayscale image of the objects expressing their volumes and real distances. This therefore provides a single composite image containing spatial information relating to the scene. Example of scan operating mode in reference to
(19) A volume intensity is associated with the grayscale value of a pixel (I). Therefore, the pixel with values 0.0.0 (RGB model) corresponds with a remote region and the associated intensity is silence (I=0). A pixel with values 255.255.255 corresponds with a very close region and the volume of the signal is maximum (I=0 dB). Each pixel can thereby be viewed as a “sound unit” used to make an audio composition. The frequency sound preferably ranges from 100 Hz to 18000 Hz.
(20) According to the operating mode, position X of the pixel could be interpreted in two ways. Scan mode: Only those signals corresponding to the pixels in the central column will sound. The scene is scanned when the user moves the head as if shaking it no. This is similar to scanning with a cane. Complete landscape mode: Several columns of pixels associated with the scene will sound simultaneously. Scanning is not necessary with this mode. The image is represented (or “sounded”) in its entirety. For example, the further to the right the pixels are, the louder they will sound on the right in the stereo panorama. The same is true for the central and left regions.
(21) Complete landscape mode requires a high computing power, so depending on the performance of the processing unit (2), instead of all the columns in the image sounding, it can be optimized using five columns, i.e., central, 45°, −45°, 80°, −80°. More columns can be used according to the processing power.
(22) The position Y of the pixel (height of the object) will define how it sounds in terms of frequency: a bandpass filter (or a generated sine wave frequency, or a pre-calculated sample with a specific frequency range, alternatives according to the calculating power of the device) is used, so the pixels in the high area will sound high-pitched, and the pixels in the low area will sound low-pitched. The sound spectrum that each pixel will cover will be defined by the number of pixels Y it will have.
EXAMPLE
(23) This example is provided to clarify how sound is generated from the depth image. It is assumed that the scan mode has been selected and that a depth image like that shown in
(24) The strength of the signal at that moment in time would be the analog mix of all the signals.
(25) The user would notice different frequencies according to the position of the pixel in height. The pixels that are at a lower height are lower pitched, and the pixels that are at a greater height are higher pitched. The sound generated by this column can be divided into a low pitch component with a high sound intensity (area B) and a component having an intermediate sound intensity with a higher pitched frequency (area C). This signal would be generated for the two left and right channels (and would be reproduced in the headphones (4i, 4d), respectively).
(26) When the user changes position of the cameras by turning the head, the depth image, and therefore the associated sound signal, will be modified.
(27)
(28)
(29) The amount of information and detail that the sound has allows identifying forms and spaces with a precision that was unheard of until now. In the tests conducted with the blind, it has been verified that after a short training period the present invention allows recognizing specific forms due to the associated sound. For example, bottles, glasses and plates on a table have characteristic sounds that allow distinguishing them from one another.
(30) Cochlear headphones, which allow leaving the ear canal free, are preferably used to transmit the sound. This improves user comfort, greatly reducing listening fatigue and being much more hygienic for prolonged use sessions.
(31) An interface associated with the processing unit (2) is envisaged in one embodiment, having a range selection button to determine the analysis distance, for example close, normal and far, with distances of 40 cm, 2 m and 6 m, respectively, or being defined by the user through an interface suited to that effect. When the button is pushed, distances will be selected in a cyclical manner. The range selection usually serves to adapt the range to different sceneries and circumstances, for example, 40 cm for locating objects on a table; 2 m for walking around the house; and 6 m for crossing the street.
(32) In another preferred embodiment of the invention, the system comprises wireless data transmission means (for example by means of Wi-Fi, Bluetooth or other similar technologies) connected to the processing unit (2), where said transmission means are connected to an external device with a wireless connection and/or to a wearable type of device.
(33) It is envisaged in one embodiment that the interface associated with the processing unit (2) has an analysis mode button. The selection between modes will be cyclical.
(34) Scan mode: Analysis only in the central area of the image. The user will turn the head in a cyclical manner from left to right, scanning the scene similarly to how this would be done with a cane. The sound is monaural.
(35) Complete landscape mode: The analysis is performed on the entire image. The sound is stereo. The user can therefore perceive forms and spaces in the entire field of vision simultaneously. For example, a column is perceived on the left (left stereo panorama), a low table is perceived in the center (central stereo panorama) and on the right (right stereo panorama) the path is clear. This prospecting mode is more complex in terms of sound since it provides more information than the scan mode does. It is easy to dominate although it does require somewhat more training.