Patent classifications
G10L2021/065
Wearable vibrotactile speech aid
A method for training vibrotactile speech perception in the absence of auditory speech includes selecting a first word, generating a first control signal configured to cause at least one vibrotactile transducer to vibrate against a person's body with a first vibration pattern based on the first word, sampling a second word spoken by the person, generating a second control signal configured to cause at least one vibrotactile transducer to vibrate against the person's body with a second vibration pattern based on the sampled second word, and presenting a comparison between the first word and the second word to the person. An array of vibrotactile transducers can be in contact with the person's body. A method for improving auditory and/or visual speech perception in adverse listening conditions or for hearing-impaired individuals can also include sampling a speech signal, extracting a speech envelope, and generating a control signal configured to cause a vibrotactile transducer to vibrate again a person's body with an intensity that varies over time based on the speech envelope.
WEARABLE AUDITORY FEEDBACK DEVICE
A wearable auditory feedback device includes a frame, a plurality of microphone arrays, a plurality of feedback motors, and a processor. The frame is wearable on a user's head or neck. The microphone arrays are embedded in the frame on a left side, a right side, and a rear side with respect to the user. The feedback motors are also embedded in the frame on the left side, the right side, and the rear side with respect to the user. The processor is configured to receive a plurality of sound waves collected with the microphone arrays from a sound wave source, determine an originating direction of the sound waves, and activate a feedback motor on a side the frame corresponding to the originating direction.
SYSTEM AND METHOD TO INSERT VISUAL SUBTITLES IN VIDEOS
A system and method to insert visual subtitles in videos is described. The method comprises segmenting an input video signal to extract the speech segments and music segments. Next, a speaker representation is associated for each speech segment corresponding to a speaker visible in the frame. Further, speech segments are analysed to compute the phones and the duration of each phone. The phones are mapped to a corresponding viseme and a viseme based language model is created with a corresponding score. Most relevant viseme is selected for the speech segments by computing a total viseme score. Further, a speaker representation sequence is created such that phones and emotions in the speech segments are represented as reconstructed lip movements and eyebrow movements. The speaker representation sequence is then integrated with the music segments and super imposed on the input video signal to create subtitles.
WEARABLE DEVICE, DISPLAY CONTROL METHOD, AND COMPUTER-READABLE RECORDING MEDIUM
A wearable device is provided that includes a microphone, a display, and controller. The controller analyzes audio information picked up by the microphone, and, when audio corresponding to a predetermined verbal address phrase has been detected in the acquired audio information, causes the display to display an indication of an utterance of a verbal address on the display.
APPARATUS FOR BI-DIRECTIONAL SIGN LANGUAGE/SPEECH TRANSLATION IN REAL TIME AND METHOD
Provided is an apparatus for bi-directional sign language/speech translation in real time and method that may automatically translate a sign into a speech or a speech into a sign in real time by separately performing an operation of recognizing a speech externally made through a microphone and outputting a sign corresponding to the speech, and an operation of recognizing a sign sensed through a camera and outputting a speech corresponding to the sign.
SIGN LANGUAGE COMMUNICATION WITH COMMUNICATION DEVICES
Implementations enable conversations between operators of communication devices who use sign language and other operators who don't. A method may include receiving images of first sign language gestures captured by a camera of a first communication device, converting the first sign language gestures into first text, transmitting the first text to a second communication device, receiving second text from the second communication device, and converting the second text into images of second sign language gestures made by an avatar. The method may also include operating the camera to capture the images of the first sign language gestures and presenting the images of the second sign language gestures on a display of the first communication device. The method may further include receiving first speech captured at the second communication device, converting the first speech into third text, and then into images of third sign language gestures made by the avatar.
SPEECH ASSESSMENT DEVICE AND METHOD FOR A MULTISYLLABIC-WORD LEARNING MACHINE, AND A METHOD FOR VISUALIZING CONTINUOUS AUDIO
A speech assessment device and method for a multisyllabic-word learning machine, and a method for visualizing continuous audio are provided. By performing the step of starting the assessment mode, the step of selecting words to be assessed, the step of choosing to play or record, the step of recording, the step of visualization (including the step of picking out fundamental frequency, the step of defining analysis point, the step of transforming polygonal lines, and the step of simplifying the polygonal lines), the step of repeating, and the step of assessment, the speech assessment device and method for a multisyllabic-word learning machine are capable of providing assistance in oral language learning, and capable of rehabilitating patients with hearing impairment through visual aids.
AUTOMATIC SMOOTHED CAPTIONING OF NON-SPEECH SOUNDS FROM AUDIO
A content server accessing an audio stream, and inputs portions of the audio stream into one or more non-speech classifiers for classification, the non-speech classifiers generating, for portions of the audio stream, a set of raw scores representing likelihoods that the respective portion of the audio stream includes an occurrence of a particular class of non-speech sounds associated with each of the non-speech classifiers. The content server generates binary scores for the sets of raw scores, the binary scores generated based on a smoothing of a respective set of raw scores. The content server applies a set of non-speech captions to portions of the audio stream in time, each of the sets of non-speech captions based on a different one of the set binary scores of the corresponding portion of the audio stream.
CONFERENCE SUPPORT APPARATUS, CONFERENCE SUPPORT METHOD, AND COMPUTER PROGRAM PRODUCT
According to an embodiment, a conference support apparatus includes a recognizer, a detector, a summarizer, and a subtitle generator. The recognizer is configured to recognize speech in speech data and generate text data. The detector is configured to detect a correction operation on the text data, the correction operation being an operation of correcting character data that has been incorrectly converted. The summarizer is configured to generate a summary relating to the text data subsequent to a part to which the correction operation is being performed, among the text data, when the correction operation is being detected. The subtitle generator is configured to generate subtitle information corresponding to the summary when the correction operation is being detected, and configured to generate subtitle information corresponding to the text data except when the correction operation is being detected.
DISPLAY EYEWEAR WITH AUDITORY ENHANCEMENT
Some embodiments provide display eyewear with auditory enhancement. In general, one aspect disclosed features a head-wearable apparatus comprising: a microphone; a display panel visible to the wearer; a gaze tracker configured to determine a direction of a gaze of a wearer of the head-wearable apparatus; and a controller configured to: extract speech from sound collected by the microphone from the determined direction, and present the extracted speech on the display panel.