Patent classifications
G10L21/18
Transcription summary presentation
A method to present a summary of a transcription may include obtaining, at a first device, audio directed to the first device from a second device during a communication session between the first device and the second device. Additionally, the method may include sending, from the first device, the audio to a transcription system. The method may include obtaining, at the first device, a transcription during the communication session from the transcription system based on the audio. Additionally, the method may include obtaining, at the first device, a summary of the transcription during the communication session. Additionally, the method may include presenting, on a display, both the summary and the transcription simultaneously during the communication session.
Transcription summary presentation
A method to present a summary of a transcription may include obtaining, at a first device, audio directed to the first device from a second device during a communication session between the first device and the second device. Additionally, the method may include sending, from the first device, the audio to a transcription system. The method may include obtaining, at the first device, a transcription during the communication session from the transcription system based on the audio. Additionally, the method may include obtaining, at the first device, a summary of the transcription during the communication session. Additionally, the method may include presenting, on a display, both the summary and the transcription simultaneously during the communication session.
AUDIO-VISUAL SPEECH SEPARATION
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; processing, for each speaker, the per-frame face embeddings of the face of the speaker to generate visual features for the face of the speaker; obtaining a spectrogram of an audio soundtrack for the video; processing the spectrogram to generate an audio embedding for the audio soundtrack; combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; determining a respective spectrogram mask for each of the one or more speakers; and determining a respective isolated speech spectrogram for each speaker.
AUDIO-VISUAL SPEECH SEPARATION
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; processing, for each speaker, the per-frame face embeddings of the face of the speaker to generate visual features for the face of the speaker; obtaining a spectrogram of an audio soundtrack for the video; processing the spectrogram to generate an audio embedding for the audio soundtrack; combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; determining a respective spectrogram mask for each of the one or more speakers; and determining a respective isolated speech spectrogram for each speaker.
VOICE TRANSMISSION COMPENSATION APPARATUS, VOICE TRANSMISSION COMPENSATION METHOD AND PROGRAM
A speech transmission compensation apparatus that assists discrimination of speech heard by a user, includes: one or more computers each including a memory and a processor configured to: accept input of a speech signal, detect a specific type of sound in the speech signal, analyze an acoustic characteristic of the specific type of sound in the speech signal and output the acoustic characteristic; accept input of the acoustic characteristic being output by the memory and the processor, generate a vibration signal of a duration corresponding to the acoustic characteristic and output the vibration signal; and accept input of the vibration signal being output by the memory and the processor and provide the user with vibration for the duration on the basis of the vibration signal.
VOICE TRANSMISSION COMPENSATION APPARATUS, VOICE TRANSMISSION COMPENSATION METHOD AND PROGRAM
A speech transmission compensation apparatus that assists discrimination of speech heard by a user, includes: one or more computers each including a memory and a processor configured to: accept input of a speech signal, detect a specific type of sound in the speech signal, analyze an acoustic characteristic of the specific type of sound in the speech signal and output the acoustic characteristic; accept input of the acoustic characteristic being output by the memory and the processor, generate a vibration signal of a duration corresponding to the acoustic characteristic and output the vibration signal; and accept input of the vibration signal being output by the memory and the processor and provide the user with vibration for the duration on the basis of the vibration signal.
Method and apparatus for processing speech signal
An apparatus for processing a speech signal is provided. The apparatus includes a communicator comprising communication circuitry configured to transmit and receive data, an actuator comprising actuation circuitry configured to generate vibration and to output a signal, a formant enhancement filter configured to increase a formant of the speech signal, and a controller comprising processing circuitry configured to control the speech signal to be received through the communicator, to estimate at least one formant frequency from the speech signal based on linear predictive coding (LPC), to estimate a bandwidth of the at least one formant frequency, to determine whether the speech signal is a voiced sound or a voiceless sound, to configure the formant enhancement filter based on the at least one formant frequency, the bandwidth of the at least one formant frequency, characteristics of the determined voiced sound or voiceless sound, and signal delivery characteristics of a human body, to apply the formant enhancement filter to the speech signal, and to control the speech signal to which the formant enhancement filter is applied to be output using the actuator through the human body.
Method and apparatus for processing speech signal
An apparatus for processing a speech signal is provided. The apparatus includes a communicator comprising communication circuitry configured to transmit and receive data, an actuator comprising actuation circuitry configured to generate vibration and to output a signal, a formant enhancement filter configured to increase a formant of the speech signal, and a controller comprising processing circuitry configured to control the speech signal to be received through the communicator, to estimate at least one formant frequency from the speech signal based on linear predictive coding (LPC), to estimate a bandwidth of the at least one formant frequency, to determine whether the speech signal is a voiced sound or a voiceless sound, to configure the formant enhancement filter based on the at least one formant frequency, the bandwidth of the at least one formant frequency, characteristics of the determined voiced sound or voiceless sound, and signal delivery characteristics of a human body, to apply the formant enhancement filter to the speech signal, and to control the speech signal to which the formant enhancement filter is applied to be output using the actuator through the human body.
AUDIO TRACKING
A transcription provider is presented with an audio recording created using one or more recording devices. A transcriptionist using proprietary computer software records at discrete intervals both the position of the audio playing for the transcriptionist and the position of the cursor in the document being typed by the transcriptionist, thereby creating both a completely typed document and an audio map. The completed document may be further processed such that each word is matched to its corresponding audio position using the information acquired from the audio map and the matched word may then be put into a separate document as a hyperlink containing meta-data that points to the exact matching audio position. By simultaneously tracking the progress of audio playback and transcriptionist progress within a document, the transcriptionist is then able to display an interactive version of the completed document.
Audio-visual speech separation
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; processing, for each speaker, the per-frame face embeddings of the face of the speaker to generate visual features for the face of the speaker; obtaining a spectrogram of an audio soundtrack for the video; processing the spectrogram to generate an audio embedding for the audio soundtrack; combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; determining a respective spectrogram mask for each of the one or more speakers; and determining a respective isolated speech spectrogram for each speaker.