Patent classifications
G10L15/04
Speech translation device, speech translation method, and recording medium
A speech translation device, for conversation between a first speaker making an utterance in a first language and a second speaker making an utterance in a second language different from the first language, includes: a speech detector that detects, from sounds that are input, a speech segment in which the first speaker or the second speaker made an utterance; a display that, after speech recognition is performed on the utterance, displays a translation result obtained by translating the utterance from the first language to the second language or from the second language to the first language; and an utterance instructor that outputs, in the second language via the display, a message prompting the second speaker to make an utterance after a first speaker's utterance or outputs, in the first language via the display, a message prompting the first speaker to make an utterance after a second speaker's utterance.
Method and apparatus with speech processing
Disclosed is a method and apparatus for processing a speech. The method includes obtaining context information from a speech signal of a user using a neural network-based encoder, determining intent information of the speech signal based on the context information, determining, based on the context information, attention information corresponding to a segment included in the speech signal, and determining, based on the attention information, a segment value of the segment by recognizing, using a decoder, a portion of the context information identified as corresponding to the segment.
NATURAL LANGUAGE PROCESSING DEVICE
A natural language processing device according to an embodiment of the present disclosure may comprise: a memory for storing a first channel named entity dictionary including basic channel names and a synonym of each of the basic channel names; a communication interface for receiving, from a display device, voice data corresponding to a voice instruction uttered by a user; and a processor which: acquires multiple channel names included in electronic program guide information; extracts channel names matching the acquired multiple channel names from the first channel named entity dictionary so as to acquire a second channel named entity dictionary; acquires the intention of a speech of the voice instruction on the basis of text data of the voice data and the second channel named entity dictionary; and transmits the acquired intention of the speech to the display device through the communication interface.
NATURAL LANGUAGE PROCESSING DEVICE
A natural language processing device according to an embodiment of the present disclosure may comprise: a memory for storing a first channel named entity dictionary including basic channel names and a synonym of each of the basic channel names; a communication interface for receiving, from a display device, voice data corresponding to a voice instruction uttered by a user; and a processor which: acquires multiple channel names included in electronic program guide information; extracts channel names matching the acquired multiple channel names from the first channel named entity dictionary so as to acquire a second channel named entity dictionary; acquires the intention of a speech of the voice instruction on the basis of text data of the voice data and the second channel named entity dictionary; and transmits the acquired intention of the speech to the display device through the communication interface.
METHODS TO EMPLOY CONCATENATION IN ASR SERVICE USAGE
Systems and methods for processing audio streams are disclosed herein. In a disclosed method, N number of audio streams is received from an independent source, each audio stream includes speech content. A set of the N audio streams is concatenated to generate a concatenated audio stream. Based on the N received audio streams or the set of audio streams, N−1 or one less that the total number of audio streams in the set of audio stream separators, respectively, are generated. An audio stream separator is inserted between every two adjacent audio streams of the concatenated audio stream to generate a single audio stream payload. The single audio stream payload is transmitted for transcription of the audio stream speech content to text content and in response to transmitting the payload, a text file is received including text content corresponding to the audio streams delineated by the audio stream separators.
METHODS TO EMPLOY CONCATENATION IN ASR SERVICE USAGE
Systems and methods for processing audio streams are disclosed herein. In a disclosed method, N number of audio streams is received from an independent source, each audio stream includes speech content. A set of the N audio streams is concatenated to generate a concatenated audio stream. Based on the N received audio streams or the set of audio streams, N−1 or one less that the total number of audio streams in the set of audio stream separators, respectively, are generated. An audio stream separator is inserted between every two adjacent audio streams of the concatenated audio stream to generate a single audio stream payload. The single audio stream payload is transmitted for transcription of the audio stream speech content to text content and in response to transmitting the payload, a text file is received including text content corresponding to the audio streams delineated by the audio stream separators.
INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING SYSTEM, AND NON-TRANSITORY COMPUTER READABLE MEDIUM
An information processing apparatus includes a processor configured to: segment, into multiple voice segments, voice data and text data converted from the voice data; impart a security level to each of the voice segments in accordance with contents of the text data and the voice data in each of the voice segments; and perform control on an output of each of the voice segments in accordance with the security level.
INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING SYSTEM, AND NON-TRANSITORY COMPUTER READABLE MEDIUM
An information processing apparatus includes a processor configured to: segment, into multiple voice segments, voice data and text data converted from the voice data; impart a security level to each of the voice segments in accordance with contents of the text data and the voice data in each of the voice segments; and perform control on an output of each of the voice segments in accordance with the security level.
ENABLING NATURAL CONVERSATIONS WITH SOFT ENDPOINTING FOR AN AUTOMATED ASSISTANT
As part of a dialog session between a user and an automated assistant, implementations can process, using a streaming ASR model, a stream of audio data that captures a portion of a spoken utterance to generate ASR output, process, using an NLU model, the ASR output to generate NLU output, and cause, based on the NLU output, a stream of fulfillment data to be generated. Further, implementations can further determine, based on processing the stream of audio data, audio-based characteristics associated with the portion of the spoken utterance captured in the stream of audio data. Based on the audio-based characteristics and/the stream of NLU output, implementations can determine whether the user has paused in providing the spoken utterance or has completed providing of the spoken utterance. If the user has paused, implementations can cause natural conversation output to be provided for presentation to the user.
System and method for combining phonetic and automatic speech recognition search
A text search query including one or more words may be received. An ASR index created for an audio recording may be searched over using the query to produce ASR search results including words, each word associated with a confidence score. For each of the words in the ASR search results associated with a confidence score below a threshold (and in some cases having one or more preceding words in the ASR index and one or more subsequent words in the ASR index), a phonetic representation of the audio recording may be searched for the word having the confidence score below the threshold, where it occurs in the audio recording, possibly after the one or more preceding words and in the audio recording before the one or more subsequent words, to produce phonetic search results. Search results may be returned include ASR and phonetic results.