Patent classifications
G10L15/183
DIALOG MANAGEMENT WITH MULTIPLE APPLICATIONS
Features are disclosed for performing functions in response to user requests based on contextual data regarding prior user requests. Users may engage in conversations with a computing device in order to initiate some function or obtain some information. A dialog manager may manage the conversations and store contextual data regarding one or more of the conversations. Processing and responding to subsequent conversations may benefit from the previously stored contextual data by, e.g., reducing the amount of information that a user must provide if the user has already provided the information in the context of a prior conversation. Additional information associated with performing functions responsive to user requests may be shared among applications, further improving efficiency and enhancing the user experience.
DIALOG MANAGEMENT WITH MULTIPLE APPLICATIONS
Features are disclosed for performing functions in response to user requests based on contextual data regarding prior user requests. Users may engage in conversations with a computing device in order to initiate some function or obtain some information. A dialog manager may manage the conversations and store contextual data regarding one or more of the conversations. Processing and responding to subsequent conversations may benefit from the previously stored contextual data by, e.g., reducing the amount of information that a user must provide if the user has already provided the information in the context of a prior conversation. Additional information associated with performing functions responsive to user requests may be shared among applications, further improving efficiency and enhancing the user experience.
CALL MANAGEMENT SYSTEM AND ITS SPEECH RECOGNITION CONTROL METHOD
A speech recognition server has a speech recognition engine, and a mode control table to hold a speech recognition mode for each call. The speech recognition engine has a mode management unit to designate a speech recognition mode for a decoder, and an output analysis unit to analyze recognition result data speech-to-text converted by speech recognition. The output analysis unit designates the speech recognition mode for the mode management unit in accordance with result of analysis of the recognition result data speech-to-text converted by the speech recognition. The mode management unit designates the speech recognition mode for the decoder in accordance with the designation with the output analysis unit. Upon speech recognition of call data, it is possible to suppress hardware resource consumption while improve users' satisfaction.
CALL MANAGEMENT SYSTEM AND ITS SPEECH RECOGNITION CONTROL METHOD
A speech recognition server has a speech recognition engine, and a mode control table to hold a speech recognition mode for each call. The speech recognition engine has a mode management unit to designate a speech recognition mode for a decoder, and an output analysis unit to analyze recognition result data speech-to-text converted by speech recognition. The output analysis unit designates the speech recognition mode for the mode management unit in accordance with result of analysis of the recognition result data speech-to-text converted by the speech recognition. The mode management unit designates the speech recognition mode for the decoder in accordance with the designation with the output analysis unit. Upon speech recognition of call data, it is possible to suppress hardware resource consumption while improve users' satisfaction.
EXTERNAL LANGUAGE MODEL INFORMATION INTEGRATED INTO NEURAL TRANSDUCER MODEL
A computer-implemented method for training a neural transducer is provided including, by using audio data and transcription data of the audio data as input data, obtaining outputs from a trained language model and a seed neural transducer, respectively, combining the outputs to obtain a supervisory output, and updating parameters of another neural transducer in training so that its output is close to the supervisory output. The neural transducer can be a Recurrent Neural Network Transducer (RNN-T).
EXTERNAL LANGUAGE MODEL INFORMATION INTEGRATED INTO NEURAL TRANSDUCER MODEL
A computer-implemented method for training a neural transducer is provided including, by using audio data and transcription data of the audio data as input data, obtaining outputs from a trained language model and a seed neural transducer, respectively, combining the outputs to obtain a supervisory output, and updating parameters of another neural transducer in training so that its output is close to the supervisory output. The neural transducer can be a Recurrent Neural Network Transducer (RNN-T).
Pronunciation error detection apparatus, pronunciation error detection method and program
The present invention provides a pronunciation error detection apparatus capable of following a text without the need for a correct sentence even when erroneous recognition such as a reading error occurs. The pronunciation error detection apparatus comprises: a speech recognition part that recognizes the speech in speech data based on a speech recognition model for a non-native speaker, and outputs speech recognition results, reliability and time information; a reliability determination part that outputs the speech recognition results with higher reliability than a predetermined threshold and the corresponding time information as the determined speech recognition results and the determined time information; and a pronunciation error detection part that outputs a phoneme as a pronunciation error when reliability for each phoneme in the speech recognition results using the native speaker speech recognition model under a weakly constraining grammar is greater than the reliability of the corresponding phoneme in the speech recognition results using the native speaker acoustic model under a constraining grammar in which the determined speech recognition results are correct for the speech data in a segment specified by the determined time information.
Real time correction of accent in speech audio signals
Systems and methods for real-time correction of an accent in a speech audio signal are provided. A method includes dividing the speech audio signal into a stream of input chunks, an input chunk from the stream of input chunks including a pre-defined number of frames of the speech audio signal, extracting, by an acoustic features extraction module from the input chunk and a context associated with the input chunk, acoustic features, the context is a pre-determined number of the frames preceding the input chunk in the stream; extracting, by a linguistic features extraction module from the input chunk and the context, linguistic features, receiving a speaker embedding for a human speaker, providing the speaker embedding, the acoustic features, and the linguistic features to a synthesis module to generate a melspectrogram with a reduced accent, providing the melspectrogram to a vocoder to generate an output chunk of an output audio signal.
SYSTEMS AND METHODS FOR AUTOMATED AUDIO TRANSCRIPTION, TRANSLATION, AND TRANSFER FOR ONLINE MEETING
The present invention discloses systems and methods for multimedia processing. For example, the present invention provides systems and methods for receiving spoken audio, converting the spoken audio to text, and transferring the text to a user. As desired, the speech or text can be translated into one or more different languages. Systems and methods for real-time conversion and transmission of speech and text are provided, including systems and methods for large scale processing of multimedia events.
SYSTEMS AND METHODS FOR AUTOMATED AUDIO TRANSCRIPTION, TRANSLATION, AND TRANSFER FOR ONLINE MEETING
The present invention discloses systems and methods for multimedia processing. For example, the present invention provides systems and methods for receiving spoken audio, converting the spoken audio to text, and transferring the text to a user. As desired, the speech or text can be translated into one or more different languages. Systems and methods for real-time conversion and transmission of speech and text are provided, including systems and methods for large scale processing of multimedia events.