G10L13/00

Pronunciation error detection apparatus, pronunciation error detection method and program

The present invention provides a pronunciation error detection apparatus capable of following a text without the need for a correct sentence even when erroneous recognition such as a reading error occurs. The pronunciation error detection apparatus comprises: a speech recognition part that recognizes the speech in speech data based on a speech recognition model for a non-native speaker, and outputs speech recognition results, reliability and time information; a reliability determination part that outputs the speech recognition results with higher reliability than a predetermined threshold and the corresponding time information as the determined speech recognition results and the determined time information; and a pronunciation error detection part that outputs a phoneme as a pronunciation error when reliability for each phoneme in the speech recognition results using the native speaker speech recognition model under a weakly constraining grammar is greater than the reliability of the corresponding phoneme in the speech recognition results using the native speaker acoustic model under a constraining grammar in which the determined speech recognition results are correct for the speech data in a segment specified by the determined time information.

Processing speech signals of a user to generate a visual representation of the user
11568864 · 2023-01-31 · ·

A computing system for generating image data representing a speaker's face includes a detection device configured to route data representing a voice signal to one or more processors and a data processing device comprising the one or more processors configured to generate a representation of a speaker that generated the voice signal in response to receiving the voice signal. The data processing device executes a voice embedding function to generate a feature vector from the voice signal representing one or more signal features of the voice signal, maps a signal feature of the feature vector to a visual feature of the speaker by a modality transfer function specifying a relationship between the visual feature of the speaker and the signal feature of the feature vector; and generates a visual representation of at least a portion of the speaker based on the mapping, the visual representation comprising the visual feature.

Processing speech signals of a user to generate a visual representation of the user
11568864 · 2023-01-31 · ·

A computing system for generating image data representing a speaker's face includes a detection device configured to route data representing a voice signal to one or more processors and a data processing device comprising the one or more processors configured to generate a representation of a speaker that generated the voice signal in response to receiving the voice signal. The data processing device executes a voice embedding function to generate a feature vector from the voice signal representing one or more signal features of the voice signal, maps a signal feature of the feature vector to a visual feature of the speaker by a modality transfer function specifying a relationship between the visual feature of the speaker and the signal feature of the feature vector; and generates a visual representation of at least a portion of the speaker based on the mapping, the visual representation comprising the visual feature.

CAPTIONED TELEPHONE SERVICE SYSTEM HAVING TEXT-TO-SPEECH AND ANSWER ASSISTANCE FUNCTIONS
20230239401 · 2023-07-27 ·

A captioned telephone service system having the text-to-speech and answer assistance functions includes a captioner, a text-to-speech system, and an answer assistance system. The captioner provides captions to a user during a phone call between the user and a peer by receiving the peer’s voice from a peer device, transcribing the peer’s voice into caption data, and transferring the caption data to the user device. The text-to-speech system is configured to receive text data from the user device, convert the text data into speech, and transfer the voice of the speech to the peer device via the voice path in real time. The answer assistance system is configured to receive the caption data from the captioner, analyze the caption data to identify a question, analyze the question to generate answer suggestions, and forward the answer suggestions to the user device for review, editing, and selection.

CAPTIONED TELEPHONE SERVICE SYSTEM HAVING TEXT-TO-SPEECH AND ANSWER ASSISTANCE FUNCTIONS
20230239401 · 2023-07-27 ·

A captioned telephone service system having the text-to-speech and answer assistance functions includes a captioner, a text-to-speech system, and an answer assistance system. The captioner provides captions to a user during a phone call between the user and a peer by receiving the peer’s voice from a peer device, transcribing the peer’s voice into caption data, and transferring the caption data to the user device. The text-to-speech system is configured to receive text data from the user device, convert the text data into speech, and transfer the voice of the speech to the peer device via the voice path in real time. The answer assistance system is configured to receive the caption data from the captioner, analyze the caption data to identify a question, analyze the question to generate answer suggestions, and forward the answer suggestions to the user device for review, editing, and selection.

SYSTEM FOR INTELLIGENT FACILITATION OF SPEECH SYNTHESIS AND SPEECH RECOGNITION WITH AUTO-TRANSLATION ON SOCIAL MEDIA PLATFORM

The present invention relates to social media networking platform features. The present invention particularly relates to a system for the facilitation of speech synthesis i.e. text to speech or audio feature on a social media networking platform. The present invention further relates to the facilitation of speech recognition i.e. audio to text feature on social media networking platform. In addition, the aforementioned system also facilitates the feature of auto-translation in all languages to aid user operating in their own preferred language. Further, the aforementioned system enables sharing of content on its portal system while retaining the track and identity of the original creator of the content. The aforementioned system may be operated through all possible forms of multi-media platforms such as computers, laptops, mobiles, tablets etc.

SYSTEMS AND METHODS FOR AUTOMATED AUDIO TRANSCRIPTION, TRANSLATION, AND TRANSFER FOR ONLINE MEETING

The present invention discloses systems and methods for multimedia processing. For example, the present invention provides systems and methods for receiving spoken audio, converting the spoken audio to text, and transferring the text to a user. As desired, the speech or text can be translated into one or more different languages. Systems and methods for real-time conversion and transmission of speech and text are provided, including systems and methods for large scale processing of multimedia events.

Operating modes that designate an interface modality for interacting with an automated assistant
11561764 · 2023-01-24 · ·

Implementations described herein relate to transitioning a computing device between operating modes according to whether the computing device is suitably oriented for received non-audio related gestures. For instance, the user can attach a portable computing device to a docking station of a vehicle and, while in transit, wave their hand near the portable computing device in order to invoke the automated assistant. Such action by the user can be detected by a proximity sensor and/or any other device capable of determining a context of the portable computing device and/or an interest of the user in invoking the automated assistant. In some implementations location, orientation, and/or motion of the portable computing device can be detected and used in combination with an output of the proximity sensor to determine whether to invoke the automated assistant in response to an input gesture from the user.

Operating modes that designate an interface modality for interacting with an automated assistant
11561764 · 2023-01-24 · ·

Implementations described herein relate to transitioning a computing device between operating modes according to whether the computing device is suitably oriented for received non-audio related gestures. For instance, the user can attach a portable computing device to a docking station of a vehicle and, while in transit, wave their hand near the portable computing device in order to invoke the automated assistant. Such action by the user can be detected by a proximity sensor and/or any other device capable of determining a context of the portable computing device and/or an interest of the user in invoking the automated assistant. In some implementations location, orientation, and/or motion of the portable computing device can be detected and used in combination with an output of the proximity sensor to determine whether to invoke the automated assistant in response to an input gesture from the user.

SECURING PERSONALLY IDENTIFIABLE AND PRIVATE INFORMATION IN CONVERSATIONAL AI-BASED COMMUNICATION
20230229808 · 2023-07-20 ·

Method and system of securing personally identifiable and sensitive information in conversational AI based communication. The method comprises enabling a first service provider device as a communication channel provider of an incoming communication mode and enabling a second service provider device s a communication channel provider of an outgoing communication mode, at least one of the incoming communication and outgoing communication modes comprising an audio communication, storing content of a conversation in the incoming communication mode in a first storage medium accessible to the first service provider device but not the second service provider device, and storing content of the conversation in the outgoing communication mode at a second storage medium accessible to the second service provider device but not the first service provider device, and anonymizing the audio communication wherein personally identifiable audio characteristics of the user are obfuscated from the service provider devices.