Patent classifications
G10L15/02
Voice Filtering Other Speakers From Calls And Audio Messages
A method includes receiving a first instance of raw audio data corresponding to a voice-based command and receiving a second instance of the raw audio data corresponding to an utterance of audible contents for an audio-based communication spoken by a user. When a voice filtering recognition routine determines to activate voice filtering for at least the voice of the user, the method also includes obtaining a respective speaker embedding of the user and processing, using the respective speaker embedding, the second instance of the raw audio data to generate enhanced audio data for the audio-based communication that isolates the utterance of the audible contents spoken by the user and excludes at least a portion of the one or more additional sounds that are not spoken by the user The method also includes executing.
Segment-based speaker verification using dynamically generated phrases
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for verifying an identity of a user. The methods, systems, and apparatus include actions of receiving a request for a verification phrase for verifying an identity of a user. Additional actions include, in response to receiving the request for the verification phrase for verifying the identity of the user, identifying subwords to be included in the verification phrase and in response to identifying the subwords to be included in the verification phrase, obtaining a candidate phrase that includes at least some of the identified subwords as the verification phrase. Further actions include providing the verification phrase as a response to the request for the verification phrase for verifying the identity of the user.
Segment-based speaker verification using dynamically generated phrases
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for verifying an identity of a user. The methods, systems, and apparatus include actions of receiving a request for a verification phrase for verifying an identity of a user. Additional actions include, in response to receiving the request for the verification phrase for verifying the identity of the user, identifying subwords to be included in the verification phrase and in response to identifying the subwords to be included in the verification phrase, obtaining a candidate phrase that includes at least some of the identified subwords as the verification phrase. Further actions include providing the verification phrase as a response to the request for the verification phrase for verifying the identity of the user.
Methods and systems for detecting and processing speech signals
Provided are methods, systems, and apparatuses for detecting, processing, and responding to audio signals, including speech signals, within a designated area or space. A platform for multiple media devices connected via a network is configured to process speech, such as voice commands, detected at the media devices, and respond to the detected speech by causing the media devices to simultaneously perform one or more requested actions. The platform is capable of scoring the quality of a speech request, handling speech requests from multiple end points of the platform using a centralized processing approach, a de-centralized processing approach, or a combination thereof, and also manipulating partial processing of speech requests from multiple end points into a coherent whole when necessary.
Processing speech signals of a user to generate a visual representation of the user
A computing system for generating image data representing a speaker's face includes a detection device configured to route data representing a voice signal to one or more processors and a data processing device comprising the one or more processors configured to generate a representation of a speaker that generated the voice signal in response to receiving the voice signal. The data processing device executes a voice embedding function to generate a feature vector from the voice signal representing one or more signal features of the voice signal, maps a signal feature of the feature vector to a visual feature of the speaker by a modality transfer function specifying a relationship between the visual feature of the speaker and the signal feature of the feature vector; and generates a visual representation of at least a portion of the speaker based on the mapping, the visual representation comprising the visual feature.
Processing speech signals of a user to generate a visual representation of the user
A computing system for generating image data representing a speaker's face includes a detection device configured to route data representing a voice signal to one or more processors and a data processing device comprising the one or more processors configured to generate a representation of a speaker that generated the voice signal in response to receiving the voice signal. The data processing device executes a voice embedding function to generate a feature vector from the voice signal representing one or more signal features of the voice signal, maps a signal feature of the feature vector to a visual feature of the speaker by a modality transfer function specifying a relationship between the visual feature of the speaker and the signal feature of the feature vector; and generates a visual representation of at least a portion of the speaker based on the mapping, the visual representation comprising the visual feature.
Method and apparatus for information query and storage medium
The present application discloses a method and an apparatus for information query, and an electronic device, which relates to a field of deep learning (DL), natural language processing (NLP) and artificial intelligence (AI) technology. The method includes: receiving a query sentence, segmenting the query sentence to obtain word segments, and obtaining a dependency relationship between two word segments and part of speech of the word segments; obtaining a coding sequence of the query sentence according to the dependency relationship and the part of speech of the word segments; matching the coding sequence with a generalized template to obtain a core corpus of the query sentence, wherein the generalized template comprises part of speech to be extracted and a dependency relationship to be extracted; and obtaining a query result corresponding to the query sentence based on the core corpus. The application no longer relies on the accumulation of massive business scenario data to enhance a generalization ability, which ensures accurate and efficient information query, and improves the efficiency and reliability of the information query process. At the same time, it may support information query in different business scenarios, with strong expansion capability and high universality.
Method and apparatus for information query and storage medium
The present application discloses a method and an apparatus for information query, and an electronic device, which relates to a field of deep learning (DL), natural language processing (NLP) and artificial intelligence (AI) technology. The method includes: receiving a query sentence, segmenting the query sentence to obtain word segments, and obtaining a dependency relationship between two word segments and part of speech of the word segments; obtaining a coding sequence of the query sentence according to the dependency relationship and the part of speech of the word segments; matching the coding sequence with a generalized template to obtain a core corpus of the query sentence, wherein the generalized template comprises part of speech to be extracted and a dependency relationship to be extracted; and obtaining a query result corresponding to the query sentence based on the core corpus. The application no longer relies on the accumulation of massive business scenario data to enhance a generalization ability, which ensures accurate and efficient information query, and improves the efficiency and reliability of the information query process. At the same time, it may support information query in different business scenarios, with strong expansion capability and high universality.
Real time correction of accent in speech audio signals
Systems and methods for real-time correction of an accent in a speech audio signal are provided. A method includes dividing the speech audio signal into a stream of input chunks, an input chunk from the stream of input chunks including a pre-defined number of frames of the speech audio signal, extracting, by an acoustic features extraction module from the input chunk and a context associated with the input chunk, acoustic features, the context is a pre-determined number of the frames preceding the input chunk in the stream; extracting, by a linguistic features extraction module from the input chunk and the context, linguistic features, receiving a speaker embedding for a human speaker, providing the speaker embedding, the acoustic features, and the linguistic features to a synthesis module to generate a melspectrogram with a reduced accent, providing the melspectrogram to a vocoder to generate an output chunk of an output audio signal.
Real time correction of accent in speech audio signals
Systems and methods for real-time correction of an accent in a speech audio signal are provided. A method includes dividing the speech audio signal into a stream of input chunks, an input chunk from the stream of input chunks including a pre-defined number of frames of the speech audio signal, extracting, by an acoustic features extraction module from the input chunk and a context associated with the input chunk, acoustic features, the context is a pre-determined number of the frames preceding the input chunk in the stream; extracting, by a linguistic features extraction module from the input chunk and the context, linguistic features, receiving a speaker embedding for a human speaker, providing the speaker embedding, the acoustic features, and the linguistic features to a synthesis module to generate a melspectrogram with a reduced accent, providing the melspectrogram to a vocoder to generate an output chunk of an output audio signal.