Patent classifications
G10L25/87
METHOD, APPARATUS, ELECTRONIC DEVICE, COMPUTER-READABLE STORAGE MEDIUM, AND COMPUTER PROGRAM PRODUCT FOR VIDEO COMMUNICATION
A method for video communication includes: selecting a first virtual object on a first video communication interface and obtaining associated first virtual object information, displaying a first virtual reality video image and a second virtual reality video image on the first video communication interface, the first virtual reality video image corresponds the first virtual object information and a first user feature, and the second virtual reality video image corresponds second virtual object information and a second user feature; and playing a target virtual audio, the target virtual audio including one or both of a first virtual audio or a second virtual audio, the first virtual audio corresponds to first voice data and the first virtual object information, and the second virtual audio corresponds to second voice data and the second virtual object information.
METHOD, APPARATUS, ELECTRONIC DEVICE, COMPUTER-READABLE STORAGE MEDIUM, AND COMPUTER PROGRAM PRODUCT FOR VIDEO COMMUNICATION
A method for video communication includes: selecting a first virtual object on a first video communication interface and obtaining associated first virtual object information, displaying a first virtual reality video image and a second virtual reality video image on the first video communication interface, the first virtual reality video image corresponds the first virtual object information and a first user feature, and the second virtual reality video image corresponds second virtual object information and a second user feature; and playing a target virtual audio, the target virtual audio including one or both of a first virtual audio or a second virtual audio, the first virtual audio corresponds to first voice data and the first virtual object information, and the second virtual audio corresponds to second voice data and the second virtual object information.
Speech feature extraction apparatus, speech feature extraction method, and computer-readable storage medium
A speech feature extraction apparatus 100 includes a voice activity detection unit 103 that drops non-voice frames from frames corresponding to an input speech utterance, and calculates a posterior of being voiced for each frame, a voice activity detection process unit 106 calculates a function value as weights in pooling frames to produce an utterance-level feature, from a given a voice activity detection posterior, and an utterance-level feature extraction unit 112 that extracts an utterance-level feature, from the frame on a basis of multiple frame-level features, using the function values.
Emitting word timings with end-to-end models
A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.
Speaker based anaphora resolution
A speech-processing system configured to determine entities corresponding to ambiguous words such as anaphora (“he,” “she,” “they,” etc.) included in an utterance. The system may associate incoming utterances with a speaker identification (ID), device ID, and other data. The system then tracks entities referred to in utterances so that if a later utterance includes an ambiguous entity reference, the system may take the speaker ID, device ID, etc. from the ambiguous reference, along with the text of the utterance and other data, and compare that information to previously mentioned entities (or other entities that may be relevant) to identify the entity mentioned in the ambiguous statement. Once the entity is determined, the system may then complete command processing of the utterance using the identified entity.
Speaker based anaphora resolution
A speech-processing system configured to determine entities corresponding to ambiguous words such as anaphora (“he,” “she,” “they,” etc.) included in an utterance. The system may associate incoming utterances with a speaker identification (ID), device ID, and other data. The system then tracks entities referred to in utterances so that if a later utterance includes an ambiguous entity reference, the system may take the speaker ID, device ID, etc. from the ambiguous reference, along with the text of the utterance and other data, and compare that information to previously mentioned entities (or other entities that may be relevant) to identify the entity mentioned in the ambiguous statement. Once the entity is determined, the system may then complete command processing of the utterance using the identified entity.
Natural assistant interaction
Systems and processes for operating a virtual assistant to provide natural assistant interaction are provided. In accordance with one or more examples, a method includes, at an electronic device with one or more processors and memory: receiving a first audio stream including one or more utterances; determining whether the first audio stream includes a lexical trigger; generating one or more candidate text representations of the one or more utterances; determining whether at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant. If at least one candidate text representation is to be disregarded, one or more candidate intents are generated based on candidate text representations of the one or more candidate text representations other than the to be disregarded at least one candidate text representation.
Pre-wakeword speech processing
A system for capturing and processing portions of a spoken utterance command that may occur before a wakeword. The system buffers incoming audio and indicates locations in the audio where the utterance changes, for example when a long pause is detected. When the system detects a wakeword within a particular utterance, the system determines the most recent utterance change location prior to the wakeword and sends the audio from that location to the end of the command utterance to a server for further speech processing.
Pre-wakeword speech processing
A system for capturing and processing portions of a spoken utterance command that may occur before a wakeword. The system buffers incoming audio and indicates locations in the audio where the utterance changes, for example when a long pause is detected. When the system detects a wakeword within a particular utterance, the system determines the most recent utterance change location prior to the wakeword and sends the audio from that location to the end of the command utterance to a server for further speech processing.
SPEECH ENDPOINTING BASED ON WORD COMPARISONS
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech endpointing based on word comparisons are described. In one aspect, a method includes the actions of obtaining a transcription of an utterance. The actions further include determining, as a first value, a quantity of text samples in a collection of text samples that (i) include terms that match the transcription, and (ii) do not include any additional terms. The actions further include determining, as a second value, a quantity of text samples in the collection of text samples that (i) include terms that match the transcription, and (ii) include one or more additional terms. The actions further include classifying the utterance as a likely incomplete utterance or not a likely incomplete utterance based at least on comparing the first value and the second value.