Patent classifications
G10L17/02
Low-latency multi-speaker speech recognition
Systems and processes for operating an intelligent automated assistant are provided. In one example, a method includes receiving mixed speech data representing utterances of a target speaker and utterances of one or more interfering audio sources. The method further includes obtaining a target speaker representation, which represents speech characteristics of the target speaker; and determining, using a learning network, probability distributions of phonetic elements directly from the mixed speech data. The inputs of the learning network include the mixed speech data and the target speaker representation. An output of the learning network includes the probability distributions of phonetic elements. The method further includes generating text corresponding to the utterances of the target speaker based on the probability distributions of the phonetic elements; and providing a response to the target speaker based on the text corresponding to the utterances of the target speaker.
ANCHORED MESSAGES FOR AUGMENTED REALITY
Augmented reality devices can be configured to display messages in response to sounds from an environment. A variety of techniques can be combined to localize and track the sources of the sounds in the environment. Messages created in response to the sounds can then be anchored to their corresponding sources in order to provide a user with a clear understanding of the location of sources of the messages. Additionally, these anchored messages can be enhanced with additional information, such as identification, to further the user’s understanding of the sources of the messages. The anchored messages can track relative movement to integrate with the AR environment.
ANCHORED MESSAGES FOR AUGMENTED REALITY
Augmented reality devices can be configured to display messages in response to sounds from an environment. A variety of techniques can be combined to localize and track the sources of the sounds in the environment. Messages created in response to the sounds can then be anchored to their corresponding sources in order to provide a user with a clear understanding of the location of sources of the messages. Additionally, these anchored messages can be enhanced with additional information, such as identification, to further the user’s understanding of the sources of the messages. The anchored messages can track relative movement to integrate with the AR environment.
SPEECH CONTROL METHOD AND APPARATUS, ELECTRONIC DEVICE AND STORAGE MEDIUM
The disclosure provides a speech control method, an electronic device and a storage medium. The method includes: obtaining a speech to be processed; obtaining a speech feature vector by performing feature analysis on the speech to be processed; determining whether the speech to be processed belongs to a target type based on the speech feature vector; and in response to the speech to be processed belonging to the target type, performing wake-up control on a target device based on the speech to be processed.
SPEECH CONTROL METHOD AND APPARATUS, ELECTRONIC DEVICE AND STORAGE MEDIUM
The disclosure provides a speech control method, an electronic device and a storage medium. The method includes: obtaining a speech to be processed; obtaining a speech feature vector by performing feature analysis on the speech to be processed; determining whether the speech to be processed belongs to a target type based on the speech feature vector; and in response to the speech to be processed belonging to the target type, performing wake-up control on a target device based on the speech to be processed.
STREAMING DATA PROCESSING FOR HYBRID ONLINE MEETINGS
Techniques of streaming data processing for hybrid online meetings are disclosed herein. In one example, a method includes receiving, at the remote server, a video stream captured by a camera in the conference room. The video stream captures images of multiple local participants of an online meeting. The method also includes determining identities of the captured images of the multiple local participants in the received video stream using meeting information of the online meeting and generating a set of individual video streams each corresponding to one of the multiple local participants. The set of individual video streams can then be transmitted to the second computing device corresponding to a remote participant of the online meeting as if the multiple local participants are virtually joining the online meeting.
STREAMING DATA PROCESSING FOR HYBRID ONLINE MEETINGS
Techniques of streaming data processing for hybrid online meetings are disclosed herein. In one example, a method includes receiving, at the remote server, a video stream captured by a camera in the conference room. The video stream captures images of multiple local participants of an online meeting. The method also includes determining identities of the captured images of the multiple local participants in the received video stream using meeting information of the online meeting and generating a set of individual video streams each corresponding to one of the multiple local participants. The set of individual video streams can then be transmitted to the second computing device corresponding to a remote participant of the online meeting as if the multiple local participants are virtually joining the online meeting.
DEVICE FINDER USING VOICE AUTHENTICATION
A computing device may receive an indication of an audio signal captured by a microphone, wherein the audio signal includes voice input. The computing device may determine that the voice input in the audio signal is from an authorized user of the computing device and includes a trigger phrase associated with a request to trigger device finder functionality based at least in part on comparing the voice input with data provided by the authorized user of the computing device. The computing device may, in response to determining that the voice input in the audio signal is from the authorized user of the computing device and includes the trigger phrase associated with the request to trigger device finder functionality, cause a speaker of the computing device to audibly output the alert sound to assist the authorized user to locate the computing device.
DEVICE FINDER USING VOICE AUTHENTICATION
A computing device may receive an indication of an audio signal captured by a microphone, wherein the audio signal includes voice input. The computing device may determine that the voice input in the audio signal is from an authorized user of the computing device and includes a trigger phrase associated with a request to trigger device finder functionality based at least in part on comparing the voice input with data provided by the authorized user of the computing device. The computing device may, in response to determining that the voice input in the audio signal is from the authorized user of the computing device and includes the trigger phrase associated with the request to trigger device finder functionality, cause a speaker of the computing device to audibly output the alert sound to assist the authorized user to locate the computing device.
Securely executing voice actions with speaker identification and authorization code
In some implementations, (i) audio data representing a voice command spoken by a speaker and (ii) a speaker identification result indicating that the voice command was spoken by the speaker are obtained. A voice action is selected based at least on a transcription of the audio data. A service provider corresponding to the selected voice action is selected from among a plurality of different service providers. One or more input data types that the selected service provider uses to perform authentication for the selected voice action are identified. A request to perform the selected voice action and (i) one or more values that correspond to the identified one or more input data types are provided to the service provider.