Patent classifications
G10L15/285
Systems and methods for voice identification and analysis
Obtaining configuration audio data including voice information for a plurality of meeting participants. Generating localization information indicating a respective location for each meeting participant. Generating a respective voiceprint for each meeting participant. Obtaining meeting audio data. Identifying a first meeting participant and a second meeting participant. Linking a first meeting participant identifier of the first meeting participant with a first segment of the meeting audio data. Linking a second meeting participant identifier of the second meeting participant with a second segment of the meeting audio data. Generating a GUI indicating the respective locations of the first and second meeting participants, and the GUI indicating a first transcription of the first segment and a second transcription of the second segment. The first transcription is associated with the first meeting participant in the GUI, and the second transcription is associated with the second meeting participant in the GUI.
PHONEME-BASED CONTEXTUALIZATION FOR CROSS-LINGUAL SPEECH RECOGNITION IN END-TO-END MODELS
A method includes receiving audio data encoding an utterance spoken by a native speaker of a first language, and receiving a biasing term list including one or more terms in a second language different than the first language. The method also includes processing, using a speech recognition model, acoustic features derived from the audio data to generate speech recognition scores for both wordpieces and corresponding phoneme sequences in the first language. The method also includes rescoring the speech recognition scores for the phoneme sequences based on the one or more terms in the biasing term list, and executing, using the speech recognition scores for the wordpieces and the rescored speech recognition scores for the phoneme sequences, a decoding graph to generate a transcription for the utterance.
ZERO LATENCY DIGITAL ASSISTANT
An electronic device can implement a zero-latency digital assistant by capturing audio input from a microphone and using a first processor to write audio data representing the captured audio input to a memory buffer. In response to detecting a user input while capturing the audio input, the device can determine whether the user input meets a predetermined criteria. If the user input meets the criteria, the device can use a second processor to identify and execute a task based on at least a portion of the contents of the memory buffer.
MITIGATION OF CLIENT DEVICE LATENCY IN RENDERING OF REMOTELY GENERATED AUTOMATED ASSISTANT CONTENT
Implementations relate to mitigating client device latency in rendering of remotely generated automated assistant content. Some of those implementations mitigate client device latency between rendering of multiple instances of out-put that are each based on content that is responsive to a corresponding automated assistant action of a multiple action request. For example, those implementations can reduce latency between rendering of first output that is based on first content responsive to a first automated assistant action of a multiple action request, and second output that is based on second content responsive to a second automated assistant action of the multiple action request.
Voice control processing method and apparatus
A voice control processing method and apparatus, where the method includes enabling, by a terminal in a data service disabled state, a data service after the terminal receives a voice instruction using a first application, where the first application is an application program used for voice control in the terminal, prohibiting, by the terminal, another application other than the first application in the terminal from using the data service, and controlling, by the terminal, the first application to execute the voice instruction using the data service, after the terminal enables the data service. The terminal in a data service disabled state receives the voice instruction. Then, the terminal enables the data service and prohibits another application from using the data service.
System and method of correlating mouth images to input commands
A system for automated speech recognition utilizes computer memory, a processor executing imaging software and audio processing software, and a camera transmitting images of a physical source of speech input. Audio processing software includes an audio data stream of audio samples derived from at least one speech input. At least one timer is configured to transmit elapsed time values as measured in response to respective triggers received by the timer. The audio processing software is configured to assert and de-assert the timer triggers to measure respective audio sample times and interim period times between the audio samples. The audio processing software is further configured to compare the interim period times with a command spacing time value corresponding to an expected interim time value between commands, thereby determining if the speech input is command data or non-command data.
Zero latency digital assistant
An electronic device can implement a zero-latency digital assistant by capturing audio input from a microphone and using a first processor to write audio data representing the captured audio input to a memory buffer. In response to detecting a user input while capturing the audio input, the device can determine whether the user input meets a predetermined criteria. If the user input meets the criteria, the device can use a second processor to identify and execute a task based on at least a portion of the contents of the memory buffer.
DISTRIBUTED PERSONAL ASSISTANT
An exemplary method for using a virtual assistant may include, at an electronic device configured to transmit and receive data, receiving a user request for a service from a virtual assistant; determining at least one task to perform in response to the user request; estimating at least one performance characteristic for completion of the at least one task with the electronic device, based on at least one heuristic; based on the estimating, determining whether to execute the at least one task at the electronic device; in accordance with a determination to execute the at least one task at the electronic device, causing the execution of the at least one task at the electronic device; in accordance with a determination to execute the at least one task outside the electronic device: generating executable code for carrying out the least one task; and transmitting the executable code from the electronic device.
PROMOTING VOICE ACTIONS TO HOTWORDS
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for designating certain voice commands as hotwords. The methods, systems, and apparatus include actions of receiving a hotword followed by a voice command. Additional actions include determining that the voice command satisfies one or more predetermined criteria associated with designating the voice command as a hotword, where a voice command that is designated as a hotword is treated as a voice input regardless of whether the voice command is preceded by another hotword. Further actions include, in response to determining that the voice command satisfies one or more predetermined criteria associated with designating the voice command as a hotword, designating the voice command as a hotword.
CONTEXT-BASED SMARTPHONE SENSOR LOGIC
Methods employ sensors in portable devices (e.g., smartphones) both to sense content information (e.g., audio and imagery) and context information. Device processing is desirably dependent on both. For example, some embodiments activate certain processor intensive operations (e.g., content recognition) based on classification of sensed content and context. The context can control the location where information produced from such operations is stored, or control an alert signal indicating, e.g., that sensed speech is being transcribed. Some arrangements post sensor data collected by one device to a cloud repository, for access and processing by other devices. Multiple devices can collaborate in collecting and processing data, to exploit advantages each may have (e.g., in location, processing ability, social network resources, etc.). A great many other features and arrangements are also detailed.