Patent classifications
G10L15/28
DETECTION OF SPEECH
A method of own voice detection is provided for a user of a device. A first signal is detected, representing air-conducted speech using a first microphone of the device. A second signal is detected, representing bone-conducted speech using a bone-conduction sensor of the device. The first signal is filtered to obtain a component of the first signal at a speech articulation rate, and the second signal is filtered to obtain a component of the second signal at the speech articulation rate. The component of the first signal at the speech articulation rate and the component of the second signal at the speech articulation rate are compared, and it is determined that the speech has not been generated by the user of the device, if a difference between the component of the first signal at the speech articulation rate and the component of the second signal at the speech articulation rate exceeds a threshold value.
DETECTION OF SPEECH
A method of own voice detection is provided for a user of a device. A first signal is detected, representing air-conducted speech using a first microphone of the device. A second signal is detected, representing bone-conducted speech using a bone-conduction sensor of the device. The first signal is filtered to obtain a component of the first signal at a speech articulation rate, and the second signal is filtered to obtain a component of the second signal at the speech articulation rate. The component of the first signal at the speech articulation rate and the component of the second signal at the speech articulation rate are compared, and it is determined that the speech has not been generated by the user of the device, if a difference between the component of the first signal at the speech articulation rate and the component of the second signal at the speech articulation rate exceeds a threshold value.
Privacy enhancement apparatuses for use with voice-activated devices and assistants
Devices for preventing unintended conversation from being recorded by a voice activated assistant device/application (VAD) are disclosed. The device is contoured to fit over a functional surface of a VAD that typically includes a plurality of microphones and control buttons. The device covers the microphones and uses its own microphones to monitor for an authorization input signal. In an embodiment, the devices uses speakers aligned with and opposing each VAD microphone. The device emits interfering audible signals during this mode of operation. Once the device senses an authorization input, the device decouples its speakers from the interfering audible signal and instead allows the device microphones to pass through to the VAD. During this mode, the VAD is in normal operation.
Privacy enhancement apparatuses for use with voice-activated devices and assistants
Devices for preventing unintended conversation from being recorded by a voice activated assistant device/application (VAD) are disclosed. The device is contoured to fit over a functional surface of a VAD that typically includes a plurality of microphones and control buttons. The device covers the microphones and uses its own microphones to monitor for an authorization input signal. In an embodiment, the devices uses speakers aligned with and opposing each VAD microphone. The device emits interfering audible signals during this mode of operation. Once the device senses an authorization input, the device decouples its speakers from the interfering audible signal and instead allows the device microphones to pass through to the VAD. During this mode, the VAD is in normal operation.
Small Footprint Multi-Channel Keyword Spotting
A method (800) to detect a hotword in a spoken utterance (120) includes receiving a sequence of input frames (210) characterizing streaming multi-channel audio (118). Each channel (119) of the streaming multi-channel audio includes respective audio features (510) captured by a separate dedicated microphone (107). For each input frame, the method includes processing, using a three-dimensional (3D) single value decomposition filter (SVDF) input layer (302) of a memorized neural network (300), the respective audio features of each channel in parallel and generating a corresponding multi-channel audio feature representation (420) based on a concatenation of the respective audio features (344). The method also includes generating, using sequentially-stacked SVDF layers (350), a probability score (360) indicating a presence of a hotword in the audio. The method also includes determining whether the probability score satisfies a threshold and, when satisfied, initiating a wake-up process on a user device (102).
Inverted Projection for Robust Speech Translation
The technology provides an approach to train translation models that are robust to transcription errors and punctuation errors. The approach includes introducing errors from actual automatic speech recognition and automatic punctuation systems into the source side of the machine translation training data. A method for training a machine translation model includes performing automatic speech recognition on input source audio to generate a system transcript. The method aligns a human transcript of the source audio to the system transcript, including projecting system segmentation onto the human transcript. Then the method performs segment robustness training of a machine translation model according to the aligned human and system transcripts, and performs system robustness training of the machine translation model, e.g., by injecting token errors into training data.
Inverted Projection for Robust Speech Translation
The technology provides an approach to train translation models that are robust to transcription errors and punctuation errors. The approach includes introducing errors from actual automatic speech recognition and automatic punctuation systems into the source side of the machine translation training data. A method for training a machine translation model includes performing automatic speech recognition on input source audio to generate a system transcript. The method aligns a human transcript of the source audio to the system transcript, including projecting system segmentation onto the human transcript. Then the method performs segment robustness training of a machine translation model according to the aligned human and system transcripts, and performs system robustness training of the machine translation model, e.g., by injecting token errors into training data.
ELECTRONIC DEVICE AND METHOD FOR PROCESSING SPEECH BY CLASSIFYING SPEECH TARGET
Various embodiments of the disclosure provide a method and a device which includes multiple cameras arranged at different positions, multiple microphones arranged at different positions, a memory, and a processor operatively connected to at least one of the multiple cameras, the multiple microphones, and the memory, wherein the processor is configured to: determine, using at least one of the multiple cameras, whether at least one of a user wearing the electronic device or a counterpart having a conversation with the user makes an utterance, configure directivity of at least one of the multiple microphones based on the determination, obtain an audio from at least one of the multiple microphones based on the configured directivity, obtain an image including a mouth shape of the user or the counterpart from at least one of the multiple cameras, and process speech of an utterance target in a different manner based on the obtained audio and the image.
Signal processing coordination among digital voice assistant computing devices
Coordinating signal processing among computing devices in a voice-driven computing environment is provided. A first and second digital assistant can detect an input audio signal, perform a signal quality check, and provide indications that the first and second digital assistants are operational to process the input audio signal. A system can select the first digital assistant for further processing. The system can receive, from the first digital assistant, data packets including a command. The system can generate, for a network connected device selected from a plurality of network connected devices, an action data structure based on the data packets, and transmit the action data structure to the selected network connected device.
Signal processing coordination among digital voice assistant computing devices
Coordinating signal processing among computing devices in a voice-driven computing environment is provided. A first and second digital assistant can detect an input audio signal, perform a signal quality check, and provide indications that the first and second digital assistants are operational to process the input audio signal. A system can select the first digital assistant for further processing. The system can receive, from the first digital assistant, data packets including a command. The system can generate, for a network connected device selected from a plurality of network connected devices, an action data structure based on the data packets, and transmit the action data structure to the selected network connected device.