G10L15/285

ELECTRONIC DEVICE AND METHOD FOR CONTROLLING THE ELECTRONIC DEVICE

Disclosed are an electronic device capable of efficiently performing speech recognition and natural language understanding and a method for controlling thereof. The electronic device includes: a microphone; a non-volatile memory configured to store virtual assistant model data comprising data that is classified according to a plurality of domains and data that is commonly used for the plurality of domains; a volatile memory; and a processor configured to: based on receiving, through the microphone, a trigger input to perform speech recognition for a user speech, initiate loading the virtual assistant model data from the non-volatile memory into the volatile memory, load, into the volatile memory, first data from among the data classified according to the plurality of domains and, while loading the first data into the volatile memory, load at least a part of the data commonly used for the plurality of domains into the volatile memory.

SYSTEMS AND METHODS FOR VOICE IDENTIFICATION AND ANALYSIS
20210065713 · 2021-03-04 · ·

Obtaining configuration audio data including voice information for a plurality of meeting participants. Generating localization information indicating a respective location for each meeting participant. Generating a respective voiceprint for each meeting participant. Obtaining meeting audio data. Identifying a first meeting participant and a second meeting participant. Linking a first meeting participant identifier of the first meeting participant with a first segment of the meeting audio data. Linking a second meeting participant identifier of the second meeting participant with a second segment of the meeting audio data. Generating a GUI indicating the respective locations of the first and second meeting participants, and the GUI indicating a first transcription of the first segment and a second transcription of the second segment. The first transcription is associated with the first meeting participant in the GUI, and the second transcription is associated with the second meeting participant in the GUI.

Harmony generation device and storage medium

A harmony generation device and a program for the same which can generate a natural harmony sound are provided. The harmony generation device (1) generates first and second harmony tones to which a voice input through a microphone (M) is shifted in pitch by first and second shift amounts calculated based on both the voice input through the microphone (M) and a chord determined from performance information of an electric guitar (G) input through an input device (34). That is, since the first and second harmony tones can be tones based on the chord of the electric guitar (G) that changes from moment to moment, the harmony sound obtained by mixing the first and second harmony tones with the voice input through the microphone (M) can be a natural harmony sound that is rich in variation according to the chord of the electric guitar (G).

Pausing automatic speech recognition

A speech interface device is configured to process user speech by storing, in volatile memory of the speech interface device, audio data that represents user speech, and inputting first audio data, of the stored audio data, to an automatic speech recognition (ASR) component of the speech interface device, determining that a criterion is satisfied, and, based on the criterion being satisfied, maintaining second audio data in the volatile memory. The ASR component may generate text data based on the first audio data, a natural language understanding (NLU) component of the speech interface device may generate NLU data based on the text data, and, if the NLU data corresponds to a recognized intent, the second audio data may be deleted. Otherwise, speech processing can be resumed by inputting the second audio data to the ASR component.

SYSTEM AND METHOD OF CORRELATING MOUTH IMAGES TO INPUT COMMANDS
20210035586 · 2021-02-04 ·

A system for automated speech recognition utilizes computer memory, a processor executing imaging software and audio processing software, and a camera transmitting images of a physical source of speech input. Audio processing software includes an audio data stream of audio samples derived from at least one speech input. At least one timer is configured to transmit elapsed time values as measured in response to respective triggers received by the timer. The audio processing software is configured to assert and de-assert the timer triggers to measure respective audio sample times and interim period times between the audio samples. The audio processing software is further configured to compare the interim period times with a command spacing time value corresponding to an expected interim time value between commands, thereby determining if the speech input is command data or non-command data.

Hotword detection on multiple devices
10909987 · 2021-02-02 · ·

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for hotword detection on multiple devices are disclosed. In one aspect, a method includes the actions of receiving, by a first computing device, audio data that corresponds to an utterance. The actions further include determining a first value corresponding to a likelihood that the utterance includes a hotword. The actions further include receiving a second value corresponding to a likelihood that the utterance includes the hotword, the second value being determined by a second computing device. The actions further include comparing the first value and the second value. The actions further include based on comparing the first value to the second value, initiating speech recognition processing on the audio data.

AUDIBLE KEYWORD DETECTION AND METHOD

The disclosure describes keyword detection in an audio processor and methods therefor including a low-power keyword detection engine (LKDE) and a high-power keyword detection engine (HKDE). In one implementation, the LKDE detects a keyword in data from a single audio source while buffering data from multiple audio sources and, upon detection of a keyword, the HKDE is awakened to verify the previously detected keyword by processing the buffered audio data from the multiple sources.

SPEECH CHIP AND ELECTRONIC DEVICE
20200402514 · 2020-12-24 ·

The present disclosure proposes a speech chip and an electronic device. The speech chip includes: a peripheral interface connected to a speech receiver and configured to receive a speech signal; a bus matrix connected to the peripheral interface; a first processor connected to the bus matrix and configured to determine whether is the speech signal contains a wake-up word according to the speech signal; a second processor connected to the bus matrix and configured to perform signal denoising and speech recognition on the speech signal; and a memory array connected to the bus matrix.

ELECTRONIC APPARATUS FOR DYNAMIC NOTE MATCHING AND OPERATING METHOD OF THE SAME
20200394214 · 2020-12-17 · ·

Disclosed are an electronic apparatus for dynamic note matching (DNM) and an operating method thereof, the method including acquiring a first section sequence by reducing a first sequence extracted from an input signal based on at least one first section in which the respective values are successively arranged; acquiring a second section sequence reduced from a pre-stored second sequence based on at least one second section in which the respective values are successively arranged; and calculating a similarity between the first section sequence and the second section sequence.

Method of creating a demographic based personalized pronunciation dictionary
20200372110 · 2020-11-26 ·

The present invention is related to the method of creating a demographic based personalized pronunciation dictionary for a user wherein the method comprising: determining at least one demographic information of the user, receiving at least one voice input from the user in association with the at least one demographic information, determining at least one voice characteristics in association with the at least one voice input received from the user in association with the at least one demographic information, determining at least one non-demographic information, identifying at least one pronunciation information from a demographic specific pronunciation dictionary located in a database in association with the at least one non-demographic information, determining, upon receiving at least one voice input from the user in association with at least one non-demographic information, at least one voice characteristics in association with the at least one voice input received from the user in association with the at least one non-demographic information, creating a personalized pronunciation dictionary for the user and storing the personalized pronunciation dictionary for the user in the database.