Patent classifications
G10L25/27
Viseme data generation for presentation while content is output
Systems and methods for viseme data generation are disclosed. Uncompressed audio data is generated and/or utilized to determine the beats per minute of the audio data. Visemes are associated with the audio data utilizing a Viterbi algorithm and the beats per minute. A time-stamped list of viseme data is generated that associates the visemes with the portions of the audio data that they correspond to. An animatronic toy and/or an animation is caused to lip sync using the viseme data while audio corresponding to the audio data is output.
OPTIMIZATION OF NETWORK MICROPHONE DEVICES USING NOISE CLASSIFICATION
Systems and methods for optimizing network microphone devices using noise classification are disclosed herein. In one example, individual microphones of a network microphone device (NMD) detect sound. The sound data is analyzed to detect a trigger event such as a wake word. Metadata associated with the sound data is captured in a lookback buffer of the NMD. After detecting the trigger event, the metadata is analyzed to classify noise in the sound data. Based on the classified noise, at least one performance parameter of the NMD is modified.
AUTOMATIC GAIN CONTROL BASED ON MACHINE LEARNING LEVEL ESTIMATION OF THE DESIRED SIGNAL
Method includes receiving, at a server device, from a plurality of input devices, audio data. The audio data of each input device corresponds to a time-related portion of the audio data. The method determines a speech energy level for each input device by providing the time-related audio portion as input to a trained model. For each input device, a statistical value associated with the speech energy level is determined. A strongest input device is identified based on the statistical value. The statistical value associated with the speech energy level of each input device other than the strongest input device is compared to the statistical value of the strongest input device. Depending on the comparison, the method determines whether to update the gain value of an input device to an estimated target gain value based on the statistical value of the speech energy level of the respective input device.
AUTOMATIC GAIN CONTROL BASED ON MACHINE LEARNING LEVEL ESTIMATION OF THE DESIRED SIGNAL
Method includes receiving, at a server device, from a plurality of input devices, audio data. The audio data of each input device corresponds to a time-related portion of the audio data. The method determines a speech energy level for each input device by providing the time-related audio portion as input to a trained model. For each input device, a statistical value associated with the speech energy level is determined. A strongest input device is identified based on the statistical value. The statistical value associated with the speech energy level of each input device other than the strongest input device is compared to the statistical value of the strongest input device. Depending on the comparison, the method determines whether to update the gain value of an input device to an estimated target gain value based on the statistical value of the speech energy level of the respective input device.
SOUND PROCESSING METHOD USING DJ TRANSFORM
Provided is a sound processing method performed by a computer, the method comprising generating a DJ transform spectrogram indicating estimated pure-tone amplitudes for respective frequencies corresponding to natural frequencies of a plurality of springs and a plurality of time points by modeling an oscillation motion of the plurality of springs having different natural frequencies, with respect to an input sound, and calculating the estimated pure-tone amplitudes for the respective natural frequencies; calculating degrees of fundamental frequency suitability based on a moving average of the estimated pure-tone amplitudes or a moving standard deviation of the estimated pure-tone amplitudes with respect to each natural frequency of the DJ transform spectrogram; and extracting the fundamental frequency based on local maximum values of the degrees of fundamental frequency suitability for the respective natural frequencies at each of the plurality of time points.
SOUND PROCESSING METHOD USING DJ TRANSFORM
Provided is a sound processing method performed by a computer, the method comprising generating a DJ transform spectrogram indicating estimated pure-tone amplitudes for respective frequencies corresponding to natural frequencies of a plurality of springs and a plurality of time points by modeling an oscillation motion of the plurality of springs having different natural frequencies, with respect to an input sound, and calculating the estimated pure-tone amplitudes for the respective natural frequencies; calculating degrees of fundamental frequency suitability based on a moving average of the estimated pure-tone amplitudes or a moving standard deviation of the estimated pure-tone amplitudes with respect to each natural frequency of the DJ transform spectrogram; and extracting the fundamental frequency based on local maximum values of the degrees of fundamental frequency suitability for the respective natural frequencies at each of the plurality of time points.
UNDERSTANDING AND RANKING RECORDED CONVERSATIONS BY CLARITY OF AUDIO
Systems and methods are provided for generating quality scores associated with a contact (e.g., a telephonic call including an agent) and with agents. In particular, the disclosed technology determines types of frames of content of the contact into a speech and/or a noise, the noise further classified into a standard noise and a non-standard noise. A frame type determiner determines a type of a frame based on a waveform analysis and/or use of speech and noise models that are trained through machine learning. The standard noise includes noise that is expected and consistent across contacts and agents (e.g., a hold music). The non-standard noise includes a noise that is unexpected in occasion and audio sources (e.g., a barking dog, a siren from street, and the like). The disclosed technology enables assessing contacts and agents based on issues associated with remote working environment that vary among agents.
UNDERSTANDING AND RANKING RECORDED CONVERSATIONS BY CLARITY OF AUDIO
Systems and methods are provided for generating quality scores associated with a contact (e.g., a telephonic call including an agent) and with agents. In particular, the disclosed technology determines types of frames of content of the contact into a speech and/or a noise, the noise further classified into a standard noise and a non-standard noise. A frame type determiner determines a type of a frame based on a waveform analysis and/or use of speech and noise models that are trained through machine learning. The standard noise includes noise that is expected and consistent across contacts and agents (e.g., a hold music). The non-standard noise includes a noise that is unexpected in occasion and audio sources (e.g., a barking dog, a siren from street, and the like). The disclosed technology enables assessing contacts and agents based on issues associated with remote working environment that vary among agents.
Information processing apparatus, information processing system, and information processing method
Provided is an apparatus that includes a voice recognition section that executes a voice recognition process on a user speech and a learning processing section that executes a process of updating a degree of confidence on the basis of an interaction made between a user and the information processing apparatus after the user speech. The degree of confidence is an evaluation value indicating the reliability of a voice recognition result of the user speech. The voice recognition section generates data on degrees of confidence in recognition of the user speech in which data plural user speech candidates based on the voice recognition result of the user speech are associated with the degrees of confidence which are evaluation values each indicating reliability of the corresponding user speech candidate.
Information processing apparatus, information processing system, and information processing method
Provided is an apparatus that includes a voice recognition section that executes a voice recognition process on a user speech and a learning processing section that executes a process of updating a degree of confidence on the basis of an interaction made between a user and the information processing apparatus after the user speech. The degree of confidence is an evaluation value indicating the reliability of a voice recognition result of the user speech. The voice recognition section generates data on degrees of confidence in recognition of the user speech in which data plural user speech candidates based on the voice recognition result of the user speech are associated with the degrees of confidence which are evaluation values each indicating reliability of the corresponding user speech candidate.