Patent classifications
G10L17/00
Automated meeting minutes generation service
Attributes of electronic content from a meeting are identified and evaluated to determine whether sub-portions of the electronic content should or should not be attributed to a user profile. Upon determining that the sub-portion should be attributed to a user profile, attributes of the sub-portion of electronic content are compared to attributes of stored user profiles. A probability that the sub-portion corresponds to at least one stored user profile is calculated. Based on the calculated probability, the sub-portion is attributed to a stored user profile or a guest user profile.
Method and apparatus with registration for speaker recognition
Disclosed is a method and apparatus with recognition for speaker recognition. The method includes determining whether an input feature vector corresponding to a voice signal of a speaker meets a candidate similarity criterion with at least one registered data included in a registration database, selectively, based on a result of the determining of whether the input feature vector meets the candidate similarity criterion, constructing a candidate list based on the input feature vector, determining whether a candidate input feature vector, among one or more candidate input feature vectors constructed in the candidate list in the selective constructing of the candidate list, meets a registration update similarity criterion with the at least one registered data, and selectively, based on a result of the determination of whether the candidate input feature vector meets the registration update similarity criterion, updating the registration database based on the candidate input feature vector.
Method and apparatus with registration for speaker recognition
Disclosed is a method and apparatus with recognition for speaker recognition. The method includes determining whether an input feature vector corresponding to a voice signal of a speaker meets a candidate similarity criterion with at least one registered data included in a registration database, selectively, based on a result of the determining of whether the input feature vector meets the candidate similarity criterion, constructing a candidate list based on the input feature vector, determining whether a candidate input feature vector, among one or more candidate input feature vectors constructed in the candidate list in the selective constructing of the candidate list, meets a registration update similarity criterion with the at least one registered data, and selectively, based on a result of the determination of whether the candidate input feature vector meets the registration update similarity criterion, updating the registration database based on the candidate input feature vector.
Processing audio and video
A wearable device may include an image sensor configured to capture a plurality of images from an environment, a microphone configured to capture sounds from the environment, and at least one processor. The at least one processor may be programmed to receive audio signals representative of the sounds captured by the at least one microphone, and receive a first image including a representation of a first individual from among the plurality of images captured by the image sensor. The at least one processor may also be programmed to obtain a first audio segment from the audio signals using the first image. The first audio segment may include a first portion of the audio signals in which the first individual is speaking. The at least one processor may also be programmed to receive a second image including a representation of a second individual from among the plurality of images captured by the image sensor, and obtain a second audio segment from the audio signals using the second image. The second audio segment may include a second portion of the audio signals in which the second individual is speaking. The at least one processor may also be programmed to receive a third image including a representation of the first individual from among the plurality of images captured by the image sensor, and using the third image, obtain a third audio segment from the audio signals. The audio segment may include a third portion of the audio signals in which the first individual is speaking. The at least one processor may also associate the first and third audio segments with the first individual and associate the second audio segment with the second individual.
Multilingual speech translation with adaptive speech synthesis and adaptive physiognomy
Techniques for the generation of dubbed audio for an audio/video are described. An exemplary approach is to receive a request to generate dubbed speech for an audio/visual file; and in response to the request to: extract speech segments from an audio track of the audio/visual file associated with identified speakers; translate the extracted speech segments into a target language; determine a machine learning model per identified speaker, the trained machine learning models to be used to generate a spoken version of the translated, extracted speech segments based on the identified speaker; generate, per translated, extracted speech segment, a spoken version of the translated, extracted speech segments using a trained machine learning model that corresponds to the identified speaker of the translated, extracted speech segment and prosody information for the extracted speech segments; and replace the extracted speech segments from the audio track of the audio/visual file with the spoken versions spoken version of the translated, extracted speech segments to generate a modified audio track.
MICROPHONE UNIT
A microphone unit includes: an audio data acquisition unit that acquires speech as audio data; an audio data registration unit that registers verification audio data obtained by extracting a feature point from the audio data; an evaluation audio data acquisition unit that acquires speech that is input to a first microphone as evaluation audio data; a verification unit that verifies whether or not a speaker who uttered speech that is based on the evaluation audio data is a speaker who uttered speech that is based on the verification audio data, based on the verification audio data and a feature point extracted from the evaluation audio data; and a verification result output unit that outputs a result of verification performed by the verification unit.
SYSTEM AND METHOD FOR ESTABLISHING DENTAL TREATMENT ENVIRONMENT
A system for establishing a dental treatment environment, includes: a head-mounted device provided at a dental clinic to be mounted on a patient's head, the head-mounted device having an image display unit and an ear-mounted speaker, a microphone for converting a sound including the voice of the medical staff in charge of the patient into an electric signal; a voice recognition module for recognizing the voice of the medical staff in charge from the electric sound input from the microphone; a content module storing multiple image contents for relaxing the patient mentally physically; a user interface having a content selection unit configured such that the patient can select a play content provided to the image display unit from the multiple image contents; and an output signal generating module for generating an output signal that is output to the head-mounted device.
VIDEO PROCESSING METHOD AND ASSOCIATED SYSTEM ON CHIP
The present invention provides a SoC including a person recognition circuit, a sound detection circuit and a processing circuit. The person recognition circuit is configured to obtain image data from an image capturing device, and perform a person recognition operation on the image data to generate a recognition result. The sound detection circuit is configured to receive a plurality of sound signals from a plurality of microphones, and determine a sound characteristic value of a main sound. The processing circuit is configured to determine a specific region in the image data according to the recognition result and the sound characteristic value of the main sound, and process the image data to highlight the specific region.
EARLY INVOCATION FOR CONTEXTUAL DATA PROCESSING
A speech processing system uses contextual data to determine the specific domains, subdomains, and applications appropriate for taking action in response to spoken commands and other utterances. The system can use signals and other contextual data associated with an utterance, such as location signals, content catalog data, data regarding historical usage patterns, data regarding content visually presented on a display screen of a computing device when an utterance was made, other data, or some combination thereof.
MULTI-TIER SPEECH PROCESSING AND CONTENT OPERATIONS
A multi-tier architecture is provided for processing user voice queries and making routing decisions for generating responses, including responses to book browsing requests and other content requests. When an utterance is associated with multiple applications in a given domain, the applications may be organized into a subdomain and a tier of routing decisions may be added to the inter-domain and intra-domain routing decision system. The system uses contextual signals to make subdomain routing decisions, including signals regarding content items that are already in a user's content catalog, consumption status of individual content items in the user's catalog, and the like