G10L2021/02087

ARRAY GEOMETRY AGNOSTIC MULTI-CHANNEL PERSONALIZED SPEECH ENHANCEMENT

Examples of array geometry agnostic multi-channel personalized speech enhancement (PSE) extract speaker embeddings, which represent acoustic characteristics of one or more target speakers, from target speaker enrollment data. Spatial features (e.g., inter-channel phase difference) are extracted from input audio captured by a microphone array. The input audio includes a mixture of speech data of the target speaker(s) and one or more interfering speaker(s). The input audio, the extracted speaker embeddings, and the extracted spatial features are provided to a trained geometry-agnostic PSE model. Output data is produced, which comprises estimated clean speech data of the target speaker(s) that has a reduction (or elimination) of speech data of the interfering speaker(s), without the trained PSE model requiring geometry information for the microphone array.

Artificial intelligence apparatus for performing voice control using voice extraction filter and method for the same
11468886 · 2022-10-11 · ·

According to an embodiment of the present invention, an artificial intelligence (AI) apparatus for performing voice control, includes a memory configured to store a voice extraction filter for extracting a voice of a registered user, and a processor to receive identification information of a user and a first voice signal of the user, to register the user using the received identification information, to extract a voice of the registered user from the received second voice signal by using the voice extraction filter corresponding to the registered user, when a second voice signal is received, and to proceed a control operation corresponding to intention information of the extracted voice of the registered user. The voice extraction filter is generated by using the received first voice signal of the registered user.

SPEECH SIGNAL PROCESSING DEVICE, SPEECH SIGNAL PROCESSING METHOD, SPEECH SIGNAL PROCESSING PROGRAM, TRAINING DEVICE, TRAINING METHOD, AND TRAINING PROGRAM

An audio signal processing apparatus (10) includes a first auxiliary feature conversion unit (12) and a second auxiliary feature conversion unit (13) that convert a plurality of signals relating to processing of an audio signal of a target speaker into a plurality of auxiliary features for the plurality of signals using a plurality of auxiliary neural networks corresponding to the plurality of signals, and an audio signal processing unit (11) that estimates information regarding an audio signal of the target speaker included in a mixed audio signal using a main neural network based on an input feature of the mixed audio signal and the plurality of auxiliary features, wherein the plurality of signals relating to processing of the audio signal of the target speaker are two or more pieces of information of different modalities.

APPARATUS FOR OUTPUTTING AN AUDIO SIGNAL IN A VEHICLE CABIN
20220319531 · 2022-10-06 · ·

Apparatus (2) for outputting an audio signal (3) in a vehicle cabin (4), the apparatus (2) comprising: at least one audio outputting device (6) configured to output an audio signal (3), particularly an audio signal (3) comprising at least one audio signal component containing a human voice, particularly a singer's voice, in a vehicle cabin (4); at least one audio receiving device (10) configured to receive a human voice signal (9) of at least one person (P) located in the or a vehicle cabin (4) whilst the at least one audio outputting device (6) outputs the audio signal (3) in the or a vehicle cabin (4); at least one processing device (11) configured to combine the audio signal (3) and the received human voice signal (9) so as to generate a combined audio signal containing the audio signal (3) and the received human voice signal (9) which combined audio signal is outputtable or output in the or a vehicle cabin (4) via the at least one audio outputting device (6).

Robust short-time fourier transform acoustic echo cancellation during audio playback
11646045 · 2023-05-09 · ·

Example techniques involve noise-robust acoustic echo cancellation. An example implementation may involve causing one or more speakers of the playback device to play back audio content and while the audio content is playing back, capturing, via the one or more microphones, audio within an acoustic environment that includes the audio playback. The example implementation may involve determining measured and reference signals in the STFT domain. During each n.sup.th iteration of an acoustic echo canceller (AEC): the implementation may involve determining a frame of an output signal by generating a frame of a model signal by passing a frame of the reference signal through an instance of an adaptive filter and then redacting the n.sup.th frame of the model signal from an n.sup.th frame of the measured signal. The implementation may further involve determining an instance of the adaptive filter for a next iteration of the AEC.

SPATIAL OPTIMIZATION FOR AUDIO PACKET TRANSFER IN A METAVERSE
20230145605 · 2023-05-11 ·

A computer-implemented method includes receiving audio packets associated with a first client device, where the audio packets each include an audio capture waveform, a timestamp, and a digital entity identification (ID). The method further includes determining, based on the digital entity ID, a position of a first digital entity in a metaverse. The method further includes determining a subset of other digital entities in a metaverse that are within an audio area of the first digital entity based on (a) a falloff distance between the first digital entity and each of the other digital entities and (b) a direction of audio propagation between the first digital entity and each of the other digital entities. The method further includes transmitting the audio packets to second client devices associated with the subset of other digital entities in the metaverse.

AUDIO SYSTEM, AUDIO DEVICE, AND METHOD FOR SPEAKER EXTRACTION
20230206941 · 2023-06-29 ·

A method for speech extraction in an audio device is disclosed. The method comprises obtaining a microphone input signal from one or more microphones including a first microphone. The method comprises applying an extraction model to the microphone input signal for provision of an output. The method comprises extracting a near speaker component in the microphone input signal according to the output of the extraction model being a machine-learning model for provision of a speaker output. The method comprises outputting the speaker output.

SELECTIVE AMPLIFICATION OF AN ACOUSTIC SIGNAL

The present subject matter relates to systems and methods for selectively amplifying an acoustic signal in a closed environment. In an implementation, a plurality of acoustic signals may be received from within the closed environment. Frequency ranges corresponding to each acoustic signal may be obtained and compared to determine presence of at least one individual in the closed environment. Acoustic signals pertaining to the at least one individual may be analysed to detect occurrence of a physiological event. Based on the analysis, the acoustic signal may be recognized as a target signal, and the target signal may be amplified in the closed environment. Further, an interfering signal may be generated to cancel other acoustic signals within the closed environment.

SYSTEMS AND METHODS FOR FILTERING UNWANTED SOUNDS FROM A CONFERENCE CALL USING VOICE SYNTHESIS
20220383888 · 2022-12-01 ·

To filter unwanted sounds from a conference call, a first voice signal is captured by a first device during a conference call and converted into corresponding text, which is then analyzed to determine that a first portion of the text was spoken by a first user and a second portion of the text was spoken by a second user. If the first user is relevant to the conference call while the second user is not, the first voice signal is prevented from being transmitted into the conference call, the first portion of text is converted into a second voice signal using a voice profile of the first user to synthesize the voice of the first user, and the second voice signal is then transmitted into the conference call. The second portion of text is not converted into a voice signal, as the second user is determined not to be relevant.

VOICE PROCESSING DEVICE
20170352349 · 2017-12-07 · ·

A voice processing device includes plural microphones 22 disposed in a vehicle, a voice source direction determination portion 16 determining a direction of a voice source by handling a sound reception signal as a spherical wave in a case where the voice source serving as a source of a voice included in the sound reception signal obtained by each of the plural microphones is disposed at a near field, the voice source direction determination portion determining the direction of the voice source by handling the sound reception signal as a plane wave in a case where the voice source is disposed at the far field, and a beamforming processing portion 12 performing beamforming so as to suppress a sound arriving from a direction range other than a direction range including the direction of the voice source.