G10L2015/0631

METHOD AND APPARATUS FOR PROCESSING SIGNAL, COMPUTER READABLE MEDIUM
20220358951 · 2022-11-10 ·

A method and apparatus for processing a signal. An implementation of the method includes: acquiring a reference signal of a to-be-tested voice, the reference signal being a signal output to a voice output device, where the voice output device outputs the to-be-tested voice after obtaining the reference signal; receiving, from a voice input device, an echo signal of the to-be-tested voice, the echo signal being a signal of the to-be-tested voice collected by the voice input device; performing signal preprocessing on the reference signal and the echo signal respectively; and inputting the processed reference signal and the processed echo signal into a pre-trained time delay estimation model, to obtain a time difference between the reference signal and the echo signal output by the time delay estimation model.

Multi-turn dialogue response generation with autoregressive transformer models

Machine classifiers in accordance with embodiments of the invention capture long-term temporal dependencies in the dialogue data better than the existing RNN-based architectures. Additionally, machine classifiers may model the joint distribution of the context and response as opposed to the conditional distribution of the response given the context as employed in sequence-to-sequence frameworks. Machine classifiers in accordance with embodiments further append random paddings before and/or after the input data to reduce the syntactic redundancy in the input data, thereby improving the performance of the machine classifiers for a variety of dialogue-related tasks. The random padding of the input data may further provide regularization during the training of the machine classifier and/or reduce exposure bias. In a variety of embodiments, the input data may be encoded based on subword tokenization.

ELECTRONIC DEVICE AND OPERATION METHOD
20230030738 · 2023-02-02 ·

An electronic device may include a user interface, a processor operatively connected to the user interface, and a memory operatively connected to the processor. The memory may store instructions that, when executed, may cause the processor to identify a modified hotword included in the first user input in response to failing to detect a hotword included in a first user input received using the user interface, to monitor a second user input received during a specified time using the user interface, to identify an existing hotword corresponding to the modified hotword using the second user input, to provide response data indicating whether to update the existing hotword using the modified hotword, through the user interface, and to update a hotword model based on a user input to the response data. Moreover, various example embodiments found through the disclosure, as well as other embodiments, are possible.

INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND INFORMATION PROCESSING PROGRAM

In a conventional method of detecting a user position from footsteps of a user collected by microphones installed in a user's home, it is necessary to precisely align the microphones in advance to an extent of coordinates. This is inconvenient for both system and user. Provided is an information processing apparatus including an acquisition unit that acquires sound data recorded by a plurality of microphones installed in arbitrary places and relative positions, from the microphones, of footsteps included in the sound data, and a learning unit that generates a learning model by learning training data including the sound data as input and the relative positions as correct answers. As a result, precise positioning of the microphones in the user's home becomes unnecessary, and the user position can be detected in a more convenient manner for both the system and the user.

Speaker-Turn-Based Online Speaker Diarization with Constrained Spectral Clustering

A method includes receiving an input audio signal that corresponds to utterances spoken by multiple speakers. The method also includes processing the input audio to generate a transcription of the utterances and a sequence of speaker turn tokens each indicating a location of a respective speaker turn. The method also includes segmenting the input audio signal into a plurality of speaker segments based on the sequence of speaker tokens. The method also includes extracting a speaker-discriminative embedding from each speaker segment and performing spectral clustering on the speaker-discriminative embeddings to cluster the plurality of speaker segments into k classes. The method also includes assigning a respective speaker label to each speaker segment clustered into the respective class that is different than the respective speaker label assigned to the speaker segments clustered into each other class of the k classes.

CONVERSATIONAL RECOMMENDATION METHOD, METHOD OF TRAINING MODEL, DEVICE AND MEDIUM
20230088445 · 2023-03-23 ·

A conversational recommendation method, a method of training a conversational recommendation model, an electronic device, and a storage medium are provided, which are related to a technical field of data processing, in particular to technical fields of voice interaction, deep learning, artificial intelligence and the like. The conversational recommendation method includes: acquiring a historical conversation information; determining a target conversation object to be generated, from a conversation target graph based on the historical conversation information, the conversation target graph includes an object node, the object node is configured to represent a conversation object, and the target conversation object is determined based on the object node; and generating a target conversation information for recommendation based on the target conversation object.

Multi-step linear interpolation of language models

A computer-implemented method is provided for generating a language model for an application. The method includes estimating interpolation weights of each of a plurality of language models according to an Expectation Maximization (EM) algorithm based on a first metric. The method further includes classifying the plurality of language models into two or more sets based on characteristics of the two or more sets. The method also includes estimating a hyper interpolation weight for the two or more sets based on a second metric specific to the application. The method additionally includes interpolating the plurality of language models using the interpolation weights and the hyper interpolation weight to generate a final language model.

Characterizing, selecting and adapting audio and acoustic training data for automatic speech recognition systems

A system for and method of characterizing a target application acoustic domain analyzes one or more speech data samples from the target application acoustic domain to determine one or more target acoustic characteristics, including a CODEC type and bit-rate associated with the speech data samples. The determined target acoustic characteristics may also include other aspects of the target speech data samples such as sampling frequency, active bandwidth, noise level, reverberation level, clipping level, and speaking rate. The determined target acoustic characteristics are stored in a memory as a target acoustic data profile. The data profile may be used to select and/or modify one or more out of domain speech samples based on the one or more target acoustic characteristics.

CONVERSATION GENERATION USING SUMMARY-GROUNDED CONVERSATION GENERATORS

An example system includes a processor to receive a summary of a conversation to be generated. The processor can input the summary into a trained summary-grounded conversation generator. The processor can receive a generated conversation from the trained summary-grounded conversation generator.

METHOD AND DEVICE FOR PRESERVING CONTEXT IN CONVERSATIONS

The present disclosure relates to preserving context in a conversation between a user (101) and a digital assistant device (102). During training, the digital assistant device (102) is provided with a plurality of conversations having a plurality of dialogues. Each of the plurality of dialogue is assigned an ID based on a context. Further, two or more test queries having a same context is provided as input and the two are more queries are assigned an ID based on the context. Thereafter, the digital assistant device (102) is configured to retrieve one or more dialogues from the plurality of dialogues where the ID of the one or more dialogues match the ID of the two or more queries. In real-time, one or more queries are received and based on a context of the one or more queries, one or more dialogues are retrieved and are provided to the user.