G10L2013/083

REAL-TIME SYSTEM FOR SPOKEN NATURAL STYLISTIC CONVERSATIONS WITH LARGE LANGUAGE MODELS

The techniques disclosed herein enable systems for spoken natural stylistic conversations with large language models. In contrast to many existing modalities for interacting with large language models that are limited to text, the techniques presented herein enable users to carry a fully spoken conversation with a large language model. This is accomplished by converting a user speech audio input to text and utilizing a prompt engine to analyze a sentiment expressed by the user. A large language model, having been trained on example conversations, by generating a text response as well as a style cue to express emotion in response to the sentiment expressed by speech audio input. A text-to-speech engine can subsequently interpret the text response and style cue to generate an audio output which emulates the sensation of human conversation.

Audio playing method, electronic device, and storage medium

The present application provides an audio playing method, an electronic device, and a computer readable storage medium. The method comprises: recognizing an audio file to be played as a text file containing sentence segmentation symbols; generating respective sentence segmentation tags at positions corresponding to the sentence segmentation symbols in the audio file, according to a correspondence relationship between the audio file and the text file; in response to a trigger operation, determining a target play point according to a current play position of the audio file and respective positions of the sentence segmentation tags; and playing the audio file from the target play point.

METHODS AND SERVERS FOR TRAINING A MODEL TO PERFORM SPEAKER CHANGE DETECTION
20250372078 · 2025-12-04 ·

A method and a server for training a model are provided. The method comprises: acquiring a punctuation training dataset including a first input and a first label, the first input including audio data and textual data representative of an utterance, the first label including a sequence of ground-truth tokens, training the model using the punctuation training dataset, thereby generating a punctuation trained model; acquiring a speaker change training dataset including a second input and a second label, the second input including second audio data and second textual data, the second label including a second sequence of ground-truth tokens, fine-tuning the punctuation trained model using the speaker change training dataset, thereby generating a speaker change model; acquiring an in-use textual data and corresponding in-use audio data; and generating, using the speaker change model, the second in-use sequence of tokens based on the in-use audio data and the in-use textual data.

METHOD AND SYSTEM FOR GENERATING SPEECH DATA FILE
20260004770 · 2026-01-01 · ·

A method for generating a speech data file from a text file, including: calculating a number of words included in a sentence part of the text file; calculating an expected duration for the sentence part based on the number of words; assigning a pausing time for the sentence part based on at least the expected duration and the saying time duration parameter, the pausing time to be attached at the end of the sentence part; and generating the speech data file associated with the text file, the speech data file including, for the sentence part, an audio speech part that, when played, includes voice of the sentence part, and a pausing part that follows the voice of the sentence part is played, that does not include the content of the sentence part, and that has a duration that equals to the associated pausing time.

Non-transitory computer-readable medium and voice generating system
12518736 · 2026-01-06 · ·

According to one or more embodiments, a non-transitory computer-readable medium including a program that, when executed, causes a server to perform functions including: converting a first text into a voice feature amount by inputting the first text into a learned conversion model, wherein the first text is in a different language from a predetermined language, the learned conversion model is pre-learned to convert a second text in the predetermined language into a voice feature amount, and synthesizing a voice from the converted voice feature amount.

Method and system for generating speech data file
12609107 · 2026-04-21 · ·

A method for generating a speech data file from a text file, including: calculating a number of words included in a sentence part of the text file; calculating an expected duration for the sentence part based on the number of words; assigning a pausing time for the sentence part based on at least the expected duration and the saying time duration parameter, the pausing time to be attached at the end of the sentence part; and generating the speech data file associated with the text file, the speech data file including, for the sentence part, an audio speech part that, when played, includes voice of the sentence part, and a pausing part that follows the voice of the sentence part is played, that does not include the content of the sentence part, and that has a duration that equals to the associated pausing time.