METHODS FOR REAL-TIME ACCENT CONVERSION AND SYSTEMS THEREOF
20240265908 ยท 2024-08-08
Inventors
Cpc classification
G10L15/02
PHYSICS
G10L13/02
PHYSICS
G10L2015/022
PHYSICS
International classification
G10L13/02
PHYSICS
G10L15/02
PHYSICS
Abstract
Techniques for real-time accent conversion are described herein. An example computing device receives an indication of a first accent and a second accent. The computing device further receives, via at least one microphone, speech content having the first accent. The computing device is configured to derive, using a first machine-learning algorithm trained with audio data including the first accent, a linguistic representation of the received speech content having the first accent. The computing device is configured to, based on the derived linguistic representation of the received speech content having the first accent, synthesize, using a second machine learning-algorithm trained with (i) audio data comprising the first accent and (ii) audio data including the second accent, audio data representative of the received speech content having the second accent. The computing device is configured to convert the synthesized audio data into a synthesized version of the received speech content having the second accent.
Claims
1. A system, comprising memory having instructions stored thereon and one or more processors coupled to the memory and configured to execute the instructions to: train a first machine-learning algorithm with first speech content from a first plurality of speakers having a first accent; apply the first machine-learning algorithm to second speech content comprising a set of phonemes associated with a first pronunciation of the second speech content to generate an output; based on the output, synthesize, using a second machine-learning-algorithm trained with first audio data comprising the first accent and second audio data comprising a second accent, third audio data representative of the second speech content having the second accent; and convert the synthesized third audio data into a synthesized version of the second speech content having the second accent.
2. The system of claim 1, wherein the instructions are executable by the one or more processors to further cause the system to align and classify each of a plurality of frames of the first speech content corresponding to respective ones of the speakers to facilitate the training.
3. The system of claim 1, wherein the instructions are executable by the one or more processors to further cause the system to map at least a first non-text linguistic representation of a first phoneme of the set of phonemes to a second non-text linguistic representation of a second phoneme associated with a second pronunciation of the second speech content, the synthesized version of the second speech content further comprises the second phoneme, and the first and second phonemes are different phonemes.
4. The system of claim 3, wherein the second pronunciation of the second speech content is different than the first pronunciation of the first speech content.
5. The system of claim 1, wherein the instructions are executable by the one or more processors to further cause the system to apply, to the output, a learned mapping between the first audio data and the second audio data.
6. The system of claim 3, wherein the instructions are executable by the one or more processors to further cause the system to map one or more frames in the output to one or more corresponding frames in the second non-text linguistic representation.
7. The system of claim 1, wherein the first audio data corresponds to a second plurality of speakers having the first accent and the second audio data corresponds to a single speaker having the second accent.
8. A method implemented by one or more computing devices and comprising: aligning and classifying each of a plurality of frames of first speech content corresponding to respective speakers having a first accent to train a first machine-learning algorithm; applying the first machine-learning algorithm to second speech content comprising a first set of phonemes associated with a first pronunciation of the second speech content; based on the application of the first machine-learning algorithm, synthesizing, using a second machine-learning-algorithm trained with first audio data comprising the first accent and second audio data comprising a second accent, third audio data representative of the second speech content having the second accent; and converting the synthesized third audio data into a synthesized version of the second speech content having the second accent.
9. The method of claim 8, further comprising mapping at least a first non-text linguistic representation of a first phoneme of the first set of phonemes to a second non-text linguistic representation of a second phoneme of a second set of phonemes associated with a second pronunciation of the second speech content to facilitate the synthesizing.
10. The method of claim 9, wherein the synthesized version of the second speech content comprises the second set of phonemes.
11. The method of claim 9, wherein the second pronunciation of the second speech content is different than the first pronunciation of the second speech content and the first and second phonemes are different phonemes.
12. The method of claim 8, further comprising continuously converting the synthesized third audio data into a synthesized version of third speech content having the second accent between 50-700 ms after receiving the third speech content having the first accent, wherein the synthesized version of the third speech content has the second accent.
13. The method of claim 8, further comprising receiving a first user input indicating a selection of the first accent and a second user input indicating a selection of the second accent.
14. The method of claim 8, wherein the first machine-learning algorithm comprises a non-text learned linguistic representation for the first accent and the method further comprises: aligning and classifying each of the plurality of frames according to monophone and triphone sounds of the first speech content to train the first machine-learning algorithm; and detecting, for each of another plurality of frames in the second speech content, a respective monophone and triphone sound based on the non-text learned linguistic representation.
15. A non-transitory computer-readable medium comprising instructions that, when executed by at least one processor, cause the at least one processor to: apply a first machine-learning algorithm to first speech content comprising first phonemes associated with a first pronunciation to derive a non-text linguistic representation of the first phonemes; based on the non-text linguistic representation of the first phonemes, synthesize, using a second machine-learning algorithm trained with first audio data comprising a first accent and second audio data comprising a second accent, third audio data representative of the first speech content having the second accent, wherein the synthesizing comprises mapping at least a first non-text linguistic representation of a first phoneme of the first phonemes to a second non-text linguistic representation of a second phoneme of second phonemes associated with a second pronunciation of the first speech content; and convert the synthesized third audio data into a synthesized version of the first speech content having the second accent and comprising the second phonemes.
16. The non-transitory computer-readable medium of claim 15, wherein the instructions, when executed by the at least one processor further causes the at least one processor to train the first machine-learning algorithm with fourth audio data comprising second speech content from speakers having the first accent.
17. The non-transitory computer-readable medium of claim 16, wherein the instructions, when executed by the at least one processor further causes the at least one processor to align and classify each of a plurality of frames of the second speech content corresponding to respective ones of the speakers to train the first machine-learning algorithm.
18. The non-transitory computer-readable medium of claim 15, wherein the second pronunciation of the first speech content is different than the first pronunciation of the first speech content and the first and second phonemes are different phonemes.
19. The non-transitory computer-readable medium of claim 15, wherein the first speech content further comprises a set of prosodic features, the instructions, when executed by the at least one processor further causes the at least one processor to synthesize the third audio data and the set of prosodic features, and the synthesized version of the first speech content has the set of prosodic features.
20. The non-transitory computer-readable medium of claim 15, wherein the instructions, when executed by the at least one processor further causes the at least one processor to transmit the synthesized version of the first speech content to a computing device.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0012]
[0013]
[0014]
[0015]
DETAILED DESCRIPTION
[0016] The following disclosure refers to the accompanying figures and several example embodiments. One of ordinary skill in the art should understand that such references are for the purpose of explanation only and are therefore not meant to be limiting. Part or all of the disclosed systems, devices, and methods may be rearranged, combined, added to, and/or removed in a variety of manners, each of which is contemplated herein.
I. Example Computing Device
[0017]
[0018] The processor 102 may comprise one or more processor components, such as general-purpose processors (e.g., a single- or multi-core microprocessor), special-purpose processors (e.g., an application-specific integrated circuit or digital-signal processor), programmable logic devices (e.g., a field programmable gate array), controllers (e.g., microcontrollers), and/or any other processor components now known or later developed. In line with the discussion above, it should also be understood that processor 102 could comprise processing components that are distributed across a plurality of physical computing devices connected via a network, such as a computing cluster of a public, private, or hybrid cloud.
[0019] In turn, data storage 104 may comprise one or more non-transitory computer-readable storage mediums that are collectively configured to store (i) software components including program instructions that are executable by processor 102 such that computing device 100 is configured to perform some or all of the disclosed functions and (ii) data that may be received, derived, or otherwise stored, for example, in one or more databases, file systems, or the like, by computing device 100 in connection with the disclosed functions. In this respect, the one or more non-transitory computer-readable storage mediums of data storage 104 may take various forms, examples of which may include volatile storage mediums such as random-access memory, registers, cache, etc. and non-volatile storage mediums such as read-only memory, a hard-disk drive, a solid-state drive, flash memory, an optical-storage device, etc. In line with the discussion above, it should also be understood that data storage 104 may comprise computer-readable storage mediums that are distributed across a plurality of physical computing devices connected via a network, such as a storage cluster of a public, private, or hybrid cloud. Data storage 104 may take other forms and/or store data in other manners as well.
[0020] The communication interface 106 may be configured to facilitate wireless and/or wired communication between the computing device 100 and other systems or devices. As such, communication interface 106 may communicate according to any of various communication protocols, examples of which may include Ethernet, Wi-Fi, Controller Area Network (CAN) bus, serial bus (e.g., Universal Serial Bus (USB) or Firewire), cellular network, and/or short-range wireless protocols, among other possibilities. In some embodiments, the communication interface 106 may include multiple communication interfaces of different types. Other configurations are possible as well.
[0021] The I/O interfaces 108 of computing device 100 may be configured to (i) receive or capture information at computing device 100 and/or (ii) output information for presentation to a user. In this respect, the one or more I/O interfaces 108 may include or provide connectivity to input components such as a microphone, a camera, a keyboard, a mouse, a trackpad, a touchscreen, or a stylus, among other possibilities. Similarly, the I/O interfaces 108 may include or provide connectivity to output components such as a display screen and an audio speaker, among other possibilities.
[0022] It should be understood that computing device 100 is one example of a computing device that may be used with the embodiments described herein, and may be representative of the computing devices 200 and 300 shown in
II. Example Functionality
[0023] Turning to
[0024] For example, as shown in
[0025] The speech content may then be passed to the accent-conversion application 203 shown in
[0026]
[0027] Further, the virtual microphone interface 205 may include a drop-down menu 207 or similar option for selecting the input source from which the accent-conversion application 203 will receive speech content, as the computing device 200 might have multiple available options to use as an input source. Still further, the virtual microphone interface 205 may include a drop-down menu 208 or similar option for selecting the desired output accent for the speech content. As shown in
[0028] Advantageously, the accent-conversion application 203 may accomplish the operations above, and discussed in further detail below, at speeds that enable real-time communications, having a latency as low as 50-700 ms (e.g., 200 ms) from the time the input speech received by the accent-conversion application 203 to the time the converted speech content is provided to the communication application 204. Further, the accent-conversion application 203 may process incoming speech content as it is received, making it capable of handling both extended periods of speech as well as frequent stops and starts that may be associated with some conversations. For example, in some embodiments, the accent-conversion application 203 may process incoming speech content every 160 ms. In other embodiments, the accent-conversion application 203 may process the incoming speech content more frequently (e.g., every 80 ms) or less frequently (e.g., every 300 ms).
[0029] Turning now to
[0030]
[0031] At block 402, the computing device 300 may receive speech content 301 having a first accent. For instance, as discussed above with respect to
[0032] The ASR engine 302 includes one or more machine learning models (e.g., a neural network, such as a recurrent neural network (RNN), a transformer neural network, etc.) that are trained using previously captured speech content from many different speakers having the first accent. Continuing the example above, the ASR engine 302 may be trained with previously captured speech content from a multitude of different speakers, each having an Indian English accent. For instance, the captured speech content used as training data may include transcribed content in which each of the speakers read the same script (e.g., a script curated to provide a wide sampling of speech sounds, as well as specific sounds that are unique to the first accent). Thus, the ASR engine 302 may align and classify each frame of the captured speech content according to its monophone and triphone sounds, as indicated in the corresponding transcript. As a result of this frame-wise breakdown of the captured speech across multiple speakers having the first accent, the ASR engine 302 may develop a learned linguistic representation of speech having an Indian English accent that is not speaker-specific.
[0033] On the other hand, the ASR engine 302 may also be used to develop a learned linguistic representation for an output accent that is only based on speech content from a single, representative speaker (e.g., a target SAE speaker) reading a script in the output accent, and therefore is speaker specific. In this way, the synthesized speech content that is generated having the target accent (discussed further below) will tend to sound like the target speaker for the output accent. In some cases, this may simplify the processing required to perform accent conversion and generally reduce latency.
[0034] In some implementations, the speech content collected from the multiple Indian English speakers as well as the target SAE speaker for training the ASR engine 302 may be based on the same script, also known as parallel speech. In this way the transcripts used by the ASR engine 302 to develop a linguistic representation for speech content in both accents are the same, which may facilitate mapping one linguistic representation to the other in some situations. Alternatively, the training data may include non-parallel speech, which may require less training data. Other implementations are also possible, including hybrid parallel and non-parallel approaches.
[0035] It should be noted that the learned linguistic representations developed by the ASR engine 302 and discussed herein may not be recognizable as such to a human. Rather, the learned linguistic representations may be encoded as machine-readable data (e.g., a hidden representation) that the ASR engine 302 uses to represent linguistic information.
[0036] In practice, the ASR engine 302 may be individually trained with speech content including multiple different accents, across different languages, and may develop a learned linguistic representation for each one. Accordingly, at block 404, the computing device 300 may receive an indication of the Indian English accent associated with the received speech content 301, so that the appropriate linguistic representation is used by the ASR engine 302. As noted above, this indication of the incoming accent (e.g., incoming accent 303 in
[0037] At block 406, the ASR engine 302 may derive a linguistic representation of the received speech content 301, based on the learned linguistic representation the ASR engine 302 has developed for the Indian English accent. For instance, the ASR engine 302 may break down the received speech content 301 by frame and classify each frame according to the sounds (e.g., monophones and triphones) that are detected, and according to how those particular sounds are represented and inter-related in the learned linguistic representation associated with an Indian English accent.
[0038] In this way, the ASR engine 302 functions to deconstruct the received speech content 301 having the first accent into a derived linguistic representation with very low latency. In this regard, it should be noted that the ASR engine 302 may differ from some other speech recognition models that are configured predict and generate output speech, such as a speech-to-text model. Accordingly, the ASR engine 302 may not need to include such functionality.
[0039] The derived linguistic representation of the received speech content 301 may then be passed to the VC engine 304. Similar to the indication of the incoming accent 303, the computing device 300 may also receive an indication of the output accent (e.g., output accent 305 in
[0040] Similar to the ASR engine 302, the VC engine 304 includes one or more machine learning models (e.g., a neural network) that use the learned linguistic representations developed by the ASR engine 302 as training inputs to learn how to map speech content from one accent to another. For instance, the VC engine 304 may be trained to map an ASR-based linguistic representation of Indian English speech to an ASR-based linguistic representation of a target SAE speaker, using individual monophones and triphones within the training data as a heuristic to better determine the alignments. Like the learned linguistic representations themselves, the learned mapping between the two representations may be encoded as machine-readable data (e.g., a hidden representation) that the VC engine 304 uses to represent linguistic information.
[0041] Accordingly, at block 408, the VC engine 304 may utilize the learned mapping between the two linguistic representations to synthesize, based on the derived linguistic representation of the received speech content 301, audio data that is representative of the speech content 301 having the second accent. The audio data that is synthesized in this way may take the form of a set of mel spectrograms. For example, the VC engine 304 may map each incoming frame in the derived linguistic representation to an outgoing target speech frame.
[0042] In this way, the VC engine 304 functions to reconstruct acoustic features from the derived linguistic representation into audio data that is representative of speech by a different speaker having the second accent, all with very low latency. Advantageously, because the VC engine 304 works at the level of encoded linguistic data and does not need to predict and generate output speech as a midpoint for the conversion, it can function more quickly than alternatives such as a STT-TTS approach. Further, the VC engine 304 may more accurately capture some of the nuances of voice communications, such as brief pauses or changes in pitch, which may be lost if the speech content were converted to text first and then back to speech.
[0043] At block 410, the output speech generation engine 306 may convert the synthesized audio data into output speech, which may be a synthesized version of the received speech content 301 having the second accent. As noted above, the output speech may further have the voice identity of the target speaker whose speech content was used to train the ASR engine 302. In some examples, the output speech generation engine 306 may take the form of a vocoder or similar component that can rapidly process audio under the real-time conditions contemplated herein. The output speech generation engine 306 may include one or more additional machine learning algorithms (e.g., a neural network, such as a generative adversarial network, one or more Griffin-Lim algorithms, etc.) that learn to convert the synthesized audio data into waveforms that are able to be heard. Other examples are also possible.
[0044] As shown in
[0045] Although the examples discussed above involve a computing device 300 that utilizes the accent-conversation application for outgoing speech (e.g., situations where the user of computing device 300 is the speaker), it is also contemplated that the accent-conversion application may be used by the computing device 300 in the opposite direction as well, for incoming speech content 301 where the user is a listener. For instance, rather than being situated as a virtual microphone between a hardware microphone and the communication application 307, the accent-conversion application may be deployed as a virtual speaker between the communication application 307 and a hardware speaker of the computing device 300, and the indication of the incoming accent 303 and the indication of the output accent 305 shown in
[0046] As a further extension, the examples discussed above involve an ASR engine 302 that is provided with an indication of the incoming accent. However, in some embodiments it may be possible to use the accent-conversion application discussed above in conjunction with an accent detection model, such that the computing device 300 is initially unaware of one or both accents that may be present in a given communication. For example, an accent detection model may be used in the initial moments of a conversation to identify the accents of the speakers. Based on the identified accents, the accent-conversion application may determine the appropriate learned linguistic representation(s) that should be used by the ASR engine 302 and the corresponding learned mapping between representations that should be used by the VC engine 304. Additionally, or alternatively, the accent detection model may be used to provide a suggestion to a user for which input/output accent the user should select to obtain the best results. Other implementations incorporating an accent detection model are also possible.
[0047]
[0048] In addition, for the example flow chart in
[0049] The program code may be stored on any type of computer readable medium, for example, such as a storage device including a disk or hard drive. The computer readable medium may include non-transitory computer readable medium, for example, such as computer-readable media that stores data for short periods of time like register memory, processor cache and Random-Access Memory (RAM). The computer readable medium may also include non-transitory media, such as secondary or persistent long-term storage, like read only memory (ROM), optical or magnetic disks, compact disc read only memory (CD-ROM), for example. The computer readable media may also be any other volatile or non-volatile storage systems. The computer readable medium may be considered a computer readable storage medium, for example, or a tangible storage device. In addition, for the processes and methods disclosed herein, each block in
III. Conclusion
[0050] Example embodiments of the disclosed innovations have been described above. Those skilled in the art will understand, however, that changes and modifications may be made to the embodiments described without departing from the true scope and spirit of the present invention, which will be defined by the claims.
[0051] Further, to the extent that examples described herein involve operations performed or initiated by actors, such as humans, operators, users, or other entities, this is for purposes of example and explanation only. Claims should not be construed as requiring action by such actors unless explicitly recited in claim language.