VOICE TRANSLATION AND VIDEO MANIPULATION SYSTEM
20230009957 · 2023-01-12
Assignee
Inventors
Cpc classification
G10L15/22
PHYSICS
G06F40/58
PHYSICS
International classification
G10L15/22
PHYSICS
Abstract
A communication modification system including an audio gathering unit that gathers an audio stream, a language detection unit that converts the audio stream into text, where the language detection unit correlates portions of the text with audio portions of the audio stream, and the language detection unit determines a first and second deviation in the audio stream portion based on the text portion and audio portion gathered by the audio gathering unit.
Claims
1. A communication modification system including: an audio gathering unit that gathers an audio stream; a language detection unit that converts the audio stream into text, wherein, the language detection unit correlates portions of the text with audio portions of the audio stream, the language detection unit determines a first and second deviation in the audio stream portion based on the text portion and audio portion gathered by the audio gathering unit.
2. The communication modification system of claim 1 wherein text is broken into individual words or phrases.
3. The communication modification system of claim 2 wherein the individual phrases and words are logically related to auto segments of the audio stream.
4. The communication modification system of claim 3 wherein each audio segment is analyzed to determine at least one speech characteristic.
5. The communication modification system of claim 4 wherein the speech characteristic is one of a dialect, a speed, an emotion or any other characteristic detectable in the audio stream.
6. The communication modification system of claim 5 wherein a first deviation algorithm is determined based on the speech characteristic in at least one audio segment.
7. The communication modification system of claim 6 wherein, the first deviation algorithm is applied to at least one audio segment to produce a modified audio segment, and the modified audio segment is analyzed to identify additional speech characteristics.
8. The communication modification system of claim 7 wherein a second deviation algorithm is determined using the modified audio segment.
9. The communication modification system of claim 8 wherein a second modified audio segment is generated using the second deviation algorithm.
10. The communication modification system of claim 9 wherein the first and second deviation algorithms are applied to all audio segments to produce a modified audio stream.
11. A method modifying an audio stream including the steps of: gathering an audio stream via an audio gathering unit; converting the audio stream into text using a language detection unit, correlating portions of the text with audio portions of the audio stream determining a first and second deviation in the audio stream portion based on the text portion and audio portion gathered by the audio gathering unit.
12. The method of claim 11 including the step of breaking the text into individual words or phrases.
13. The method of claim 12 including the step of logically replating the individual phrases and words auto segments of the audio stream.
14. The method of claim 13 including the step of analyzing each audio segment to determine at least one speech characteristic.
15. The method of claim 14 including the step of detecting at least one speech characteristic including dialect, speed, emotion or any other characteristic detectable in the audio stream.
16. The method of claim 15 including the step of determining a first deviation algorithm in at least one audio stream segment based on the speech characteristic determine on each of the at least one audio segments.
17. The method of claim 16 including the step of applying the first deviation algorithm to at least one audio segment to produce a modified audio segment and analyzing the modified audio segment to identify additional speech characteristics.
18. The method of claim 17 including the step of determining a second deviation algorithm using the modified audio segment.
19. The method of claim 18 including the step of generating a second modified audio segment using the second deviation algorithm.
20. The method of claim 19 including the step of applying the first and second deviation algorithms to all audio segments to produce a modified audio stream.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0025] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an implementation of the present invention and, together with the description, serve to explain the advantages and principles of the invention. In the drawings:
[0026]
[0027]
[0028]
[0029]
[0030]
DETAILED DESCRIPTION OF THE INVENTION
[0031] Referring now to the drawings which depict different embodiments consistent with the present invention, wherever possible, the same reference numbers will be used throughout the drawings and the following description to refer to the same or like parts.
[0032] The voice translator system gathers information on the audio and video streams of a video communication. The audio and video are parsed, and the audio is translated into a foreign language and the video is manipulated such that the mouths of the users mimic the user speaking the foreign language.
[0033]
[0034] The audio gathering unit 110 and language detection unit 112 may be embodied by one or more servers. Alternatively, each of the facial recognition unit 114 and facial recreation unit 116 may be implemented using any combination of hardware and software, whether as incorporated in a single device or as a functionally distributed across multiple platforms and devices.
[0035] In one embodiment, the network 108 is a cellular network, a TCP/IP network, or any other suitable network topology. In another embodiment, the voice translation unit 102 may be servers, workstations, network appliances or any other suitable data storage devices. In another embodiment, the communication devices 104 and 106 may be any combination of cellular phones, telephones, personal data assistants, or any other suitable communication devices. In one embodiment, the network 108 may be any private or public communication network known to one skilled in the art such as a local area network (“LAN”), wide area network (“WAN”), peer-to-peer network, cellular network, or any suitable network, using standard communication protocols. The network 108 may include hardwired as well as wireless branches.
[0036]
[0037]
[0038] In one embodiment, the network 108 may be any private or public communication network known to one skilled in the art such as a Local Area Network (“LAN”), Wide Area Network (“WAN”), Peer-to-Peer Network, Cellular network, or any suitable network, using standard communication protocols. The network 108 may include hardwired as well as wireless branches.
[0039]
[0040] In step 318, the language detection unit 112 detects the language spoken in the video stream. In step 320, the audio gathering unit 110 generates a new audio stream using the translated text. In step 322, the audio gathering unit 110 uses digital audio processing to identify specific speech patterns of the speaker in the original audio stream and applies the speech patterns to the newly generated audio stream. The speech patterns are normalized and the applied the translated audio stream such that the audio stream is manipulated to match the speakers voice to the pronunciations in the converted text. In step 324, the pixels representing the mouth of the speaker are manipulated to replicate the movement of the speaker's mouth saying the newly translated text. The facial recreation unit 116 modifies the pixels representing the speaker's mouth in the video based on previously observed formations of the mouth while the speaker was dictating the original untranslated text. In step 326, the facial creation unit 116 modifies all frames of the video to correspond with the mouth movements for the translate text.
[0041] In addition to translating and replicated the movements of the speaker's mouth, emotions represent a large portion of communication. In one embodiment, after the face coordinates are identified, the audio gathering unit 110 may determine the emotions conveyed by the speaker and may adjust the speaker's image to convey to the speaker's emotional state. In one embodiment, the audio gathering unit 110 may determine the emotional state of the speaker by analyzing the volume and speed of the audio as well as the facial coordinates of the speaker while speaking.
[0042] As an illustrative example, a first user on a communication device 104 calls a user on a second communication device 106 with the first user initiating a video call. Each user selects their native language and the audio gathering unit 110 translates the audio in the preferred language of each user. In addition, the mouth images of each user are adjusted to mimic each user saying the words in the other user's preferred language.
[0043]
[0044] In step 412, the language detection unit 112 analyzes each segment of the correlated audio segments to determine the voice characteristics in each segment. In step 414, the language detection unit 112 determines the characteristics of each segment that is adjusted based on a predetermined audio output format. The characteristics includes, but is not limited to, speed of speech, tone, pronunciation of letters and words, and any other voice characteristic. In step 416, the language detection unit 112 adjusts the voice characteristics of each audio segment to generate a modified audio output. In step 418, a second audio deviation is determined based on user input. In step 420, language detection unit applies the second deviation to each audio segment. In step 422, the language detection unit 112 combines all the segments into a single audio segment.
[0045] As an illustrative example, the process may be used to generate an audio output that receives a user's voice and modifies the user's voice to simulate another person's voice. By applying the second deviation, the modified user's voice can be further modified to sound similar to a third person's voice.
[0046] While various embodiments of the present invention have been described, it will be apparent to those of skill in the art that many more embodiments and implementations are possible that are within the scope of this invention. Accordingly, the present invention is not to be restricted except in light of the attached claims and their equivalents.