Automated systems and methods for providing bidirectional parallel language recognition and translation processing with machine speech production for two users simultaneously to enable gapless interactive conversational communication
20190354592 ยท 2019-11-21
Inventors
- Sharat Chandra Musham (Austin, TX, US)
- Pranitha Paladi (Austin, TX, US)
- Rey Moulton (Brooklyn, NY, US)
Cpc classification
G10L21/06
PHYSICS
G06F40/58
PHYSICS
G10L13/02
PHYSICS
International classification
Abstract
A novel system and multi-device invention that provide a means to communicate in real-time (conversationally) between two or more individuals, regardless of each individual's preferred or limited mode of transmission or receipt (by gesture, by voicein Mandarin to German to Farsi, by text in any major language, and via machine learning, eventually by dialect).
Systems and methods for conversational communication between two individuals using multiple language modes (e.g. visual language and verbal language) through the use of a worn device (for hands-free language input capability) are provided. Information may be stored in memory regarding user preferences, as well as various language databasesvisual to verbal to textural- or the system can determine and adapt to user (primary and second) preferences and modes based on direct input, and adapt. Core processing for worn device can be performed 1) off-device via cloud processing through wireless transmission, 2) on-board, or a 3) mix of both, depending on the embodiment, and location of use, for example if the user is out of range from access to a high-speed wireless network and needs to rely more on on-board processing, or to maintain conversational speed dual/real-time translation and conversion.
Claims
1. A device worn by a speech-impaired signer behind both of their hands approximately mid-torso based on natural hand position for sign-language communication, which tracks both of their hands and records gestures from behind said hands, before said device uses its own in-system database of behind-the-hands perspective correlative gestural images to map the captured images to standard front-of-hand sign language dictionaries via a machine-learned model that is trained using multiple similar images per word i.e., convert the visual analogue data to digital data, to text-based language, and outputs in machine speech, through the device via a wireless method such as wi-fi to a second person's smartphone or similar device, or via an embedded speaker on said device, to communicate with other people whose primary input/output communication mode is verbal,
2. The device of claim 1, wherein the device can receive voice from another human and translate it to symbolic data displayed on the top of the device via an LCD panel, for the wearer to read on the device,
3. The device of claim 1, in which the text or symbol display on the device is touch-activated, so that sentences or symbol strings can be paused by the user.
4. Furthermore, the touch-screen display of claim 3, by which a user can double-tap on an individual word or symbol to request further definition via the device system's database.
5. The touch-screen display of claim 3, in which after an individual selects a word or symbol for further definition and receives it, can tap the touch-screen device to restart the text to continue moving.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0108] The following drawings further describe by illustration the advantages and objects of the present invention in various embodiments. Each drawing is referenced by corresponding figure reference characters within the DETAILED DESCRIPTION section to follow.
[0109]
[0110]
[0111]
[0112]
[0113]
DETAILED DESCRIPTION AND FIRST EMBODIMENT
[0114] Introduction. The following is a detailed description of exemplary embodiments to illustrate the principles of the invention. The embodiments are provided to illustrate aspects of the invention, but the invention is not limited to any embodiment. The scope of the invention encompasses numerous alternatives, modifications and equivalent; it is limited only by the claims.
[0115]
[0116] The interaction in
[0117]
[0118] As you can see, in
[0119] Our system and method as embodied in a hardware machine detailed by one participant's use as illustrated in
[0120] Based on our systems individual speech dataat the word or moneme levelrecognition and translation but faster continual speech output, it can pace directly with a normal human conversation. The average length for an individual's individual message contribution in an interactive conversation is approximately 30 words, so in this case, we show the total time, in seconds, from User 110B's first word input to last machine speech output at 11 seconds across the timeline 238.
[0121] It is important to note here that as our system already has its own internal analogue to digital converters, 214 and 222, language translation processors, 216 and 224, and even on-board speech modeling to convert translated text language from 216 and 224 to machine speech modeled sound at 218 and 226, respectively, and that all processors have their own memory for storing recent data input and output typically for at least a few cycles worth of recent processing, colloquial referred to as scratch memory, our current systemwithout augmenting any individual component or major process, with simply added memory at current components, could record user voice data in an overall vocal imprint map (as a database) so that over time, as the total map is being completed, the specific machine speech's voice can be augmented (updated as a more complete model of the user's voice is created over time with variations across words and monemes and phonemes for specific pitch, tone variations specific to their own voiceprint) so that our systems specific machine speech on their behalf, regardless of output language, can replicate, with precision any User's, such as 110B's, exact speech pattern. For example, is the user in
[0122] More importantly, at the point our system has learned User 110B's voiceprint and can use it in its own machine speech to the point of where it sounds like her real voice as mentioned previously, should User 110B start using our system to talk to her friend Sharat, a native Hindi speaker, the first time User 110B uses our system with Sharat, as it will have already started modeling its own speech output on the updating model of her voiceprint, it's Hindi speech will soundat the very least-feminine, and closer to 110B's own real voice from the first time she and Sharat converse through our system. Of course, since all Users in our system are both sender and receiver, and as our system is activated by use, and thus has automated speech recognition and modeling based on its core function, it will develop a more exact voice model for all users, from User 112A to 110B's friend Sharat once he comes into the system as a conversational member.
[0123]
[0124]
[0125] As out system is translating and delivering moneme by moneme or word by word as they are being spoken at conversational speed, in this case, from User 110B to User 112B, our system uses one audio channel, 312A, for User 110B's speech input, and another audio channel, 312B, for machine speech output in French to 112B, so that neither channel creates noise for the other as much of the time both User 1 and Machine will be speaking at the same time as User 2 is listening. [00128] Furthermore, while both speech input and speech output streams, over audio channel 1, 312A, and audio channel 2, 312B are spoken and heard as a series of consecutive words in a normal conversational stream, our system processes and delivers each word at a time, and reassembles the output words as speech sequentially in one audio stream to mimic natural speech pattern and aid in reception. For example, User 110B says people and our system receives and processes it, 316A, and send it as speech output (les gens) 316B, so that User 112B hears les gens in it in under 1 second, but then User 110B, who continues speaking normally without pause, says spend and our system receives and processes it, 318A, and send it as speech output (passant) 318B, so that User 112B hears passant approximately 1.5 seconds into the conversation. User 110B conveys one idea, which happens to contain 3 sentences, of varying length to accomplish that, which in this case, is measured at normal rate of conversation, at 10.5 seconds to complete as per the timeline 314A. However, as our system already started delivering translated speech to 112B at approximately 1 second from when User 110B started talking, when User 110B finishes talking at approximately 10.5 seconds, with the last words, 320A, being again, only a total of 11 seconds between both conversational partners through our system, 110B and 112B, have passed as User 112B hears the last words in the French machine speech stream, of a nouveau, 320B. Our system, but use of parallelbut integratedprocessing from translation through machine speech output essentially allows its Users, as speakers to speak directly in another language (through our machine language), so uniquely, there is no waiting gap between original language speech segment and translated speech segment, and so that a receiver, such as 112B can process the meaning of what 110B is saying while she is saying it, and so can respond almost immediately as soon as she has finished, in this case, without waiting another 11 seconds to hear a translated set of speech after User 110B has spoken. This is how out system is uniquely enable true real-time conversation between two individuals speaking two different languages. [00129]
[0126]
[0127] The total time for this process to be completed is approximately a few milliseconds, but it is discretely processed since individual words or monemes have to be outputted in sequence at 470 as the User/Speaker continues to speak into the system at 410 and as it is continued to be processed live at 412.
[0128] Ergo, within milliseconds from word one or moneme one is going into our systems process A through 416 and 418A, the second word or moneme being spoken into the system at 410 is being processed by the systems bridge processor at 412, converted to digital text at 416, but then into a second parallel process running in tandem to Process A, but as Process B at 418B. It is the same process as shown for word or moneme one. This means the second word or moneme at 418B is checked against database 420 to see if it is already in the database 410. If it is not, the text is saved in the database storage 422B for later use. If it is in the database, then the system checks to see if the translation is already available at 424B, if it is not, based on the user setting enabled, text will be translated into a second language of choice, 426B, and saved as translated text as step 428B. If the translated text is available in system's database 420, is available as translated text, it is pulled and received at 432B, and our system then checks to see if machine generated output is already saved in storage and available for output at 438B, and it is retrieved from database storage in 442B. Then, for this first word or moneme, machine speech is outputted for the first word at 436B, and in queue for machine generated audio output at 450. In addition, if the translated text is multiple words, such as a moneme, or smallest unit of meaning is more than one word, at 430B these multiple words will be reassembled into grammatically correct sequencing at 430B, then our system generates machine audio from that reassembled set at 434B before our systems machine speech is outputted for the first word or word set at 436B and then placed into queue for machine generated audio 450.
[0129] This is an ongoing dual process, so as word or moneme one and word or moneme two are moving through our systems processing A and B, respectively, the next words being spoken live through 410 have already moved through bridge processing 412 and been converted to digital text at 416, then also move through process A and B, for example word or moneme 3 and word and moneme 4 from speech input 410 are moving through process A and process B respectively, and as our system can dual process and machine output words in cue at 450 faster than normal human speech at 410, with organic speech pauses, as all of these words, in sequence, are cued in order at 450, they are output as machine speech audio at 470 in normal speech pattern flow for a second user to hear in their language, but while a first user is still speaking their message through 410. So that when a human speaker completes their conversational speech, which is on average 30 words at 30 seconds for a normal conversational contribution, the machine speech output will have completed the same message in the other language so that the User 2 has heard, at 470, the entire spoken message in machine audio within a second after User one has spoken it at 410.
[0130] To avoid generating noise in providing machine speech output at 470 while the human speaker is speaking into our system at 410 almost simultaneously, we have noted that our system always utilizes two separate audio channels, such as audio channel one for input at 410 and 412, and audio channel two for output at 450 and 470.
OBJECTS OF THE INVENTION
[0131] Introduction. Accordingly, several advantages and objects of the present invention are:
[0132] A principal object of the present invention is to provide a means for two individuals speaking two different native languages to converse in real time without typical translation or delivery gaps through the use of machine segmentation and component processing, translation, and conversions into translated audio while original language message is still being spoken, so that when one person finishes speaking on one language, the receiver has already heard their message, via machine speech, in their language and uniquely already had time to process (mentally) the linguistical meaning so they can respond immediately, just like in a normal human face to face conversation.
[0133] As the primary and ideal channel is via mobile device, and as the systemby eliminating the processing drag and time gap of batch processing as is standard with all current translation technologyit can also be used uniquely by two individuals at the same time as opposed to the standard linear input, translation delivery, then receiver having to use their own system to do the same in return. This allows for our system to be automatically closed to all excepting the two members in the conversation, so that all information can be automatically encrypted and decrypted, though not apparent to either users in a conversation, if transmitted via wireless or wired channel, even though a translation engine, any in-transit information will be undecipherable to a person who has access to a public point, such as a translation engine server, but the receiver, within our system hears speech in their own language without having to manually decrypt the data as it is being delivered from the system to that single reception point.
[0134] Another object of the invention, based on its pre-componentized speech monemes or words, is to be able to repurpose those audio components to learn by placing each unique component in a voice model mesh/database, the specific and unique imprint of its user. This way, over time, as the system has received enough data-points (about 2,000 spoken words), it can then model its own machine speech on behalf of the User to sound perceptively replicated and be used as the machine's speech when delivering translated speech to individual's in their language that the User 1 chooses, but in their voice.
[0135] In addition, once a User voice imprint model is complete, it can be used as a rapid filter to capture pitch, tonal, and pacing changes unique to a User's specific conversation, to be able to transmit those emotional or paralinguistic cues or messages to a receiver at the time of this conversation.
[0136] Furthermore, based on the general processes of our system, and the ease in repurposing componentization necessary for its translation, as the unique tonal and pitch characteristics of a User's voice imprint model can be used as an always on reverse filter, should the system be used in a mobile computing environment, it can also be used to eliminate ambient noise in public areas should a User desire to have a close face to face conversation that necessitates speaking at a distance from a mobile phone and holding it equidistant from a conversational partner to hear and input as well.