Automated systems and methods for providing bidirectional parallel language recognition and translation processing with machine speech production for two users simultaneously to enable gapless interactive conversational communication

20190354592 ยท 2019-11-21

    Inventors

    Cpc classification

    International classification

    Abstract

    A novel system and multi-device invention that provide a means to communicate in real-time (conversationally) between two or more individuals, regardless of each individual's preferred or limited mode of transmission or receipt (by gesture, by voicein Mandarin to German to Farsi, by text in any major language, and via machine learning, eventually by dialect).

    Systems and methods for conversational communication between two individuals using multiple language modes (e.g. visual language and verbal language) through the use of a worn device (for hands-free language input capability) are provided. Information may be stored in memory regarding user preferences, as well as various language databasesvisual to verbal to textural- or the system can determine and adapt to user (primary and second) preferences and modes based on direct input, and adapt. Core processing for worn device can be performed 1) off-device via cloud processing through wireless transmission, 2) on-board, or a 3) mix of both, depending on the embodiment, and location of use, for example if the user is out of range from access to a high-speed wireless network and needs to rely more on on-board processing, or to maintain conversational speed dual/real-time translation and conversion.

    Claims

    1. A device worn by a speech-impaired signer behind both of their hands approximately mid-torso based on natural hand position for sign-language communication, which tracks both of their hands and records gestures from behind said hands, before said device uses its own in-system database of behind-the-hands perspective correlative gestural images to map the captured images to standard front-of-hand sign language dictionaries via a machine-learned model that is trained using multiple similar images per word i.e., convert the visual analogue data to digital data, to text-based language, and outputs in machine speech, through the device via a wireless method such as wi-fi to a second person's smartphone or similar device, or via an embedded speaker on said device, to communicate with other people whose primary input/output communication mode is verbal,

    2. The device of claim 1, wherein the device can receive voice from another human and translate it to symbolic data displayed on the top of the device via an LCD panel, for the wearer to read on the device,

    3. The device of claim 1, in which the text or symbol display on the device is touch-activated, so that sentences or symbol strings can be paused by the user.

    4. Furthermore, the touch-screen display of claim 3, by which a user can double-tap on an individual word or symbol to request further definition via the device system's database.

    5. The touch-screen display of claim 3, in which after an individual selects a word or symbol for further definition and receives it, can tap the touch-screen device to restart the text to continue moving.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0108] The following drawings further describe by illustration the advantages and objects of the present invention in various embodiments. Each drawing is referenced by corresponding figure reference characters within the DETAILED DESCRIPTION section to follow.

    [0109] FIG. 1 is a diagram showing two individuals who speak different languages using our systems and methods with input speech recognition, translation and machine speech output running in tandem on similar timelines so that to enable a gapless conversational loop.

    [0110] FIG. 2 is a block diagram depicting one embodiment of our systems and methods with one of its users to show how our hardware machine simultaneously converts input to output as the User is speaking.

    [0111] FIG. 3A is a diagram depicting one embodiment of our systems and methods showing how one User 1's speech input is serialized and how our system both receives speech input while delivering through machine speech. translated speech output to a receiver, User 2, in their language while the initial message is still being spoken by User 1 in User is language.

    [0112] FIG. 3B is a diagram depicting the same embodiment of our systems and methods showing how one User 2 becomes a Sender from m being inside our same system as User 1 and responding spontaneously in their language while User 1 receives, via our system's machine speech the message in User 1's language, mirroring the same processes as shown in the prior figure, but from the second participant's point of view.

    [0113] FIG. 4 is a flowchart depicting an embodiment of the serialized speech parallel process from human input to translated machine speech output, 900, focusing on the processing of the first and second word or monemes in a User's longer, ongoing spoken message.

    DETAILED DESCRIPTION AND FIRST EMBODIMENT

    [0114] Introduction. The following is a detailed description of exemplary embodiments to illustrate the principles of the invention. The embodiments are provided to illustrate aspects of the invention, but the invention is not limited to any embodiment. The scope of the invention encompasses numerous alternatives, modifications and equivalent; it is limited only by the claims.

    [0115] FIGS. 1-4 illustrate the core scenario in which a user devices 105A and 105B facilitate a conversational, i.e. interactive, direct and gapless communication, between two users both speaking different source languages and our system speaking in tandem with each user as live speech translator, in the target language of the other user, or for that point in time, in the target language.

    [0116] The interaction in FIGS. 1-4 are described with reference to two primary users, who in a conversational loop through our system are at times both sender and receiver in what would in a standard model of communication be a linear point to point route. The user devices 105A and 105B can be, for example, a cellular or smart phone, a desktop or laptop computer, or even our own manufactured device since these devices are merely entry points into our system. These user devices allow immediate access to our system anywhere in the world but the primary requirement for them is an embedded speaker and microphone, wireless transmitter, and power access if not own source, as the components for much of our system's core unique functionality, are proprietary hardware, though not necessarily so.

    [0117] FIG. 1 is an illustration showing the general communication advantage our system and method in enabling a continual conversational loop between two or more people, from any location, and its parallel processing enabling of two in-tandem and concurrent speech streams, that is, original language input and target language output, with approximate time-duration for both individuals' contribution to the first conversational loop.

    [0118] As you can see, in FIG. 1, 110A, User 1 is holding a terminal for input into our system, and her original speech signal is shown going from left to right, as data stream 114. However, as our system serializes and dual-processes (recognizes, text translates, converts to translated speech via machine speech) every word separately as it is being spoken, and starts delivering as speech each word as the following word is being spoken in originating language, the second data stream, 116, is shown active concurrently with 114. So, if User 1, 110A, starts speaking into terminal 105A at zero point, by 0.5 seconds, the system is already translating the first input word in this example by 0.5 seconds, and completes translation by 1 second, which is why on the machine language outdata stream 116 starts at 0.5 seconds. User 1, 110A, continues speaking, and our system continues to parse, identity, convert, translate, and then produce translated speech output for the second word on a second processing stream, as the first word has already been delivered, and the third word User 1, 110A is speaking is already being translated, and so on until User 1, 110A completes whatever she wants to saysentence to paragraph length. As far as ceiling of length of speech translation, with our system, a User could even be providing a 10 minute Shakespearian monologue as the processing strain across two sets of processors at one word per cycle is much lower than a single batch of 30 words such as with prior speech translation systems and so can run almost infinitely, constrained by terminal power and human physical duration, and attention span of a listener, e.g. User 2, 112A. Please note that 114 ends at 10.5 seconds, whereas 116 ends at 11 seconds. 0.5-1 second is the approximate lag time between User 1 speaking their initial word and our system's machine speech output of the last translated word being delivered to User 2, 112A. From a human communication framework, this is a nominal almost imperceptible lag time between input and output, and this is why User 2, 112A, can respond immediately to User 1, 110A, as our system has made it possible for him to (mentally) process the content or meaning of User 1's speech as translated speech in while User 1 has been speaking, as opposed to prior speech translation systems where he would have to wait to hear a separate translated speech after another User has spoken in a different language.

    [0119] Our system and method as embodied in a hardware machine detailed by one participant's use as illustrated in FIG. 2 showing a User 1, 110B, speaking into a mobile terminal. Our system receives this analogue speech input data into a Bridge Processor, 210, as a stream, and parcels out each word in order it receives them back and forth to two different sub-systems in order to continually convert and deliver translated machine speech at the pace of the originating sentence so that it can be received as fluid conversational speech versus individual words separated with pauses. Therefore, when a speaker, here shown as 110B speaks into the terminal, analogue sound date passes through bridge processor 210, then the first word is sent to our system's Analogue-to-Digital processor, hereafter referred to as A/D Processor, where it is converted into digital data, then to processor 216, where it is converted to text in the originating language, then translated to text in the second target language, then this text is converted to audio through our machine processor 218, before going to our second bridge processor 228 that then reassembles an analogue audio stream as each word is received into the bridge processor 228, and then sent off terminal through the wireless transmitter 230. But as word 212 is moving through this processing, word two, 220 is already being sent from Bridge Processor 210 to A/D processor 222, translation processor 224, and text to machine speech conversion at processor 226, then through processor 228 and as the second word in the stream that 230 wireless transmitter has already initiated sending, as the next word in the stream. And so on, as Word 3, 232, is sent from bridge processor for input, 210, word 4, 234 is moving through bridge processor 210 and in route to A/D processor 222. This bridge parceled dual-processing individually parceled word data, processed, and reassembled data stream continues, as shown in box 236, for as long as User 110B continues to talk.

    [0120] Based on our systems individual speech dataat the word or moneme levelrecognition and translation but faster continual speech output, it can pace directly with a normal human conversation. The average length for an individual's individual message contribution in an interactive conversation is approximately 30 words, so in this case, we show the total time, in seconds, from User 110B's first word input to last machine speech output at 11 seconds across the timeline 238.

    [0121] It is important to note here that as our system already has its own internal analogue to digital converters, 214 and 222, language translation processors, 216 and 224, and even on-board speech modeling to convert translated text language from 216 and 224 to machine speech modeled sound at 218 and 226, respectively, and that all processors have their own memory for storing recent data input and output typically for at least a few cycles worth of recent processing, colloquial referred to as scratch memory, our current systemwithout augmenting any individual component or major process, with simply added memory at current components, could record user voice data in an overall vocal imprint map (as a database) so that over time, as the total map is being completed, the specific machine speech's voice can be augmented (updated as a more complete model of the user's voice is created over time with variations across words and monemes and phonemes for specific pitch, tone variations specific to their own voiceprint) so that our systems specific machine speech on their behalf, regardless of output language, can replicate, with precision any User's, such as 110B's, exact speech pattern. For example, is the user in FIG. 2, 110B, speaks in English, idiosyncratic qualities of her voice can be gleaned from every (English) word she speaks, and over a few hundred 30-40 word contributions, regardless of translated output language and speech, a close model of User 110B's voice will be stored in our system's own database attaches to 218 and 226, or optionally with bridge processor 228 if a User should opt for speech to speech vocal modulation. Either way, even if an ongoing conversational partner through our system, such as User 112A, who will be shown in the next Figure to be a French speaker, User 110B's own voice imprint model will be perfected over time independent of how it is spoken post-translation. Over several conversations with User 112A, that User will recognize a shift in the general tone and pitch and other related vocal characteristics of our system's machine speech (on behalf of User 110B), as it more closely resembled 110B's real vocal pattern. To User 112B, it will sound as if User 110B is speaking French fluently in her voice, but of course, unless User 112A speaks to User 110B in person he will not recognize it as her real voice, and unless User 112B can also speak English, he will not understand User 110B is not a native French speaker until a face-to-face meeting.

    [0122] More importantly, at the point our system has learned User 110B's voiceprint and can use it in its own machine speech to the point of where it sounds like her real voice as mentioned previously, should User 110B start using our system to talk to her friend Sharat, a native Hindi speaker, the first time User 110B uses our system with Sharat, as it will have already started modeling its own speech output on the updating model of her voiceprint, it's Hindi speech will soundat the very least-feminine, and closer to 110B's own real voice from the first time she and Sharat converse through our system. Of course, since all Users in our system are both sender and receiver, and as our system is activated by use, and thus has automated speech recognition and modeling based on its core function, it will develop a more exact voice model for all users, from User 112A to 110B's friend Sharat once he comes into the system as a conversational member.

    [0123] FIGS. 3A and 3B show the conversational loop of our system as depicted broadly, but in whole, in FIG. 1, but specifically from a word by word speech input, translation, and speech conversion to output perspective.

    [0124] FIG. 3A shows User 110B, and as in FIG. 2, initiating a conversation through our system to User 312B. User 110B says, as shown in block 310A people spend a lot of time tidying things, but they never seem to spend time muddling them. Things just seem to get in a muddle by themselves. And then people have to tidy them up again. This is translated and spoken by our system in machine language to User 112B in his native language, French, as les gens passent beaucoup de temps a ranger les chose maid ils jamais semblent passer du temps a les embrouiller les chose semblent simplement se brouller par elles-memes et puis les gens doivent les ranger a nouveau., as shown in block 310B.

    [0125] As out system is translating and delivering moneme by moneme or word by word as they are being spoken at conversational speed, in this case, from User 110B to User 112B, our system uses one audio channel, 312A, for User 110B's speech input, and another audio channel, 312B, for machine speech output in French to 112B, so that neither channel creates noise for the other as much of the time both User 1 and Machine will be speaking at the same time as User 2 is listening. [00128] Furthermore, while both speech input and speech output streams, over audio channel 1, 312A, and audio channel 2, 312B are spoken and heard as a series of consecutive words in a normal conversational stream, our system processes and delivers each word at a time, and reassembles the output words as speech sequentially in one audio stream to mimic natural speech pattern and aid in reception. For example, User 110B says people and our system receives and processes it, 316A, and send it as speech output (les gens) 316B, so that User 112B hears les gens in it in under 1 second, but then User 110B, who continues speaking normally without pause, says spend and our system receives and processes it, 318A, and send it as speech output (passant) 318B, so that User 112B hears passant approximately 1.5 seconds into the conversation. User 110B conveys one idea, which happens to contain 3 sentences, of varying length to accomplish that, which in this case, is measured at normal rate of conversation, at 10.5 seconds to complete as per the timeline 314A. However, as our system already started delivering translated speech to 112B at approximately 1 second from when User 110B started talking, when User 110B finishes talking at approximately 10.5 seconds, with the last words, 320A, being again, only a total of 11 seconds between both conversational partners through our system, 110B and 112B, have passed as User 112B hears the last words in the French machine speech stream, of a nouveau, 320B. Our system, but use of parallelbut integratedprocessing from translation through machine speech output essentially allows its Users, as speakers to speak directly in another language (through our machine language), so uniquely, there is no waiting gap between original language speech segment and translated speech segment, and so that a receiver, such as 112B can process the meaning of what 110B is saying while she is saying it, and so can respond almost immediately as soon as she has finished, in this case, without waiting another 11 seconds to hear a translated set of speech after User 110B has spoken. This is how out system is uniquely enable true real-time conversation between two individuals speaking two different languages. [00129] FIG. 3B is an illustration showing the second part of a conversational loop our system enables, as described in FIG. 1 which shows a French speaking User, 112C, directly responding to User 110B, here as 110C as the user is shown as a receiver/listener, instead of a speaker/sender as in FIG. 3A. To show a direct response, the timeline 320A starts at 11.5 seconds into the conversation, more specifically, this complete loop in a larger conversation. This loop lasts in total 21.5 seconds, and is composed of both Users as speaker/sender and listener/receiver. The figure also shows User 112C having his own dedicated audio channel, 312B, for which he had heard French machine speech from our system in FIG. 3B, and now over which he speaks into our system, shown here directionally from right to left over audio channel 2, 312B. Our system works exactly the same as shown in FIG. 3A, with User 112C's first spoken word being Oui spoken at 11.5 seconds (from the start of the overall conversation as depicted in FIG. 3A), and received as yes from our system's machine speech by the 12th second of the conversation by User 110C. For ease of comprehension, we've listed User 112C's own speech from left to right, as is the western convention, while the timeline 320A counts up from right to left from 11.5 seconds to 21.5 seconds, when User 112C completes his last word, moi 324A, which is heard in English through our system's machine speech as me, in English, by User 110C as me, 324B.

    [0126] FIG. 4 is a flowchart depicting an embodiment of the serialized speech parallel process from human input to translated machine speech output, 400, focusing on the processing of the first and second word or monemes in a User's longer, ongoing spoken message, such as User 110 in FIG. 1 communicating with second User 112. In our system and method, when a user speaks into our system, the speech is received live at 410 and in through one audio channel into a bridge processor at 412, and is immediately entered into the translation process 400 as the User is still speaking. So as the User is speaking into the system at 410, and this audio is being captured by the bridge processing, that processing is continual and continually flowing from 412 to 416 where it is converted from analogue speech data to digitized text. From there, the first word or moneme (depending on the smallest unit of meaning per input or output language) enters into process A, at 418A where it is checked against database 420 to see if it is already in the database 410. If it is not, the text is saved in the database storage 422A for later use. If it is in the database, then the system checks to see if the translation is already available at 424A, if it is not, based on the user setting enabled, text will be translated into a second language of choice, 426A, and saved as translated text as step 428A. If the translated text is available in system's database 420, is available as translated text, it is pulled and received at 432A, and our system then checks to see if machine generated output is already saved in storage and available for output at 438A, and it is retrieved from database storage in 442A. Then, for this first word or moneme, machine speech is outputted for the first word at 436A, and in queue for machine generated audio output at 450. In addition, if the translated text is multiple words, such as a moneme, or smallest unit of meaning is more than one word, at 430A these multiple words will be reassembled into grammatically correct sequencing at 430A, then our system generates machine audio from that reassembled set at 434A before our systems machine speech is outputted for the first word or word set at 436A and then placed into queue for machine generated audio 450.

    [0127] The total time for this process to be completed is approximately a few milliseconds, but it is discretely processed since individual words or monemes have to be outputted in sequence at 470 as the User/Speaker continues to speak into the system at 410 and as it is continued to be processed live at 412.

    [0128] Ergo, within milliseconds from word one or moneme one is going into our systems process A through 416 and 418A, the second word or moneme being spoken into the system at 410 is being processed by the systems bridge processor at 412, converted to digital text at 416, but then into a second parallel process running in tandem to Process A, but as Process B at 418B. It is the same process as shown for word or moneme one. This means the second word or moneme at 418B is checked against database 420 to see if it is already in the database 410. If it is not, the text is saved in the database storage 422B for later use. If it is in the database, then the system checks to see if the translation is already available at 424B, if it is not, based on the user setting enabled, text will be translated into a second language of choice, 426B, and saved as translated text as step 428B. If the translated text is available in system's database 420, is available as translated text, it is pulled and received at 432B, and our system then checks to see if machine generated output is already saved in storage and available for output at 438B, and it is retrieved from database storage in 442B. Then, for this first word or moneme, machine speech is outputted for the first word at 436B, and in queue for machine generated audio output at 450. In addition, if the translated text is multiple words, such as a moneme, or smallest unit of meaning is more than one word, at 430B these multiple words will be reassembled into grammatically correct sequencing at 430B, then our system generates machine audio from that reassembled set at 434B before our systems machine speech is outputted for the first word or word set at 436B and then placed into queue for machine generated audio 450.

    [0129] This is an ongoing dual process, so as word or moneme one and word or moneme two are moving through our systems processing A and B, respectively, the next words being spoken live through 410 have already moved through bridge processing 412 and been converted to digital text at 416, then also move through process A and B, for example word or moneme 3 and word and moneme 4 from speech input 410 are moving through process A and process B respectively, and as our system can dual process and machine output words in cue at 450 faster than normal human speech at 410, with organic speech pauses, as all of these words, in sequence, are cued in order at 450, they are output as machine speech audio at 470 in normal speech pattern flow for a second user to hear in their language, but while a first user is still speaking their message through 410. So that when a human speaker completes their conversational speech, which is on average 30 words at 30 seconds for a normal conversational contribution, the machine speech output will have completed the same message in the other language so that the User 2 has heard, at 470, the entire spoken message in machine audio within a second after User one has spoken it at 410.

    [0130] To avoid generating noise in providing machine speech output at 470 while the human speaker is speaking into our system at 410 almost simultaneously, we have noted that our system always utilizes two separate audio channels, such as audio channel one for input at 410 and 412, and audio channel two for output at 450 and 470.

    OBJECTS OF THE INVENTION

    [0131] Introduction. Accordingly, several advantages and objects of the present invention are:

    [0132] A principal object of the present invention is to provide a means for two individuals speaking two different native languages to converse in real time without typical translation or delivery gaps through the use of machine segmentation and component processing, translation, and conversions into translated audio while original language message is still being spoken, so that when one person finishes speaking on one language, the receiver has already heard their message, via machine speech, in their language and uniquely already had time to process (mentally) the linguistical meaning so they can respond immediately, just like in a normal human face to face conversation.

    [0133] As the primary and ideal channel is via mobile device, and as the systemby eliminating the processing drag and time gap of batch processing as is standard with all current translation technologyit can also be used uniquely by two individuals at the same time as opposed to the standard linear input, translation delivery, then receiver having to use their own system to do the same in return. This allows for our system to be automatically closed to all excepting the two members in the conversation, so that all information can be automatically encrypted and decrypted, though not apparent to either users in a conversation, if transmitted via wireless or wired channel, even though a translation engine, any in-transit information will be undecipherable to a person who has access to a public point, such as a translation engine server, but the receiver, within our system hears speech in their own language without having to manually decrypt the data as it is being delivered from the system to that single reception point.

    [0134] Another object of the invention, based on its pre-componentized speech monemes or words, is to be able to repurpose those audio components to learn by placing each unique component in a voice model mesh/database, the specific and unique imprint of its user. This way, over time, as the system has received enough data-points (about 2,000 spoken words), it can then model its own machine speech on behalf of the User to sound perceptively replicated and be used as the machine's speech when delivering translated speech to individual's in their language that the User 1 chooses, but in their voice.

    [0135] In addition, once a User voice imprint model is complete, it can be used as a rapid filter to capture pitch, tonal, and pacing changes unique to a User's specific conversation, to be able to transmit those emotional or paralinguistic cues or messages to a receiver at the time of this conversation.

    [0136] Furthermore, based on the general processes of our system, and the ease in repurposing componentization necessary for its translation, as the unique tonal and pitch characteristics of a User's voice imprint model can be used as an always on reverse filter, should the system be used in a mobile computing environment, it can also be used to eliminate ambient noise in public areas should a User desire to have a close face to face conversation that necessitates speaking at a distance from a mobile phone and holding it equidistant from a conversational partner to hear and input as well.