Artificial Intelligence Communication with Caller and Real-Time Transcription and Manipulation Thereof
20180054507 ยท 2018-02-22
Inventors
- Adam Bentitou (New York, NY, US)
- David Mansfield (New York, NY, US)
- Robert Lippman (East Hampton, NY, US)
Cpc classification
H04L43/10
ELECTRICITY
H04M3/527
ELECTRICITY
H04W4/14
ELECTRICITY
International classification
H04W4/14
ELECTRICITY
Abstract
Receiving a telephone call to an auto-attendant, artificial intelligence, or person takes place. While this phone call is being conducted, a speech to text transcription is created and sent in real-time to another person at another network node. This person can read the transcript and interact with the phone call by sending his or her own commands, text, or speech to be made part of the phone call.
Claims
1. A method of receiving and processing a telephone call, comprising the steps of: receiving a phone call at a first network node; using speech recognition, creating a transcription of audio of said telephone call; while creating said transcription of audio of said telephone call, sending said transcription to a bidirectional transceiver at a second network node in real-time; determining a desired disposition of the call based on conversations between artificial intelligence and the calling party; and receiving instructions from said bidirectional transceiver to assist or instruct the artificial intelligence in responding to the calling party or in determining the desired disposition of the call.
2. The method of claim 1, wherein, after receiving said phone call at said first network node, having a conversation with a calling party of said telephone call using text to speech synthesis; and wherein text of said text to speech synthesis is used in said transcription.
3. The method of claim 1, wherein, after receiving said phone call at said first network node, having a conversation with a calling party of said telephone call using pre-recorded audio; and wherein a transcript of said pre-recorded audio is stored before said telephone call is made and used in said transcription.
4. The method of claim 1, wherein audio of said telephone call is played at said bidirectional transceiver in real-time, before said step of receiving instructions from said bidirectional transceiver to send said telephone call to said second network node.
5. The method of claim 1, wherein said transcription of audio continues after said phone call is sent to said second network node.
6. The method of claim 1, further comprising a step of sending audio of said phone call to a third network node while said call is sent to said second network node.
7. A method of receiving and processing a telephone call, comprising the steps of: receiving a phone call at a first network node; using speech recognition, creating a transcription of audio of said telephone call; while creating said transcription of audio of said telephone call, sending said transcription to a bidirectional transceiver at a second network node in real-time; during said phone call, transmitting a message that includes audio output of at least one of text to speech synthesis or pre-recorded audio to a calling party, based on information provided by or obtained from the calling party and instructions received from said bidirectional transceiver receiving said transcription; and directing the call or responding to the calling party with the message based on the information provided by or obtained from the calling party and instructions received from said bidirectional transceiver.
8. The method of claim 7, wherein said speech recognition determines that said calling party wants to schedule a meeting, and said instructions received from said bidirectional transceiver include a date and time for said meeting.
9. The method of claim 7, wherein said instructions received from said bidirectional transceiver indicate that a called party is unavailable, and a proposed time for said called party to place a new telephone call to said calling party, said instructions further comprising said proposed new time.
10. The method of claim 7, wherein said instructions include playing audio during said telephone call, based on input into said bidirectional transceiver.
11. The method of claim 7, wherein said bidirectional transceiver, while receiving said transcription, sends instructions to said first network node to end said telephone call; and said telephone call is disconnected from said first network node.
12. The method of claim 7, wherein said bidirectional transceiver, while receiving said transcription, sends instructions to said first network node to forward said phone call to a third party; during said phone call, audio is transmitted to said calling party, indicating said phone call is being transferred or answered; and said telephone call is forwarded from said first network node to a bidirectional transceiver associated with said third party.
13. The method of claim 7, wherein while creating said transcription, importance or urgency is detected by a device at said first network node, and said telephone call is forwarded from said first network node to said bidirectional transceiver in response to the detected importance or urgency.
14. A telephone switch comprising at least one telephone network node and at least one network connection with a bidirectional transceiver, which: receives a phone call at said at least one network node; uses speech recognition to create a transcription of audio of said telephone call; while creating said transcription of audio of said telephone call, sends said transcription to said bidirectional transceiver in real-time via said at least one network connection; during said phone call, transmits audio output of at least one of text to speech synthesis or pre-recorded audio to a calling party via said at least one network node based on information provided by or obtained from the calling party and instructions received from said bidirectional transceiver receiving said transcription; and directs the call or responds to the calling party with the message based on the information provided by the calling party and instructions received from said bidirectional transceiver.
15. The telephone switch of claim 14, wherein using said speech recognition, a processor on said telephone switch determines that said calling party wants to schedule a meeting, and said instructions received from said bidirectional transceiver include a date and time for said meeting.
16. The telephone switch of claim 14, wherein said instructions received from said bidirectional transceiver indicate that a called party is unavailable and a proposed time for said called party to place a new telephone call to said calling party, said instructions further comprising said proposed new time.
17. he telephone switch of claim 14, wherein said instructions include playing audio in said telephone call based on input into said bidirectional transceiver.
18. The telephone switch of claim 14, wherein said bidirectional transceiver, while receiving said transcription, sends instructions to said first network node to end said telephone call; and said telephone call is disconnected from said first network node.
19. The telephone switch of claim 14, wherein said bidirectional transceiver, while receiving said transcription, sends instructions to said first network node to forward said phone call to a third party; during said phone call, audio is transmitted to said calling party indicating said phone call is being transferred or answered; and said telephone call is forwarded from said first network node to a bidirectional transceiver associated with said third party.
20. The method of claim 1, wherein the artificial intelligence executes the disposition of the call automatically and independently after communicating with the calling party unless overridden by instructions from the bidirectional transceiver, with the instructions from the bidirectional receiver received by the artificial intelligence either before or after the artificial intelligence determines the disposition of the call.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0016]
[0017]
[0018]
[0019]
[0020]
DETAILED DESCRIPTION OF EMBODIMENTS OF THE DISCLOSED TECHNOLOGY
[0021] Receiving a telephone call to an auto-attendant, artificial intelligence, or person takes place. While this phone call is being conducted, a speech to text transcription is created and sent in real-time to another person at another network node. This person can read the transcript and interact with the phone call by sending his or her own commands, text, or speech to be made part of the phone call.
[0022] For purposes of this disclosure, speech recognition is defined as making a determination of words exhibited aurally. Further, voice recognition is defined as making a determination as to who is the speaker of words.
[0023] Embodiments of the disclosed technology are described below, with reference to the figures provided.
[0024]
[0025] Calling party identification mechanisms used to determine who the calling party is, include location determination mechanisms based on location reported by the GPS, the Internet protocol (IP address) of one of the bi-directional transceivers 110 and/or 120, and looking up a location associated with a number reported by the calling line identification (caller ID) or ANI (automated number identification) protocols.
[0026] Input/output mechanisms of the bi-directional transceivers can include a keyboard, touch screen, display, and the like, used to receive input from, and send output to, a user of the device. A transmitter enables wireless transmission and receipt of data via a packet-switched network, such as packet-switched network 130. This network, in embodiments, interfaces with a telecommunications switch 132 which routes phone calls and data between two of the bi-directional transceivers 110 and 120. Versions of these data, which include portions thereof, can be transmitted between the devices. A version of data is that which has some of the identifying or salient information, as understood by a device receiving the information. For example, audio converted into packetized data can be compressed, uncompressed, and compressed again, forming another version. Such versions of data are within the scope of the claimed technology, when audio or other aspects are mentioned.
[0027] Referring again to the telecom switch 132, a device and node where data are received and transmitted to another device via electronic or wireless transmission, it is connected to a network node 134, such as operated by an entity controlling the methods of use of the technology disclosed herein. This network node is a distinct device on the telephone network, which sends and receives data to the telephone network, or another network which carries audio or versions of data used for creating, or were created from, audio. At the network node is a processor 135 deciding when the bi-directional transceivers 110 and 120 can communicate with each other via audio, such as by forwarding the call from a transceiver 110 to a transceiver 120. At the network node 134 there is also memory 136 (volatile or non-volatile) for temporary storage of data, storage 138 for permanent storage of data, and input/output 137 (like the input/output 124), and an interface 139 for connecting via electrical connection to other devices.
[0028] Still discussing
[0029]
[0030] While the call is taking place between the called party and the AI in step 220, a written transcription of the call is created in real-time in step 225, based on the text (converted to speech), text transcribed of the recorded voice, and/or the voice of the calling party converted to text. This transcription, again, in real-time or at the same moment in time that another part of the conversation is being transcribed, is sent to a second network node, such as bi-directional transceiver 120. Thus, the calling party (such as party 110) and the AI are having a conversation with audio back and forth, speech to text, and text to speech, while the called party and/or second network node and/or bi-directional transceiver 120 is receiving a written transcribed version of part or all of the audio between the calling party and AI.
[0031] A sample transcription of the audio in the phone call between the AI and the calling party might look something like this, by way of example:
[0032] Synthesized voice: I'm sorry, but Mr. Lippman is unavailable. Is this an urgent matter?
[0033] Calling Party: Yes, it is!
[0034] Synthesized voice: Please tell me why it's urgent.
[0035] Calling Party: I can't find the cat food, and the cat needs to eat!
[0036] Synthesized voice: Who is this, by the way?
[0037] Calling Party: It's Mr. Lippman's son.
[0038] Synthesized voice: Okay, let me see if Mr. Lippman wants to answer the call.
[0039] The calling party and urgency of the call can be determined automatically based on the text transcription of the conversation. For example, Mr. Lippman's son might be determined as being the caller based on voice recognition (comparing the voice to previous calls with son), his location (comparing to prior locations when the son called and/or limiting the location when it is believed to be the son to calls from a certain area code or area codes that Mr. Lippman has previously designated as where his family might be calling from), or the like. In this case, urgency might also be detected based on certain keywords such as son or cat. Mr. Lippman might want all calls from his son to be detected as urgent, so that the call might be detected as urgent as soon as the calling party says Yes, it is! or makes another recognizable utterance determined to be from a specific calling party. Or, in another embodiment, a negative keyword such as cat may be used. Thus, if someone says cat, the call will be considered non-urgent because Mr. Lippman doesn't want to be interrupted to talk about the cat when he is unavailable. In any of the above cases, once urgency is detected, the call can be sent to the called party in step 240, such as to a device associated with the called party or under the direct operative control of the called party, such that, the called party can exchange audio with the calling party in the phone call.
[0040] In other embodiments, the called party and/or second network node and/or bi-directional transceiver 120 sends data, which are received by a device carrying out parts of the disclosed technology such as a telephone switch (which can comprise a single physical device or many such devices interacting directly or indirectly with the telephone network effecting audio in the telephone network itself). These data can include, as in step 235, a request to transfer the call to another party. That is, the call can be transferred to the second network node in step 240, or a third network node in step 245. The third network node can be, in embodiments of the disclosed technology, a third party previously unconnected to the audio or transcript of the call taking place. This can be a form of call forwarding which involves forwarding the call itself to another telephone network node and/or forwarding the real-time or live transcription to another.
[0041] Or, in step 270, the bi-directional transceiver 120 can send instructions for the call to be disconnected. This can take place instead of, or after, steps 240 and/or 245. This can be indicated by hanging up the phone or selecting a button exhibited on the phone to disconnect the call. Further, once the call is disconnected, or as a function of selecting to disconnect the call (via voice instruction or text instruction which is recognized as such, or selecting a button, such as shown in
[0042] If no call transfer request is made in step 235, then step 250 can be carried out. Otherwise, the AI can continue to converse with the caller while steps 220, 225, and 230 are carried out cyclically and/or simultaneously until the calling party or AI decides to end the call and disconnect the phone call. Though, if step 250 is answered in the affirmative and meeting time is requested, then steps 260 and 265 are carried out cyclically, where in step 260, a requested time is presented to the called party, and in step 265 a meeting time and place is negotiated. The meeting time and place can be arranged entirely by the calling party and artificial intelligence, and in some embodiments, also with the input, during the call, into the bidirectional transceiver receiving the transcription. This meeting time and place can be a physical meeting place, or simply a time when the calling party and an intended recipient or other human being, such as an operator of the bidirectional transceiver (120) at the second network node, can converse via voice. Such a negotiated time for a further phone call might create a temporary whitelist for the calling party at the time of the future call, or provide a password/passcode for the calling party to present for the subsequent call to reach the bidirectional transceiver by way of carrying out step 240. After negotiating the time and place, the call can continue between the calling party and AI (steps 220, 225, and, in some cases, step 230).
[0043]
[0044] Steps 220, 225, and 230 remain as shown and described with respect to
[0045] In addition to selecting an exhibited selectable element in step 310, a person operating the bi-directional transceiver 120 might also input text or speech in response to a query made by the AI to the second party (person receiving the transcript). A conversation, for example, might take place as follows:
[0046] Calling Party: Please tell Adam his refrigerator is running.
[0047] AI: I can do that for you. Hold on one moment.
[0048] Adam, viewing this conversation, might read this in the transcription on his device and then select a button such as, Acknowledge receipt in step 310, enter text into his device (e.g., by typing or selecting letters) in step 315, such as I know or inputting speech into a microphone of the device in step 318 by saying, I know. In any of these cases, the inputted information on the bi-directional transceiver is then transmitted to the switch in step 320, such as via a wired or wireless network, such as a cellular phone data network or wired IP connection.
[0049] In another example, the calling party and AI are having a back and forth conversation such as follows:
[0050] Calling Party: My internet is down.
[0051] AI: I understand your internet connection is not working. Did you check if your router is plugged in?
[0052] Calling Party: The problem is DNS server is not responding.
[0053] AI: Again, did you unplug your router and plug it back in?
[0054] Calling Party: Ugh. Don't you understand what I'm saying?
[0055] At this point, the person reading the transcript over at the bi-directional transceiver may carry out step 315 or 318 and free-form enter text to be inserted into the conversation such as, What is your DNS server IP address currently? The AI will wait for a moment in the conversation to enter the text in step 350, when the input is parsed, and then modify the AI conversation in step 355 accordingly. The AI can transcribe the speech input 318 into text or use the text in step 315 and transcribe this into the AI voice stating, What is your DNS server IP address currently? In this manner, the calling party is still hearing only the AI but the input for the conversation is actually from a human interacting directly with the conversation.
[0056] In yet another embodiment, an AI need not be used at all. Building on the tech support example above, suppose the AI which does not understand DNS server is actually a human being. In such a case, in step 220 a human is conversing with the caller. In this case, the written transcript in step 225 is still carried out based on, at least in part, instructions read by the tech support person or speech recognition. The modification of the AI conversation in step 355 then becomes modification of the conversation, based on input provided by the second party. So the second party might then tell the tech support person (the called party) what to say, while monitoring the transcript. Many such transcripts of many simultaneous calls can be monitored in this way by, for example, a person with more experience in handling calls. Upon seeing that a call needs to be escalated to a higher level, such a selectable element can be selected in step 310, transmitted to the switch in 320, and the call is forwarded to the second party or another party better able to handle the call.
[0057]
[0058] Selectable element 415 instructs the AI to schedule a time to call back later and determine who will make the call and to what number. This is confirmed through a conversation where such information is exchanged and confirmed between the calling party (shown as caller in the figure) and the AI. Similarly, using selectable element 420, an in-person meeting can be scheduled. The operator of the device 120 may also desire to hear the audio in real-time by using selectable element 425 to do so. While doing so, the rest of the selectable elements can continue to function as before. Or, the person can take the call outright, using button 435, and the call is forwarded to the bi-directional transceiver 120. In some embodiments, the transcription continues, while in others the transcription ceases at this point.
[0059] The person can also select forward button 430 to have the call forwarded to a third party, as described with reference to
[0060] The blacklist selectable element 450 ensures that next time a particular calling party is recognized (such as by using voice recognition or caller identify information [e.g. CallerID or ANI]) the steps of sending a transcript to the second party/second node/bi-directional transceiver 120 are not carried out. Conversely, the whitelist selectable element 455 ensures that the next time a particular calling party is recognized in a subsequent call, the call is forwarded with two-way voice communication to the second node/bi-directional transceiver 120. In such a case, a transcription may or may not be made, depending on the embodiment. Thus, it should also be understood that hearing audio 425 and speaking 440 involves one way audio communication, whereas taking a call 435, or forwarding a call 430, involves two way audio communication. Speaking 440 can actually involve no direct audio communication, as a version of the spoken word is sent based on speech to text (speech recognition), followed by text to speech conversation, so that the speech is in the voice of the AI or other called party handing the audio of the call.
[0061]
[0062] Further, it should be understood that all subject matter disclosed herein is directed at, and should be read only on, statutory, non-abstract subject matter. All terminology should be read to include only the portions of the definitions which may be claimed. By way of example, computer readable storage medium is understood to be defined as only non-transitory storage media.
[0063] While the disclosed technology has been taught with specific reference to the above embodiments, a person having ordinary skill in the art will recognize that changes can be made in form and detail without departing from the spirit and the scope of the disclosed technology. The described embodiments are to be considered in all respects only as illustrative and not restrictive. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope. Combinations of any of the methods, systems, and devices described herein-above are also contemplated and within the scope of the disclosed technology.