ELECTRONIC PERSONAL INTERACTIVE DEVICE
20220319517 · 2022-10-06
Inventors
Cpc classification
G10L15/30
PHYSICS
G06F3/017
PHYSICS
G06F3/167
PHYSICS
H04L12/2818
ELECTRICITY
G10L13/033
PHYSICS
G10L15/1815
PHYSICS
G06V40/28
PHYSICS
H04W84/18
ELECTRICITY
G10L15/22
PHYSICS
G06F16/435
PHYSICS
H04M1/0202
ELECTRICITY
International classification
G10L15/22
PHYSICS
G06F16/435
PHYSICS
G10L13/033
PHYSICS
G10L15/14
PHYSICS
G10L15/30
PHYSICS
Abstract
An interface device and method of use, comprising audio and image inputs; a processor for determining topics of interest, and receiving information of interest to the user from a remote resource; an audio-visual output for presenting an anthropomorphic object conveying the received information, having a selectively defined and adaptively alterable mood; an external communication device adapted to remotely communicate at least a voice conversation with a human user of the personal interface device. Also provided is a system and method adapted to receive logic for, synthesize, and engage in conversation dependent on received conversational logic and a personality.
Claims
1. An electronic system, comprising: a network communication port configured to communicate with a communication network; a memory configured to store topics of interest to a user; an external interface, configured to transmit a series of search requests to at least one of an external automated search engine and an external automated social network system, and to receive responses through the network communication port; a conversational agent configured to: interact with the user according to a natural language conversation dependent on the stored topics of interest and the received responses; define the series of search requests; and update the memory to maintain current topics of interest to the user dependent on user feedback obtained during the natural language conversation; and a user interface directed by the conversational agent.
2. The electronic system according to claim 1, wherein the user interface comprises a microphone, a speaker, a display, and a camera.
3. The electronic system according to claim 1, wherein: the topics of interest comprise identified persons; the conversational agent is configured to communicate the identified persons to the external automated social network system; and the received responses comprise social network records relating to the identified persons.
4. The electronic system according to claim 1, wherein: the topics of interest comprise current events; the conversational agent is configured to communicate characteristics of the current events to the external automated search engine; and the received responses comprise news reports relating to the current events.
5. The electronic system according to claim 1, wherein: the user interface comprises an audiovisual interface; the conversational agent is further configured to determine an emotional state of the user based on interactions with the user through the audiovisual interface; and the conversational agent is further configured to selectively react to the emotional state of the user.
6. The electronic system according to claim 1, wherein: the conversational agent is further configured to determine an emotional state of the user; and the series of requests are selectively dependent on the determined emotional state of the user.
7. The electronic system according to claim 1, wherein: the conversational agent is further configured to determine a current emotional state of the user and a desired emotional state of the user; the natural language conversation is dependent on the stored topics of interest, the received responses, the determined current emotional state of the user, and the determined desired emotional state of the user.
8. The electronic system according to claim 1, wherein the conversational agent is further configured to: store a status of conversational elements at an end of a conversation in the memory; update the conversational elements at the end of the conversation by transmitting search requests; and introducing the updated conversational elements in the natural language conversation.
9. The electronic system according to claim 1, wherein the conversational agent is implemented using an artificial neural network.
10. The electronic system according to claim 1, wherein the conversational agent is further configured to determine an emergency state of the user, and to automatically contact emergency assistance services in event of the determined emergency state of the user.
11. The electronic system according to claim 10, wherein the conversational agent is further configured to control the user interface to selectively communicate at least one of audio and visual information to the emergency assistance services.
12. The electronic system according to claim 10, wherein the conversational agent is further configured to determine a mood of the user based on implicit information in audio input received through the user interface.
13. The electronic system according to claim 10, wherein the conversational agent is further configured to mine data from the at least one of the external automated search engine and the external automated social network system based on learned relevance of information within the topics of interest from prior interaction with the user.
14. The electronic system according to claim 10, wherein the conversational agent is further configured to initiate a conversation with the user.
15. An interactive conversational system, comprising: a network communication port; a memory configured to store topics of interest to a user; a database interface, configured to transmit search requests to an automated database system, and to receive responses, through the network communication port; a conversational agent configured to define the search requests and to interact with the user in a natural language conversation dependent on the stored topics of interest and the received responses; a conversation continuity agent configured to update the memory to maintain current topics of interest to the user dependent on user feedback obtained during the natural language conversation; and a user interface directed by the conversational agent.
16. The interactive conversational system according to claim 15, wherein the automated database system comprises an Internet search engine comprising knowledge records.
17. The interactive conversational system according to claim 15, wherein the automated database system comprises an Internet social network system comprising human relationship records.
18. The interactive conversational system according to claim 15, wherein: the user interface comprises an audiovisual interface; the conversational agent is further configured to determine an emotional state of the user based on implicit interactions with the user through the audiovisual interface; and the conversational agent is further configured to select topics of conversation based on the emotional state of the user.
19. The interactive conversational system according to claim 15, wherein the conversational agent is further configured to: store a status of conversational elements at an end of a conversation in the memory; update the conversational elements at the end of the conversation by transmitting search requests; and introducing the updated conversational elements into the natural language conversation.
20. An interactive conversational method, comprising: storing topics of interest to a user in a memory; transmitting search requests to an automated database system selected from the group consisting of an Internet search engine, a new search engine, and a social network database search engine, through an automated communication network; receiving responses to the transmitted search requests through the automated communication network; defining the search requests based on conversational natural language communications with a user through a user interface device, the topics of interest to the user, and the received responses; updating the memory to maintain current topics of interest to the user dependent on user feedback obtained during the conversational natural language communications; and conducting the conversational natural language communications with the user with an automated conversational agent according to the topics of interest to the user, the received responses, natural language user inputs, and a user context, with an automated conversational agent.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0148]
[0149]
[0150]
[0151]
[0152]
[0153]
[0154]
[0155]
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
EXAMPLE 1
Cell Phone
[0156]
[0157]
[0158] In the example, in step 210, Ulysses says, “Is my grandson James partying instead of studying?” Ulysses has an angry voice and a mad facial expression. In step 220, the machine detects the mood of the user (angry/mad) based on audio input (angry voice) and image input (mad facial expression). This detection is done by one or more processors, which is, for example, a Qualcomm Snapdragon processor. Also, the one or more processors are involved in detecting the meaning of the speech, such that the machine would be able to provide a conversationally relevant response that is at least partially responsive to any query or comment the user makes, and builds on the user's last statement, in the context of this conversation and the course of dealings between the machine and the user. Roy, US App. 2009/0063147, incorporated herein by reference, discusses an exemplary phonetic, syntactic and conceptual analysis drive speech recognition system. Roy's system, or a similar technology, could be used to map the words and grammatical structures uttered by the user to a “meaning”, which could then be responded to, with a response converted back to speech, presented in conjunction with an anthropomorphic avatar on the screen, in order to provide a conversationally relevant output. Another embodiment of this invention might use hierarchal stacked neural networks, such as those described by Commons, U.S. Pat. No. 7,613,663, incorporated herein by reference, in order to detect the phonemes the user pronounces and to convert those phonemes into meaningful words and sentence or other grammatical structures. In one embodiment, the facial expression and/or the intonation of the user's voice are coupled with the words chosen by the user to generate the meaning. In any case, at a high level, the device may interpret the user input as a concept with a purpose, and generates a response as a related concept with a counter-purpose. The purpose need not be broader than furthering the conversation, or it may be goal-oriented. In step 230, the machine then adjusts the facial expression of the image of Penelope to angry/mad to mirror the user, as a contextually appropriate emotive response. In another embodiment, the machine might use a different facial expression in order to attempt to modify the user's mood. Thus, if the machine determines that a heated argument is an appropriate path, then a similar emotion to that of the user would carry the conversation forward. In other cases, the interface adopts a more submissive response, to defuse the aggression of the user.
[0159] Clearly, the machine has no way of knowing whether James is partying or studying without relying on external data. However, according to one embodiment of the invention, the machine can access a network, such as the Internet, or a database to get some relevant information. Here, in step 240, the machine checks the social networking website Facebook to determine James' recent activity. Facebook reveals that James got a C on his biology midterm and displays several photographs of James getting drunk and engaging in “partying” behavior. The machine then replies 250 to the user, in an angry female voice, “It is horrible. James got a C on his biology midterm, and he is drinking very heavily. Look at these photographs taken by his neighbor.” The machine then proceeds to display the photographs to the user. In step 260, the user continues the conversation, “Oh my God. What will we do? Should I tell James that I will disinherit him unless he improves his grades?”
[0160] Note that a female voice was used because Penelope is a woman. In one embodiment, other features of Penelope, for example, her race, age, accent, profession, and background could be used to select an optimal voice, dialect, and intonation for her. For example, Penelope might be a 75-year-old, lifelong white Texan housewife who speaks with a strong rural Texas accent.
[0161] The machine could look up the information about James in response to the query, as illustrated here. In another embodiment, the machine could know that the user has some favorite topics that he likes to discuss (e g , family, weather, etc.) The machine would then prepare for these discussions in advance or in real-time by looking up relevant information on the network and storing it. This way, the machine would be able to discuss James' college experience in a place where there was no Internet access. In accordance with this embodiment, at least one Internet search may occur automatically, without a direct request from the user. In yet another embodiment, instead of doing the lookup electronically, the machine could connect to a remote computer server or a remote person who would select a response to give the user. Note that the remote person might be different from the person whose photograph appears on the display. This embodiment is useful because it ensures that the machine will not advise the user to do something rash, such as disinheriting his grandson.
[0162] Note that both the machine's response to the user's first inquiry and the user's response to the machine are conversationally relevant, meaning that the statements respond to the queries, add to the conversation, and increase the knowledge available to the other party. In the first step, the user asked a question about what James was doing. The machine then responded that James' grades were bad and that he had been drunk on several occasions. This information added to the user's base of knowledge about James. The user then built on what the machine had to say by suggesting threatening to disinherit James as a potential solution to the problem of James' poor grades.
[0163] In one embodiment, the machine starts up and shuts down in response to the user's oral commands This is convenient for elderly users who may have difficulty pressing buttons. A deactivation permits the machine to enter into a power saving low power consumption mode. In another embodiment, the microphone and camera monitor continuously the scene for the presence of an emergency. If an emergency is detected, emergency assistance services, selected for example from the group of one or more of police, fire, ambulance, nursing home staff, hospital staff, and family members might be called. Optionally, the device could store and provide information relevant to the emergency, to emergency assistance personnel. Information relevant to the emergency includes, for example, a video, photograph or audio recording of the circumstance causing the emergency. To the extent the machine is a telephone, an automated e911 call might be placed, which typically conveys the user's location. The machine, therefore, may include a GPS receiver, other satellite geolocation receiver, or be usable with a network-based location system.
[0164] In another embodiment of this invention, the machine provides a social networking site by providing the responses of various people to different situations. For example, Ulysses is not the first grandfather to deal with a grandson with poor grades who drinks and parties a lot. If the machine could provide Ulysses with information about how other grandparents dealt with this problem (without disinheriting their grandchildren), it might be useful to Ulysses.
[0165] In yet another embodiment (not illustrated) the machine implementing the invention could be programmed to periodically start conversations with the user itself, for example, if the machine learns of an event that would be interesting to the user. (E.g., in the above example, if James received an A+in chemistry, the machine might be prompted to share the happy news with Ulysses.) To implement this embodiment, the machine would receive relevant information from a network or database, for example through a web crawler or an RSS feed. Alternatively, the machine could check various relevant websites, such as James' social networking pages, itself to determine if there are updates. The machine might also receive proactive communications from a remote system, such as using an SMS or MMS message, email, IP packet, or other electronic communication.
EXAMPLE 2
Cell Phone with Low Processing Abilities
[0166] This embodiment of this invention, as illustrated in
[0167] The user says something that is heard at call center 330 by employee 332. The employee 332 can also see the user through the camera in the user's telephone. An image of the user appears on the employee's computer 334, such that the employee can look at the user and infer the user's mood. The employee then selects a conversationally relevant response, which builds on what the user said and is at least partially responsive to the query, to say to the user. The employee can control the facial expression of the avatar on the user's cell phone screen. In one embodiment, the employee sets up the facial expression on the computer screen by adjusting the face through mouse “drag and drop” techniques. In another embodiment, the computer 334 has a camera that detects the employee's facial expression and makes the same expression on the user's screen. This is processed by the call center computer 334 to provide an output to the user through cell phone's 310 speaker. If the user asks a question, such as, “What will the weather be in New York tomorrow?” the call center employee 332 can look up the answer through Google or Microsoft Bing search on computer 334.
[0168] Preferably, each call center employee is assigned to a small group of users whose calls she answers. This way, the call center employee can come to personally know the people with whom she speaks and the topic that they enjoy discussing. Conversations will thus be more meaningful to the users.
EXAMPLE 3
Smart Phone, Laptop or Desktop with CPU Connected to a Network
[0169] Another embodiment of the invention illustrated in
[0170] As noted above, persons skilled in the art will recognize many ways the mood-determining logic 430 could operate. For example, Bohacek, U.S. Pat. No. 6,411,687, incorporated herein by reference, teaches that a speaker's gender, age, and dialect or accent can be determined from the speech. Black, U.S. Pat. No. 5,774,591, incorporated herein by reference, teaches about using a camera to ascertain the facial expression of a user and determining the user's mood from the facial expression. Bushey, U.S. Pat. No. 7,224,790, similarly teaches about “verbal style analysis” to determine a customer's level of frustration when the customer telephones a call center. A similar “verbal style analysis” can be used here to ascertain the mood of the user. Combining the technologies taught by Bohacek, Black, and Bushey would provide the best picture of the emotional state of the user, taking many different factors into account.
[0171] Persons skilled in the art will also recognize many ways to implement the speech recognizer 440. For example, Gupta, U.S. Pat. No. 6,138,095, incorporated herein by reference, teaches a speech recognizer where the words that a person is saying are compared with a dictionary. An error checker is used to determine the degree of the possible error in pronunciation. Alternatively, in a preferred embodiment, a hierarchal stacked neural network, as taught by Commons, U.S. Pat. No. 7,613,663, incorporated herein by reference, could be used. If the neural networks of Commons are used to implement the invention, the lowest level neural network would recognize speech as speech (rather than background noise). The second level neural network would arrange speech into phonemes. The third level neural network would arrange the phonemes into words. The fourth level would arrange words into sentences. The fifth level would combine sentences into meaningful paragraphs or idea structures. The neural network is the preferred embodiment for the speech recognition software because the meanings of words (especially keywords) used by humans are often fuzzy and context sensitive. Rules, which are programmed to process clear-cut categories, are not efficient for interpreting ambiguity.
[0172] The output of the logic to determine mood 430 and the speech recognizer 440 are provided to a conversation logic 450. The conversation logic selects a conversationally relevant response 452 to the user's verbal (and preferably also image and voice tone) input to provide to the speakers 460. It also selects a facial expression for the face on the screen 470. The conversationally relevant response should expand on the user's last statement and what was previously said in the conversation. If the user's last statement included at least one query, the conversationally relevant response preferably answers at least part of the query. If necessary, the conversation logic 450 could consult the internet 454 to get an answer to the query 456. This could be necessary if the user asks a query such as “Is my grandson James partying instead of studying?” or “What is the weather in New York?”
[0173] To determine whether the user's grandson James is partying or studying, the conversation logic 450 would first convert “grandson James” into a name, such as James Kerner. The last name could be determined either through memory (stored either in the memory of the phone or computer or on a server accessible over the Internet 454) of prior conversations or by asking the user, “What is James' last name?” The data as to whether James is partying or studying could be determined using a standard search engine accessed through the Internet 454, such as Google or Microsoft Bing. While these might not provide accurate information about James, these might provide conversationally relevant information to allow the phone or computer implementing the invention to say something to keep the conversation going. Alternatively, to provide more accurate information the conversation logic 450 could search for information about James Kerner on social networking sites accessible on the Internet 454, such as Facebook, LinkedIn, Twitter, etc., as well as any public internet sites dedicated specifically to providing information about James Kerner. (For example, many law firms provide a separate web page describing each of their attorneys.) If the user is a member of a social networking site, the conversation logic could log into the site to be able to view information that is available to the user but not to the general public. For example, Facebook allows users to share some information with their “friends” but not with the general public. The conversation logic 450 could use the combination of text, photographs, videos, etc. to learn about James' activities and to come to a conclusion as to whether they constitute “partying” or “studying.”
[0174] To determine the weather in New York, the conversation logic 450 could use a search engine accessed through the Internet 454, such as Google or Microsoft Bing. Alternatively, the conversation logic could connect with a server adapted to provide weather information, such as The Weather Channel, www.weather.com, or AccuWeather, www.accuweather.com, or the National Oceanic and Atmospheric Administration, www.nws.noaa.gov.
[0175] Note that, to be conversationally relevant, each statement must expand on what was said previously. Thus, if the user asks the question, “What is the weather in New York?” twice, the second response must be different from the first. For example, the first response might be, “It will rain in the morning,” and the second response might be, “It sunny after the rain stops in the afternoon.” However, if the second response were exactly the same as the first, it would not be conversationally relevant as it would not build on the knowledge available to the parties.
[0176] The phone or computer implementing the invention can say arbitrary phrases. In one embodiment, if the voice samples of the person on the screen are available, that voice could be used. In another embodiment, the decision as to which voice to use is made based on the gender of the speaker alone.
[0177] In a preferred embodiment, the image on the screen 470 looks like it is talking. When the image on the screen is talking, several parameters need to be modified, including jaw rotation and thrust, horizontal mouth width, lip corner and protrusion controls, lower lip tuck, vertical lip position, horizontal and vertical teeth offset, and tongue angle, width, and length. Preferably, the processor of the phone or computer that is implementing the invention will model the talking head as a 3D mesh that can be parametrically deformed (in response to facial movements during speech and facial gestures).
EXAMPLE 4
Smart Clock Radio
[0178] Another embodiment of this invention illustrated in
[0179] In one embodiment, the radio 500 operates in a manner equivalent to that described in the smartphone/laptop embodiment illustrated in
[0180] Therefore, in a preferred embodiment, the camera 510 is more powerful than a typical laptop camera and is adapted to viewing the user's face to determine the facial expression from a distance. Camera resolutions on the order of 8-12 megapixels are preferred, although any camera will suffice for the purposes of the invention.
EXAMPLE 5
Television with Set-Top Box
[0181] The next detailed embodiment of the invention illustrated in
[0182] If the STB has a memory and is able to process machine instructions and connect to the internet (over WiFi, Ethernet or similar), the invention may be implemented on the STB (not illustrated). Otherwise, the STB may connect to a remote server 650 to implement the invention. The remote server will take as input the audio and image data gathered by the STB's microphone and camera. The output provided is an image to display in screen 630 and audio output for speakers 640.
[0183] The logic to determine mood 430, speech recognizer 440, and the conversation logic 450, which connects to the Internet 454 to provide data for discussion all operate in a manner identical to the description of
[0184] When setting up the person to be displayed on the screen, the user needs to either select a default display or send a photograph of a person that the user wishes to speak with to the company implementing the invention. In one embodiment, the image is transmitted electronically over the Internet. In another embodiment, the user mails a paper photograph to an office, where the photograph is scanned, and a digital image of the person is stored.
EXAMPLE 6
Robot with a Face
[0185]
[0186] The logic implementing the invention operates in a manner essentially identical to that illustrated in
[0187] There are some notable differences between the present embodiment and that illustrated in
[0188] In one embodiment, the camera is mobile, and the robot rotates the camera so as to continue looking at the user when the user moves. Further, the camera is a three-dimensional camera comprising a structured light illuminator. Preferably, the structured light illuminator is not in a visible frequency, thereby allowing it to ascertain the image of the user's face and all of the contours thereon.
[0189] Structured light involves projecting a known pattern of pixels (often grids or horizontal bars) on to a scene. These patterns deform when striking surfaces, thereby allowing vision systems to calculate the depth and surface information of the objects in the scene. For the present invention, this feature of structured light is useful to calculate and to ascertain the facial features of the user. Structured light could be outside the visible spectrum, for example, infrared light. This allows for the robot to effectively detect the user's facial features without the user being discomforted.
[0190] In a preferred embodiment, the robot is completely responsive to voice prompts and has very few button, all of which are rather larger. This embodiment is preferred because it makes the robot easier to use for elderly and disabled people who might have difficulty pressing small buttons.
[0191] In this disclosure, we have described several embodiments of this broad invention. Persons skilled in the art will definitely have other ideas as to how the teachings of this specification can be used. It is not our intent to limit this broad invention to the embodiments described in the specification. Rather, the invention is limited by the following claims.
[0192] With reference to
[0193] A number of program modules may be stored on the hard disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, including an operating system 35, one or more application programs 36, other program modules 37, and program data 38. A user may enter commands and information into the personal computer 20 through input devices such as a keyboard 40 and pointing device 42. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 21 through a serial data interface 46 that is coupled to the system bus, but may be collected by other interfaces, such as a parallel port, game port or a universal serial bus (USB). A monitor 47 or another type of display device is also connected to the system bus 23 via an interface, such as a video adapter 48. In addition to the monitor, personal computers typically include other peripheral output devices (not shown), such as speakers and printers.
[0194] The personal computer 20 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 49, through a packet data network interface to a packet switch data network. The remote computer 49 may be another personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the personal computer 20, although only a memory storage device 50 has been illustrated in
[0195] When used in a LAN networking environment, the personal computer 20 is connected to the local network 51 through a network interface or adapter 53. When used in a WAN networking environment, the personal computer 20 typically includes a modem 54 or other elements for establishing communications over the wide area network 52, such as the Internet. The modem 54, which may be internal or external, is connected to the system bus 23 via the serial port interface 46. In a networked environment, program modules depicted relative to the personal computer 20, or portions thereof, may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other elements for establishing a communications link between the computers may be used.
[0196] Typically, a digital data stream from a superconducting digital electronic processing system may have a data rate which exceeds a capability of a room temperature processing system to handle. For example, complex (but not necessarily high data rate) calculations or user interface functions may be more efficiently executed on a general purpose computer than a specialized superconducting digital signal processing system. In that case, the data may be parallelized or decimated to provide a lower clock rate, while retaining essential information for downstream processing.
[0197] The present embodiments are to be considered in all respects as illustrative and not restrictive, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. The invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The disclosure shall be interpreted to encompass all of the various combinations and permutations of the elements, steps, and claims disclosed herein, to the extent consistent, and shall not be limited to specific combinations as provided in the detailed embodiments.