INVOKING AUTOMATED ASSISTANT FUNCTION(S) BASED ON DETECTED GESTURE AND GAZE
20230053873 · 2023-02-23
Inventors
Cpc classification
G10L15/22
PHYSICS
G06F3/038
PHYSICS
G06F3/017
PHYSICS
G06F3/167
PHYSICS
International classification
Abstract
Invoking one or more previously dormant functions of an automated assistant in response to detecting, based on processing of vision data from one or more vision components: (1) a particular gesture (e.g., of one or more “invocation gestures”) of a user; and/or (2) detecting that a gaze of the user is directed at an assistant device that provides an automated assistant interface (graphical and/or audible) of the automated assistant. For example, the previously dormant function(s) can be invoked in response to detecting the particular gesture, detecting that the gaze of the user is directed at an assistant device for at least a threshold amount of time, and optionally that the particular gesture and the directed gaze of the user co-occur or occur within a threshold temporal proximity of one another.
Claims
1. A client device comprising: at least one vision component; at least one microphone; one or more processors; memory operably coupled with the one or more processors, wherein the memory stores instructions that, in response to execution of the instructions by one or more of the processors, cause one or more of the processors to perform the following operations: receiving a stream of vision data that is based on output from the vision component of the client device; receiving a stream of audio data that is based on output from the microphone of the client device; determining, based on processing the vision data: that a gaze of a user is directed toward the client device, and a user profile for the user; determining, based on processing the audio data, that a spoken utterance, included in the audio data: temporally corresponds to the gaze, and has voice characteristics that match the user profile that is determined based on processing the vision data; and in response to determining the gaze of the user, and contingent on determining that the spoken utterance temporally corresponds to the gaze and has the voice characteristics that match the user profile that is determined based on processing the vision data: causing at least one dormant function of the automated assistant to be activated.
2. The client device of claim 1, wherein the at least one dormant function of the automated assistant, that is caused to be activated in response to determining the gaze of the user, and contingent on determining that the spoken utterance temporally corresponds to the gaze and has the voice characteristics that match the user profile that is determined based on processing the vision data comprises: transmitting of data, from the client device, to a remote server associated with the automated assistant.
3. The client device of claim 1, wherein the at least one dormant function of the automated assistant, that is caused to be activated in response to determining the gaze of the user, and contingent on determining that the spoken utterance temporally corresponds to the gaze and has the voice characteristics that match the user profile that is determined based on processing the vision data further comprises: graphically rendering content that is tailored to the user profile.
4. The client device of claim 1, wherein the at least one dormant function of the automated assistant, that is caused to be activated in response to determining the gaze of the user, and contingent on determining that the spoken utterance temporally corresponds to the gaze and has the voice characteristics that match the user profile that is determined based on processing the vision data comprises: automatic speech processing of the audio data.
5. The client device of claim 1, wherein determining, based on processing the vision data, the user profile of the user comprises performing facial recognition based on processing the vision data.
6. The client device of claim 1, wherein determining, based on processing the vision data, that the gaze of the user is directed toward the client device comprises processing the vision data using a trained gaze machine learning model stored locally at the client device.
7. The client device of claim 1, further comprising: determining that the user profile is authorized for the client device; wherein causing the at least one dormant function of the automated assistant to be activated is further contingent on determining that the user profile is authorized for the client device.
8. A method implemented by one or more processors of a client device that facilitates touch-free interaction between one or more users and an automated assistant, the method comprising: processing image frames captured by a camera of the client device; determining, based on processing the image frames: that a gaze of a user is directed toward the client device, and a user profile for the user; processing audio data captured by one or more microphones of the client device; determining, based on processing the audio data, that a spoken utterance, included in the audio data: temporally corresponds to the gaze, and has voice characteristics that match the user profile that is determined based on processing the image frames; and in response to determining the gaze of the user, and contingent on determining that the spoken utterance temporally corresponds to the gaze and has the voice characteristics that match the user profile that is determined based on processing the image frames: causing at least one dormant function of the automated assistant to be activated.
9. The method of claim 8, wherein the at least one dormant function of the automated assistant, that is caused to be activated in response to determining the gaze of the user, and contingent on determining that the spoken utterance temporally corresponds to the gaze and has the voice characteristics that match the user profile that is determined based on processing the image frames comprises: transmitting of data, from the client device, to a remote server associated with the automated assistant.
10. The method of claim 9, wherein the at least one dormant function of the automated assistant, that is caused to be activated in response to determining the gaze of the user, and contingent on determining that the spoken utterance temporally corresponds to the gaze and has the voice characteristics that match the user profile that is determined based on processing the image frames comprises: automatic speech processing of the audio data.
11. The method of claim 8, wherein the at least one dormant function of the automated assistant, that is caused to be activated in response to determining the gaze of the user, and contingent on determining that the spoken utterance temporally corresponds to the gaze and has the voice characteristics that match the user profile that is determined based on processing the image frames further comprises: graphically rendering content that is tailored to the user profile.
12. The method of claim 8, wherein the at least one dormant function of the automated assistant, that is caused to be activated in response to determining the gaze of the user, and contingent on determining that the spoken utterance temporally corresponds to the gaze and has the voice characteristics that match the user profile that is determined based on processing the image frames comprises: automatic speech processing of the audio data.
13. The method of claim 8, wherein determining, based on processing the image frames, the user profile of the user comprises performing facial recognition based on processing at least one of the image frames.
14. The method of claim 13, wherein determining, based on processing the image frames, that the gaze of the user is directed toward the client device comprises processing the image frames using a trained gaze machine learning model stored locally at the client device.
15. The method of claim 8, further comprising: determining that the user profile is authorized for the client device; wherein causing the at least one dormant function of the automated assistant to be activated is further contingent on determining that the user profile is authorized for the client device.
16. A client device, comprising: a vision component; a presence sensor; one or more processors, wherein one or more of the processors are configured to: detect, based on a signal from the presence sensor, that a human is present in an environment of the presence sensor; in response to detecting that the human is present in the environment: activate the vision component to provide a stream of vision data that is based on output from the vision component; process the vision data using at least one trained machine learning model stored locally on the client device to monitor for occurrence of both: an invocation gesture of a user captured by the vision data, and a gaze of the user that is directed toward the client device; detect, based on the monitoring, occurrence of both: the invocation gesture, and the gaze; and in response to detecting the occurrence of both the invocation gesture and the gaze: cause at least one dormant function of an automated assistant to be activated.
17. The client device of claim 16, wherein the at least one dormant function of the automated assistant, that is caused to be activated in response to detecting the occurrence of both the invocation gesture and the gaze comprises: transmitting of data, from the client device, to a remote server associated with the automated assistant.
18. The client device of claim 16, wherein the at least one dormant function of the automated assistant, that is caused to be activated in response to detecting the occurrence of both the invocation gesture and the gaze comprises: automatic speech processing of the audio data.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
DETAILED DESCRIPTION
[0036]
[0037] In various implementations, an instance of an automated assistant client 110, by way of its interactions with one or more cloud-based automated assistant components 130, may form what appears to be, from a user's perspective, a logical instance of an automated assistant 120 with which the user may engage in a human-to-computer interactions (e.g., spoken interactions, gesture-based interactions, and/or touch-based interactions). One instance of such an automated assistant 120 is depicted in
[0038] The one or more client devices 106 may include, for example, one or more of: a desktop computing device, a laptop computing device, a tablet computing device, a mobile phone computing device, a computing device of a vehicle of the user (e.g., an in-vehicle communications system, an in-vehicle entertainment system, an in-vehicle navigation system), a standalone interactive speaker (which in some cases may include a vision sensor), a smart appliance such as a smart television (or a standard television equipped with a networked dongle with automated assistant capabilities), and/or a wearable apparatus of the user that includes a computing device (e.g., a watch of the user having a computing device, glasses of the user having a computing device, a virtual or augmented reality computing device). Additional and/or alternative client computing devices may be provided. As noted previously, some client devices 106 may take the form of assistant devices that are primarily designed to facilitate interactions between users and automated assistant 120 (e.g., a standalone interactive device with speaker(s) and a display).
[0039] Client device 106 can be equipped with one or more vision components 107 having one or more fields of view. Vision component(s) 107 may take various forms, such as monographic cameras, stereographic cameras, a LIDAR component, a radar component, etc. The one or more vision components 107 may be used, e.g., by a visual capture module 114, to capture vision frames (e.g., image frames (still images or video)) of an environment in which client device 106 is deployed. These vision frames may then be at least selectively analyzed, e.g., by a gaze and gesture module 116 of invocation engine 115, to monitor for occurrence of: a particular gesture (of one or more candidate gestures) of a user captured by the vision frames and/or a directed gaze from the user (i.e., a gaze that is directed toward the client device 106). The gaze and gesture module 116 can utilize one or more trained machine learning models 117 in monitoring for occurrence of a particular gesture and/or a directed gaze.
[0040] In response to detection of the particular gesture and the directed gaze (and optionally in response to detection of one or more other condition(s) by other conditions module 118), the invocation engine 115 can invoke one or more previously dormant functions of the automated assistant 120. Such dormant functions can include, for example, processing of certain sensor data (e.g., audio data, video, image(s), etc.) and/or rendering (e.g., graphically and/or audibly) of certain content.
[0041] As one non-limiting example, prior to detection of the particular gesture and the directed gaze, vision data and/or audio data captured at the client device 106 can be processed and/or temporarily buffered only locally at the client device 106 (i.e., without transmission to the cloud-based automated assistant component(s) 130). However, in response to detection of the particular gesture and the directed gaze, audio data and/or vision data (e.g., recently buffered data and/or data received after the detection) can be transmitted to the cloud-based automated assistant component(s) 130 for further processing. For example, the detection of the particular gesture and the directed gaze can obviate a need for the user to speak an explicit invocation phrase (e.g., “OK Assistant”) in order to cause a spoken utterance of the user to be fully processed by the automated assistant 120, and responsive content generated by the automated assistant 120 and rendered to the user.
[0042] For instance, rather than the user needing to speak “OK Assistant, what's today's forecast” to obtain today's forecast, the user could instead: perform a particular gesture, look at the client device 106, and speak only “what's today's forecast” during or temporally near (e.g., within a threshold of time before and/or after) performing the gesture and/or looking at the client device 106. Data corresponding to the spoken utterance “What's today's forecast” (e.g., audio data that captures the spoken utterance, or a textual or other semantic conversion thereof) can be transmitted by the client device 106 to the cloud-based automated assistant component(s) 130 in response to detecting the gesture and the directed gaze, and in response to the spoken utterance being received during and/or temporally near the gesture and directed gaze. In another example, rather than the user needing to speak “OK Assistant, turn up the heat” to increase the temperature of his/her home via a connected thermostat, the user could instead: perform a particular gesture, look at the client device 106, and speak only “turn up the heat” during or temporally near (e.g., within a threshold of time before and/or after) performing the gesture and/or looking at the client device 106. Data corresponding to the spoken utterance “turn up the heat” (e.g., audio data that captures the spoken utterance, or a textual or other semantic conversion thereof) can be transmitted by the client device 106 to the cloud-based automated assistant component(s) 130 in response to detecting the gesture and the directed gaze, and in response to the spoken utterance being received during and/or temporally near the gesture and directed gaze. In another example, rather than the user needing to speak “OK Assistant, open the garage door” to open his/her garage, the user could instead: perform a particular gesture, look at the client device 106, and speak only “open the garage door” during or temporally near (e.g., within a threshold of time before and/or after) performing the gesture and/or looking at the client device 106. Data corresponding to the spoken utterance “open the garage door” (e.g., audio data that captures the spoken utterance, or a textual or other semantic conversion thereof) can be transmitted by the client device 106 to the cloud-based automated assistant component(s) 130 in response to detecting the gesture and the directed gaze, and in response to the spoken utterance being received during and/or temporally near the gesture and directed gaze. In some implementations, the transmission of the data by the client device 106 can be further contingent on the other condition module 118 determining the occurrence of one or more additional conditions. For example, the transmission of the data can be further based on local voice activity detection processing of the audio data, performed by the other conditions module 118, indicating that voice activity is present in the audio data. Also, for example, the transmission of the data can additionally or alternatively be further based on determining, by the other conditions module 118, that the audio data corresponds to the user that provided the gesture and the directed gaze. For instance, a direction of the user (relative to the client device 106) can be determined based on the vision data, and the transmission of the data can be further based on determining, by the other conditions module 118, that a spoken utterance in the audio data comes from the same direction (e.g., using beamforming and/or other techniques). Also, for instance, a user profile of the user can be determined based on the vision data (e.g., using facial recognition) and the transmission of the data can be further based on determining, by the other conditions module 118, that a spoken utterance in the audio data has voice characteristics that match the user profile. As yet another example, transmission of the data can additionally or alternatively be further based on determining, by the other conditions module 118 based on vision data, that mouth movement of the user co-occurred with the detected gesture and/or directed gaze of the user, or occurred with a threshold amount of time of the detected gesture and/or directed gaze. The other conditions module 118 can optionally utilize one or more other machine learning models 119 in determining that other condition(s) are present. Additional description of implementations of gaze and gesture module 116, and of the other conditions module 118, is provided herein (e.g., with reference to
[0043] Each of client computing device 106 and computing device(s) operating cloud-based automated assistant components 130 may include one or more memories for storage of data and software applications, one or more processors for accessing data and executing applications, and other components that facilitate communication over a network. The operations performed by client computing device 106 and/or by automated assistant 120 may be distributed across multiple computer systems. Automated assistant 120 may be implemented as, for example, computer programs running on one or more computers in one or more locations that are coupled to each other through a network.
[0044] As noted above, in various implementations, client computing device 106 may operate an automated assistant client 110. In some of those various implementations, automated assistant client 110 may include a speech capture module 112, the aforementioned visual capture module 114, and an invocation engine 115, which can include the gaze and gesture module 116 and optionally the other conditions module 118. In other implementations, one or more aspects of speech capture module 112, visual capture module 114, and/or invocation engine 115 may be implemented separately from automated assistant client 110, e.g., by one or more cloud-based automated assistant components 130.
[0045] In various implementations, speech capture module 112, which may be implemented using any combination of hardware and software, may interface with hardware such as a microphone(s) 109 or other pressure sensor to capture an audio recording of a user's spoken utterance(s). Various types of processing may be performed on this audio recording for various purposes, as will be described below. In various implementations, visual capture module 114, which may be implemented using any combination of hardware or software, may be configured to interface with visual component 107 to capture one or more vision frames (e.g., digital images), that correspond to an optionally adaptable field of view of the vision sensor 107.
[0046] Speech capture module 112 may be configured to capture a user's speech, e.g., via a microphone(s) 109, as mentioned previously. Additionally or alternatively, in some implementations, speech capture module 112 may be further configured to convert that captured audio to text and/or to other representations or embeddings, e.g., using speech-to-text (“STT”) processing techniques. However, because client device 106 may be relatively constrained in terms of computing resources (e.g., processor cycles, memory, battery, etc.), speech capture module 112 local to client device 106 may be configured to convert a finite number of different spoken phrases-such as phrases that invoke automated assistant 120—to text (or to other forms, such as lower dimensionality embeddings). Other speech input may be sent to cloud-based automated assistant components 130, which may include a cloud-based STT module 132.
[0047] Cloud-based TTS module 131 may be configured to leverage the virtually limitless resources of the cloud to convert textual data (e.g., natural language responses formulated by automated assistant 120) into computer-generated speech output. In some implementations, TTS module 131 may provide the computer-generated speech output to client device 106 to be output directly, e.g., using one or more speakers. In other implementations, textual data (e.g., natural language responses) generated by automated assistant 120 may be provided to client device 106, and a local TTS module of client device 106 may then convert the textual data into computer-generated speech that is output locally.
[0048] Cloud-based STT module 132 may be configured to leverage the virtually limitless resources of the cloud to convert audio data captured by speech capture module 112 into text, which may then be provided to natural language understanding module 135. In some implementations, cloud-based STT module 132 may convert an audio recording of speech to one or more phonemes, and then convert the one or more phonemes to text. Additionally or alternatively, in some implementations, STT module 132 may employ a state decoding graph. In some implementations, STT module 132 may generate a plurality of candidate textual interpretations of the user's utterance, and utilize one or more techniques to select a given interpretation from the candidates.
[0049] Automated assistant 120 (and in particular, cloud-based automated assistant components 130) may include an intent understanding module 135, the aforementioned TTS module 131, the aforementioned STT module 132, and other components that are described in more detail herein. In some implementations, one or more of the modules and/or modules of automated assistant 120 may be omitted, combined, and/or implemented in a component that is separate from automated assistant 120. In some implementations one or more of the components of automated assistant 120, such as intent understanding module 135, TTS module 131, STT module 132, etc., may be implemented at least on part on client devices 106 (e.g., in combination with, or to the exclusion of, the cloud-based implementations).
[0050] In some implementations, automated assistant 120 generates various content for audible and/or graphical rendering to a user via the client device 106. For example, automated assistant 120 may generate content such as a weather forecast, a daily schedule, etc., and can cause the content to be rendered in response to detecting a gesture and/or directed gaze from the user as described herein. Also, for example, automated assistant 120 may generate content in response to a free-form natural language input of the user provided via client device 106, in response to gestures of the user that are detected via vision data from visual component 107 of client device, etc. As used herein, free-form input is input that is formulated by a user and that is not constrained to a group of options presented for selection by the user. The free-form input can be, for example, typed input and/or spoken input.
[0051] Natural language processor 133 of intent understanding module 135 processes natural language input generated by user(s) via client device 106 and may generate annotated output (e.g., in textual form) for use by one or more other components of automated assistant 120. For example, the natural language processor 133 may process natural language free-form input that is generated by a user via one or more user interface input devices of client device 106. The generated annotated output includes one or more annotations of the natural language input and one or more (e.g., all) of the terms of the natural language input.
[0052] In some implementations, the natural language processor 133 is configured to identify and annotate various types of grammatical information in natural language input. For example, the natural language processor 133 may include a morphological module that may separate individual words into morphemes and/or annotate the morphemes, e.g., with their classes. Natural language processor 133 may also include a part of speech tagger configured to annotate terms with their grammatical roles. Also, for example, in some implementations the natural language processor 133 may additionally and/or alternatively include a dependency parser (not depicted) configured to determine syntactic relationships between terms in natural language input.
[0053] In some implementations, the natural language processor 133 may additionally and/or alternatively include an entity tagger (not depicted) configured to annotate entity references in one or more segments such as references to people (including, for instance, literary characters, celebrities, public figures, etc.), organizations, locations (real and imaginary), and so forth. In some implementations, data about entities may be stored in one or more databases, such as in a knowledge graph (not depicted), and the entity tagger of the natural language processor 133 can utilize such database(s) in entity tagging.
[0054] In some implementations, the natural language processor 133 may additionally and/or alternatively include a coreference resolver (not depicted) configured to group, or “cluster,” references to the same entity based on one or more contextual cues. For example, the coreference resolver may be utilized to resolve the term “there” to “Hypothetical Café” in the natural language input “I liked Hypothetical Café last time we ate there.”
[0055] In some implementations, one or more components of the natural language processor 133 may rely on annotations from one or more other components of the natural language processor 133. For example, in some implementations the named entity tagger may rely on annotations from the coreference resolver and/or dependency parser in annotating all mentions to a particular entity. Also, for example, in some implementations the coreference resolver may rely on annotations from the dependency parser in clustering references to the same entity. In some implementations, in processing a particular natural language input, one or more components of the natural language processor 133 may use related prior input and/or other related data outside of the particular natural language input to determine one or more annotations.
[0056] Intent understanding module 135 may also include an intent matcher 134 that is configured to determine an intent of a user engaged in an interaction with automated assistant 120. While depicted separately from natural language processor 133 in
[0057] Intent matcher 134 may use various techniques to determine an intent of the user, e.g., based on output from natural language processor 133 (which may include annotations and terms of the natural language input), based on user touch inputs at a touch-sensitive display of client device 106, and/or based on gestures and/or other visual cues detected in vision data. In some implementations, intent matcher 134 may have access to one or more databases (not depicted) that include, for instance, a plurality of mappings between grammars and responsive actions (or more generally, intents), visual cues and responsive actions, and/or touch inputs and responsive actions. For example, the grammars included in the mappings can be selected and/or learned over time, and may represent common intents of users. For example, one grammar, “play <artist>”, may be mapped to an intent that invokes a responsive action that causes music by the <artist> to be played on the client device 106 operated by the user. Another grammar, “[weatherIforecast] today,” may be match-able to user queries such as “what's the weather today” and “what's the forecast for today?” As another example, the visual cue to action mappings can include “general” mappings that are applicable to multiple users (e.g., all users) and/or user-specific mappings. Some examples of visual cue to action mappings include mappings for gestures. For instance, a “wave” gesture can be mapped to an action of causing tailored content (tailored to the user providing the gesture) to be rendered to the user, a “thumbs up” gesture can be mapped to a “play music” action; and a “high five” gesture can be mapped to a “routine” of automated assistant actions to be performed, such as turning on a smart coffee maker, turning on certain smart lights, and audibly rendering a news summary.
[0058] In addition to or instead of grammars, in some implementations, intent matcher 134 may employ one or more trained machine learning models, alone or in combination with one or more grammars, visual cues, and/or touch inputs. These trained machine learning models may also be stored in one or more databases and may be trained to identify intents, e.g., by embedding data indicative of a user's utterance and/or any detected user-provided visual cues into a reduced dimensionality space, and then determining which other embeddings (and therefore, intents) are most proximate, e.g., using techniques such as Euclidean distance, cosine similarity, etc.
[0059] As seen in the “play <artist>” example grammar above, some grammars have slots (e.g., <artist>) that can be filled with slot values (or “parameters”). Slot values may be determined in various ways. Often users will provide the slot values proactively. For example, for a grammar “Order me a <topping> pizza,” a user may likely speak the phrase “order me a sausage pizza,” in which case the slot <topping> is filled automatically. Additionally or alternatively, if a user invokes a grammar that includes slots to be filled with slot values, without the user proactively providing the slot values, automated assistant 120 may solicit those slot values from the user (e.g., “what type of crust do you want on your pizza?”). In some implementations, slots may be filled with slot values based on visual cues detected based on vision data captured by vision component 107. For example, a user could utter something like “Order me this many cat bowls” while holding up three fingers to visual component 107 of client device 106. Or, a user could utter something like “Find me more movies like this” while holding of a DVD case for a particular movie.
[0060] In some implementations, automated assistant 120 may facilitate (or “broker”) transactions between users and agents, which may be independent software processes that receive input and provide responsive output. Some agents may take the form of third party applications that may or may not operate on computing systems that are separate from those that operate, for instance, cloud-based automated assistant components 130. One kind of user intent that may be identified by intent matcher 134 is to engage a third party application. For example, automated assistant 120 may provide access to an application programming interface (“API”) to a pizza delivery service. A user may invoke automated assistant 120 and provide a command such as “I'd like to order a pizza.” Intent matcher 134 may map this command to a grammar that triggers automated assistant 120 to engage with the third party pizza delivery service. The third party pizza delivery service may provide automated assistant 120 with a minimum list of slots that need to be filled in order to fulfill a pizza delivery order. Automated assistant 120 may generate and provide to the user (via client device 106) natural language output that solicits parameters for the slots.
[0061] Fulfillment module 138 may be configured to receive the predicted/estimated intent that is output by intent matcher 134, as well as an associated slot values (whether provided by the user proactively or solicited from the user) and fulfill (or “resolve”) the intent. In various implementations, fulfillment (or “resolution”) of the user's intent may cause various fulfillment information (also referred to as “responsive” information or data) to be generated/obtained, e.g., by fulfillment module 138.
[0062] Fulfillment information may take various forms because an intent can be fulfilled in a variety of ways. Suppose a user requests pure information, such as “Where were the outdoor shots of ‘The Shining’ filmed?” The intent of the user may be determined, e.g., by intent matcher 134, as being a search query. The intent and content of the search query may be provided to fulfillment module 138, which as depicted in
[0063] Additionally or alternatively, fulfillment module 138 may be configured to receive, e.g., from intent understanding module 135, a user's intent and any slot values provided by the user or determined using other means (e.g., GPS coordinates of the user, user preferences, etc.) and trigger a responsive action. Responsive actions may include, for instance, ordering a good/service, starting a timer, setting a reminder, initiating a phone call, playing media, sending a message, initiating a routine of multiple actions, etc. In some such implementations, fulfillment information may include slot values associated with the fulfillment, confirmation responses (which may be selected from predetermined responses in some cases), etc.
[0064] Additionally or alternatively, fulfillment module 138 may be configured to infer intent(s) of a user (e.g., based on time of day, past interactions, etc.) and obtain responsive information for those intent(s). For example, the fulfillment module 138 can be configured to obtain a daily calendar summary for a user, a weather forecast for the user, and/or other content for the user. The fulfillment module 138 can further cause such content to be “pushed” for graphical and/or audible rendering to the user. For example, the rendering of such content can be the dormant functionality that is invoked in response to invocation engine 115 detecting the occurrence of a particular gesture and a directed gaze.
[0065] Natural language generator 136 may be configured to generate and/or select natural language output (e.g., words/phrases that are designed to mimic human speech) based on data obtained from various sources. In some implementations, natural language generator 136 may be configured to receive, as input, fulfillment information associated with fulfillment of an intent, and to generate natural language output based on the fulfillment information. Additionally or alternatively, natural language generator 136 may receive information from other sources, such as third party applications, which it may use to compose natural language output for the user.
[0066] Referring now to
[0067] Turning initially to
[0068] The gaze and gesture module 116 processes the vision frames using one or more machine learning models 117 to monitor for the occurrence of both a directed gaze and a particular gesture. When both the directed gaze and the particular gesture are detected, the gaze and gesture module 116 provides an indication of detection of the gaze and gesture to invocation engine 115.
[0069] In
[0070] When the invocation engine 115 receives an indication of the directed gaze and gesture, and a temporally proximate indication of the other conditions, the invocation engine 115 causes invocation of dormant function(s) 101. For example, the invocation of the dormant function(s) 101 can include one or more of: activating a display screen of the client device 106; causing content to be visually and/or audibly rendered by the client device 106; causing visual frames and/or audio data to be transmitted by the client device 106 to one or more cloud-based automated assistant component(s) 130; etc.
[0071] In some implementations, and as described in more detail with respect to
[0072] In some other implementations, the gaze and gesture module 116 can utilize an end-to-end machine learning model that accepts, as input, visual frames (or features thereof) and that can be utilized to generate (based on processing of the input over the model) output that indicates whether a particular gesture and a directed gaze have occurred. Such a machine learning model can be, for example, a neural network model, such as a recurrent neural network (RNN) model that includes one or more memory layers (e.g., long short-term memory (LSTM) layer(s)). Training of such an RNN model can be based on training examples that include, as training example input, a sequence of visual frames (e.g., a video) and, as training example output, an indication of whether the sequence includes both a gesture and a directed gaze. For example, the training example output can be a single value that indicates whether both the gesture and directed gaze are present. As another example, the training example output can include a first value that indicates whether a directed gaze is present and N additional values that each indicate whether a corresponding one of N gestures is included (thereby enabling training of the model to predict a corresponding probability for each of N separate gestures). As yet another example, the training example output can include a first value that indicates whether a directed gaze is present and a second value that indicates whether any of one or more particular gestures is present (thereby enabling training of the model to predict a probability that corresponds to whether any gesture is included).
[0073] In implementations where the model is trained to predict a corresponding probability for each of N separate gestures, the gaze and gesture module 116 can optionally provide invocation engine 115 with an indication of which of the N gestures occurred. Further, the invocation of the dormant functions 101 by the invocation engine 115 can be dependent on which of the N separate gestures occurred. For example, for a “wave” gesture the invocation engine 115 can cause certain content to be rendered on a display screen of the client device; for a “thumbs up” gesture the invocation engine 115 can cause audio data and/or visual frame(s) to be transmitted to cloud-based automated assistant component(s) 130; and for a “high five” gesture the invocation engine 115 can cause a “routine” of automated assistant actions to be performed, such as turning on a smart coffee maker, turning on certain smart lights, and audibly rendering a news summary.
[0074]
[0075] In
[0076] In the example of
[0077] The gesture module 116A can use one or more gesture machine learning models 117A for detecting a particular gesture. Such a machine learning model can be, for example, a neural network model, such as an RNN model that includes one or more memory layers. Training of such an RNN model can be based on training examples that include, as training example input, a sequence of visual frames (e.g., a video) and, as training example output, an indication of whether the sequence includes one or more particular gestures. For example, the training example output can be a single value that indicates whether a single particular gesture is present. For instance, the single value can be a “0” when the single particular gesture is not present and a “1” when the single particular gesture is present. In some of those examples, multiple gesture machine learning models 117A are utilized, each tailored to a different single particular gesture. As another example, the training example output can include N values that each indicate whether a corresponding one of N gestures is included (thereby enabling training of the model to predict a corresponding probability for each of N separate gestures). In implementations where the model is trained to predict a corresponding probability for each of N separate gestures, the gesture module 116A can optionally provide invocation engine 115 with an indication of which of the N gestures occurred. Further, the invocation of the dormant functions by the invocation engine 115 can be dependent on which of the N separate gestures occurred.
[0078] The gaze module 116B can use one or more gaze machine learning models 117A for detecting a directed gaze. Such a machine learning model can be, for example, a neural network model, such as a convolutional neural network (CNN) model. Training of such a CNN model can be based on training examples that include, as training example input, a visual frames (e.g., an image) and, as training example output, an indication of whether the image includes a directed gaze. For example, the training example output can be a single value that indicates whether directed gaze is present. For example, the single value can be a “0” when no directed gaze is present, a “1” when a gaze is present that is directed directly at, or within 5 degrees of, a sensor that captures the image, a “0.75” when a gaze is present that is directed within 5-10 degrees of a sensor that captures the image, etc.
[0079] In some of those and/or other implementations, the gaze module 116B determines a directed gaze only when a directed gaze is detected with at least a threshold probability and/or for at least a threshold duration. For example, a stream of image frames can be processed using the CNN model and processing each frame can result in a corresponding probability that the frame includes a directed gaze. The gaze module can determine there is a directed gaze only if at least X % of a sequence of image frames (that corresponds to the threshold duration) has a corresponding probability that satisfies a threshold. For instance, assume X % is 60%, the probability threshold is 0.7, and the threshold duration is 0.5 seconds. Further assume 10 image frames correspond to 0.5 seconds. If the image frames are processed to generate probabilities of [0.75, 0.85, 0.5, 0.4, 0.9, 0.95, 0.85, 0.89, 0.6, 0.85], a directed gaze can be detected since 70% of the frames indicated a directed gaze with a probability that is greater than 0.7. In these and other manners, even when a user briefly diverts his/her gaze direction, a directed gaze can be detected. Additional and/or alternative machine learning models (e.g., RNN models) and/or techniques can be utilized to detect a directed gaze that occurs with at least a threshold duration.
[0080]
[0081] In
[0082] In some implementations, the gesture module 116A can utilize the provided region(s) to process only corresponding portion(s) of each vision frame. For example, the gesture module 116A can “crop” and resize the vision frames to process only those portion(s) that include human and/or body region(s). In some of those implementations, the gesture machine learning model(s) 117A can be trained based on vision frames that are “cropped” and the resizing can be to a size that conforms to input dimensions of such a model. In some additional or alternative implementations, the gesture module 116A can utilize the provided region(s) to skip processing of some vision frames all together (e.g., those indicated as not including human and/or body regions). In yet other implementations, the gesture module 116A can utilize the provided region(s) as an attention mechanism (e.g., as a separate attention input to the gesture machine learning model 117A) to focus the processing of each vision frame.
[0083] Likewise, in some implementations, the gaze module 116B can utilize the provided region(s) to process only corresponding portion(s) of each vision frame. For example, the gaze module 116B can “crop” and resize the vision frames to process only those portion(s) that include human and/or face region(s). In some of those implementations, the gaze machine learning model 117B can be trained based on vision frames that are “cropped” and the resizing can be to a size that conforms to input dimensions of such a model. In some additional or alternative implementations, the gaze module 116B can utilize the provided region(s) to skip processing of some vision frames all together (e.g., those indicated as not including human and/or face regions). In yet other implementations, the gaze module 116B can utilize the provided region(s) as an attention mechanism (e.g., as a separate attention input to the gaze machine learning model 117B) to focus the processing of each vision frame.
[0084] In some implementations, detection and classification model 116C can additionally or alternatively provide indications of certain region(s) to other conditions module 118 (not depicted in
[0085] In some implementations, detection and classification model 116C can additionally or alternatively provide, to gesture module 116A and gaze module 116B, indications of region(s) that are classified as TVs or other video display sources. In some of those implementations, the modules 116A and 116B can crop those region(s) out of processed vision frames, focus attention away from those regions, and/or otherwise ignore those regions in detections or lessen the chances that detections will be based on such regions. In these and other manners, false-positive invocations of dormant function(s) can be mitigated.
[0086]
[0087]
[0088] In image 360, a bounding box 361 is provided and represents a region of the image that can be determined (e.g., by detection and classification module 116C of
[0089] In image 360, a bounding box 362 is also provided and represents a region of the image that can be determined (e.g., by detection and classification module 116C of
[0090] In image 360, a bounding box 363 is also provided and represents a region of the image that can be determined to correspond to a video display and that might raise false positives of visual cues. For example, the television might render video showing one or more individuals making gestures, looking into the camera, etc., any of which could be misinterpreted as occurrence of a gesture and/or directed gaze. In some implementations, detection and classification module 116C of
[0091]
[0092] At block 402, the system receives vision data that is based on output from vision component(s). In some implementations, the vision component(s) can be integrated with a client device that includes an assistant client. In some implementations, the vision component(s) can be separate from, but in communication with, the client device. For example, the vision component(s) can include a stand-alone smart camera that is in wired and/or wireless communication with a client device that includes an assistant client.
[0093] At block 404, the system processes vision data using at least one machine learning model, to monitor for occurrence of both: a gesture and a directed gaze.
[0094] At block 406, the system determines whether both a gesture and a gaze have been detected based on the monitoring of block 404. If not, the system proceeds back to block 402, receives additional vision data, and performs another iteration of blocks 404 and 406. In some implementations, the system determines both a gesture and a gaze have been detected based on detecting a gesture and directed gaze co-occur or occur within a threshold temporal proximity of one another. In some additional or alternative implementations, the system determines both a gesture and a gaze have been detected based on detecting the gesture is of at least a threshold duration (e.g., “waving” for at least X duration or “thumbs up” for at least X duration) and/or the directed gaze is of at least a threshold duration (which can be that same or different from that optionally used for the gesture duration).
[0095] If, at an iteration of block 406, the system determines that both a gesture and a gaze have been detected based on the monitoring of block 404, the system optionally proceeds to block 408 (or, when block 408 is not included, directly to block 410).
[0096] At optional block 408, the system determines whether one or more other conditions are satisfied. If not, the system proceeds back to block 402, receives additional vision data, and performs another iteration of blocks 404, 406, and 408. If so, the system proceeds to block 410. The system can determine whether one or more other conditions are satisfied using the vision data received at block 402, audio data, and/or other sensor or non-sensor data. Various other condition(s) can be considered by the system, such as those explicitly described herein.
[0097] At block 410, the system activates one or more inactive automated assistant functions. The system can activate various inactive automated assistant functions, such as those described explicitly herein. In some implementations different types of gestures can be monitored for in block 404, and which inactive function(s) are activated in block 410 can be dependent on the particular type of gesture that is detected in the monitoring of block 404.
[0098] At block 412, the system monitors for deactivation condition(s), for the automated assistant function(s) activated at block 410. Deactivation condition(s) can include, for example, a timeout, at least a threshold duration of lack of lack of detected spoken input and/or detected directed gaze, an explicit stop command (spoken, gestured, or touch inputted), and/or other condition(s).
[0099] At block 414, the system determines whether deactivation condition(s) have been detected based on the monitoring of block 412. If not, the system proceeds back to block 412 and continues to monitor for the deactivation condition(s). If so, the system can deactivate the function(s) activated at block 410, and proceeds back to block 402 to again receive vision data and again monitor for the occurrence of both an invocation gesture and a gaze.
[0100] As one example of blocks 412 and 414, where the activated function(s) include the streaming of audio data to one or more cloud-based automated assistant component(s), the system can stop the streaming in response to detecting a lack of voice activity for at least a threshold duration (e.g., using a VAD), in response to an explicit stop command, or in response to detecting (through continued gaze monitoring) that the user's gaze has not been directed to the client device for at least a threshold duration.
[0101] Turning now to
[0102] At block 402A, the system receives and buffers vision data.
[0103] At block 404A, the system processes the vision data using a gesture machine learning model to monitor for occurrence of a gesture. In some implementations, block 404A includes sub-block 404A1, where the system processes a portion of the vision data, using the gesture machine learning model, based on detecting that the portion corresponds to human and/or body regions.
[0104] At block 406A, the system determines whether a gesture has been detected based on the monitoring of block 404A. If not, the system proceeds back to block 402A, receives and buffers additional vision data, and performs another iteration of block 404A.
[0105] If, at an iteration of block 406A, the system determines that a gesture has been detected based on the monitoring of block 404A, the system proceeds to block 404B.
[0106] At block 404B, the system processes buffered and/or additional vision data using a gaze machine learning model to monitor for occurrence of a directed gaze.
[0107] In some implementations, block 404B includes sub-block 404B1, where the system processes a portion of the vision data, using the gaze machine learning model, based on detecting that the portion corresponds to human and/or face regions.
[0108] At block 406B, the system determines whether a directed gaze has been detected based on the monitoring of block 40BA. If not, the system proceeds back to block 402A, receives and buffers additional vision data, and performs another iteration of block 404A.
[0109] If, at an iteration of block 406B, the system determines that a gesture has been detected based on the monitoring of block 404B, the system proceeds to block 408 or 410 of
[0110] Various examples are described herein of activating dormant assistant function(s) in response to detecting both a particular gesture and a directed gaze. However, in various implementations dormant assistant function(s) can be activated in response to detecting only one of: a particular gesture, and a directed gaze, optionally in combination with one or more other conditions, such as those described herein. For example, in some of those various implementations, dormant assistant function(s) can be activated in response to detecting a directed gaze of a user that is of at least a threshold duration, along with co-occurring other condition(s) such as mouth movement of the user. Also, for example, in some of those various implementations, dormant assistant function(s) can be activated in response to detecting a gesture of a user, along with co-occurring and/or temporally proximal other condition(s) such as detected voice activity.
[0111]
[0112] Computing device 510 typically includes at least one processor 514 which communicates with a number of peripheral devices via bus subsystem 512. These peripheral devices may include a storage subsystem 524, including, for example, a memory subsystem 525 and a file storage subsystem 526, user interface output devices 520, user interface input devices 522, and a network interface subsystem 516. The input and output devices allow user interaction with computing device 510. Network interface subsystem 516 provides an interface to outside networks and is coupled to corresponding interface devices in other computing devices.
[0113] User interface input devices 522 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and/or other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computing device 510 or onto a communication network.
[0114] User interface output devices 520 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem may also provide non-visual display such as via audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computing device 510 to the user or to another machine or computing device.
[0115] Storage subsystem 524 stores programming and data constructs that provide the functionality of some or all of the modules described herein. For example, the storage subsystem 524 may include the logic to perform selected aspects of the method of
[0116] These software modules are generally executed by processor 514 alone or in combination with other processors. Memory 525 used in the storage subsystem 524 can include a number of memories including a main random access memory (RAM) 530 for storage of instructions and data during program execution and a read only memory (ROM) 532 in which fixed instructions are stored. A file storage subsystem 526 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations may be stored by file storage subsystem 526 in the storage subsystem 524, or in other machines accessible by the processor(s) 514.
[0117] Bus subsystem 512 provides a mechanism for letting the various components and subsystems of computing device 510 communicate with each other as intended. Although bus subsystem 512 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple busses.
[0118] Computing device 510 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 510 depicted in
[0119] In situations in which the systems described herein collect or otherwise monitor personal information about users, or may make use of personal and/or monitored information), the users may be provided with an opportunity to control whether programs or features collect user information (e.g., information about a user's social network, social actions or activities, profession, a user's preferences, or a user's current geographic location), or to control whether and/or how to receive content from the content server that may be more relevant to the user. Also, certain data may be treated in one or more ways before it is stored or used, so that personal identifiable information is removed. For example, a user's identity may be treated so that no personal identifiable information can be determined for the user, or a user's geographic location may be generalized where geographic location information is obtained (such as to a city, ZIP code, or state level), so that a particular geographic location of a user cannot be determined. Thus, the user may have control over how information is collected about the user and/or used. For example, in some implementations, users may opt out of assistant devices using vision component 107 and/or using vision data from vision component 107 in monitoring for occurrence of gestures and/or directed gazes.