Processing Voice Commands
20220392435 · 2022-12-08
Inventors
Cpc classification
G10L15/22
PHYSICS
G06F16/686
PHYSICS
G06F3/167
PHYSICS
G10L17/02
PHYSICS
International classification
G06F16/68
PHYSICS
G10L17/02
PHYSICS
Abstract
Recorded background noises, and other contextual data, may be used to assist in resolving ambiguity in spoken voice commands. The background noises may comprise sounds from entities in a room other than the user issuing the voice commands. One such entity may be a content item being watched by the user, and the captured background noises may comprise audio of the content item. The content item may be identified based on the captured audio of the content item in the background noises, and the identification may be used to interpret the ambiguous voice command. Additional contextual information associated with the voice commands (e.g., identifications of the users in the room) and/or the content item (e.g., the video quality of the content item, a service outputting the content item, a genre of the content item, etc.) may be used to identify the content item.
Claims
1. A method comprising: receiving, by a computing device, audio comprising a voice command and background noise; determining, based on speech recognition, that the voice command that is associated with a plurality of devices; identifying, based on a comparison of the background noise to a database of audio fingerprints, a content item audio in the background noise; selecting, based on the identified content item audio, one of the plurality of devices; and causing an action to be executed on the selected one of the plurality of devices.
2. The method of claim 1, further comprising narrowing, based on contextual information associated with the audio, a search space in the database of audio fingerprints, wherein the identifying the content item audio in the background noise comprises searching the narrowed search space for a match to the background noise.
3. The method of claim 1, further comprising: determining, based on the audio, an identity of a user who spoke the voice command; and narrowing, based on one or more viewing characteristics of the user, a search space in the database of audio fingerprints, wherein the identifying the content item audio in the background comprises searching the narrowed search space for a match to the background noise.
4. The method of claim 1, further comprising: receiving a video image associated with the audio; identifying one or more visual objects in the video image; and narrowing, based on the one or more visual objects, a search space in the database of audio fingerprints, wherein the identifying the content item audio in the background comprises searching the narrowed search space for a match to the background noise.
5. The method of claim 1, further comprising: receiving information indicating a video quality of a content item; and narrowing, based on the video quality, a search space in the database of audio fingerprints, wherein the identifying the content item audio in the background comprises searching the narrowed search space for a match to the background noise.
6. The method of claim 1, further comprising: receiving information indicating a content source currently in use; determining content items available from the content source; and narrowing, based on the content items available from the content source, a search space in the database of audio fingerprints, wherein the identifying the content item audio in the background comprises searching the narrowed search space for a match to the background noise.
7. The method of claim 1, wherein the voice command corresponds to: adjusting an audio volume of a content output device; and adjusting a temperature setting on a thermostat.
8. The method of claim 1, wherein the voice command corresponds to: adjusting an audio volume of a content output device; and adjusting a temperature setting on a thermostat, and wherein the identifying the content item audio in the background is further based on: a current temperature in a room associated with the audio; a current volume level of the audio; or one or more content sources or applications currently in use.
9. The method of claim 1, further comprising storing ambiguity resolution data indicating, for the voice command: a plurality of context conditions; and for each of the context conditions, a corresponding action to be taken.
10. The method of claim 1, further comprising: receiving information indicating an application currently in use; and narrowing, based on the application, a search space in the database of audio fingerprints, wherein the identifying the content item audio in the background comprises searching the narrowed search space for a match to the background noise.
11. A method comprising: receiving, by a computing device, audio comprising a voice command; determining, based on speech recognition, a content item audio present in a background of the audio; selecting, based on the content item audio, a voice-enabled device corresponding to the voice command; and causing the selected voice-enabled device to perform the voice command.
12. The method of claim 11, wherein the determining the content item audio comprises: narrowing an audio fingerprint search space based on contextual information associated with the audio; and determining, from the narrowed audio fingerprint search space, a content item matching the background of the audio.
13. The method of claim 11, wherein the determining the content item audio comprises: narrowing an audio fingerprint search space based on information indicating content items available from a content service; and determining, from the narrowed audio fingerprint search space, a content item matching the background of the audio.
14. The method of claim 11, wherein the determining the content item audio comprises: narrowing an audio fingerprint search space based on recognizing a visual object in an image of a screen of a content output device; and determining, from the narrowed audio fingerprint search space, a content item matching the background of the audio.
15. The method of claim 11, further comprising storing information associating the voice command with a plurality of different voice-enabled devices, wherein the information indicates one or more context conditions for each of the different voice-enabled devices.
16. A method comprising: receiving, by a computing device, audio comprising a voice command and background noise; determining, based on speech recognition, that the voice command comprises a request for content recommendation; identifying, based on a comparison of the background noise to a database of audio fingerprints, a content item matching the background noise; generating the content recommendation based on the matching content item; and causing display of the generated content recommendation.
17. The method of claim 16, wherein the identifying the matching content item comprises: narrowing a search space based on contextual information associated with the audio; and identifying, from the narrowed search space, the matching content item.
18. The method of claim 16, wherein the identifying the matching content item comprises: determining, based on identifying one or more objects in an image of a screen of a content output device, a genre of a content item being outputted by the content output device; determining a search space associated with the genre; and searching the search space to find a match between the background noise and audio of the matching content item in the search space.
19. The method of claim 16, wherein the identifying the matching content item comprises: identifying, from an image of a screen of a content output device, a logo; determining a search space comprising content items associated with the logo; and searching the search space to find a match between the background noise and audio of the matching content item in the search space.
20. The method of claim 16, wherein the identifying the matching content item comprises: receiving information indicating an application currently in use; and determining a search space comprising content items associated with the application; and searching the search space to find a match between the background noise and audio of the matching content item in the search space.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] Some features are shown by way of example, and not by limitation, in the accompanying drawings. In the drawings, like numerals reference similar elements.
[0007]
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
DETAILED DESCRIPTION
[0016] The accompanying drawings show examples of various features. It is to be understood that the examples shown in the drawings and/or discussed herein are non-exclusive and that there are other examples of how the disclosure may be practiced.
[0017]
[0018] The communication links 101 may originate from the local office 103 and may comprise components not shown, such as splitters, filters, amplifiers, etc., to help convey signals clearly. The communication links 101 may be coupled to one or more wireless access points 127 configured to communicate with one or more mobile devices 125 via one or more wireless networks. The mobile devices 125 may comprise smartphones, tablets, or laptop computers with wireless transceivers, tablets or laptop computers communicatively coupled to other devices with wireless transceivers, and/or any other type of device configured to communicate via a wireless network.
[0019] An example premise 102a may comprise an interface 120. The interface 120 may comprise circuitry used to communicate via the communication links 101. The interface 120 may comprise a modem 110, which may comprise transmitters and receivers used to communicate via the communication links 101 with the local office 103. The modem 110 may comprise, for example, a coaxial cable modem (for coaxial cable lines of the communication links 101), a fiber interface node (for fiber optic lines of the communication links 101), twisted-pair telephone modem, a wireless transceiver, and/or any other desired modem device. One modem is shown in
[0020] The gateway 111 may also comprise one or more local network interfaces to communicate, via one or more local networks, with devices associated with the premises 102a. Example types of local networks comprise Multimedia Over Coax Alliance (MoCA) networks, Ethernet networks, networks communicating via Universal Serial Bus (USB) interfaces, wireless networks (e.g., IEEE 802.11, IEEE 802.15, Bluetooth), networks communicating via in-premises power lines, and others. The lines connecting the interface 120 with the other devices associated with the premises 102a may represent wired or wireless connections, as may be appropriate for the type of local network used. One or more of the devices at the premises 102a may be configured to provide wireless communications channels (e.g., IEEE 802.11 channels) to communicate with one or more of the mobile devices 125, which may be on- or off-premises.
[0021] The devices in the example premise 102a may comprise, e.g., content output devices 112 (e.g., televisions), other devices 113 (e.g., a DVR or STB), personal computers 114, laptop computers 115, wireless devices 116 (e.g., wireless routers, wireless laptops, notebooks, tablets and netbooks, cordless phones (e.g., Digital Enhanced Cordless Telephone—DECT phones), mobile phones, mobile televisions, personal digital assistants (PDA)), landline phones 117 (e.g., Voice over Internet Protocol—VoIP phones), and any other desired devices. The mobile devices 125, one or more of the devices in the premises 102a, and/or other devices may receive, store, output, and/or otherwise use assets. An asset may comprise a video, a game, one or more images, software, audio, text, webpage(s), and/or other content.
[0022] One or more of the devices in the example premise 102a may be voice-enabled devices that may be controlled by voice commands from users in the premise 102a. The voice-enabled devices may be used in many different contexts in the premise 102a, such as controlling video and/or audio output of content output devices in the premise 102a (e.g., “turn on CNN,” “turn the volume up,” etc.), controlling heating or cooling systems in the premise 102a (e.g., “turn thermostat temperature up”), initiating outgoing telephone calls (e.g., “call Aaliyah”), receiving incoming telephone calls (e.g., “accept call”), controlling home security system (e.g., “enable security system till 7 am”), shopping (e.g., “buy dishwashing liquid”), sending and receiving e-mails, text messaging, web browsing, controlling others devices, search keywords (e.g., find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), determining user voice characteristics, speech-to-text processing (e.g., word processors or emails), and many others. A voice command processing system may be configured to coordinate activities of the various devices, and may initially process a captured voice command in order to determine the corresponding action to take (e.g., directing the recorded voice command to a particular voice-enabled device, sending a control command based on the voice command, etc.). The voice command processing system may be implemented as hardware processors and/or software executing on a computing device. When a user communicates with a voice-enabled device by speaking, the user's speech is recorded as an audio clip. In addition to the user's voice command (e.g., “show me more like this”), the audio clip comprising the user's speech may also comprise background noises captured during the recording of the audio clip. The background noises may comprise sounds from one or more entities near the user. The entities may be human beings (e.g., other human beings in the same room as the user or the device), content output devices (e.g., televisions, personal computers, laptop computers, notebooks, tablets, netbooks, mobile phones, etc.), other devices (e.g., cordless phones, etc.), appliances (e.g., heating or cooling system, mowers, leaf blowers, blending machines, etc.) and/or other entities (e.g., toys, pets, etc.). The additional background noises from these entities may help the voice command processing system determine the appropriate course of action for the speaking user.
[0023] The voice command processing system may use speech recognition to interpret one or more requests in the voice commands spoken by the user in the content of the user's speech and determine an appropriate action to be taken based on the voice command. If a voice command processing system cannot interpret a request in a voice command (e.g., if the voice command matches commands for multiple voice-enabled devices), the voice command processing systems may use sounds of one or more entities in the background noises to interpret the voice command (e.g., the user says, “show me more programs like this” and the background noises comprising audio of the content item being watched by the user is used to determine which program the user is watching). After interpreting the voice command, the voice command processing system may trigger, based on the interpreted voice command, an action or a set of actions to process the user requests. The voice command processing system may be able to interpret the request, but may use the sounds of one or more entities in the background noises to determine the accuracy of the interpretation of the request (e.g., the user says, “show me more programs like this” and the voice command processing system is aware of the program the user is watching but uses the background noises comprising audio of the content item to confirm that the user is indeed watching the program, to confirm that the user was referring to that program), or if the user says “turn it down,” and that command could apply equally to a volume setting of a television and a temperature setting of a thermostat, the system can determine that since there are no background noises matching any content item, that the command was likely intended for the thermostat and not the television. Additionally, the voice command processing system may comprise one or more cameras that record a video of the speaking user while recording the voice command, and the video captured in the recording may be used to interpret the voice command. The background images may comprise physical gestures made by the user while issuing the voice command (e.g., determining that the speaker is moving his hand from left to right in the video while saying “change” may be used to interpret the voice command as a request to change a channel) or certain activities of the speaking user (e.g., determining that the user is exiting his home while saying “turn on” may be used to interpret the voice command as a request to turn on the home security system).
[0024] The local office 103 may comprise an interface 104. The interface 104 may comprise one or more computing devices configured to send information downstream to, and to receive information upstream from, devices communicating with the local office 103 via the communications links 101. The interface 104 may be configured to manage communications among those devices, to manage communications between those devices and backend devices such as servers 105-107 and 122, and/or to manage communications between those devices and one or more external networks 109. The interface 104 may, for example, comprise one or more routers, one or more base stations, one or more optical line terminals (OLTs), one or more termination systems (e.g., a modular cable modem termination system (M-CMTS), or an integrated cable modem termination system (I-CMTS)), one or more digital subscriber line access modules (DSLAMs), and/or any other computing device(s). The local office 103 may comprise one or more network interfaces 108 that comprise circuitry needed to communicate via the external networks 109. The external networks 109 may comprise networks of Internet devices, telephone networks, wireless networks, wired networks, fiber optic networks, and/or any other desired network. The local office 103 may also or alternatively communicate with the mobile devices 125 via the interface 108 and one or more of the external networks 109, e.g., via one or more of the wireless access points 127.
[0025] The push notification server 105 may be configured to generate push notifications to deliver information to devices in the premises 102 and/or to the mobile devices 125. The content server 106 may be configured to provide content to devices in the premises 102 and/or to the mobile devices 125. This content may comprise, for example, video, audio, text, web pages, images, files, etc. The content server 106 (or, alternatively, an authentication server) may comprise software to validate user identities and entitlements, to locate and retrieve requested content, and/or to initiate delivery (e.g., streaming) of the content. The application server 107 may be configured to offer any desired service. For example, an application server may be responsible for collecting, and generating a download of, information for electronic program guide listings. Another application server may be responsible for monitoring user viewing habits and collecting information from that monitoring for use in selecting advertisements. Yet another application server may be responsible for formatting and inserting advertisements in a video stream being transmitted to devices in the premises 102 and/or to the mobile devices 125. The local office 103 may comprise additional servers, such as the fingerprint analysis server 122 (described below), additional push, content, and/or application servers, and/or other types of servers. Although shown separately, the push server 105, the content server 106, the application server 107, the fingerprint analysis server 122, and/or other server(s) may be combined. The servers 105, 106, 107, and 122, and/or other servers, may be computing devices and may comprise memory storing data and also storing computer executable instructions that, when executed by one or more processors, cause the server(s) to perform steps described herein.
[0026] The fingerprint analysis server 122 may be configured to receive data associated with the sounds of one or more entities in background noises captured by the devices in the premises 102 and 102a and identify the entities based on the received data. For example, the fingerprint analysis server 122 may receive data associated with the audio of a content item (e.g., a movie, a show, a gaming event, an advertisement, live news, etc.) being displayed by one of the devices in the example premise 102a. The fingerprint analysis server 122 may identify the displayed content item based on data associated with the audio of the content item. The fingerprint analysis server 122 may then provide information associated with the identified content item to the devices in the premises 102 and 102a. Additional details of the fingerprint analysis server 122 will be discussed further below.
[0027]
[0028] Although
[0029]
[0030] The home automation system 305 may monitor and/or control various attributes in the environment 300 and may serve as a voice command processing system. The home automation system 305 may monitor and control various lighting systems (e.g., smart electrical plugs and/or switches, smart lighting, etc.), heating, ventilation, and air conditioning (HVAC) systems (e.g., smart thermostats, smart smoke detectors, etc.), entertainment systems (e.g., multimedia hubs, wearable devices, toy robots, etc.), pet monitoring devices and systems (e.g., electronically controlled dog doors, litter boxes, aquariums, terrariums, etc.), and/or smart appliances in the premise (e.g., a smart oven or stove, a smart coffee machine, smart locks, etc.), and any other devices such as those typically found around the premise. The home automation system 305 may typically connect with the controlled devices, systems, and/or appliances. The home automation system 305 may comprise a variety of devices, such as wall-mounted terminals, tablet or desktop computers, a mobile phone application, a Web interface that may also be accessible off-site through the Internet, the gateway device 304, and/or one of the content output devices (e.g., the content output devices 308, 314, 316, and 318). The home automation system 305 may be voice-enabled and may serve as a voice command processing system to receive and process voice commands from the users 326 and 328 via one or more microphone 305A, the remote control device 310, and/or the digital assistant 312 (e.g., AMAZON ALEXA on the AMAZON ECHO devices, SIRI on an IPHONE, GOOGLE ASSISTANT on GOOGLE-enabled/ANDROID mobile devices, etc.).
[0031] The security system 306 deployed at the environment 300 may communicate with a number of sensors that can be configured to detect various occurrences and/or other changes in state(s) at the environment 300. For example, the security system 306 may include an image sensing or capturing device (e.g., the camera 320) for periodically capturing an image of the environment 300. The camera 320 may be located at any suitable location throughout environment 300. Furthermore, the camera 320 may be positioned such that the display screens of one or more of the content output devices (e.g., the content output devices 308, 314, 316, and 318) in the environment 300 may be in the field of view of the camera 320. Additionally or alternately, the security system 306 may use the cameras of the content output devices (e.g., the content output devices 308, 314, 316, and 318) as additional image sensing or capturing devices for capturing images of the environment 300. The security system 306 may also comprise one or more door sensors, one or more window sensors, one or more smoke detectors, one or more glass break sensors, flood sensors, gas leak sensors, and medical sensors. While
[0032] The gateway device 304 may implement one or more aspects of the gateway interface device 111, which was discussed above with respect to
[0033] The gateway device 304 may provide a local area network interface to allow communications among the various devices (e.g., the content output devices 308, 314, 316, and 318, the remote control 310, the digital assistant 312, etc.) and the various systems (the home automation system 305, the security system 306 including the sensor device 320, etc.) in the environment 300. The gateway device 304 may provide the various devices (e.g., the content output devices 308, 314, 316, and 318, the remote control 310, the digital assistant 312, etc.) and the various systems (the home automation system 305, the security system 306 including the sensor device 320, etc.) in the environment 300 internet connectivity and wireless local area networking (WLAN) functionalities.
[0034] One or more of the content output devices (e.g., the content output devices 308, 314, 316, and 318) and/or one or more of more systems in the environment 300 (e.g., the home automation system 305, the security system 306, software applications running on any of the devices, etc.) may be voice-enabled devices that are capable of receiving and interpreting voice commands. The voice commands may be received via one or more microphones that are part of or otherwise connected to a particular voice-enabled content output device, the remote control device 310, and/or the digital assistant 312. The voice-enabled devices may further be capable of controlling another device in the environment 300. For example, the content output device 318 may, in response to a voice command, communicate with another device such as the content output device 308 to cause the content output device 308 to record media content or to display media content. The communication between the content output device 308 and the content output device 318 may be a direct communication between the two devices or a communication via an intermediate device, such as the gateway device 304. If the content output device being controlled is itself a voice-enabled device, the content output device may control itself in response to the voice command. For example, if the content output device 308 is a voice-enabled device and has its own one or more microphones, the content output device 308 may, in response to a voice command it receives, record media content and/or display media content. If the content output device being controlled is itself a voice-enabled device and also capable of controlling another device, the content output device may receive a voice command and interpret the voice command to determine, in response to the voice command, whether to control itself and/or control the other device. If the content output device determines that the voice command is one that is also a voice command for the other device, then the content output device may take steps to avoid redundant action based on the same voice command. For example, the content output device could delay taking action on the command and determine whether the other device takes its corresponding action (e.g., the content output device can determine whether the other device lowers its volume). If the other device does not respond to the command within a predetermined time, then the content output device may take a delayed action in response to the command.
[0035] Voice-enabled devices (e.g., the content output devices 308, 314, 316, and 318, the home automation system 305, the security system 306, software applications executing on a computing device, etc.) may be controlled by voice commands spoken by the user 326 and/or the user 328, and the home automation system 305 may include a voice command processing system to provide centralized coordination of voice command responses. The voice command processing system may listen to various users or entities in the environment 300 by continuously recording audio clips via an integrated microphone, a remote control 310, a digital assistant 312, and/or other recording devices. Alternately, the user 326 and/or the user 328 may initiate recording of an audio clip comprising a voice command by pressing a button on the content output device, the remote control 310, the digital assistant 312, and/or other recording devices. The voice command processing system may then perform a speech recognition analysis on the recorded audio clips to identify voice commands. Alternately, the voice processing system of multiple voice-enabled devices may be hosted as a Software-as-a-Service (SaaS) application, a web-architected application, or a cloud-delivered service in a cloud based system such as Amazon Web Services (AWS). The voice processing system may be localized or distributed nationally or internationally and may include load balancing to handle processing voice commands from a large number of users, voice-enabled devices, and/or premises.
[0036] The voice command processing system (e.g., implemented by home automation system 305) may maintain a small vocabulary of words and/or phrases that can be used to recognize a number of voice commands and/or maintain a list of recognized voice commands. For example, a number of voice commands may be associated with controlling the security system 306, and other voice commands may be associated with controlling a content output device 308. Such a vocabulary and/or list of recognized voice commands may be stored by the voice-enabled device and/or by a different physical device, such as in the non-rewritable memory 202, the rewritable memory 203, the removable media 204, and/or the hard drive 205, accessible to the voice command processing system. After receiving an audio clip, a voice command processing system may identify words and/or phrases in the audio clip and match the identified words and/or phrases to words and/or phrases of recognized voice commands. If a matching recognized voice command is found, the voice command processing system may deliver the audio clip to the corresponding voice-enabled device, which may perform actions associated with the recognized voice command. The voice command processing system may also send commands to control devices that are to be controlled by the voice command, without requiring those devices to process the audio clip—this may be useful to provide voice control functionality to devices that, on their own, do not possess voice command processing ability.
[0037] A voice command processing system may control multiple devices based on received voice commands (e.g., the digital assistant 312 may control the home automation system 305, the security system 306, and/or one or more content output devices). Such a voice command processing system may maintain a vocabulary of recognized words and phrases and/or a list of recognized voice commands for each of the devices it controls. After receiving an audio clip, the voice command processing system may interpret the voice commands to identify the device that the speaker of the voice command intended to control by matching words and/or phrases in the received voice commands with recognized words and/or phrases in the list of recognized voice commands for each of the devices and/or systems it controls. For example, the voice command processing system of the home automation system 305 may receive an audio clip “turn on the light” and compare the received voice command to voice commands in the list of recognized voice commands for the home automation system 305 and the list of recognized voice commands for the security system 306. Based on the determined device or system and the request in the voice commands, the voice command processing system may allow for certain commands to be completed. Similarly, a voice-enabled device (e.g., the content output device 308) may host multiple voice-enabled applications (e.g., NETFLIX, YOUTUBE, HULU, email processing software, an application for controlling the settings of the voice-enabled device, etc.) and store a vocabulary of recognized words and/or phrases or a list of recognized voice commands for each of the applications. After recording or receiving an audio clip, the voice command processing system may interpret the voice command in the audio clip to identify the application that the speaker of the voice command intended to control by matching words and/or phrases in the received voice commands with recognized words and/or phrases in the list of recognized voice commands for each of the application.
[0038] A particular voice command phrase may be valid for controlling multiple different services, and detecting the voice command phrase alone might not be sufficient for allowing the voice command processing system to determine what the user intended to control. For example, the phrase “turn it up” may be a valid voice command for increasing an audio level for a content output device 308, as well as for increasing a temperature setting on a thermostat. For such ambiguous voice command phrases, additional processing may be performed to resolve the ambiguity and to execute the speaker's intended function.
[0039] In Example 351, the voice command processing system may determine that the background audio matches the audio of a movie. The fact that the movie audio is heard in the background may suggest that the user's intended command was to increase the audio volume of the movie. The determination may be associated with a confidence value to allow other contextual clues to be taken into account. If desired, the voice command processing system may be configured to request further user clarification from the user, instead of assuming that the low-confidence interpretation was correct if the confidence is below a threshold (e.g., below a “Medium”). If the matching audio was the only detected contextual condition, then the voice command processing system may proceed to send a command to increase audio volume of a movie that the user was watching via the content output device 308. If other contextual conditions are detected, then other interpretations of the voice command may be reached, as will be discussed below.
[0040] In Example 352, the voice command processing system may determine that the background audio matches the audio of a song. The fact that the song audio is heard in the background may suggest that the user's intended command was to increase the audio volume of a music service that the user was listening to via an audio content output device 316. The voice command processing system may send a command to increase the audio volume of the audio content output device 316. Similar to example 351, this command may be associated with a relatively low confidence if that is the only available contextual information.
[0041] In Example 353, after determining that the background audio matches the audio of a movie. additional contextual information may be used to provide a greater degree of confidence. For example, the voice command processing system may determine that the background audio matches a movie that is available from a particular streaming application and that this streaming application is currently executing on the content output device 308 (e.g., by querying the streaming application or consulting a list of content offered by the streaming application), and that the user may be watching the movie via the streaming application. The voice command processing system may send a command to increase the audio volume of the movie being watched on the content output device 308, and may do so with greater confidence due to the additional context information. For example, while Example 351 heard background audio matching a movie, it was not known that the movie is actually being viewed by a user of the system. That matching movie audio could have simply been overheard audio from another user using a different device in the room (e.g., someone is watching a movie on their phone, perhaps). Knowing, in Example 353, that the movie is also being output by an application of the system increases the likelihood that the overheard audio was actually being viewed by a user of the system.
[0042] In Example 354, yet more contextual information may be used. In addition to the context information from Example 353, the voice command processing system may also determine that the current temperature, in the room in which the voice command phrase was heard, is within a normal temperature range. The voice command processing system may communicate with a thermostat and may retrieve historical temperature settings and current temperature measurements to make this determination. If the current temperature is within the normal temperature range, then the likelihood that the user intended to adjust the thermostat is relatively low, and as a consequence, the voice command processing system may adjust the audio volume of the movie being output on the content output device 308. This may be done with a high degree of confidence since several independent sources of contextual information are in agreement.
[0043] In Example 355, if a temperature measurement in the room is below normal, then this may suggest that the user's intent was to increase the temperature setting of the thermostat. The voice command processing system may increase the thermostat setting with low confidence if that is the only available contextual information.
[0044] In Example 356, if the temperature measurement in the room is below normal, and there is no background audio, then the system may increase the thermostat with medium confidence, as it is more likely that the user did not intend to control the audio of a movie or music. Similarly, if there is audio but it does not match a content item, or if there is a match, but the audio volume is already louder than usual (and/or already louder than an ambient sound level in the environment 300), then the voice command processing system may infer that the user was unlikely to have wanted to increase an audio volume even more, and as a consequence may increase the thermostat setting with medium confidence.
[0045] In Example 357, if the current room temperature is below normal and there are no audio services running, then the voice command processing system may determine, with high confidence, that the user intended to control the thermostat. The voice command processing system may then send a command to increase the thermostat setting.
[0046] In Example 358, the voice command processing system may be unable to resolve the ambiguity with sufficient confidence. The room temperature may be colder than normal, but there may also be a matching audio in the background that matches a movie that a currently-running streaming application has available. In such situations, the voice command processing system may determine that additional clarification is needed and may prompt the user to clarify whether the user intended to increase the volume or increase the temperature in the room. An audio prompt may be output (e.g., “Did you mean to increase audio volume?” or “Are you referring to content audio or room temperature—please say ‘audio’ or ‘temperature’ for the one you wanted to control”). The degree of confidence needed to take action may be configured as desired. The Example 358 context may be configured to default to controlling the audio. The default may be established based on the user's historical patterns, such as the user's preferred audio volume level, or by determining that the user has a tendency to make frequent audio adjustments and infrequent thermostat adjustments.
[0047] The table 350 may be used to resolve ambiguities, but it may also be used to indicate command results even for unambiguous commands. For example, the phrase “arm security system” may be uniquely associated with the security system 306. The table 350 may simply indicate that the result for that command is to change the status setting of the security system 306 to an armed state.
[0048] Sometimes, a voice-enabled device or system may receive an ambiguous voice command and fail to determine the request in the ambiguous voice command and/or the device to be controlled by using speech recognition analysis. An ambiguous voice command may be a voice command which partially matched with one or more voice commands from the list of recognized voice commands for the voice-enabled device or system, or the voice command may not match with any recognized voice command in the list of voice commands. The voice processing system of the voice-enabled device or system may determine one or more subject items (e.g., missing information) in the voice command and try to find the identity of the subject items to fully interpret the requests in the ambiguous voice commands. The subject items may be associated with a content item a content output device is outputting, a voice-enabled device, a voice-enabled system, a voice-enabled service/application, an attribute of a voice-enabled device/system/service/application that the user is trying to control, etc.
[0049] As another example, the user 326 may issue a voice command “show me more programs like this” to the content output device 308, which may be coupled with a DVR and host applications for multiple video on-demand streaming content services (e.g., NETFLIX, HULU, YOUTUBE, PEACOCK, etc.) that do not share their streaming activities with the content output device 308. Therefore, the voice command processing system may recognize the phrase “show me more program like this” but may not be able to determine which program the user is watching if the user is streaming the program from one of the multiple video on-demand streaming content services that do not share their streaming activities. The voice command processing system may learn the identity of the program by recognizing the audio of the program playing in the background of the recorded voice command and then use that identity to generate and display a content recommendation for the user. As another example, the user 326 may issue a voice command “play the next episode.” Therefore, the voice command processing system may recognize the phrase “play the next episode” but may not be able to determine which video on-demand streaming content service the speaker is targeting and/or whether the speaker is targeting the DVR that is coupled to the content output device 308. The voice command processing system may determine which service (e.g., DVR, NETFLIX, PEACOCK, etc.) is in current use and direct the voice command to that service. Even if the voice command is directed to that service, the voice command processing system may also use the voice command “play the next episode” for its own purposes. For example, the voice command processing system may learn the identity of the program (and use that identity to generate its own content recommendation for the user.
[0050] Additional information related to the environment 300, the users, the content output devices, and/or the systems in the environment 300 may be used to correctly interpret the ambiguous voice commands. The voice command processing system may interpret the request in a voice command on its own and then use additional information related to the environment 300 to confirm the accuracy of the interpretation.
[0051]
[0052] A user 401 may speak a voice command 402, and the voice command 402 may be captured by one or more microphones 403. The microphones 403 may be standalone microphones and/or integrated into other devices such as a handheld remote control, portable computing device, smartphone, etc. A voice command identification process 404 may perform audio processing to recognize the voice command 402. Any desired speech recognition technique may be used to identify the user's voice in the audio clip. The voice command identification process 404 may retrieve a voice pattern of the user 401 and may use this pattern to identify the user's voice in the audio captured by microphone 403. The voice command identification process 404 may filter audio signals with frequencies that are associated with human speech from the audio clip by using any desired filtering technique. The voice command identification process 404 may also filter one or more signals with frequencies that are not associated with human speech and classify these audio signals as background noises 408
[0053] The voice command identification process 404 may parse speech in the filtered audio signals that are associated with human speech into blocks or chunks more suitable for subsequent speech processing. For example, linear predictive coding (LPC) can be used to break the human speech into various items, such as verbs, sentences, nouns, and so on. Speech recognition can be performed to identify a request in the identified items in the human speech. For example, the voice command identification process 404 may identify a request by fully or partially matching the identified items in the speech to one or more recognized voice commands in the list of recognized voice commands for the voice-enabled device 400. Duplicated items and items indicating filler words (e.g., “um,” “uh,” “er,” “ah,” “like,” “okay,” “right,” “you know,” etc.) may be discarded before the speech reorganization process. Various speech recognition may be used to perform speech recognization, such as hidden Markov Models (HMM), dynamic time warping (DTW)-based speech recognition, neural networks, Viterbi decoding, and deep feedforward and/or recurrent neural networks.
[0054] If a voice command 402 is detected and is not ambiguous (e.g., “switch to CNN,” “increase thermostat temperature by 2 degrees, etc.), then the voice command identification process 404 may simply send a corresponding command signal 417. The command may control a device, such as controlling a thermostat to increase a temperature setting or contenting a content output device to switch to CNN, in accordance with the voice command 402. The command signal 417 may include a copy of the audio that the microphone 403 captured, which may be helpful if the target device has voice processing capability of its own. For example, if the voice command identification process 404 determines that the voice command 402 was intended for a voice-enabled content service 413, then the signal 417 may send a copy of the microphone's 403 audio to the voice-enabled content service 413, thereby allowing the content service 413 to process the audio on its own.
[0055] However, if the voice command identification process 404 supports multiple different voice-enabled devices, and if the same voice command 402 is usable for multiple devices (e.g., turn it up”), then an ambiguity may result. The voice command identification process 404 may use table 350 to determine a response to the voice command 402 and may enlist the assistance of other processes to help identify the correct response. Additionally or alternatively, the voice command identification process 404 may be able to determine which device the voice command is targeted for but may still find the voice command to be ambiguous (e.g., “show me more program like this,” etc.).
[0056] The other processes may identify background sounds that were captured by the microphone 403 when the user 401 spoke the voice command 402. The background sounds may have included content audio 405 of a video program that the user 401 was watching when the voice command 402 was spoken and/or other miscellaneous sounds 406 (e.g., passing cars outside, other people, pets, household appliances, etc.). Some or all of these background sounds may be used to help the voice command identification process 404 determine the intent of the voice command.
[0057] The voice command identification process 404 may use any desired audio filtering technique to separate the voice command 402 from the audio captured by the microphone 403, resulting in background noise 408 that may comprise the content audio 405 and/or other miscellaneous sounds 406. For example, the voice command identification process 404 may filter the audio captured by the microphone 403 to remove the recognized voice command (e.g., by applying an inverse audio signal of the voice command), and the remaining sounds in the audio may be designated the background noise 408. Portions of the audio occurring before and/or after the recognized voice command may be designated the background noise 408. In return, the background noise analysis 407 may provide an identification 409 of the content audio 405 (and/or any other recognized noise in the background noise 408), and the voice command identification process 404 may use this identified content item to determine the intent behind the voice command 402, and to determine how to react to the voice command 402.
[0058] The background noise analysis 407 may compare the background noise 408 (or a fingerprint pattern of the background noise 408) with a database 410 of known sounds to identify sounds in the background noise 408 (e.g., identify a movie being watched while the voice command 402 was spoken; identify a household appliance that was in use, etc.). The database 410 may contain audio fingerprints for a large variety of content items, such as video programs, movies, songs, online videos, audio books, etc. Any desired type of audio fingerprinting may be used and may identify characteristics of content for purposes of comparison (e.g., frequency levels, colors, patterns, objects, etc.). The database 410 may also contain audio fingerprints for sound sources other than content items, such as sounds of household appliances, smoke alarms, security systems, barking dogs, etc. The database 410 may allow the voice command identification process 404 to identify the various sounds that are in the background noise 408, and any desired audio (and/or video) fingerprinting technique may be used for supporting audio and/or visual searching.
[0059] Fingerprints of content items stored in the fingerprint database 410 may be tagged with one or more characteristics associated with the content item, such as the name of the content item, the season number, the episode number, the genre of the content item, actors present in the content item, content providers broadcasting or streaming the content item, broadcast time, video quality of the content item and so on. For example, a fingerprint for a show named “HAWAII FIVE-O” may be tagged with the genres “crime” and “drama,” actors “Scott Park” and “Grace Kim,” video quality “4K,” and content provider “CBS.” As another example, a fingerprint for a movie named “The Sleepover” may be tagged with the genre “comedy,” actors “Sadie Stanley” and “Maxwell Simkins,” video quality “8K,” and content provider “NETFLIX.” Additionally, the database 410 may comprise fingerprints of entities other than content items, such as fingerprints of sounds of HVAC systems, lawnmowers, pets, etc.
[0060] The size of the database 410 used for the background noise analysis 407 will affect the time that the background noise analysis 407 comparison requires, as the background noise 408 may need to be compared to more entries if the database 410 has more entries. To help streamline this process, context information may be used to limit the comparison to a subset search space of the database 410. For example, if the background noise analysis process 407 knows that the user 401 historically watches a lot of action movies, then the background noise analysis 407 may begin its analysis by focusing on comparing the background noise 408 to known audio 410 of action movies. With a smaller search space, the background noise analysis 407 may more quickly identify sources of sound in the background noise 408.
[0061] A context collection process 411 may supply contextual information 412 to the background noise analysis 407, to assist with the recognition process. The context collection process 406 may gather context information from a variety of sources.
[0062] One source of context information may deal with the content service(s) 413 that the user 401 is using. The user 401 may have access to content items via multiple different content services 413. A content service may, for example, comprise and/or otherwise be associated with a source of content. A content service may comprise a remote linear content provider that broadcasts television channels on a cable service (e.g., NBC, CBS, FOX, HGTV, etc.) and/or streams live events (e.g., sporting events, news, weather, etc.). A content service may also be a remote video on-demand (VOD) streaming content provider (e.g., NETFLIX, AMAZON PRIME VIDEO, HULU, APPLE TV+, DISNEY PLUS, HBO NOW, PEACOCK, ITUNES, etc. that provides content items to speakers/users based on the speakers/users' requests). Additionally, the content service may be a local content service provider in the premise (e.g., DVD, DVR recordings, content recorded on a user's phone or computer, a video feed from a security camera, etc.). The user 401 may choose to watch a content item from one of the content services 413. The chosen content service may enable retrieval of the selected content item from one or more content servers (e.g., the content server 106 in
[0063] If the background noise analysis 407 is provided with contextual information 412 indicating that the user 401 is currently using a video streaming application such as PEACOCK, but PEACOCK does not share information about its streaming activities (e.g., which content item is currently being streamed, etc.), then the background noise analysis 407 may use the contextual information that PEACOCK is currently in use to identify the content item being streamed by PEACOCK. The background noise analysis 407 may begin its sound matching comparison by comparing the background noise 408 with known audio 410 of content items that are available from the PEACOCK streaming application. The known audio database 410 may include information indicating one or more sources for various content items represented in that database. However, if PEACOCK does provide information identifying content being output, such as through an application program interface (API), then the background noise analysis 407 may not be needed, as the system can simply query the API to receive information identifying the content being output.
[0064] The contextual information 412 may include information about the usage of various devices, such as the video and audio rendering device 414. If the background noise analysis process 407 knows, for example, that a display device is not in use, but an audio device is in use, then the audio matching comparison can focus on audio content such as songs and musical soundtracks, instead of video content.
[0065] The contextual information 412 may include information from a user database 415. The user database 415 may contain various kinds of information about the user 401, such as user preferences, viewing history, service usage history, application permissions, demographic information, subscription information, voice characteristics, temperature settings for the thermostat, usage history for lighting, etc. This user information may be useful in, for example, deducing whether the user 401 is likely to be watching a favorite show or to limit the search space for content items based on the user's preferences. The user database 415 may comprise samples of the user 401's voice to assist in recognizing the user 401. The user database 415 may indicate frequently-viewed genres of content items, preferred genres based on the time and day (e.g., morning, afternoon, evening, weekend, weekday, etc.), and/or other characteristics of users in the environment 300. The user's frequently viewed genres of content items, or preferred genres based on the time and day may be used to determine a search space that comprises content items associated with the speaker's preferred genres and not other genres, as will be discussed further below.
[0066] The user database 415 may contain privacy settings for the user 401. The privacy settings may indicate the user's preferences regarding usage of the contextual information 412. For example, the user 401 may indicate whether the user gives permission to have photos taken in the room, or facial recognition to occur, or viewing history information to be accessed, etc. The user 401 may wish to only allow voice recognition of the user's 401 own voice and not of any other human voices that may be captured by the microphone. The user 401 may indicate that certain portions of the viewing history are not to be used (e.g., individual content items, or content types, that the user 401 does not wish to be identified).
[0067]
[0068] The contextual information 412 may be useful to the background noise analysis process 407 for narrowing the search space for recognizing a sound in the background noise 408. The contextual information 412 may also be useful to the voice command identification process 404 to assist in evaluating the actions in, for example, the conditions in table 350 discussed above. For example, if the voice command identification process 404 knows that the current temperature in the room is colder than the user's normal preference as indicated in the user information 415, then the voice command identification process may be more likely to conclude that the user's “turn it up” voice command was intended to increase the thermostat setting. The contextual information 412 may be used for both the background noise analysis 407 and the voice command identification 404.
[0069] The contextual information 412 may include image data captured by a camera. For example, the security system 306 may send images captured by the camera 320. The voice command identification process 404 (or other processes in voice command processing system 400) may identify various objects in the captured images that may reduce the search space for identifying various entities in the background noise 408, or otherwise assist in handling the voice command 402. For example image processing may recognize users who are in the room when a voice command 402 is spoken, and the preferences of those users may be used to limit the search space for content audio recognition.
[0070] Additionally or alternately, the captured images may comprise the display screen of a content output device 308 outputting the content item. The genre of the content item may be determined based on visual objects present on the display screen (e.g., a display screen showing players playing football in a field may indicate that the content item may be a sporting event). The background noise analysis process 407 may initiate a reduced search space that only includes sounds of sporting events.
[0071] A logo of the content service may be recognized from the captured images of the display screen. The background noise analysis process 407 may initiate a reduced search space that only includes sounds of content items available via the identified content source for identifying the content item. The security system 306 may be configured to capture images whenever the microphone 403 records an audio clip. Additionally or alternately, the security system 306 may capture images at periodic intervals (e.g., every thirty seconds, every minute, five minutes, and so forth) and transmit the captured images to the context inferring engine 406.
[0072] The contextual information 412 may include data from a gateway device (e.g., the gateway device 304 in
[0073] The contextual information 412 may include data indicating the status of various devices on the premises. For example, the home automation system 305 may send data to the context collection process 411 regarding which lights are on in the environment 300, the states of the HVAC systems (e.g., temperature setting, fan setting, timers, information from smoke detectors, etc.), which entertainment systems (e.g., multimedia hubs, wearable devices, toy robots, etc.) are currently active and their states, and/or information about active and inactive smart appliances in the environment 300 (e.g., a smart oven or stove, a smart coffee machine, smart locks, etc.), etc. Data from the home automation system 305 may also be used to modify the search space the background noise analysis process 407. For example, if the home automation system 305 indicates that the coffee machine is on, this information may be used to determine, from user information 415, that the user 401 often watches a particular talk show while drinking coffee. Therefore, the reduced search space for identifying the background noise 408 may initially focus on finding matches among talk show audio samples.
[0074] The contextual information 412 may also include external contextual information 416 received from remote sources. For example, information about the user 401 usages of a streaming service may be obtained from a streaming service server located remotely from the user's home. Any of the contextual information discussed herein may be obtained from an external source.
[0075]
[0076] In step 501, a voice command processing system may be initialized in one or more computing devices, such as those illustrated in the voice command processing system 400 in
[0077] As part of this initial configuration, one or more user interfaces may be displayed to a user 401 to gather user information 415, such as their viewing preferences, desired thermostat settings for different days and times, subscription services, etc. This information may also be gathered automatically by the voice command processing system 400 by monitoring user behavior over time.
[0078] Privacy can be an important concern to user 401, and in step 502, the user 401 may be prompted to provide privacy settings.
[0079] A pre-existing profile for privacy settings may already be stored, for example, in user information 415, and a user interface may be displayed based on the pre-existing profile. Alternatively, different predefined privacy settings may be identified (e.g., the predefined privacy settings may include default profiles for using all the sounds identified in the background noises and all the contextual information gathered by the voice command processing system, etc.) and a user interface may be displayed based on the pre-defined privacy settings. After receiving user inputs via the displayed user interface, the user inputs from the displayed user interfaces may be stored as privacy settings for the user in a database at step 503.
[0080] At step 504, communication may be initialized with a gateway device (e.g., the gateway device 304 in
[0081] At step 505, communication with a home automation system (e.g., the home automation system 305 in
[0082] At step 506, communication may be initialized with a security system (e.g., the security system 306 in
[0083] At step 507, communication may be initialized with one or more content applications (e.g., content services 404 in
[0084] At step 508, communication may be initialized with a video and audio rendering engine (e.g., the video and audio rendering engine 414 in
[0085] After initialization, the voice command processing system 400 may begin to listen for potential voice commands. The microphone 403 may continuously record audio clips in the environment 300, and if any sound beyond a minimal threshold is detected, a determination may be made in step 509 as to whether a voice command was detected in the audio clip. The voice command may include a keyphrase, such as “Hey Xfinity” or “Hey Alexa,” to help clearly indicate that a voice command is being spoken.
[0086] The presence of a voice command may be determined by filtering audio signals associated with human voice using various signal filtering techniques (e.g., frequency-division multiplexing) and analyzing the filtered audio signals by using any speech recognition technique, such as hidden Markov Models (HMM), dynamic time warping (DTW)-based speech recognition, neural networks, Viterbi decoding, and deep feedforward and/or recurrent neural networks. A voice command may be identified in the audio clip by identifying a word or a phrase in the filtered audio signals, where the identified word or phrase comprises a request by a speaker of the word or phrase to control one of many computing devices near the speaker. The identity of the speaking user 401 may be determined by analyzing the prosodic characteristics of the user's speech, such as pitch, loudness, tempo, rhythm, and intonation, for example, with stored data 415 indicating the prosodic characteristics of the user. If a voice command is not identified in the audio clip, the algorithm continues to wait for another recorded audio clip.
[0087] The presence of the voice command may be determined using any speech recognition technique, such as hidden Markov Models (HMM), dynamic time warping (DTW)-based speech recognition, neural networks, Viterbi decoding, and deep feedforward and/or recurrent neural networks. The voice command may be separated from any other sounds in the audio captured by microphone 403, and those other sounds may be designated as the background noises 408.
[0088] If a voice command is identified, then a determination may be made at step 510 as to whether the voice command comprises one or more requests for content recommendation (e.g., “show me more programs like this”). If content recommendation requests are determined, the algorithm may proceed to step 516.
[0089] If a content recommendation is not identified, then a determination may be made at step 511 as to whether additional background processing will be used to assist in processing the voice command. If, for example, the voice command was clearly understood and unambiguous (e.g., “Hey Xfinity, please arm the security system”), then the voice command may be processed without needing any additional assistance regarding background noises. For example, the voice command may be understood with a high level of confidence or a confidence level that is higher than a predetermined confidence threshold. This confidence may be indicated by a voice recognition process and/or table 350. Such a clear identification may occur if the recognized voice command is only assigned to one corresponding result in table 350. In that case, in step 512, a corresponding command may be sent based on the recognized voice command. For example, a control signal may be sent to the security system 306, changing the security setting to an armed state based on clearly identifying a voice command to do so.
[0090] However, if the voice command is ambiguous (e.g., if the same voice command is valid in table 350 for multiple different results or if critical information is missing in the voice command) or if the voice command is interpreted or understood with a low level of confidence or the confidence level does not satisfy a predetermined confidence threshold, then additional processing of background information may be used to help resolve the ambiguity. The table 350 may be consulted to retrieve entries for the ambiguous voice command (e.g., if an ambiguous “turn it up” command was heard, then the table 350 may contain entries for the possible results associated with that ambiguous command). The entries may indicate one or more types of contextual information that can be used to resolve the ambiguity (e.g., content audio matching, applications being used, etc.), and in step 513, those context types may be determined. In the example table 350, for the ambiguous voice command “turn it up,” the ambiguity may be resolved using: 1) audio content matching (e.g., recognizing audio of movie/song); 2) application usage (e.g., streaming app 1 in use); 3) content source availability (e.g., movies available from streaming app 1); 4) thermostat information (e.g., room temperature and user's temperature preference); and 5) room audio level (e.g., audio volume). To assist with resolving the ambiguity, as indicated in the table 350, these various types of context information may be retrieved. Additionally, while resolving ambiguity is one example of using background information, there may be other reasons. For example, if the voice command processing system 400 simply wishes to provide an added service to complement a content source 413, then the background processing may be helpful. The voice command processing system 400 may wish to identify content items that a user device is outputting, so that a separate set of content recommendations (distinct from the content source 413 being outputted) may be provided. Such a content recommendation service may provide recommendations at a more comprehensive level—if the user 401 uses five different content sources 413, a comprehensive content recommendation system may offer recommendations based on knowledge of the user's usage of all of the different content sources 413. Alternately or additionally, instead of processing background noises after determining ambiguous voice commands, the background noises may be periodically processed (e.g., every five minutes, every 10 minutes, etc.) to identify sources of various sounds in the background noises and/or content items being outputted by content devices. An ambiguous voice command may be interpreted by using information from the last processing of background noises.
[0091] Context information from a heating/ventilation/air conditioning (HVAC) system 330 may be useful. If, in step 514, such information would help resolve the ambiguity, then in step 515, the context collection process 411 may obtain context information from an HVAC system 330. This context information may include thermostat settings, current measured temperatures, current HVAC status (e.g., heat is running, air-conditioning is running, etc.), historical heating and/or cooling patterns, etc. The examples above are merely examples, and with the proliferation of the Internet of Things (IoT) connecting more and more smart devices (wearables such as watches, video game consoles, smart appliances, etc.), context information may be obtained from any sort of device, depending on the contexts that will help resolve a particular voice command.
[0092] In step 516, if application usage context information would be useful in resolving the ambiguity, then that application usage context information may be retrieved 517. Retrieving the application usage context information may comprise the context collection process 411 sending a request to various content service applications 413 to inquire about whether the content services are in use. The context collection process 411 may communicate with one or more external servers to request external contextual information 416 regarding current applications that may be in use. For example, the user information 415 may indicate that the user 401 has subscriptions to several streaming services, and the context collection process 411 may comprise communicating with those streaming services to determine whether they are in current use by the user 401. The context collection process 411 may also send requests to various computing devices to request identification of applications that are currently in use. The collected context information may comprise more than simply a binary indication of whether the application is in use, or which applications are in use. Other application details may also be retrieved. For example, if the application provides information identifying a title of a content item being streamed, or a library of available content items, or historical usage information, etc., then such additional application details may also be retrieved.
[0093] In step 518, if device usage context information would be useful in resolving the ambiguity, then the device usage context information may be retrieved 519. Similar to the application usage context collection 516, the device usage context information may be retrieved by sending requests to various devices that are associated with the user, to determine which devices are in current use. Various device contexts may be identified, and several additional examples are illustrated in the following steps.
[0094] For example, the context collection process 411 may determine whether the video and audio rendering 414 is in use or if the gateway 304 is currently in use. The security system 304 may be an example of such a device, and in step 520, a determination may be made as to whether security system 304 context information would be useful in resolving the ambiguity. If so, then the context collection process 411 may retrieve security system 306 information (e.g., current armed status, security sensor history, alarm schedule, etc.) from the security system 306 in step 521. Usage information from the gateway 304 may be obtained, indicating bandwidth being used for streaming, a source of streaming content, information indicating types of data being streamed and to which device, etc. Usage of appliances may be determined. For example, the status of a coffee maker may be used if a user 401 tends to watch television while drinking coffee in the morning.
[0095] Video cameras, such as sensor 320, may also provide context information. For example, the voice command processing system 400 may use facial recognition to recognize the user 401 issuing the voice command and may retrieve preferences of the user 401 from user information 415 to resolve an ambiguity in a voice command. Multiple users may be recognized as well, and multiple user preferences may be retrieved from user information 415. In step 522, a determination may be made as to whether video camera context information would be useful in resolving the ambiguity. If so, video image context information may be retrieved at step 523. The video image context information may be one or more images from a camera, and/or may be information processed using one or more images from a camera. For example, the video image context information may simply comprise an identification of a user 401 whose face was recognized via a facial recognition process. Various facial recognization techniques may be used, such as machine learning-based models, including regression-based models, neural network-based models, and/or fully-connected network-based models.
[0096] The video context information may comprise other recognized objects in the one or more images from a camera. For example, a camera may capture an image of a display screen in the room (e.g., audio/video output device 308), and an object recognition process may be executed on the captured image to recognize one or more objects being displayed on the display screen. A recognized video source logo may help indicate a content source (e.g., a television channel, streaming service, etc.) that the user 401 was watching when speaking the voice command. Actors visible on the display screen may be recognized through a facial recognition process and may be identified in the video image context information. A fingerprint analysis server 322 may be able to recognize a video content item by recognizing scenes from a video image (e.g., by generating image fingerprints to allow visual searching to find a scene in a content item), and may provide context information identifying the source and/or content item on the display screen. One or more objects, such as actors, objects, a genre of the content item being displayed via the content output device, and/or logos of a content service outputting the content item may be identified from the screen of the content output device. The genre of the content item may be determined based on one or more objects identified from the display screen in the image. For example, as shown in
[0097] The image processing is not limited to images from a camera in the environment. The same processing may be performed on video images being displayed by a display device (e.g., audio/video display 308), or sent to such a display device by gateway 304.
[0098] The context information collected by the context collection process 411 may be used to streamline the audio search for identifying sounds in the background noise 408. In step 524, an initial search space of the database 410 may be determined. The initial search space may be the entire database 410, which would allow the voice command recognition system 400 to recognize the most possible matches in the background audio 408. However, the collected context information may help reduce this search space.
[0099] In step 525, the search space may be reduced based on the user's 401 preferred content item. For example, the context collection process 411 may determine the identity of the user 401 based on facial recognition context information in step 520 and may also obtain user preference information 415 corresponding to the identified user 401. If the user 401 prefers watching comedies, then the search space 600 may be reduced to (at least initially) focus on audio fingerprints for comedic content items 601. If a search were executed using this reduced space, then the audio in the background noise 408 would be compared against the audio fingerprints for comedies 601, which is smaller than the entire database 410, and as a consequence, this search would be conducted much faster. However, additional context information may be used to even further narrow this search space.
[0100] In step 526, the search space may be reduced based on the application(s) that are currently in use. For example, if in step 513 it is determined that the user 401 is using the HULU content streaming application, then the search space may be reduced to focus on content items that are available from that streaming service by eliminating content items that are not available from that streaming service. As noted previously, the database 410 may contain information indicating one or more sources for each of the listed content items, and this information may be used for this reduction. Further reducing the search space may result in an intersection 602 of comedies on HULU, and searching this space for a match with the background noise 408 may be accomplished much faster than searching the entire database 410.
[0101] Steps 525 and 526 are merely examples. The use of context information to reduce the audio fingerprint search space may use different contextual information, and/or may omit some of the contextual information, and may generally use any of the types of context information described herein. For example, contextual information associated with a certain actor may also be used to narrow the initial search space. The contextual information associated with the actor may be determined from recognizing the actor (e.g., via facial recognition in obtaining device context information at step 518, or as part of image processing in step 522, on the screen of the display screen outputting the content item. The initial search space may be narrowed by selecting content items that are associated with the identified actor (e.g., if John Smith is recognized from the image of the display screen, select fingerprints that are tagged as being associated with John Smith). In some examples, multiple actors may be identified at step 523, and the initial search space may be determined by selecting content items that are associated with some or all the identified actors.
[0102] In step 527, the reduced search space may be searched to find a matching audio fingerprint that matches the background noise 408. Any desired fingerprint matching process may be used, and if a match is not found, then the search space may be broadened. For example, the fingerprint may comprise sound amplitudes or frequency of the sound wave, as measured at several points in time. However, any type of fingerprint may be generated using various components of the sound wave.
[0103] As illustrated in
[0104] In step 532 of
[0105] If, however, in step 534, the recognized content item is not known to be available from the content application or device that is currently in use, the voice command identification 404 may still conclude that the “turn it up” voice command was a request to increase audio volume 536, but this determination may be made with a lower confidence (e.g., a “Medium” confidence level). As noted above, a threshold may be established for a degree of confidence required for the system to take action—if a conclusion cannot be reached with sufficient confidence, then the system may request assistance from the user 401, as indicated further below.
[0106] In step 533, if the content was recognized in the background audio 408, then in step 537, a determination may be made as to whether the current temperature in the room is colder than the normal range of temperature that is preferred for the HVAC 330. This information may be retrieved, for example, in step 515. If the room is cold, then the voice command identification 404 may conclude that, given the contextual circumstances of a content app/device in use, unrecognized content in the background audio 408, and a cold room, that the “turn it up” voice command was intended to increase the temperature setting of the HVAC thermostat. Given the context, this determination may be made with a lower confidence. On the other hand, if the room is not currently cold, then the voice command identification 404 may conclude that the “turn it up” command was intended to increase the audio volume of the content application/device that is currently in use 539. This determination, given the context, may also be given a low confidence.
[0107] In step 532, if there is no content application or device currently in use, then the resolution of the “turn it up” ambiguity could simply depend on the temperature in the room. The temperature may be checked in step 540, and if the room is cold, then in step 541, the voice command identification 404 may conclude that the voice command was a request to increase the temperature of the room. A control signal may be sent to the HVAC 330 to increase the temperature setting of the HVAC 330 thermostat.
[0108] If, in step 540, it is determined that the room is not cold, then the contextual information might not be able to resolve the ambiguity. There is no content application or device in use, and the room is not currently cold, so it may remain ambiguous what the user meant with the “turn it up” command. In that situation, in step 542, the voice command identification 404 may conclude that there is insufficient information to resolve the ambiguity and may prompt the user 401 for clarification as to what was intended. The prompt may be an audio message played via audio output device 316 and may ask the user to restate the desired command or to use a different phrase for the intended result (e.g., “Say ‘volume’ if you meant to increase the audio; say ‘temperature’ if you meant to increase the room temperature.”). After processing the voice command, the process may return to step 509 to await the next voice command. The example process in
[0109]
[0110] In step 544, a content recommendation may be generated comprising content items that are similar to the identified content items from the content identification results at step 527. For example, the content recommendation may comprise content items from the same genre as the identified content item, with the same actors, plots, ratings, directors, and/or producers as the recognized content item, or from the same content source of the identified content item. At step 545, the content recommendation may be displayed via one or more content output devices.
[0111] A database may comprise two or more separate databases, and when considered together, still constitute a “database” as that term is used herein. A database may be distributed across a cloud or the Internet. Various processes described herein may comprise hardware module(s), software module(s) executing on one or more hardware processors, or a combination of hardware and software modules; any of software modules may comprise instructions stored in the non-rewritable memory 202, the rewritable memory 203, the removable media 204, and/or the hard drive 205, and the instructions, when executed by one or more hardware processors may cause to perform one or more functions. Although examples are described above, features and/or steps of those examples may be combined, divided, omitted, rearranged, revised, and/or augmented in any desired manner. Various alterations, modifications, and improvements will readily occur to those skilled in the art. Such alterations, modifications, and improvements are intended to be part of this description, though not expressly stated herein, and are intended to be within the spirit and scope of the disclosure. Accordingly, the foregoing description is by way of example only, and is not limiting.