Handling Responses to Speech Processing

20170330565 ยท 2017-11-16

Assignee

Inventors

Cpc classification

International classification

Abstract

A plurality of microphones are positioned at different locations. A dispatch system in communication with the microphones derives a plurality of audio signals from the plurality of microphones, computes a confidence score for each derived audio signal, compares the computed confidence scores. Based on the comparison, the dispatch system selects at least one of the derived audio signals for further handling, receives a response to the further processing, and outputs the response using an output device. The output device does not correspond to the microphone that captured the selected audio signals.

Claims

1. A system comprising: a plurality of microphones positioned at different locations; an output device; and a dispatch system in communication with the microphones and configured to: derive a plurality of audio signals from the plurality of microphones, compute a confidence score for each derived audio signal, compare the computed confidence scores, based on the comparison, select at least one of the derived audio signals for further handling, receive a response to the further processing, and output the response using the output device; wherein the output device does not correspond to the microphone that captured the selected audio signals.

2. The system of claim 1, wherein the output device comprises one or more of a loudspeaker, headphones, a wearable audio device, a display, a video screen, or an appliance.

3. The system of claim 1, wherein upon receiving multiple responses to the further processing, the dispatch system determines an order in which to output the responses by combining the responses into a single output.

4. The system of claim 1, wherein upon receiving multiple responses to the further processing, the dispatch system determines an order in which to output the responses by selecting fewer than all of the responses to output.

5. The system of claim 1, wherein upon receiving multiple responses to the further processing, the dispatch system sends different responses to different output devices.

6. A method of processing audio signals, comprising: receiving audio signals from a plurality of microphones positioned at different locations; in a dispatch system in communication with the microphones: deriving a plurality of audio signals from the plurality of microphones, computing a confidence score for each derived audio signal, comparing the computed confidence scores, based on the comparison, selecting at least one of the derived audio signals for further handling, receiving a response to the further processing, and output the response using an output device; wherein the output device does not correspond to the microphone that captured the selected audio signals.

7. The method of claim 6, wherein the output device is not located at any of the locations where the microphones are located.

8. A system comprising: a plurality of devices positioned at different locations; and a dispatch system in communication with the devices and configured to: receive a response from a speech processing system in response to a previously-communicated request, determine a relevance of the response to each of the devices, and forward the response to at least one of the devices based on the determination.

9. The system of claim 8, wherein the at least one of the devices comprises an audio output device, and forwarding the response causes that device to output audio signals corresponding to the response.

10. The system of claim 8, wherein the at least one of the devices comprises a display, a video screen, or an appliance.

11. The system of claim 8, wherein the response is a first response, and the dispatch system is further configured to receive a response from a second speech processing system.

12. The system of claim 11, wherein the dispatch system is further configured to forward the first response to a first one of the devices, and forward the second response to a second one of the devices.

13. The system of claim 11, wherein the dispatch system is further configured to forward both the first response and the second response to a first one of the devices.

14. The system of claim 11, wherein the dispatch system is further configured to forward only one of the first response and the second response to any of the devices.

15. The system of claim 8, wherein determining the relevance of the response comprises determining which of the devices were associated with the previously-communicated request.

16. The system of claim 8, wherein determining the relevance of the response comprises determining which of the devices is closest to a user associated with the previously-communicated request.

17. The system of claim 8, wherein determining the relevance of the response is based on preferences associated with a user of the claimed system.

18. The system of claim 8, wherein determining the relevance of the response comprises determining a context of the previously-communicated request.

19. The system of claim 18, wherein the context includes one or more of an identification of a user that was associated with the request, which microphone of a plurality of microphones was associated with the request, a location of the user relative to the device locations, operating state of other devices in the system, and time of day.

20. The system of claim 8, wherein determining the relevance of the response comprises determining capabilities or resource availability of the devices.

21. The system of claim 8, wherein determining the relevance of the response comprises determining a relationship between the output devices and the microphones associated with the selected audio signals.

22. The system of claim 8, wherein determining the relevance of the response comprises determining which of the output devices is closest to a source of the selected audio signal.

23. A system comprising: a plurality of microphones positioned at different microphone locations; a plurality of loudspeakers positioned at different loudspeaker locations; and a dispatch system in communication with the microphones and loudspeakers and configured to: derive a plurality of voice signals from the plurality of microphones; compute a confidence score about the inclusion of a wakeup word for each derived voice signal; compare the computed confidence scores, based on the comparison, select at least one of the derived voice signals and transmit at least a portion of the selected signal or signals to a speech processing system, receive a response from a speech processing system in response to the transmission, determine a relevance of the response to each of the loudspeakers, and forward the response to at least one of the loudspeakers for output based on the determination.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0019] FIG. 1 shows a system layout of microphones and devices that may respond to voice commands received by the microphones.

DESCRIPTION

[0020] As more and more devices implement voice-controlled user interfaces (VUIs), a problem arises that multiple devices may detect the same spoken command and attempt to handle it, resulting in problems ranging from redundant responses to contradictory actions being taken at different points of action. Similarly, if a spoken command can result in output or action by multiple devices, which device should take action may be ambiguous. In some VUIs, a special phrase, referred to as a wakeup word, wake word, or keyword is used to activate the speech recognition features of the VUIthe device implementing the VUI is always listening for the wakeup word, and when it hears it, it parses whatever spoken commands came after it. This is done to conserve processing resources, by not parsing every sound that is detected, and can help disambiguate which system was the target of the command, but if multiple systems are listening for the same wakeup word, such as because the wakeup word is associated with a service provider and not individual pieces of hardware, the problem remains of determining which device should handle the command.

[0021] FIG. 1 shows a potential environment, in which a stand-alone microphone array 102, a smart phone 104, a loudspeaker 106, and a set of headphones 108 each have microphones that detect a user's speech (to avoid confusion, we refer to the person speaking as the user and the device 106 as a loudspeaker; discrete things spoken by the user are utterances). Each of the devices that detects the utterance 110 transmits what it heard as an audio signal to a dispatch system 112. In the case of the devices having multiple microphones, those devices may combine the signals rendered by the individual microphones to render single combined audio signal, or they may transmit a signal rendered by each microphone.

[0022] This disclosure refers to various different types of audio and related signals. For clarity, the following conventions are used. Acoustic signal refers to physical signals, that is, physical sound pressure waves that are interpreted as sound by humans, such as the utterances mentioned above. Audio signal refers to electrical signals that represent sound. Audio signals may be generated from a microphone responding to acoustic audio, or they may be received from other electronic sources, such as recordings, computer-generated signals, or streamed data. Audio output refers to acoustic signals generated by a loudspeaker based on an audio signal input to the speaker.

[0023] The dispatch system 112 maybe a cloud-based service to which each of the devices is individually connected, a local service running on one of the same devices or an associated device, a distributed service running cooperatively on some or all of the devices themselves, or any combination of these or similar architectures. Due to their different microphone designs and their differing proximity to the user, each of the devices may hear the utterance 110 differently, if at all. For example, the stand-alone microphone array 102 may have a high-quality beam-forming capability that allows it to clearly hear the utterance regardless of where the user is, while the headphones 108 and the smart phone 104 have highly directional near-field microphones that only clearly pick up the user's voice if the user is wearing the headphones and holding the phone up to their face, respectively. Meanwhile, the loudspeaker 106 may have a simple omnidirectional microphone that detects the speech well if the user is close to and facing the loudspeaker, but produces a low-quality signal otherwise.

[0024] Based on these and similar factors, the dispatch system 112 computes a confidence score for each audio signal (this may include the devices themselves scoring their own detection before sending what they heard, and sending that score along with their respective audio signals). Based on a comparison of the confidence scores, to each other, to a baseline, or both, the dispatch system 112 selects one or more of the audio signals for further processing. This may include locally performing speech recognition and taking direct action, or transmitting the audio signal over a network 114, such as the Internet or any private network, to another service provider. For example, if one of the devices produces an audio signal with a high confidence that the signal contains the wakeup word OK Google, that audio signal may be sent to Google's cloud-based speech recognition system for handling. In the case that the audio signal is transmitted to a remote service, the wakeup word may be included along with whatever utterance followed it, or the utterance alone may be sent.

[0025] The confidence scoring may be based on a large number of factors, and may indicate confidence in more than one parameter as well. For example, the score may indicate a degree of confidence about which wakeup word was used (including whether one was used at all), or where the user was located relative to the microphone. The score may also indicate a degree of confidence in whether the audio signal is of high quality. In one example, the dispatch system may score the audio signals from two devices as both having a high confidence score that a particular wakeup word was used, but score one of them with a low confidence in the quality of the audio signal, while the other is scored with a high confidence in the audio signal quality. The audio signal with the high confidence score for signal quality would be selected for further processing.

[0026] When more than one device transmits an audio signal, one of the critical things to determine confidence in is whether the audio signals represent the same utterance or two (or more) different utterances. The scoring itself may be based on such factors as signal level, signal-to-noise ratio (SNR), amount of reverberation in the signal, spectral content of the signal, user identification, knowledge about the user's location relative to the microphones, or relative timing of the audio signals at two or more of the devices. Location-related scoring and user identity-related scoring may be based on both the audio signals themselves and on external data such as visual systems, wearable trackers worn by users, and identity of the devices providing the signals. For example, if a smart phone is the source of the audio signal, a confidence score that the owner of that smart phone is the user whose voice was heard would be high. User location may be determined based on the strength and timing of acoustic signals received at multiple locations, or at multiple microphones in an array at a single location.

[0027] In addition to determining which wakeup word was used and which signal is best, the scoring may provide additional context that informs how the audio signal should be handled. For example, if the confidence scores indicate that the user was facing the loudspeaker, than it may be that a VUI associated with the loudspeaker should be used, over one associated with the smart phone. Context may include such things as which user was speaking, where the user was located and facing relative to the devices, what activity was the user engaged in (e.g., exercising, cooking, watching TV), what time of day it is, or what other devices are in use (including devices other than those providing the audio signals).

[0028] In some cases, the scoring indicates that more than one command was heard. For example, two devices may each have high confidence that they heard different wakeup words, or that they heard different users speaking. In that case, the dispatch system may send two requestsone request to each system for which a wakeup word was used, or two different requests to a single system that both users invoked. In other cases, more than one of the audio signals may be sentfor example, to get more than one response, to let the remote system decide which one to use, or to improve the voice recognition by combining the signals. In addition to selecting an audio signal for further handling, the scoring may also lead to other user feedback. For example, a light may be flashed on whichever device was selected, so that the user knows the command was received.

[0029] Similar considerations come into play when a response is received from whatever service or system the dispatch system sent the audio signal to for handling. In many cases, the context around the utterance will also inform the handling of the response. For example, the response may be sent to the device from which the selected audio signal was received. In other cases, the response may be sent to a different device. For example, if the audio signal from the stand-alone microphone array 102 was selected, but the response back from the VUI is to start playing an audio file, the response should be handled by the headphones 108 or the loudspeaker 106. If the response is to display information, the smart phone 104 or some other device with a screen would be used to deliver the response. If the microphone array audio signal was selected because the scoring indicated that it had the best signal quality, additional scoring may have indicated that the user was not using the headphones 108 but was in the same room as the loudspeaker 106, so the loudspeaker is the likely target for the response. Other capabilities of the devices would also be consideredfor example, while only audio devices are shown, voice commands could address other systems, such as lighting or home automation systems. Hence, if the response to the utterance is to turn down lights, the dispatch system may conclude that it is referring to the lights in the room where the strongest audio signal was detected. Other potential output devices include displays, screens (e.g., the screen on the smart phone, or a television monitor), appliances, door locks, and the like. In some examples, the context is provided to the remote system, and the remote system specifically targets a particular output device based on a combination of the utterance and the context.

[0030] As mentioned, the dispatch system may be a single computer or a distributed system. The speech processing provided may similarly be provided by a single computer or a distributed system, coextensive with or separate from the dispatch system. They each may be located entirely locally to the devices, entirely in the cloud, or split between both. They may be integrated into one or all of the devices. The various tasks describedscoring signals, detecting wakeup words, sending a signal to another system for handling, parsing the signal for a command, handling the command, generating a response, determining which device should handle the response, etc., may be combined together or broken down into more sub-tasks. Each of the tasks and sub-tasks may be performed by a different device or combination of devices, locally or in a cloud-based or other remote system.

[0031] When we refer to microphones, we include microphone arrays without any intended restriction on particular microphone technology, topology, or signal processing. Similarly, references to loudspeakers and headphones should be understood to include any audio output devicestelevisions, home theater systems, doorbells, wearable speakers, etc.

[0032] Embodiments of the systems and methods described above comprise computer components and computer-implemented steps that will be apparent to those skilled in the art. For example, it should be understood by one of skill in the art that instructions for executing the computer-implemented steps may be stored as computer-executable instructions on a computer-readable medium such as, for example, floppy disks, hard disks, optical disks, Flash ROMS, nonvolatile ROM, and RAM. Furthermore, it should be understood by one of skill in the art that the computer-executable instructions may be executed on a variety of processors such as, for example, microprocessors, digital signal processors, gate arrays, etc. For ease of exposition, not every step or element of the systems and methods described above is described herein as part of a computer system, but those skilled in the art will recognize that each step or element may have a corresponding computer system or software component. Such computer system and/or software components are therefore enabled by describing their corresponding steps or elements (that is, their functionality), and are within the scope of the disclosure.

[0033] A number of implementations have been described. Nevertheless, it will be understood that additional modifications may be made without departing from the scope of the inventive concepts described herein, and, accordingly, other embodiments are within the scope of the following claims.