MULTI-DEVICE WAKEWORD DETECTION
20220351724 · 2022-11-03
Assignee
Inventors
Cpc classification
G10L15/22
PHYSICS
G10L15/32
PHYSICS
International classification
Abstract
A method for selecting a device for audio processing may involve receiving a first wakeword confidence metric from a first device that includes at least a first microphone and receiving a second wakeword confidence metric from a second device that includes at least a second microphone. The first and second wakeword confidence metrics may correspond to a first local maximum of a first plurality of wakeword confidence values determined by the first device and a second local maximum of a second plurality of wakeword confidence values determined by the second device. The method may involve comparing the first wakeword confidence metric and the second wakeword confidence metric and selecting a device for subsequent audio processing based, at least in part, on a comparison of the first wakeword confidence metric and the second wakeword confidence metric.
Claims
1. A method of selecting a device for audio processing, the method comprising: receiving a first wakeword confidence metric from a first device that includes at least a first microphone, the first wakeword confidence metric corresponding to a first local maximum of a first plurality of wakeword confidence values determined by the first device; receiving a second wakeword confidence metric from a second device that includes at least a second microphone, the second wakeword confidence metric corresponding to a second local maximum of a second plurality of wakeword confidence values determined by the second device; comparing the first wakeword confidence metric and the second wakeword confidence metric; and selecting a device for subsequent audio processing based, at least in part, on a comparison of the first wakeword confidence metric and the second wakeword confidence metric.
2. The method of claim 1, further comprising: receiving a third wakeword confidence metric from a third device that includes at least a third microphone, the third wakeword confidence metric corresponding to a third local maximum of a third plurality of wakeword confidence values determined by the third device; comparing the third wakeword confidence metric with the first wakeword confidence metric and the second wakeword confidence metric; and selecting a device for subsequent audio processing based, at least in part, on a comparison of the first wakeword confidence metric, the second wakeword confidence metric and the third wakeword confidence metric.
3. The method of claim 1, wherein the subsequent audio processing comprises a speech recognition process.
4. The method of claim 1, wherein the subsequent audio processing comprises a command recognition process.
5. The method of claim 4, further comprising controlling a selected device according to the command recognition process.
6. The method of claim 1, wherein a local maximum is determined subsequent to determining that a wakeword confidence value exceeds a wakeword detection start threshold.
7. The method of claim 6, wherein a local maximum is determined by detecting a decrease in a wakeword confidence value after a previous wakeword confidence value has exceeded the wakeword detection start threshold, or wherein a local maximum is determined by detecting, after a previous wakeword confidence value has exceeded the wakeword detection start threshold, a decrease in a wakeword confidence value of audio frame n as compared to a wakeword confidence value of audio frame n−k, wherein k is an integer.
8. (canceled)
9. The method of claim 6, further comprising initiating a local maximum determination time interval after a wakeword confidence value of the first device, the second device or another device exceeds, with a rising edge, the wakeword detection start threshold.
10. The method of claim 9, further comprising terminating the local maximum determination time interval after a wakeword confidence value of the first device, the second device or another device falls below a wakeword detection end threshold.
11. The method of claim 1, wherein: the first device samples audio data received by the first microphone according to a first clock domain; and the second device samples audio data received by the second microphone according to a second clock domain that is different from the first clock domain.
12-14. (canceled)
15. A method of selecting a device for audio processing, the method comprising: determining, by a first device that includes a first microphone system having at least a first microphone, a first wakeword confidence metric, wherein determining the first wakeword confidence metric involves: producing, via the first microphone system, first audio data corresponding to detected sound; determining, based on the first audio data, a first plurality of wakeword confidence values; determining a first local maximum of the first plurality of wakeword confidence values; and determining the first wakeword confidence metric based on the first local maximum; receiving a second wakeword confidence metric from a second device that includes at least a second microphone, the second wakeword confidence metric corresponding to a second local maximum of a second plurality of wakeword confidence values determined by the second device; comparing the first wakeword confidence metric and the second wakeword confidence metric; and selecting a device for subsequent audio processing based, at least in part, on a comparison of the first wakeword confidence metric and the second wakeword confidence metric.
16. The method of claim 15, wherein a local maximum is determined subsequent to determining that a wakeword confidence value exceeds a wakeword detection start threshold, or wherein a local maximum is determined by detecting a decrease in a wakeword confidence value after a previous wakeword confidence value has exceeded the wakeword detection start threshold.
17. (canceled)
18. The method of claim 15, wherein a local maximum is determined by detecting, after a previous wakeword confidence value has exceeded the wakeword detection start threshold, a decrease in a wakeword confidence value of audio frame n as compared to a wakeword confidence value of audio frame n−k, wherein k is an integer.
19. The method of claim 18, further comprising initiating a local maximum determination time interval after a wakeword confidence value of the first device, the second device or another device exceeds, with a rising edge, the wakeword detection start threshold.
20. The method of claim 19, wherein the local maximum determination time interval initiates at time A and terminates at a time (A+K), a time at which wakeword confidence values of the first device and the second device fall below a wakeword detection end threshold.
21. The method of claim 19, wherein the local maximum determination time interval initiates at time A and terminates at a time (A+K), a time at which a wakeword confidence value of the first device, the second device or another device falls below a wakeword detection end threshold.
22-26. (canceled)
27. An apparatus configured to perform the method of claim 1.
28. One or more non-transitory media having software stored thereon, the software including instructions for controlling one or more devices to perform the method of claim 1.
29. An apparatus configured to perform the method of claim 15.
30. One or more non-transitory media having software stored thereon, the software including instructions for controlling one or more devices to perform the method of claim 15.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058] Like reference numbers and designations in the various drawings indicate like elements.
DETAILED DESCRIPTION
[0059] An orchestrated system consisting of multiple smart audio devices may be configured to determine when a “wakeword” (defined above) from a user is detected. At least some devices of such a system may be configured to listen for a command from the user.
[0060]
[0061] In a living space (e.g., that of
[0070] In accordance with some embodiments, a system that estimates where a sound (e.g., a wakeword or other signal for attention) arises or originates may have some determined confidence in (or multiple hypotheses for) the estimate. For example, if a user happens to be near a boundary between zones of the system's environment, an uncertain estimate of location of the user may include a determined confidence that the user is in each of the zones. In some conventional implementations of a voice interface it may be required that the voice assistant's voice is only issued from one location at a time, this forcing a single choice for the single location (e.g., one of the eight speaker locations, 1.1 and 1.3, in
[0071] Next, with reference to
[0072] More specifically, elements of the
[0082] As talker 101 utters sound 102 indicative of a wakeword in the acoustic space, the sound is received by nearby device 103, mid-distance device 105, and far device 107. In this example, each of devices 103, 105, and 107 is (or includes) a wakeword detector, and each of devices 103, 105, and 107 is configured to determine when wakeword likelihood (probability that a wakeword has been detected by the device) exceeds a predefined threshold. As time progresses, the wakeword likelihood determined by each device can be graphed as a function of time.
[0083]
[0084] As is apparent from inspection of
[0085] Returning to
[0086]
[0087] In this example, the apparatus 300 includes an interface system 305 and a control system 310. The interface system 305 may, in some implementations, be configured for receiving input from each of a plurality of microphones in an environment. The interface system 305 may include one or more network interfaces and/or one or more external device interfaces (such as one or more universal serial bus (USB) interfaces). According to some implementations, the interface system 305 may include one or more wireless interfaces. The interface system 305 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system and/or a gesture sensor system. In some examples, the interface system 305 may include one or more interfaces between the control system 310 and a memory system, such as the optional memory system 315 shown in
[0088] The control system 310 may, for example, include a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components. In some implementations, the control system 310 may reside in more than one device. For example, a portion of the control system 310 may reside in a device within one of the environments depicted in
[0089] In some implementations, the control system 310 may be configured for performing, at least in part, the methods disclosed herein. According to some examples, the control system 310 may be configured for implementing methods of selecting a device for audio processing, e.g., such as those disclosed herein. In some such examples, the control system 310 may be configured for selecting, based at least in part on a comparison of a plurality of wakeword confidence metrics, a device for audio processing.
[0090] Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. The one or more non-transitory media may, for example, reside in the optional memory system 315 shown in
[0091] In some examples, the apparatus 300 may include the optional microphone system shown in
[0092]
[0093] In this example, block 405 involves receiving a first wakeword confidence metric from a first device that includes at least a first microphone. According to this example, the first wakeword confidence metric corresponds to a first local maximum of a first plurality of wakeword confidence values determined by the first device.
[0094] In this implementation, block 410 involves receiving a second wakeword confidence metric from a second device that includes at least a second microphone. According to this example, the second wakeword confidence metric corresponds to a second local maximum of a second plurality of wakeword confidence values determined by the second device. In this example, the first device and the second device are in the same environment, which may be an environment like that shown in
[0095] However, the first microphone and the second microphone may or may not be synchronous microphones, based on the particular implementation. As used herein, microphones may be referred to as “synchronous” if the sounds detected by the microphones are digitally sampled using the same sample clock, or synchronized sample clocks. For example, a first microphone of a plurality of microphones within the environment may sample audio data according to a first sample clock and a second microphone of the plurality of microphones may sample audio data according to the first sample clock.
[0096] According to some alternative implementations, at least some microphones, or microphone systems, of an environment may be “asynchronous.” As used herein, microphones may be referred to as “asynchronous” if the sounds detected by the microphones are digitally sampled using distinct sample clocks. For example, a first microphone of a plurality of microphones within the environment may sample audio data according to a first sample clock and a second microphone of the plurality of microphones may sample audio data according to a second sample clock. In some instances, the microphones in an environment may be randomly located, or at least may be distributed within the environment in an irregular and/or asymmetric manner.
[0097] Referring again to
[0098] According to the example shown in
[0099] According to some implementations, the subsequent audio processing may be, or may include, a speech recognition process. For example, the subsequent audio processing may be, or may include, a command recognition process. In some instances, method 400 may involve controlling a selected device according to the command recognition process. For example, method 400 may involve controlling a virtual assistant according to the command recognition process. In some such examples, method 400 may involve controlling the virtual assistant to initiate a telephone call, controlling the virtual assistant to perform an Internet search, controlling the virtual assistant to provide instructions to another device, such as a television, a sound system controller or another device in the environment.
[0100] In some examples, method 400 may involve receiving wakeword confidence metrics from more than two devices in an environment. Some such examples may involve receiving a third wakeword confidence metric from a third device that includes at least a third microphone. The third wakeword confidence metric may correspond to a third local maximum of a third plurality of wakeword confidence values determined by the third device. In some such examples, method 400 may involve comparing the third wakeword confidence metric with the first wakeword confidence metric and the second wakeword confidence metric and selecting a device for subsequent audio processing based, at least in part, on a comparison of the first wakeword confidence metric, the second wakeword confidence metric and the third wakeword confidence metric.
[0101] According to some examples, method 400 may involve receiving first through N.sup.th wakeword confidence metrics from first through N.sup.th devices in an environment. The first through N.sup.th wakeword confidence metrics may correspond to first through N.sup.th local maxima of the wakeword confidence values determined by the first through N.sup.th devices. In some such examples, method 400 may involve comparing the first through N.sup.th wakeword confidence metrics and selecting a device for subsequent audio processing based, at least in part, on a comparison of the first through N.sup.th wakeword confidence metrics.
[0102] In some implementations, blocks 405 and 410 may involve receiving, by a third device configured for determining wakeword confidence values and determining a local maximum of the wakeword confidence values, the first wakeword confidence metric and the second wakeword confidence metric. In some such implementations, the third device may be configured to perform at least blocks 415 and 420 of method 400. In some implementations, the third device may be a local device. In some such implementations, all three devices may be, or may include, a wakeword detector. One or more of the devices may be, or may include, a virtual assistant. However, in other implementations, the third device may be a remote device, such as a server.
[0103] According to some examples, a local maximum may be determined subsequent to determining that a wakeword confidence value exceeds a wakeword detection start threshold, which may be a predetermined threshold. For example, referring again to
[0104] In some such implementations, a local maximum may be determined by detecting, after a previous wakeword confidence value has exceeded the wakeword detection start threshold, a decrease in a wakeword confidence value of audio frame as compared to a wakeword confidence value of a previous audio frame, which in some instances may be the most recent audio frame or one of the most recent audio frames. For example, a local maximum may be determined by detecting, after a previous wakeword confidence value has exceeded the wakeword detection start threshold, a decrease in a wakeword confidence value of audio frame n as compared to a wakeword confidence value of audio frame n-k, wherein k is an integer.
[0105] According to some such implementations, some methods may involve initiating a local maximum determination time interval after a wakeword confidence value of the first device, the second device or another device exceeds, with a rising edge, the wakeword detection start threshold. Some such methods may involve terminating the local maximum determination time interval after a wakeword confidence value of the first device, the second device or another device falls below a wakeword detection end threshold.
[0106] For example, referring again to
[0107] According to some examples, the local maximum determination time interval may terminate after a wakeword confidence value of all devices in a group falls below the wakeword detection end threshold 215b. For example, referring to
[0108]
[0109] According to this example, a sequence of values of wakeword confidence is determined by each of the detectors 502A-502N, and each such sequence is fed into one of a plurality of local maximum detectors 503A-503N. In some such examples, each such value is w.sub.i(n), i={1 . . . M}, where M represents the number of wakeword detectors 502, i represents a detector index and n represents a frame index. At some time after a wakeword confidence (determined by one of detectors 502A-502N) exceeds a predefined wakeword detection start threshold, the wakeword confidence typically begins to fall. For example, one of the local maximum detectors 503A-503N may determine that w.sub.i(n)<w.sub.i(n−k), where k represents a number of frames. In one such implementation, one of the local maximum detectors 503A-503N may determine that w.sub.i(n)<w.sub.i(n−1). When the wakeword confidence begins to fall, in some implementations the local maximum confidence value y.sub.i up to this point in time may be recorded. In some implementations, y.sub.i=max (w.sub.i), w.sub.i=[w.sub.i(n−N), w.sub.i(n−N+1), . . . , w(n)].sup.T, where N represents the length of a relevant history buffer.
[0110] According to some such implementations, each such local maximum confidence value may be provided to an element of the system that implements a device selector. In the example that is shown in
[0111] According to some examples, after all of the devices have produced a maximum confidence, y.sub.i, the index of the most confident device, argmax(y.sub.i), which is the greatest one of the maximum confidence values y.sub.i, is chosen for subsequent speech capture. For example, if the wakeword detector nearest the user generates the maximum confidence value, y.sub.i, the smart audio device in (or for) which this detector is implemented is caused to enter a state of attentiveness (and may assert an appropriate attentiveness indication to the user) in which it awaits a subsequent voice command, and then, in response to such a voice command, the device may perform at least one predetermined action.
[0112]
[0113] In this example, block 605 involves determining, by a first device that includes a first microphone system having at least a first microphone, a first wakeword confidence metric. In this example, determining the first wakeword confidence metric involves producing, via the first microphone system, first audio data corresponding to detected sound. According to this example, determining the first wakeword confidence metric involves determining, based on the first audio data, a first plurality of wakeword confidence values and determining a first local maximum of the first plurality of wakeword confidence values. In this implementation, determining the first wakeword confidence metric involves determining the first wakeword confidence metric based on the first local maximum. For example, determining the first wakeword confidence metric may involve making the first wakeword confidence metric equal to the first local maximum.
[0114] In this implementation, block 610 involves receiving a second wakeword confidence metric from a second device that includes at least a second microphone. According to this example, the second wakeword confidence metric corresponds to a second local maximum of a second plurality of wakeword confidence values determined by the second device. In this example, the first device and the second device are in the same environment, which may be an environment like that shown in
[0115] However, the first microphone and the second microphone may or may not be synchronous microphones, based on the particular implementation. According to some examples, a first microphone of a plurality of microphones within the environment may sample audio data according to a first sample clock and a second microphone of the plurality of microphones may sample audio data according to a second sample clock.
[0116] According to the example shown in
[0117] According to some implementations, the subsequent audio processing may be, or may include, a speech recognition process. For example, the subsequent audio processing may be, or may include, a command recognition process. In some instances, method 600 may involve controlling a selected device according to the command recognition process. For example, method 600 may involve controlling a virtual assistant according to the command recognition process. In some such examples, method 600 may involve controlling the virtual assistant to initiate a telephone call, controlling the virtual assistant to perform an Internet search, controlling the virtual assistant to provide instructions to another device, such as a television, a sound system controller or another device in the environment.
[0118] In some examples, method 600 may involve receiving wakeword confidence metrics from more than two devices in an environment. Some such examples may involve receiving a third wakeword confidence metric from a third device that includes at least a third microphone. The third wakeword confidence metric may correspond to a third local maximum of a third plurality of wakeword confidence values determined by the third device. In some such examples, method 600 may involve comparing the third wakeword confidence metric with the first wakeword confidence metric and the second wakeword confidence metric and selecting a device for subsequent audio processing based, at least in part, on a comparison of the first wakeword confidence metric, the second wakeword confidence metric and the third wakeword confidence metric.
[0119] According to some examples, method 600 may involve receiving first through N.sup.th wakeword confidence metrics from first through N.sup.th devices in an environment. The first through N.sup.th wakeword confidence metrics may correspond to first through N.sup.th local maxima of the wakeword confidence values determined by the first through N.sup.th devices. In some such examples, method 600 may involve comparing the first through N.sup.th wakeword confidence metrics and selecting a device for subsequent audio processing based, at least in part, on a comparison of the first through N.sup.th wakeword confidence metrics.
[0120] In some implementations, method 600 may involve receiving, by a third device configured for determining wakeword confidence values and determining a local maximum of the wakeword confidence values, the first wakeword confidence metric and the second wakeword confidence metric. In some such implementations, the third device may be configured to perform at least blocks 415 and 420 of method 400. In some implementations, the third device may be a local device. In some such implementations, all three devices may be, or may include, a wakeword detector. One or more of the devices may be, or may include, a virtual assistant. However, in other implementations, the third device may be a local device that does not include a wakeword detector and/or a device that is not configured to determine a wakeword confidence metric corresponding to a local maximum of a plurality of wakeword confidence values. According to some alternative implementations, the third device may be a remote device, such as a server.
[0121] According to some examples, a local maximum may be determined subsequent to determining that a wakeword confidence value exceeds a wakeword detection start threshold, which may be a predetermined threshold. For example, referring again to
[0122] In some such implementations, a local maximum may be determined by detecting, after a previous wakeword confidence value has exceeded the wakeword detection start threshold, a decrease in a wakeword confidence value of audio frame as compared to a wakeword confidence value of a previous audio frame, which in some instances may be the most recent audio frame or one of the most recent audio frames. For example, a local maximum may be determined by detecting, after a previous wakeword confidence value has exceeded the wakeword detection start threshold, a decrease in a wakeword confidence value of audio frame n as compared to a wakeword confidence value of audio frame n−k, wherein k is an integer.
[0123] According to some such implementations, some methods may involve initiating a local maximum determination time interval after a wakeword confidence value of the first device, the second device or another device exceeds, with a rising edge, the wakeword detection start threshold. Some such methods may involve terminating the local maximum determination time interval after a wakeword confidence value of the first device, the second device or another device falls below a wakeword detection end threshold.
[0124] According to some such methods, the local maximum determination time interval may initiate at time A and may terminates at a time (A+K). Some such methods are described above with reference to
[0125] While specific embodiments and applications of the disclosure have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of this disclosure.