VOICE IDENTIFICATION FOR OPTIMIZING VOICE SEARCH RESULTS
20230186941 · 2023-06-15
Inventors
- Ajay Juneja (Punjab, IN)
- Vaibhav Gupta (Bangalore, IN)
- Ashish Gupta (Bangalore, IN)
- Senthil Kumar Karuppasamy (Bangalore, IN)
- Reda Harb (Bellevue, WA, US)
Cpc classification
G10L15/22
PHYSICS
International classification
G10L15/22
PHYSICS
Abstract
Systems and methods are provided for processing a voice input stream with interruptions and/or supplemental comments. Generally, a virtual voice assistant may receive an input stream with a first input comprising a voice query from a first voice and a second input comprising a secondary query from a second voice (e.g., an interruption or a supplement). The virtual assistant may determine that the second voice does not match the first voice, and then process the voice query to produce first results. Some embodiments may determine whether the secondary query is a supplement or an interruption and, e.g., choose to ignore an interruption or set aside a supplement if it may be used to help the search query. In some embodiments, results for the first query may be compared with results for the first query with a portion of the supplement.
Claims
1. A method of processing a voice input stream comprising a first input and a second input, the method comprising: receiving the first input comprising a voice query from a first voice; receiving the second input comprising a secondary query from a second voice; determining that the second voice does not match the first voice; and in response to determining that the second voice does not match the first voice, processing the voice query, without the second query, to produce first results.
2. The method of claim 1 further comprising determining, based on the first results, whether the secondary query is a supplement or an interruption.
3. The method of claim 2, wherein determining, based on the first results, whether the secondary query is a supplement or an interruption comprises: calculating a relevance score for the first results; determining whether the relevance score meets or exceeds a predetermined threshold; in response to determining the relevance score is below the predetermined threshold: providing the first results; and in response to determining the relevance score meets or exceeds the predetermined threshold: processing the voice query with one or more portions of the secondary query to produce second results.
4. The method of claim 1 further comprising: calculating a first relevance score for the first results; processing the voice query with one or more portions of the secondary query to produce second results; calculating a second relevance score for the second results; comparing the first relevance score to the second relevance score; and in response to determining the second relevance score meets or exceeds the first relevance score, providing a portion of the second results.
5. The method of claim 1, wherein determining that the second voice does not match the first voice comprises: comparing traits of the first voice with traits of the second voice; determining, based on the comparison, a voice match score; determining that the voice match score is less than a match threshold; and outputting that no match exists.
6. The method of claim 1, wherein determining that the second voice does not match the first voice comprises inputting the first input and the second input into a trained machine learning model to generate data indicative of whether the first input matches the second input.
7. The method of claim 1, wherein determining that the second voice does not match the first voice comprises: accessing a plurality of voice profiles; comparing the first input to the plurality of voice profiles to determine a first profile for the first voice; comparing the second input to the plurality of voice profiles to determine a second profile for the second voice; determining that the first profile and is not a match to the second profile; and outputting that no match exists.
8. The method of claim 1, wherein the voice query comprises a first set of text based on the first input and the second query comprises a second set of text based on the second input.
9. The method of claim 1, wherein determining that the second voice does not match the first voice further comprises: receiving a third input comprising a third query from a third voice; determining that the third voice matches the first voice; and combining the third query with the first query.
10. The method of claim 1, wherein determining that the second voice does not match the first voice further comprises: receiving a third input comprising a third query from a third voice; determining that the third query matches at least one of the following: the first query and the second query; transmitting a command to pause or mute content; receiving a fourth input comprising a fourth query; and processing the fourth query.
11. A system for processing a voice input stream comprising a first input and a second input, the system comprising: input/output circuitry configured to: receive the first input comprising a voice query from a first voice; receive the second input comprising a secondary query from a second voice; and processing circuitry configured to: determine that the second voice does not match the first voice; and in response to determining that the second voice does not match the first voice, process the voice query, without the second query, to produce first results.
12. The system of claim 11, wherein the processing circuitry is further configured to: determine, based on the first results, whether the secondary query is a supplement or an interruption; in response to determining the secondary query is a supplement, process the voice query with one or more portions of the secondary query to produce second results; and provide the second results.
13. The system of claim 12, wherein the processing circuitry is further configured to determine, based on the first results, whether the secondary query is a supplement or an interruption by: calculating a relevance score for the first results; determining whether the relevance score meets or exceeds a predetermined threshold; in response to determining the relevance score is below the predetermined threshold, providing the first results; and in response to determining the relevance score meets or exceeds the predetermined threshold, processing the voice query with one or more portions of the secondary query to produce second results.
14. The system of claim 11, wherein the instructions further cause the control circuitry to: calculate a first relevance score for the first results; process the voice query with one or more portions of the secondary query to produce second results; calculate a second relevance score for the second results; compare the first relevance score to the second relevance score; and in response to determining the second relevance score meets or exceeds the first relevance score, provide a portion of the second results.
15. The system of claim 11, wherein the processing circuitry is further configured to determine that the second voice does not match the first voice by: comparing traits of the first voice with traits of the second voice; determining, based on the comparison, a voice match score; determining that the voice match score is less than a match threshold; and outputting that no match exists.
16. The system of claim 11, wherein the processing circuitry is further configured to determine that the second voice does not match the first voice by inputting the first input and the second input into a trained machine learning model to generate data indicative of whether the first input matches the second input.
17. The system of claim 11, wherein the processing circuitry is further configured to determine that the second voice does not match the first voice by: accessing a plurality of voice profiles; comparing the first input to the plurality of voice profiles to determine a first profile for the first voice; comparing the second input to the plurality of voice profiles to determine a second profile for the second voice; determining that the first profile and is not a match to the second profile; and outputting that no match exists.
18. The system of claim 11, wherein the voice query comprises a first set of text based on the first input and the second query comprises a second set of text based on the first input.
19. The system of claim 11, wherein the input/output circuitry is further configured to receive a third input comprising a third query from a third voice; and wherein the processing circuitry is further configured to determine that the second voice does not match the first voice by: determining that the third voice matches the first voice; and combining the third query with the first query.
20. The system of claim 11, wherein the input/output circuitry is further configured to: receive a third input comprising a third query from a third voice; transmit a command to pause or mute content; receive a fourth input comprising a fourth query; and wherein the processing circuitry is further configured to determine that the second voice does not match the first voice by: determining that the third query matches at least one of the following: the first query and the second query; instructing the input/output circuitry to transmit a command to pause or mute content in response to determining that the third query matches the first query or the second query; and processing the fourth query.
21-60. (canceled)
Description
BRIEF DESCRIPTION OF THE FIGURES
[0029] The above and other objects and advantages of the disclosure will be apparent upon consideration of the following detailed description, taken in conjunction with the accompanying drawings, in which like reference characters refer to like parts throughout, and in which:
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
DETAILED DESCRIPTION
[0045]
[0046] Device 101 may be any computing device providing a user interface, such as a voice assistant, a virtual assistant, and/or a voice interface allowing for voice-based communication with a user and/or via an electronic content display system for a user. Examples of such computing devices are a smart home assistant similar to a Google Home® device or an Amazon® Alexa® or Echo® device, a smartphone or laptop computer with a voice interface application for receiving and broadcasting information in voice format, a set-top box or television running a media guide program or other content display program for a user, or a server executing a content display application for generating content for display to a user. In some embodiments, computing devices may work in conjunction such as devices depicted in
[0047] In scenario 100, first user 110 and second user 120 are attempting to query device 101. For example, each of first user 110 and second user 120 may be making a request for a virtual assistant interface of device 101, and each user may be in the same room/area or not. In some embodiments, first user 110 and second user 120 may each be considered a user of device 101, e.g., making queries and requests to device 101 regularly and each have a voice profile with device 101. In some embodiments, both first user 110 and second user 120 may be using device 101 for the first time.
[0048] Device 101 captures each request from first user 110 and second user 120. One or more of wake word 112, request 114, interrupting request 122 and request 116 may be captured as an input stream, e.g., to be processed by a virtual assistant. In some embodiments, device 101 automatically converts audio/voice to text for each portion of the input stream, e.g., using automated speech recognition (ASR). In some embodiments, device 101 transmits audio files to a server to convert audio/voice to text for each request. For instance, first user 110 may speak wake word 112 (“Hey Assistant, . . . ”) to activate the virtual assistant on device 101. First user 110 may begin request 114, saying, “Play . . . ” before being interrupted with interrupting request 122 from second user 120. For instance, interrupting request 122 may include a request for a song that is unpopular or inappropriate for the situation, e.g., saying, “C″mon, play “Free Bird” by Skynyrd!” First user 110 may follow request 114, e.g., after a brief pause, perhaps due to an interruption, with request 116, requesting to play ““Celebration” by Kool & The Gang.”
[0049] In some embodiments, device 101 may determine to which request to respond and/or act. For instance, first user 110 request to play “Celebration” but second user 120 requests to play “Free Bird.” Deciding which request to honor may depend on determining which user initiated the first virtual assistant request. In scenario 100, first user 110 initiated the request with wake word 112 and started request 114. In scenario 100, second user 120 interrupts first user 110 with interrupting request 122. The virtual assistant of device 101 in scenario 100 must determine whether all requests, e.g., in the voice input stream, came from one person and/or whether to discard one or more of the captured requests as interruptions.
[0050] In order to correctly process the right request from an input stream and ignore an interruption, there are a few steps a virtual assistant may perform. For instance, in scenario 100, the virtual assistant of device 101 may identify that the voice input(s) by first user 110 and second user 120 are not from the same source. In some embodiments, device 101 may discard statements in the input stream made by anyone other than the user who initiated the request, e.g., first user 110.
[0051] In scenario 100, device 101 makes listen decision 124, e.g., to set aside interrupting request 122. Listen decision 124 depicts a determination to ignore interrupting request 122 and/or statements from second user 120. In scenario 100, device 101 issues virtual assistant response 126, saying, “OK. Now playing “Celebration” by Kool & The Gang,” and begins to play the song, also demonstrating that interrupting request 122 is set aside and/or ignored. In some embodiments, device 101 may set aside statements made by second user 120 and/or determine if interrupting request 122 may offer supplemental information.
[0052]
[0053] In scenario 150, first user 160 and second user 170 are providing voice input to device 101. For example, each of first user 160 and second user 170 may be making a request for a virtual assistant interface of device 101, and each user may be in the same room/area or not. In some embodiments, first user 160 and/or second user 170 may each be considered a user of device 101, e.g., making queries and requests to device 101 regularly. In some embodiments, both first user 160 and second user 170 may be using device 101 for the first time.
[0054] Device 101 captures each request from first user 160 and second user 170. One or more of wake word 162, request 164, and supplemental request 172 may be captured as an input stream, e.g., to be processed by a virtual assistant. In some embodiments, device 101 automatically converts audio/voice to text for each portion of the input stream, e.g., using ASR. In some embodiments, device 101 transmits audio files to a server to convert audio/voice to text for each request. For instance, first user 160 may speak wake word 162 (“Hey Assistant, . . . ”) to activate the virtual assistant on device 101. First user 160 may begin request 164, saying, “Play “Jump” by . . . ” before forgetting which version of the song titled “Jump” is correct. For instance, there are at least three popular songs with the title “Jump,” including a pop song by the Pointer Sisters, a hip hop song by Kriss Kross, and a rock song by Van Halen. In scenario 150, second user 170 offers a supplemental request 172, saying, “ . . . it's by Van Halen.” First user 160 does not say anything else in this scenario. In some embodiments, first user 160 may offer confirmation, e.g., by repeating “Van Halen” or saying, “Yes.” In some embodiments, first user 160 may deny supplemental request 172 by disagreeing, canceling, or offering additional voice input for the query.
[0055] In some embodiments, device 101 may determine to which request to respond and/or act. For instance, first user 160 requests to play “Jump” and second user 170 supplements the artist “Van Halen.” Deciding whether to incorporate supplemental request 172 in processing request 164 may depend on determining which user initiated the first virtual assistant request. In scenario 150, first user 160 initiated the request with wake word 162 and started request 164. In scenario 150, second user 170 supplements first user 160 with supplemental request 172. The virtual assistant of device 101 in scenario 150 must determine whether all requests, e.g., in the voice input stream, came from one person and/or whether to use a statement as a supplement (or, e.g., discard one or more of the captured requests as an interruption, as depicted in
[0056] In order to correctly process the right request from an input stream and determine whether to incorporate a potential supplement, there are a few steps a virtual assistant may perform. For instance, in scenario 150, the virtual assistant of device 101 may identify that the voice input(s) by first user 160 and second user 170 are not from the same source. In some embodiments, device 101 may discard statements in the input stream made by anyone other than the user who initiated the request, e.g., first user 160.
[0057] In scenario 150, device 101 makes listen decision 174, e.g., to accept supplemental request 172. Listen decision 174 depicts a determination to listen to supplemental request 172 from second user 170. In scenario 150, device 101 issues virtual assistant response 176, saying, “OK. Now playing “Jump” by Van Halen,” and begins to playback the song, also demonstrating that supplemental request 172 was incorporated. In some embodiments, device 101 may set aside statements made by second user 170 prior to determining whether supplemental request 172 may offer valuable supplemental information.
[0058]
[0059] In scenario 175, first user 180 and second user 190 are providing voice input to device 101. For example, each of first user 180 and second user 190 may be making a request for a virtual assistant interface of device 101, and each user may be in the same room/area or not.
[0060] Device 101 captures each request from first user 180 and second user 190. One or more of wake word 182, request 184, and supplemental request 192 may be captured as an input stream, e.g., to be processed by a virtual assistant. In some embodiments, device 101 automatically converts audio/voice to text for each portion of the input stream, e.g., using ASR. In some embodiments, device 101 transmits audio files to a server to convert audio/voice to text for each request. For instance, first user 180 may speak wake word 182 (“Hey Assistant, . . . ”) to activate the virtual assistant on device 101. First user 180 may begin request 184, saying, “What's the weather look like this weekend in Ocean City?” before identifying which Ocean City. For instance, there are at least five states in the United States of America with cities or towns named “Ocean City,” including Maryland, New Jersey, North Carolina, Florida, and Washington. In scenario 175, second user 190 offers a supplemental request 192, saying, “ . . . New Jersey.” First user 180 does not say anything else in this scenario. In some embodiments, first user 180 may offer confirmation, e.g., by repeating “New Jersey” or saying, “Yes.” In some other scenarios, first user 180 may deny supplemental request 192 by disagreeing, canceling, or offering additional voice input for the query, e.g., “No. the one in Maryland,” but does not.
[0061] In some embodiments, device 101 may determine to which request to respond and/or act. For instance, first user 180 request to respond to the weather request in “Ocean City” and second user 190 supplements with the state “New Jersey.” Deciding whether to incorporate supplemental request 192 in processing request 184 may depend on determining which user initiated the first virtual assistant request. In scenario 175, first user 180 initiated the request with wake word 182 and started request 184. In scenario 175, second user 190 supplements first user 180 with supplemental request 192. The virtual assistant of device 101 in scenario 175 must determine whether all requests, e.g., in the voice input stream, came from one person and/or whether to use a statement as a supplement (or, e.g., discard one or more of the captured requests as an interruption, like in
[0062] In order to correctly process the right request from an input stream and determine whether to incorporate a potential supplement, there are a few steps a virtual assistant may perform. For instance, in scenario 175, the virtual assistant of device 101 may identify that the voice input(s) by first user 180 and second user 190 are not from the same source. In some embodiments, device 101 may discard statements in the input stream made by anyone other than the user who initiated the request, e.g., first user 180.
[0063] In scenario 175, device 101 makes listen decision 194, e.g., to accept supplemental request 192. Listen decision 194 depicts a determination to listen to supplemental request 192 from second user 190. In scenario 175, device 101 issues virtual assistant response 196, saying, “The weather in Ocean City, N.J. looks clear this weekend, with a high of 71° and a low of 55° at night,” demonstrating that supplemental request 192 was incorporated. In some embodiments, device 101 may set aside statements made by second user 190 prior to determining whether supplemental request 192 may offer valuable supplemental information.
[0064]
[0065] The computing device 200, e.g., device 100, may be any device capable of acting as a voice interface system such as by running one or more application programs implementing voice-based communication with a user, and engaging in electronic communication with server 230. For example, computing device 200 may be a voice assistant, smart home assistant, digital TV, laptop computer, smartphone, tablet computer, or the like.
[0066] Control circuitry 304 may be based on any suitable processing circuitry such as processing circuitry 306. As referred to herein, processing circuitry should be understood to mean circuitry based on one or more microprocessors, microcontrollers, digital signal processors, programmable logic devices, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), etc., and may include a multi-core processor (e.g., dual-core, quad-core, hexa-core, or any suitable number of cores). In some embodiments, processing circuitry may be distributed across multiple separate processors or processing units, for example, multiple of the same type of processing units (e.g., two Intel Core i7 processors) or multiple different processors (e.g., an Intel Core i5 processor and an Intel Core i7 processor). In some embodiments, control circuitry 304 executes instructions for receiving streamed content and executing its display, such as executing application programs that provide interfaces for content providers to stream and display content on display 312.
[0067] Control circuitry 304 may thus include communications circuitry suitable for communicating with a content provider 140 server or other networks or servers. Communications circuitry may include a cable modem, an integrated services digital network (ISDN) modem, a digital subscriber line (DSL) modem, a telephone modem, Ethernet card, or a wireless modem for communications with other equipment, or any other suitable communications circuitry. Such communications may involve the Internet or any other suitable communications networks or paths. In addition, communications circuitry may include circuitry that enables peer-to-peer communication of user equipment devices, or communication of user equipment devices in locations remote from each other.
[0068] Memory may be an electronic storage device provided as storage 308 that is part of control circuitry 304. As referred to herein, the phrase “electronic storage device” or “storage device” should be understood to mean any device for storing electronic data, computer software, or firmware, such as random-access memory, read-only memory, hard drives, optical drives, digital video disc (DVD) recorders, compact disc (CD) recorders, BLU-RAY disc (BD) recorders, BLU-RAY 3D disc recorders, digital video recorders (DVR, sometimes called a personal video recorder, or PVR), solid state devices, quantum storage devices, gaming consoles, gaming media, or any other suitable fixed or removable storage devices, and/or any combination of the same. Storage 308 may be used to store various types of content described herein as well as media guidance data described above. Nonvolatile memory may also be used (e.g., to launch a boot-up routine and other instructions). Cloud-based storage may be used to supplement storage 308 or instead of storage 308.
[0069] Storage 308 may also store instructions or code for an operating system and any number of application programs to be executed by the operating system. In operation, processing circuitry 306 retrieves and executes the instructions stored in storage 308, to run both the operating system and any application programs started by the user. The application programs can include one or more voice interface applications for implementing voice communication with a user, and/or content display applications which implement an interface allowing users to select and display content on display 312 or another display.
[0070] Control circuitry 304 may include video generating circuitry and tuning circuitry, such as one or more analog tuners, one or more MPEG-2 decoders or other digital decoding circuitry, high-definition tuners, or any other suitable tuning or video circuits or combinations of such circuits. Encoding circuitry (e.g., for converting over-the-air, analog, or digital signals to MPEG signals for storage) may also be included. Control circuitry 304 may also include scaler circuitry for upconverting and downconverting content into the preferred output format of the user equipment 300. Circuitry 304 may also include digital-to-analog converter circuitry and analog-to-digital converter circuitry for converting between digital and analog signals. The tuning and encoding circuitry may be used by the user equipment device to receive and to display, to play, or to record content. The tuning and encoding circuitry may also be used to receive guidance data. The circuitry described herein, including for example, the tuning, video generating, encoding, decoding, encrypting, decrypting, scaler, and analog/digital circuitry, may be implemented using software running on one or more general purpose or specialized processors. Multiple tuners may be provided to handle simultaneous tuning functions (e.g., watch and record functions, picture-in-picture (PIP) functions, multiple-tuner recording, etc.). If storage 308 is provided as a separate device from user equipment 300, the tuning and encoding circuitry (including multiple tuners) may be associated with storage 308.
[0071] A user may send instructions to control circuitry 304 using user input interface 310. User input interface 310 may be any suitable user interface, such as a remote control, mouse, trackball, keypad, keyboard, touch screen, touchpad, stylus input, joystick, voice recognition interface, or other user input interfaces. Display 312 may be provided as a stand-alone device or integrated with other elements of user equipment device 300. For example, display 312 may be a touchscreen or touch-sensitive display. In such circumstances, user input interface 310 may be integrated with or combined with display 312. Display 312 may be one or more of a monitor, a television, a liquid crystal display (LCD) for a mobile device, amorphous silicon display, low temperature poly silicon display, electronic ink display, electrophoretic display, active matrix display, electro-wetting display, electrofluidic display, cathode ray tube display, light-emitting diode display, electroluminescent display, plasma display panel, high-performance addressing display, thin-film transistor display, organic light-emitting diode display, surface-conduction electron-emitter display (SED), laser television, carbon nanotubes, quantum dot display, interferometric modulator display, or any other suitable equipment for displaying visual images. In some embodiments, display 312 may be HDTV-capable. In some embodiments, display 312 may be a 3D display, and the interactive media guidance application and any suitable content may be displayed in 3D. A video card or graphics card may generate the output to the display 312. The video card may offer various functions such as accelerated rendering of 3D scenes and 2D graphics, MPEG-2/MPEG-4 decoding, TV output, or the ability to connect multiple monitors. The video card may be any processing circuitry described above in relation to control circuitry 304. The video card may be integrated with the control circuitry 304. Speakers 314 may be provided as integrated with other elements of user equipment device 300 or may be stand-alone units. The audio component of videos and other content displayed on display 312 may be played through speakers 314. In some embodiments, the audio may be distributed to a receiver (not shown), which processes and outputs the audio via speakers 314.
[0072]
[0073] Storage 410 is a memory that stores a number of programs for execution by processing circuitry 408. In particular, storage 410 may store a number of device interfaces 412, an ASR interface 414, voice engine 416 for processing voice inputs via device 200 and selecting voice profiles therefrom, and storage 418. The device interfaces 412 are interface programs for handling the exchange of commands and data with the various devices 200. ASR interface 414 is an interface program for handling the exchange of commands with and transmission of voice inputs to various ASR servers 220. A separate interface 414 may exist for each different ASR server 220 that has its own format for commands or content. Voice engine 416 includes code for executing all of the above-described functions for processing voice inputs, identifying and/or differentiating voice inputs, determining interruptions, determining supplemental information, and sending one or more portions of a voice input to ASR interface 414 for transmission to ASR server 220. Storage 418 is memory available for any application and is available for storage of terms or other data retrieved from device 200, such as voice profiles, or the like.
[0074] The device 400 may be any electronic device capable of electronic communication with other devices and accepting voice inputs. For example, the device 400 may be a server, or a networked in-home smart device connected to a home modem and thereby to various devices 200. The device 400 may alternatively be a laptop computer or desktop computer configured as above.
[0075] ASR server 220 may be any server configured to run an ASR application program and may be configured similar to server 400 of
[0076]
[0077] Profile data structure 500 comprises multiple profiles such as profiles 510, 520, 530, 540, 550, 560, and 570. Voice identification (ID) numbers in profile data structure 500 may be populated with ID numbers. Each profile of profile data structure 500 has fields, such as fields 562-568. For instance, profile 560 has a voice ID 562 of “VOICE ID N,” language 564 as “en-US” for U.S.-based English, demographic 565 as “adult female,” voice fingerprint 566 of “voice fingerprint N,” and timestamp 568 of “2021-06-29 2:47 PM.” Timestamp 568 is the most recent of the timestamps while timestamp 518 is the oldest. In some embodiments, a timestamp indicates creation date. In some embodiments, a timestamp indicates the date and time of last use of the profile. In some embodiments, the profile database may be governed by an expiration time (e.g., three months, one year, etc.), and each profile may be deleted at a certain point after the corresponding timestamp if there is insufficient use. For instance, timestamp 518 of phrase 510 indicates “2021-06-09 10:18 AM.” If profile data structure 500 has an expiration timer of, e.g., six months, then phrase 510 would be deleted on Dec. 9, 2021, if there is no additional use.
[0078]
[0079] Some embodiments may utilize a voice engine to perform one or more parts of process 600, e.g., as part of an ASR platform or interactive virtual assistant application, stored and executed by one or more of the processors and memory of a device and/or server such as those depicted in
[0080] At step 602, a voice engine receives a first voice input as an input query, e.g., for a voice query to be processed. For instance, the voice engine (e.g., in conjunction with an ASR engine) may determine text and/or keywords based on a first voice input as the input query. A voice engine may capture an input stream that comprises multiple inputs, e.g., from one or more voices. In some embodiments, portions of an input stream may be processed as separate inputs. In some embodiments, a virtual assistant may receive a wake word and a query as a first voice input, e.g., as part of a captured input stream, to be set as the input query. In scenario 100 of
[0081] At step 604, the voice engine identifies a first profile for the first voice input. For example, the user who initiates the virtual assistant may be identified and/or assigned a profile. The first voice to issue voice input may be identified as the primary voice input (e.g., first voice profile) for the query. In some embodiments, interrupting voices may be assigned as “interrupters,” “supplemental,” and/or secondary voices. In scenario 100 of
[0082] At step 608, the voice engine receives a second voice input, e.g., as part of the input stream. For instance, a second voice command or query may be provided to a virtual assistant. In some embodiments, a second voice input may be provided by a different user from the one who provided the first voice input, e.g., a person who interrupts and/or provides supplemental comments. For instance, the second voice input may detrimentally interrupt the voice query or may positively supplement the initial query. In some cases, the second voice input may be an interruption and not helpful with the first query. For instance,
[0083] At step 610, the voice engine determines whether the second voice input matches the identified profile. In some embodiments, a voice profile may be assigned to the second voice input, e.g., following step 604. In some embodiments, the second voice input may be compared with the first voice input to determine if the same user provided both inputs.
[0084] If, at step 610, the voice engine determines the second voice input matches the identified first profile then, at step 612, the voice engine combines the second voice input with the input query (e.g., the first voice input). For instance, there might be a slight pause between two utterances by a first user that were intended to be one statement or query submitted to a voice assistant. In
[0085] If, at step 610, the voice engine determines the second voice input and the identified first profile are not a match then, at step 614, the voice engine sets aside the second voice input from the input stream. In some embodiments the second voice input may be set aside and used as a supplemental query term if, e.g., the results for the input query fail. In some embodiments, the second voice input may be set aside and used as a supplemental query term if, e.g., the results for the input query are ambiguous, too numerous, or otherwise improper. In some embodiments, the second voice input may be discarded completely.
[0086] At step 616, the voice engine receives a third voice input. For instance, the third voice input may interrupt the voice query or may supplement the query. In some cases, the third voice input may be provided by the same user as a prior input, e.g., following a brief pause after the first voice input or the second voice input. For instance, in
[0087] At step 620, the voice engine determines whether the third voice input matches the identified first profile. In some embodiments, the second voice input may be compared with the first voice input to determine if the same user provided both inputs.
[0088] If the voice engine determines the third voice input matches the identified first profile, then, at step 622, the voice engine combines the third voice input with the input query (e.g., the first voice input). For instance, there might be a slight pause (or interruption) between two utterances by a first user that were intended to be one statement or query submitted to a voice assistant. For instance, in
[0089] If the voice engine determines the third voice input does not match the identified first profile then, at step 624, the voice engine sets aside the third voice input. In some embodiments, the third voice input may be set aside and used as a supplemental query term if, e.g., the results for the input query fail or are ambiguous, too numerous, or otherwise improper. In some embodiments, the third voice input may be discarded.
[0090] At step 626, the voice engine transmits the input query for processing and response. For instance, the virtual assistant may process the input query and provide one or more results for the input query. In some embodiments, the input query may incorporate one or more parts of the voice input stream, e.g., as an audio file and/or as processed by ASR/NLP. In some instances, the input query may comprise one or more of, e.g., the wake word, the first voice input, the second voice input, and the third voice input. In some embodiments, a wake word will be removed and/or ignored. In some instances, the input query may comprise the first voice input and supplemental input from one or more of, e.g., the second voice input, and the third voice input.
[0091]
[0092] At step 702, a voice engine receives a first voice input as an input query, e.g., for a voice query to be processed. For instance, the voice engine (e.g., in conjunction with an ASR engine) may determine text and/or keywords based on a first voice input as the input query. A voice engine may capture an input stream that comprises multiple inputs, e.g., from one or more voices. In some embodiments, portions of an input stream may be processed as separate inputs. In some embodiments, a virtual assistant may receive a wake word and a query as a first voice input, e.g., as part of a captured input stream, to be set as the input query. In scenario 100 of
[0093] At step 704, the voice engine identifies a first profile for the first voice input. For example, the user who initiates the virtual assistant may be identified and/or assigned a profile. The first voice to issue voice input may be identified as the primary voice input (e.g., first voice profile) for the query. In some embodiments, interrupting voices may be assigned as “interrupters,” “supplemental,” and/or other secondary voices. In scenario 100 of
[0094] At step 708, the voice engine receives a second voice input, e.g., as part of the input stream. For instance, a second voice command or query may be provided to a virtual assistant. In some embodiments, a second voice input may be provided by a different user than who provided the first voice input, e.g., a person who interrupts and/or provides supplemental comments. For instance,
[0095] At step 710, the voice engine determines whether the second voice input matches the identified profile. In some embodiments, a voice profile may be assigned to the second voice input, e.g., following step 704. In some embodiments, the second voice input may be compared with the first voice input to determine if the same user provided both inputs.
[0096] If, at step 710, the voice engine determines the second voice input matches the identified first profile then, at step 712, the voice engine combines the second voice input with the input query (e.g., the first voice input). For instance, there might be a slight pause between two utterances by a first user that were intended to be one statement or query submitted to a voice assistant. In some embodiments, two voice inputs may already be combined, e.g., as part of the same input stream and an interruption may be removed. In
[0097] If, at step 710, the voice engine determines the second voice input and the identified first profile are not a match at then, at step 720, the voice engine determines whether the second voice input adds supplemental information to the input query. For instance, the voice engine (e.g., in conjunction with an ASR engine) may determine whether the text of the second voice is related to the text of the input query. In some embodiments, a second voice input may be supplemental if it filters and/or refines initial search results. In some embodiments, a machine learning model may be trained to determine similarity and/or whether two voice inputs may be considered related or supplemental to one another. In some embodiments, the voice engine may determine whether the results for the query from the first voice input fail and/or are too ambiguous, too numerous, or otherwise improper prior to evaluating whether the second voice input would improve the input query and thus, appropriately add supplemental information to the initial query.
[0098] If, at step 720, the voice engine determines the second voice input adds supplemental information to the input query then, at step 712, the voice engine combines the second voice input with the input query (e.g., the first voice input). In some embodiments, two voice inputs may already be combined, e.g., as part of the same input stream, and an interruption may be removed. For instance, a query and a supplement may be a part of the same input stream and the supplement may remain as part of the input stream to be processed (while any interruptions or non-relevant input may be removed).
[0099] If, at step 720, the voice engine determines the second voice input does not add supplemental information to the input query then, at step 724, the voice engine sets aside the second voice input. For instance, the second voice input may be marked as an interrupter or unrelated comment and the initial query may be used without supplement. In some embodiments, the second voice input may be removed from the voice input stream and not processed with the first input. In some embodiments the second voice input may be set aside and only used as a supplemental query term if, e.g., the results for the input query are exceedingly poor, e.g., below a very low threshold (e.g., 10-20% match). For instance, search results may be very high (e.g., hundreds or thousands) and/or even more ambiguous, numerous, or otherwise improper. In some cases, the search results might fail. In some embodiments the second voice input may be recorded, e.g., voice training, model training, profiling, etc., even though it is set aside.
[0100] At step 726, the voice engine transmits the input query for processing and response. For instance, the virtual assistant may process the input query, determine one or more keywords and/or text from the input query for search, and provide search results based on the input query. In some embodiments, the input query may incorporate one or more parts of the voice input stream, e.g., as an audio file and/or as processed by ASR/NLP. In some instances, the input query may comprise one or more of, e.g., the wake word, the first voice input, the second voice input, and the third voice input. In some embodiments, a wake word will be removed and/or ignored. In some instances, the input query may comprise the first voice input and supplemental input from one or more of, e.g., the second voice input, and the third voice input. For instance,
[0101]
[0102] At step 752, a voice engine receives a first voice input. For instance, a first voice command or query is provided to a virtual assistant, e.g., to be processed. A voice engine may capture an input stream that comprises multiple inputs, e.g., from one or more voices. In some embodiments, portions of an input stream may be processed as separate inputs, e.g., a first voice input and a second voice input.
[0103] At step 754, the voice engine generates a first query from the first voice input. For instance, the voice engine (e.g., in conjunction with an ASR engine) may determine text and/or keywords based on the first voice input as the first query. In some embodiments, a virtual assistant may receive a wake word and a command/query as a first voice input to be set as the first query. In scenario 100 of
[0104] At step 756, the voice engine receives a second voice input. For instance, a second voice command or query may be provided to a virtual assistant. In some embodiments, a second voice input may be provided by a different user from the one who provided the first voice input, e.g., a person who interrupts and/or provides supplemental comments. For instance,
[0105] At step 758, the voice engine generates a supplement from the second voice input. For instance, the voice engine (e.g., in conjunction with an ASR engine) may determine text and/or keywords based on the second voice input as the “supplement.” A supplement may be generated when the second voice input interrupts and/or follows the first voice input. At this point, a supplement may comprise a detrimental interruption or a positive addition. Generally, in some embodiments, the supplement generated from the second voice input may be combined with the query, may be set aside and used when results for the initial query require more information, or may be discarded.
[0106] At step 760, the voice engine generates one or more search results for the first query. For instance, a virtual assistant may submit the first query to a search engine such as Google® or Bing® and receive search results for the submitted first query. In some embodiments, a virtual assistant may conduct its own search, via a network or the internet, and return search results.
[0107] At step 762, the voice engine generates a relevance score for the one or more search results. A relevancy score may be any type of determination of strength of the search results including, for instance, a score based on metrics of relevance to the query, relevance to other results, lack of ambiguity, number of irrelevant results, popularity, accessibility, redundancy in results, publish dates of results, links, interlinks, and other key search metrics. In some embodiments, a relevance score may be calculated for each result for the submitted query, e.g., as the search results are determined. For example, a search engine may rank each of the search results by a score for presentation, and a normalized score (e.g., 0-100) may be used as a relevance score for each result. In some embodiments, the normalized score of the top-ranked hit is the normalized relevance score for the set of search results. This may be helpful because many ASR platforms only return the top hit of the search results for a particular voice query. In some embodiments, a model may be trained to receive an input of search results and produce a relevance score.
[0108] In some embodiments, a weighted average of the top few (e.g., 3-5) results may be used to determine a relevance score for the set of search results for a particular query. In some embodiments, relevancy of the top few (e.g., 2-6) results with each other may be used to determine a relevance score for the set of search results for a particular query. For instance, if the results for a search on “Giants score” produces results for baseball and football, the lack of relevance among search results indicates ambiguity (and a potential need for supplemental information). In some embodiments, higher relevance scores reflect a lack of ambiguity in the search results.
[0109] In some embodiments, the search query itself may be at least a portion of the basis for a relevance score of the results. For instance, known and popular commands and queries may each have a preset high score. For example, asking a virtual assistant for the time or weather at home may be assigned a high score triggering automatic dismissal of any interruptions or supplements as unnecessary, moving to step 766. However, in some embodiments, questions may require a dynamic details that could be considered ambiguous, e.g., time or weather in a different location, a search result relevance score may be ambiguous. For instance, in
[0110] At step 764, the voice engine determines whether the relevance score above a predetermined threshold. For instance, with a relevance score scale of, e.g., 0-100, a threshold of 75 may indicate whether the search results are good enough and/or not based on ambiguity. In some embodiments, with a relevance score scale of, e.g., low, medium, or high, a threshold of medium may indicate whether the search results are sufficiently relevant and/or clear of ambiguity.
[0111] If the relevance score meets or exceeds the predetermined threshold then, at step 766, the voice engine provides the search result(s). For example, with a relevance score scale of, e.g., 0-100, and a threshold of 65, a relevance score of 80 would surpass the threshold. In some embodiments, one or more of the search results are passed to the virtual assistant for delivery. For instance, the top result may be read aloud by the virtual assistant. In some embodiments, one or more of the search results may be provided via an interface for the virtual assistant and/or another connected device. In some embodiments, an answer to the query may be taken as a part of one or more of the search results. In scenario 100 of
[0112] If the relevance score is not above the predetermined threshold, then, at step 768, the voice engine generates new search result(s) based on the first query and the supplement. For instance, with a relevance score scale of, e.g., 0-100, and a threshold of 70, a relevance score of 69 would fall short of the threshold, and new results using the query and the supplement would be generated. A new search, e.g., based on the first query and the supplement, may be conducted in various ways. In some embodiments, a search with the query and the supplement may be conducted and new results produced. For instance, one or more keywords may be taken from the supplement and combined with the initial query to produce a new set of search results. In some embodiments, the initial search results from a search based on the first query may be filtered or refined using, e.g., a portion of the supplement, so that a new set of results is produced (e.g., and the top result(s) output). For instance, one or more keywords may be taken from the supplement and used to filter the initial search results and produce new search results. In some embodiments, one or more keywords may be taken from the first query and combined with the supplement to produce new search results.
[0113] At step 769, the voice engine provides the new search result(s) based on the first query and the supplement. In some embodiments, one or more of the new search results are passed to the virtual assistant for delivery. For instance, the top result of the new search may be read aloud by the virtual assistant. In some embodiments, one or more of the new search results may be provided via an interface for the virtual assistant and/or another connected device. In some embodiments, an answer to the first query (and supplement) may be taken as a part of one or more of the new search results. In some embodiments, a new relevance score may be determined for the new search results and, e.g., the new search results may only be provided if the new relevance score is greater than the relevance score for the search results for the first query. In some embodiments, if the new relevance score is not greater than the relevance score for the first query results, an error and/or request to repeat may be issued.
[0114]
[0115] At step 772, a voice engine receives a first voice input. For instance, a first voice command or query is provided to a virtual assistant, e.g., to be processed. A voice engine may capture an input stream that comprises multiple inputs, e.g., from one or more voices. In some embodiments, portions of an input stream may be processed as separate inputs, e.g., a first voice input and a second voice input.
[0116] At step 774, the voice engine generates a first query from the first voice input. For instance, the voice engine (e.g., in conjunction with an ASR engine) may determine text and/or keywords based on the first voice input as the first query. In scenario 100 of
[0117] At step 776, the voice engine generates one or more first search results for the first query. For instance, a virtual assistant may submit the first query to a search engine such as Google® or Bing® and receive a set of first search results for the submitted first query. In some embodiments, a virtual assistant may conduct its own search, via a network or the internet, and return the first search results.
[0118] At step 778, the voice engine generates a relevance score for the one or more search results. A relevancy score may be any type of determination of strength of the search results including, for instance, a score based on metrics of relevance to the query, relevance to other results, lack of ambiguity, number of irrelevant results, popularity, accessibility, redundancy in results, publish dates of results, links, interlinks, and other key search metrics. In some embodiments, a relevance score may be calculated for each result for the submitted query, e.g., as the search results are determined. For example, a search engine may rank each of the search results by a score for presentation, and a normalized score (e.g., 0-100) may be used as a relevance score for each result. In some embodiments, the normalized score of the top-ranked hit is the normalized relevance score for the set of search results. This may be helpful because many ASR platforms only return the top hit of the search results for a particular voice query. In some embodiments, a model may be trained to receive an input of search results and produce a relevance score. In some embodiments, a weighted average of the top few (e.g., 3-5) results may be used to determine a relevance score for the set of search results for a particular query. In some embodiments, relevancy of the top few (e.g., 2-6) results with each other may be used to determine a relevance score for the set of search results for a particular query.
[0119] At step 782, the voice engine receives a second voice input. For instance, a second voice command or query may be provided to a virtual assistant. In some embodiments, a second voice input may be provided by a different user from the one who provided the first voice input, e.g., a person who interrupts and/or provides supplemental comments. For instance,
[0120] At step 784, the voice engine generates a supplement from the second voice input. For instance, the voice engine (e.g., in conjunction with an ASR engine) may determine text and/or keywords based on the second voice input as the “supplement.” A supplement may be generated when the second voice input interrupts and/or follows the first voice input. At this point, a supplement may comprise a detrimental interruption or a positive addition. Generally, in some embodiments, the supplement generated from the second voice input may be combined with the query, may be set aside and used when results for the initial query require more information, or may be discarded.
[0121] At step 786, the voice engine generates one or more new search results for the first query and the supplement. A new search, e.g., based on the first query and the supplement, may be conducted in various ways. In some embodiments, a search with the query and the supplement may be conducted and new results produced. For instance, one or more keywords may be taken from the supplement and combined with the initial query to produce a new set of search results. In some embodiments, the initial search results from a search based on the first query may be filtered or refined using, e.g., a portion of the supplement so that a new set of results is produced (e.g., and the top result(s) output). For instance, one or more keywords may be taken from the supplement and used to filter the initial search results and produce new search results. In some embodiments, one or more keywords may be taken from the first query and combined with the supplement to produce new search results.
[0122] At step 790, the voice engine determines whether the first relevance score is greater than the second relevance score. For instance, with a relevance score scale of, e.g., 0-100, a first score of 67 may indicate the first search results are good, but a new relevance score of 73 may indicate that the new search result(s) are better. In some embodiments, with a relevance score scale of, e.g., low, medium, or high, a first score of high may indicate a better search than with a supplement/interruption with a relevance score of low. In some embodiments, if the new relevance score is not greater than the relevance score for the first query results by a certain percentage or threshold, an error and/or request to repeat the query or queries may be issued.
[0123] If the first relevance score is greater than the second relevance score then, at step 792, the voice engine provides the first search result(s). For example, with a relevance score scale of, e.g., 0-100, a first relevance score of 85 and a second relevance score of 65, the initial search results are probably more accurate than the results based on the supplement. In some embodiments, one or more of the first search results are passed to the virtual assistant for delivery. For instance, the top result may be read aloud by the virtual assistant. In some embodiments, one or more of the search results may be provided via an interface for the virtual assistant and/or another connected device. In some embodiments, an answer to the first query may be taken as a part of one or more of the first search results. In scenario 100 of
[0124] If, at step 790, the second relevance score is greater than the first relevance, score then, at step 794, the voice engine provides the new search result(s) based on the first query and the supplement. In some embodiments, one or more of the new search results are passed to the virtual assistant for delivery. For instance, the top result of the new search may be read aloud by the virtual assistant or provided via an interface. In some embodiments, an answer to the first query (and supplement) may be taken as a part of one or more of the new search results. In scenario 150 of
[0125]
[0126] At step 802, a voice engine receives a voice input. For instance, a voice command or query is provided to a virtual assistant, e.g., to be processed. In some embodiments, an interruption or supplemental comment may be provided to a virtual assistant, e.g., to be profiled and/or matched to a profile. A voice engine may capture an input stream that comprises multiple inputs, e.g., from one or more voices. In some embodiments, portions of an input stream may be processed as separate inputs, e.g., a first voice input, et al.
[0127] At step 804, the voice engine generates a fingerprint—e.g., a “voiceprint,” a “voice fingerprint,” or a “voice template”—of the voice input. A voice fingerprint is a typical way to perform voice recognition. For instance, each voice may have a fingerprint. Voice fingerprints may be used, e.g., for identification, security, and other biometric applications. In some embodiments, a fingerprint may be a mathematical expression of a person's voice or vocal tract. A voice fingerprint may be developed from a few phrases. In some embodiments, an initial voice fingerprint may be developed based on an initial training session. In some embodiments, many voice fingerprints may be generated for a user which may be merged together, e.g., with an initial voice fingerprint, for higher accuracy. In some embodiments, a voice fingerprint may be stored as a hash value.
[0128] At step 808, the voice engine accesses voice profiles, e.g., in a database. For instance, the voice engine may access a database of voice profiles with each unique voice profile having a fingerprint. An exemplary voice database is depicted in
[0129] At step 810, the voice engine compares the fingerprint to profile fingerprints. For instance, with voice identification the voice fingerprint in question may be compared to each available voice fingerprint in the database to find a match, if it exists. In some embodiments, a new voice fingerprint may be correlated with each voice fingerprint in the database and a match score (e.g., 0-100 scale) may be produced based on the confidence of the match. Generally, if the match score is above a predetermined confidence threshold, a profile match is said to exist. In some embodiments, the voice database may be organized to expedite matching by, e.g., clustering similar voice fingerprints based on similar voice traits. In some embodiments, a machine learning model may be trained to receive a voice input and produce a match from a database of voice fingerprints. For instance, a training set of voices and profiles may be used to train, test, and retrain a model that predicts a voice identification for each provided new voice input.
[0130] At step 812, the voice engine determines whether the fingerprint matches any profile fingerprint, e.g., with a match score above a confidence threshold. For instance, if the match score between the fingerprint of a new voice input and a profile fingerprint is above a predetermined confidence threshold, a profile match is said to exist and a voice identified. In some embodiments, the confidence threshold may be low (e.g., 55 on a scale of 0-100). For instance, sometimes the voice engine aims to quickly differentiate speakers and determine if an assumed interruption or supplemental comment comes from the same speaker or a new person. In such cases, quick, lower-confidence matching might be more efficient than, e.g., using a confidence threshold for a match required for digital security.
[0131] If, at step 812, the fingerprint matches a profile fingerprint (e.g., a match score that meets or exceeds the confidence threshold) then, at step 814, the voice engine provides the profile matching the voice input.
[0132] If, at step 812, the fingerprint does not match a profile fingerprint (e.g., no match scores above the confidence threshold) then, at step 816, the voice engine generates new voice profile. In such cases, a new voice profile may be used to, e.g., differentiate voices that may be offering commands and queries from voices offering interruptions and/or supplemental information.
[0133]
[0134] At step 822, a voice engine receives a first voice input. For instance, a voice command or query may be provided to a virtual assistant, e.g., to be processed. A voice engine may capture an input stream that comprises multiple inputs, e.g., from one or more voices. In some embodiments, portions of an input stream may be processed as separate inputs, e.g., a first voice input and a second voice input. In some embodiments, a request, such as request 114 of
[0135] At step 824, the voice engine receives a second voice input. For instance, a second voice command or query may be provided to a virtual assistant. In some embodiments, an interruption or supplemental comment may be provided to a virtual assistant, e.g., to be profiled and/or matched to a profile. In some embodiments, a second voice input may be provided by a different user from the one who provided the first voice input, e.g., a person who interrupts and/or provides supplemental comments. For instance,
[0136] At step 830, the voice engine compares the first voice input with the second voice input for various traits, e.g., acoustic metrics. For instance, the voice engine may compare one or more acoustic traits such as pitch, tone, resonance, amplitude, loudness, etc. In some cases, the voice engine may compare loudness and/or amplitude to determine if the first voice input and the second voice input came from a similar distance from the microphone prior to analyzing other voice traits. Some embodiments may be able to differentiate voices quickly based on volume before looking at other traits like, e.g., pitch, timbre, echo, etc. In some embodiments, one or more traits may be measured and/or depicted mathematically (e.g., using a graphic equalizer) and compared. In some embodiments, a sound match score may be determined based on a comparison of one or more of acoustic traits such as pitch, timbre, echo, etc.
[0137] At step 832, the voice engine determines whether the first voice traits match the second voice and/or acoustic traits, e.g., with a match score above a confidence threshold. In some embodiments, each trait may have a confidence threshold. For instance, if the first voice input and the second voice input match in amplitude by less than 70%, they are probably not from the same source. In some embodiments, if the first voice input and the second voice input match in amplitude at about 75%, other traits such as pitch may be needed to differentiate the speakers. In some cases, if pitch matches by less than, e.g., 65%, then the two voice inputs may be assumed to be different.
[0138] If, at step 832, the first voice traits match the second voice traits (e.g., a match score that meets or exceeds the threshold) then, at step 834, the voice engine outputs that first voice input and second voice input are the same speaker.
[0139] If, at step 832, the first voice traits match the second voice traits (e.g., a match score below the confidence threshold) then, at step 816, the voice engine outputs that first voice input and second voice input are different speakers.
[0140]
[0141] Some embodiments may utilize a voice engine to perform one or more parts of process 900, e.g., as part of an ASR platform or interactive virtual assistant application, stored and executed by one or more of the processors and memory of a device and/or server such as those depicted in
[0142] At step 902, a voice engine receives a first voice input, e.g., a voice query to be processed. For instance, a virtual assistant may receive a wake word and a query as a first voice input. In scenario 100 of
[0143] At step 904, the voice engine processes and responds to the input query. In some embodiments, the voice engine transmits the input query for processing. In some embodiments, the virtual assistant may process the input query, determine one or more keywords and/or text from the input query for a search, and provide search results based on the input query. In some embodiments, a wake word will be removed and/or ignored. In some instances, the input query may comprise a first voice input and a supplemental input. For instance,
[0144] At step 908, the voice engine receives a second voice input. For instance, a second voice command or query may be provided to a virtual assistant. In some embodiments, the second voice input may be a new request or a repeat of one or more portions of the prior request. For example, a user may repeat a request because the response was incorrect. In some cases, the second voice input may be provided by a different user (e.g., a new request or still a repeat).
[0145] At step 910, the voice engine determines whether the second voice input matches the first voice input. In some embodiments, consecutive voice inputs that match may indicate that the voice engine provided an improper response and, e.g., the first input may not have been correctly captured. A repeat request may be identical or similar with regard to the sound and/or substance of the first voice input, e.g., a repeat, a rephrase, one or more similar sounding portions, one or more similar words, etc. In some embodiments, the voice engine may analyze the sound and substance of the first voice input and the second voice input for similarities and generate a match score. In some embodiments, there may be a predetermined threshold match score to determine if two voice inputs match. For instance, a match score of 50 or higher on a 0 to 100 scale may indicate that the second voice input matches the first voice input. In some embodiments, the virtual assistant may be more cautious and assume a match and use, e.g., a match score of 35 or higher on a 0 to 100 scale to indicate that the second voice input matches the first voice input. In some embodiments, the virtual assistant have an adjustable threshold that depends on how recent the last request may have been. For instance, a second request following a first request fairly quickly may indicate a repeated query due to an improper response, so a threshold may be lower (e.g., 20 on a scale of 0-100) when a new voice input occurs 5 seconds after a first query/initial response than if a new voice input were provided 30 seconds after a prior query (e.g., a threshold of 60 on the same scale).
[0146] If the voice engine determines, at step 910, that the second voice input does not match the first voice input then, at step 912, the voice engine processes and responds to the latest input, e.g., the second voice input. For instance,
[0147] If, at step 910, the voice engine determines that the second voice input matches the first voice input then, at step 914, the voice engine transmits a signal to pause and/or mute a background noise. For instance, a virtual assistant working in conjunction with a content delivery system, e.g., a cable provider and/or streaming platform, may transmit a signal to pause the content playback to allow a repeat of a request or command. In some embodiments, a virtual assistant may transmit a signal via wire (e.g., over HDMI, ethernet, etc.) or wirelessly (e.g., infrared, RF, WiFi, Bluetooth, etc.) to pause content playback. For instance, a command to pause playback may be transmitted to allow the user to repeat his or her request. In some embodiments, a virtual assistant may transmit a signal, e.g., via wire or wirelessly, to mute sounds in the background of the request. For instance, a command to mute a TV and/or speakers may be transmitted to allow the user to repeat his or her request. In some embodiments, the virtual assistant may be playing back the background noise and, thus, may be able to pause or mute the background noise. In some embodiments, a virtual assistant may be able to detect which device is playing the background noise. For instance, a virtual assistant may receive a signal via network about which device is playing the background noise. In some embodiments, a virtual assistant may identify the background noise (e.g., using a music or content identification application) and determine which device is playing the background noise. In some embodiments, a virtual assistant may identify the background noise and trigger performance of noise cancellation. The voice engine then waits for further voice input, e.g., at step 916.
[0148] At step 916, the voice engine receives a new voice input. For instance, a new voice command or query may be provided to a virtual assistant, e.g., while the background noise is muted/paused. In some embodiments, the new voice input may be a new request or a repeat of one or more portions of one or more of the prior requests. For example, a user may repeat a request (multiple times) because the virtual assistant's prior response was incorrect. In some cases, the second voice input may be provided by a different user (e.g., a new request or still a repeat).
[0149] At step 918, the voice engine processes and responds to the latest voice input. For instance,
[0150] At step 920, the voice engine transmits a signal to resume and/or unmute the background noise. For instance, a virtual assistant may transmit a signal (via streaming platform and/or content delivery system) to resume/un-pause the content playback after allowing repeat of the request or command. In some embodiments, a virtual assistant may transmit a signal via wire or wirelessly to resume/un-pause content playback. For instance, a command to resume playback may be transmitted after allowing the user to repeat his or her prior request. In some embodiments, a virtual assistant may transmit a signal, e.g., via wire or wirelessly, to unmute sounds in the background of the request that were previously muted to allow repeat of a query. For instance, a command to unmute a TV and/or speakers may be transmitted after previously muting the sounds and allowing the user to repeat his or her request. In some embodiments, the virtual assistant may have been playing back the background noise prior to muting or pausing and, thus, may be able to resume or unmute the background noise quickly.
[0151] In some embodiments, the voice engine finishes responding and waits for a new first voice input, e.g., at step 902. For instance, if a minute lapses since an input/response, the voice engine may assume the query was correctly answered. In some embodiments, the voice engine returns to step 908 and waits for further voice input. For instance, if a new input is provided, the voice engine may assume the query was incorrectly answered again and have to determine whether to mute/pause the background noise again.
[0152]
[0153] At step 952, a voice engine receives a first voice input. For instance, a voice command or query may be provided to a virtual assistant, e.g., to be processed. A voice engine may capture an input stream that comprises multiple inputs, e.g., from one or more voices. In some embodiments, a request, such as request 114 of
[0154] At step 954, the voice engine receives a second voice input. For instance, a second voice command or query may be provided to a virtual assistant. In some embodiments, the second voice input may be a new request or a repeat of one or more portions of the prior request. In some embodiments, consecutive voice inputs that match may indicate that the voice engine provided an improper response and, e.g., the first input may not have been correctly captured. For example, a user may repeat a request because the response was incorrect. In some embodiments, a second voice input may be provided by a different user (e.g., a new request or still a repeat).
[0155] At step 960, the voice engine compares the first voice input with the second voice input for sound and substance. For instance, the voice engine may compare the first voice input with the second voice input regarding sound by comparing one or more various traits, e.g., acoustic metrics, of each input. For instance, the voice engine may compare one or more acoustic traits such as pitch, tone, resonance, amplitude, loudness, etc. In some cases, the voice engine may compare loudness and/or amplitude to determine if the first voice input and the second voice input came from a similar distance from the microphone prior to analyzing other voice traits. Some embodiments may be able to differentiate voices quickly based on volume before looking at other traits like, e.g., pitch, timbre, echo, etc. In some embodiments, a sound match score may be determined based on a comparison of one or more of acoustic traits such as pitch, timbre, echo, etc. In some embodiments, one or more traits may be measured and/or depicted mathematically (e.g., using a graphic equalizer) and compared. The voice engine may also compare the first voice input with the second voice input regarding substance, e.g., by processing each using ASR/NLP and comparing the substance of each request and/or query. In some embodiments, such a comparison may analyze keywords, topics, homonyms, synonyms, syntax, sentence structure, etc. to determine if the substance of the first voice input and the second input are the same. In some embodiments, a substance match score (normalized, e.g., 0-100) may be determined based on a comparison of one or more of keywords, topics, homonyms, synonyms, syntax, sentence structure, etc. In some embodiments, a match score may be determined based on one or more a sound match score and a substance match score. For instance, a match score may be calculated based on a weighted average of a sound match score and a substance match score. In some embodiments, timing between the voice queries may be considered, e.g., as a factor pointing towards a repeat (or correction) due to loud background noise.
[0156] At step 962, the voice engine determines whether the first voice input matches the second voice input based on sound and substance, e.g., above a threshold. In some embodiments, a match score, calculated based on a weighted average of a sound match score and a substance match score, may have a confidence threshold (e.g., meeting or exceeding 75 on a normalized scale of 0-100). In some embodiments, each acoustic trait and/or substantive trait may have a confidence threshold. For instance, if the first voice input and the second voice input match in amplitude by less than 70%, they are probably not from the same source. However, in some embodiments, a high substantive score and a low sound match score may indicate that another source is making the request/query. In some embodiments, if the substantive analysis reveals that each input shares, e.g., greater than two keywords, then the voice engine may determine that the first voice input matches the second voice input. In some embodiments, if the substantive analysis reveals that each input shares, e.g., at least one homophone and/or synonym, then the voice engine may determine that the first voice input matches the second voice input. In some embodiments, a combination of acoustic traits and/or substantive traits may have one or more confidence threshold. For instance, if the voice is determined to be the same with 80% confidence and includes at least one keyword, a match may be determined.
[0157] If, at step 962, the first voice input is determined as matching the second voice input (e.g., a match score that meets or exceeds the threshold) then, at step 964, the voice engine outputs that first voice input and second voice input indicate a repeat.
[0158] If, at step 962, the first voice input is determined as not matching the second voice input (e.g., a match score that falls below the threshold) then, at step 966, the voice engine outputs that first voice input and second voice input do not indicate a repeat.
[0159] The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the disclosure. However, it will be apparent to one skilled in the art that the specific details are not required to practice the methods and systems of the disclosure. Thus, the foregoing descriptions of specific embodiments of the present invention are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the methods and systems of the disclosure and various embodiments with various modifications as are suited to the particular use contemplated. Additionally, different features of the various embodiments, disclosed or otherwise, can be mixed and matched or otherwise combined so as to create further embodiments contemplated by the disclosure.