VEHICLE AVATAR DEVICES FOR INTERACTIVE VIRTUAL ASSISTANT
20230012342 · 2023-01-12
Inventors
- Shenbin Zhao (Shanghai, CN)
- Jianchao Lin (Shanghai, CN)
- Nan Li (Beijing City, CN)
- David Yu (Shanghai, CN)
- Lei Shiah (Shanghai, CN)
- Fatty Lin (New Taipei City, TW)
- Bruno Xu (Beijing, CN)
- Feng Xia (Chengdu, CN)
- Feng Liu (Shanghai, CN)
Cpc classification
G10L15/22
PHYSICS
G01C21/3629
PHYSICS
B60Q9/00
PERFORMING OPERATIONS; TRANSPORTING
G06F3/167
PHYSICS
G06F3/165
PHYSICS
International classification
G10L15/22
PHYSICS
Abstract
A system and method for providing avatar device status indicators for voice assistants in multi-zone vehicles. The method comprises: receiving at least one signal from a plurality of microphones, wherein each microphone is associated with one of a plurality of spatial zones, and one of a plurality of avatar devices; wherein the at least one signal further comprises a speech signal component from a speaker; wherein the speech signal component is a voice command or question; sending zone information associated with the speaker and with one of the plurality of spatial zones to an avatar; activating one the plurality of avatar devices in a respective one of the plurality of spatial zones associated with the speaker.
Claims
1. A method for an interactive voice assistant system for a vehicle comprising: receiving at least one signal from a plurality of microphones, wherein each microphone is associated with one of a plurality of spatial zones, and one of a plurality of avatar devices, wherein the at least one signal further comprises a speech signal component from a speaker, and wherein the speech signal component is a voice command or question; sending zone information associated with the speaker and with a zone of the plurality of spatial zones to an avatar device, from the plurality of avatar devices, associated with the zone; and controlling lighting display of the avatar device in the zone of the plurality of spatial zones associated with the speaker to visually indicate statuses for different operations for the interactive voice assistant system.
2. The method of claim 1, wherein each avatar device, of the plurality of avatar devices, comprises a LED device, and wherein the system is configured to control a respective light display of the each avatar device by illuminating, de-illuminating, or changing the illumination of the respective LED device.
3. The method of claim 1, wherein sending the zone information comprises: determining the avatar device to activate based on detection of the speech signal component by one or more of the plurality of microphones.
4. The method of claim 3, wherein determining the avatar device to activate based on the detection of the speech signal component comprises: determining proximity of at least one microphone from the plurality of microphones to the speaker according to one or more of: coherence-to-diffuse ratio that indicates proximity of the at least one microphone to the speaker, relative time delays between the at least one microphone and one or more other microphones of the plurality of microphones, a signal-to-noise ratio smoothed over time, zone activity detection based on voice biometrics, or visual information provided by a camera or another sensor configured to provide information regarding a spatial zone position of an active speaker.
5. The method of claim 1, wherein controlling the lighting display of the avatar device comprises: lighting the lighting display of the avatar device according to a lighting configuration indicating that the voice assistant system is active or is listening for a further voice command in the zone associated with the speaker.
6. The method of claim 1, wherein controlling the lighting display of the avatar device comprises: lighting the lighting display of the avatar device according to one of a plurality of status indication lighting configurations associated with a determined voice assistant system status from a plurality of voice assistant system statuses.
7. The method of claim 6, wherein the plurality of voice assistant system statuses includes one or more of: listening status indicating the voice assistant system is receiving a command from the speaker, processing status indicating the voice assistant system is processing a previously received command, a snoozing status indicating the voice assistant system is in snoozing mode, or idle status indicating the voice assistant command is in idle mode.
8. The method of claim 6, wherein the plurality of status indication lighting configurations each controls a brightness level and activation sequence of multiple LED lights of the avatar device.
9. The method of claim 6, further comprising: causing, based on the determined voice assistant system status, vibrations of a seat in the zone associated with the speaker.
10. The method of claim 1, wherein controlling the lighting display of the avatar device further comprises: controlling the lighting display of the avatar device to indicate emotion information associated with an executed activity of the voice assistant system.
11. The method of claim 10, wherein controlling the lighting display to indicate the emotion information comprises: controlling the lighting display to indicate an angry emotion in response to a determination that an amount of time, computed by a navigation system coupled to the voice assistance system, to arrive at a destination specified by the speaker is greater than usual.
12. The method of claim 1, wherein controlling the lighting display of the avatar device further comprises: controlling the lighting display of the avatar device to indicate the speaker's emotional state.
13. The method of claim 12, further comprising: determining the speaker's emotion state based on one or more of: voice data received from one or more microphones in the zone associated with the speaker, or visual information of the speaker obtained from a camera in the vehicle.
14. Non-transitory computer readable media comprising computer instructions executable on a processor-based device to: receive at least one signal from a plurality of microphones, wherein each microphone is associated with one of a plurality of spatial zones, and one of a plurality of avatar devices, wherein the at least one signal further comprises a speech signal component from a speaker, and wherein the speech signal component is a voice command or question; send zone information associated with the speaker and with a zone of the plurality of spatial zones to an avatar device, from the plurality of avatar devices, associated with the zone; and control lighting display of the avatar device in the zone of the plurality of spatial zones associated with the speaker to visually indicate statuses for different operations for the interactive voice assistant system.
15. A system comprising: an interactive voice assistant subsystem for a vehicle; one or more microphones to generate voice signals responsive to acoustic signals, wherein each microphone is associated with one of a plurality of spatial zones; a plurality of avatar devices with light displays to provide light-based information; and a processor-based controller to: receive at least one signal from at least one of the plurality of microphones, wherein the at least one signal comprises a speech signal component from a speaker, and wherein the speech signal component is a voice command or a question; send zone information associated with the speaker and with a zone of the plurality of spatial zones to an avatar device, from the plurality of avatar devices, associated with the zone; and control a respective lighting display of the avatar device in the zone of the plurality of spatial zones associated with the speaker to visually indicate statuses for different operations for the interactive voice assistant subsystem.
16. The system of claim 15, wherein each avatar device, of the plurality of avatar devices, comprises a LED device, and wherein the controller is configured to control a respective light display of the each avatar device by illuminating, de-illuminating, or changing the illumination of the respective LED device.
17. The system of claim 15, wherein the controller configured to send the zone information is configured to: determine the avatar device, from the plurality of avatar devices, to activate based on detection of the speech signal component by one or more of the plurality of microphones, including to determine proximity of at least one microphone from the plurality of microphones to the speaker according to one or more of: coherence-to-diffuse ratio that indicates proximity of the at least one microphone to the speaker, relative time delays between the at least one microphone and one or more other microphones of the plurality of microphones, a signal-to-noise ratio smoothed over time, zone activity detection based on voice biometrics, or visual information provided by a camera or another sensor configured to provide information regarding a spatial zone position of an active speaker.
18. The system of claim 15, wherein the controller configured to control the lighting display of the avatar device is configured to: light the lighting display of the avatar device according to one of a plurality of status indication lighting configurations associated with a determined voice assistant system status from a plurality of voice assistant system statuses.
19. The system of claim 15, wherein the controller configured to control the lighting display of the avatar device is further configured to: control the lighting display of the avatar device to indicate emotion information associated with an executed activity of the voice assistant system.
20. The system of claim 15, wherein the controller is further configured to: determine the speaker's emotion state based on one or more of: voice data received from one or more microphones in the zone associated with the speaker, or visual information of the speaker obtained from a camera in the vehicle; and control the lighting display of the avatar device to indicate the speaker's emotional state.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] The accompanying drawings illustrate aspects of the present disclosure, and together with the general description given above and the detailed description given below, explain the principles of the present disclosure. As shown throughout the drawings, like reference numerals designate like or corresponding parts.
[0011]
[0012]
[0013]
[0014]
DETAILED DESCRIPTION OF THE DISCLOSURE
[0015] Referring to the drawings and, in particular, to
[0016] Environment 10 can include spatial zones 110, 120, 130, and 140, having microphones 114, 124, 134, and 144, respectively. Microphones 114, 124, 134, and 144 are arranged such that different spatial zones 110, 120, 130, and 140 are covered by each respective microphone. Specifically, microphones 114, 124, 134, and 144 are spatially separated so that each spatial zone is defined by the proximity to the corresponding microphone. This is also referred to as an “acoustic bubble” around the microphone. Specifically, microphones 114, 124, 134, and 144 are spatially separated so that each spatial zone is defined by the proximity to the corresponding microphone.
[0017] Environment 10 can also include spatial zones 110, 120, 130, and 140, having avatars 115, 125, 135, and 145, respectively. Avatars 115, 125, 135, and 145 are arranged such that different spatial zones 110, 120, 130, and 140 are covered by each respective avatar. In the embodiments as described herein, avatars 115, 125, 135, and 145 are described as lighting devices corresponding to each spatial zone, for example an LED light or LED strip.
[0018] Spatial zones 110, 120, 130, and 140 are indicated by the respective dashed boundary lines. The dashed lines are for illustrative purposes only and are not intended to limit the relative sizes and/or dispositions within environment 10.
[0019] In
[0020] Although four spatial zones are shown in environment 10, the system and method of the present disclosure is operable in an environment with at least two zones. For example, in a vehicular environment, there can be one seat-dedicated microphone 114 and LED strip 115 for zone 110 and a second seat-dedicated microphone 124 and LED strip 125 for zone 120. Such a configuration corresponds to one microphone and LED avatar for the driver's seat and one microphone and LED avatar for the front passenger's seat.
[0021] Although each of spatial zones 110, 120, 130 and 140 is shown in the figures to include a single microphone, each zone can include multiple microphones or an array of microphones to focus on the related speaker in each zone. That is, although microphone 114 is shown and described as one microphone, for example, microphone 114 can be an array of microphones. Advantageously, such an arrangement allows for techniques such as beamforming. Examples can also comprise virtual microphones. A virtual microphone as used herein is understood to be a combination of multiple physical microphones in an array of microphones dedicated to a single spatial zone and the processing and determining of one output signal therefrom. Beamforming techniques to determine one output signal are examples. This output signal associated to the array of microphones and designated as the output signal of a virtual microphone, can focus on one dedicated zone similar to a single omni-directional microphone positioned close to a speaker in a particular zone, or similar to a directional microphone steered towards the desired zone or rather speaker.
[0022] Although each of spatial zones 110, 120, 130 and 140 is shown in the figures to include a single LED or LED strip, each zone can include multiple LEDs or LED strips. Also, while the terms “avatar” and “LED” or “LED strip” are used interchangeably herein, an avatar can include other devices providing visual cues or lighting that can correspond uniquely to a different zone. As will also be appreciated, while avatars are described as visual avatars, other sensory avatar devices that do not distract a driver can be used. For example, avatars 115, 125, 135, and 145 can haptic avatars, such as a vibrating element embedded in each seat.
[0023] It will further be understood that environments such as environment 10 can have more than four spatial zones as long as each zone has at least one microphone and one avatar. For example, a sports utility vehicle with seating for six passengers can be outfitted with six microphones and six LED strips for 6 zones corresponding to six seats. So again, for a van having twelve seats (12 zones, 12 microphones, 12 LED strips), a bus having sixty seats (60 zones, 60 microphones, 60 LED strips), and so on.
[0024] Referring to
[0025] System 100 includes the following exemplary components that are electrically and/or communicatively connected: a sound reproducer 102 (
[0026] SP unit 210 performs gain estimation and application, speaker activity detection, and multi-channel signal processing.
[0027] Sound reproducer 102 is an electromechanical device that produces sound, also known as a loudspeaker. The location shown for sound reproducer 102 in
[0028] Microphones 114, 124, 134, and 144 are transducers that convert sound into an electrical signal. Typically, a microphone utilizes a diaphragm that converts sound to mechanical motion that is in turn converted to an electrical signal.
[0029] Several types of microphones exist that use different techniques to convert, for example, air pressure variations of a sound wave into an electrical signal. Nonlimiting examples include: dynamic microphones that use a coil of wire suspended in a magnetic field; condenser microphones that use a vibrating diaphragm as a capacitor plate; and piezoelectric microphones that use a crystal of made of piezoelectric material. A microphone according to the present disclosure can also include a radio transmitter and receiver for wireless applications.
[0030] Microphones 114, 124, 134, and 144 can be directional micro-phones (e.g. cardioid microphones) so that focus on a spatial zone is emphasized. An omni-directional microphone can also focus on one zone by its position within the zone close to the desired speaker. Microphone 114 can be one or more microphones or microphone arrays. Microphones 124, 134, and 144 can also be one or more microphones or microphone arrays.
[0031] Sound reproducer 102 and microphones 114, 124, 134, and 144 can be disposed in one or more enclosures 150.
[0032] Detecting in which zone of at least two zones a person is speaking based on multiple microphone signals can be done, e.g., by evaluating the speech power occurring at a microphone in each of the at least two zones.
[0033] The system can be configured to perform of multi-zone processing (e.g., for separation, combination, or zone selection) using, for example, the observation of level differences of the different microphone signals. For each passenger speaking it is assumed that the passenger-dedicated microphone for the respective passenger's seat shows higher signal level compared to the microphones for the other seats. Typically, acoustic cross-talk couplings between the spatial zones in the car (“cross-talk”) are at least in the range of about −6 dB (depending on the placement of the microphones, the position of the speaker and further room acoustic parameters).
[0034] The system is also configured with a virtual assistant. The terms “virtual assistant,” “digital assistant,” “intelligent automated assistant,” or “automatic digital assistant” can refer to any information processing system that can interpret natural language input in spoken and/or textual form to infer user intent, and perform actions based on the inferred user intent. For example, to act on an inferred user intent, the system can be configured to one or more of the following: identifying a task flow with steps and parameters designed to accomplish the inferred user intent; inputting specific requirements from the inferred user intent into the task flow; executing the task flow by invoking programs, methods, services, APIs, or the like; and generating output responses to the user in an audible (e.g., speech) and/or visual form.
[0035] A virtual assistant is configured to accept a user request at least partially in the form of a natural language command, request, statement, narrative, and/or inquiry. Typically, the user request seeks either an informational answer or performance of a task by the virtual assistant.
[0036] As shown in
[0037] The client-side portion executed on the vehicle control system computing device 200 can provide client-side functionalities, such as user-facing input and output processing and communications with server system 280. Server system 280 can provide server-side functionalities for any number of clients residing on a respective user device.
[0038] Server system can include one or more virtual assistant servers 281 that can include a client-facing I/O interface 284, one or more processing modules, data and model storage 283, and an I/O interface to external services. The client-facing I/O interface 284 can facilitate the client-facing input and output processing for virtual assistant server. The one or more processing modules can utilize data and model storage 283 to determine the user's intent based on natural language input, and can perform task execution based on inferred user intent. Virtual assistant server 281 can include an external services I/O interface configured to communicate with external services 30, such as telephony services, calendar services, information services, messaging services, navigation services, and the like, through network(s) 20 for task completion or information acquisition. The I/O interface 285 to external services 30 can facilitate such communications.
[0039] Server system 280 can be implemented on one or more standalone data processing devices or a distributed network of computers. In some examples, server system 280 can employ various virtual devices and/or services of third party service providers (e.g., third-party cloud service providers) to provide the underlying computing resources and/or infrastructure resources of server system.
[0040] Although the functionality of the virtual assistant is described in as including both a client-side portion 220 and a server-side portion 281, in some examples, the functions of an assistant (or speech recognition in general) can be implemented as a standalone application installed on a user device or vehicle control system 200. In addition, the division of functionalities between the client and server portions of the virtual assistant can vary in different examples.
[0041]
[0042] At step 310, a user such as user 12 speaks a wake up or voice command detected by one of the microphones such as microphone 114, which is used to signal the system or drive 100 to wake up. The wake up command spoken by the user can be a phrase such as “hello” in various languages. The microphone closest to and in the zone of the speaker sends the signal to system 100. System 100 is able to identify which zone the signal was received from. For example, a signal received from microphone 114 in zone 110 indicates the speaker is user 12 and is the driver.
[0043] In some embodiments, at step 301, a trigger can be used to detect a speech zone. Exemplary non-limiting triggers include a Coherence-to-Diffuse-Ratio that indicates proximity of the microphone to the speaker, relative time delays between microphones, a Signal-to-Noise-Ratio smoothed over time, zone activity detection based on voice biometrics, or visual information provided by a camera or another sensor (not shown) configured to provide information regarding the spatial zone position of an active speaker.
[0044] In some embodiments, where an algorithm related to a camera extracts activity information of the zone dedicated speaker based on visual information, a camera can be used for the trigger.
[0045] At step 320 the system 100 sends zone or seat information to an avatar such as LED avatar 115. The LED avatar can be any one of the LED avatars in the vehicle, depending on which user is the speaker or which zone needs information based on the voice input. For example, where user 12 is the driver and the speaker, LED avatar 115 in zone 110 is sent the seat information at step 320, which the same zone as the speaker.
[0046] At step 330, in response to receiving the zone or seat information, the LED avatar lights up in the zone corresponding to the zone of the user 12. For example, LED avatar 115 lights up in zone 110 to indicate that the voice assistant is active or listening for a voice command in zone 110 from speaker user 12.
[0047] In some embodiments the LED avatar in the active area or zone can light up with a low brightness to indicate that the voice assistance is active in that zone.
[0048] At step 340, a user issues voice commands to the voice assistant to operate a voice recognition system. A user can operate the voice recognition system as known in the art, by issuing various commands, to command the voice assistant to operate any number of vehicle operations and systems. For example, voice assistants are configured to perform operations such as switching radio stations, opening or closing a specific window, locking the vehicle, calling a specific person, or adjusting or enter a destination into the navigation system. The voice commands are received by system 100 through a microphone that is in the same zone as the user.
[0049] At step 350, system 100 sends avatar status information and command to an LED avatar. Avatar status information can include statuses any number of statuses such as, for example, listening, processing, snoozing, or idle. For example, when user 12 speaks a voice command at step 340, the system 100 sends an avatar status information corresponding to the status “listening” to LED avatar 115.
[0050] In some embodiments, when system 100 sends the status “listening” to an LED avatar, the system is configured to control the brightness of the LED. The system can be further configured to vary the brightness based on speaker input to the system. For example, in a “listening” status configuration, the system can be configured to dim or brighten the LED to correspond to the the volume of a user's voice. In some embodiments, the LEDs can be configured to dim when the volume of a user's voice decreases and brighten when the volume of a user's voice increase.
[0051] In some embodiments, when the avatar status is “processing”, the system 100 can be configured to light an avatar having an LED strip or series of LED lights from a left end of the LED lights to a right end of the LED lights. The LED lights that light up first will also fade first.
[0052] The system can the be configured to light LED lights the right end to the left end. This sequence can repeat while the avatar status information is “processing”. In some embodiments, the processing status can indicate that the system 100 received a voice command, is currently processing the request.
[0053] In some embodiments when the avatar status is “snoozing”, the system 100 can be configured to change the brightness of the LED avatar low to high and then from high to low and repeat this sequence while the avatar status is “snoozing”.
[0054] In some embodiments when the avatar status is “idle”, the system 100 can be configured to turn the LED avatar off so no light is emitted.
[0055] In some embodiments, the system can be configured to operate a haptic feedback avatar device to send vibrations to the seat in the zone the user is in, based on the avatar status information.
[0056] Accordingly, at step 360, the system 100 is configured to operate LED or other avatar, for example as described above, based on the avatar information received from system 100 in step 350.
[0057] In some embodiments, at step 370 the system 100 can be configured to provide emotion information to an LED avatar in the zone of the user to indicate a particular emotion. In some embodiments the emotion is related to the voice command issued by the user in step 340, and the result of the processed request or command. For example, if the voice command issued by the user at step 340 is “go home”, the system 100 can be configured to enter the known address of the user in the system's navigation. If the amount of time for the user to arrive at the destination is greater than usual due to traffic, the system 100 can be configured to send emotion information indicating an angry state to the LED avatar in the zone of the user. For example, the system can be configured to command the LED to emit a red light based on the “angry” emotional prompt.
[0058] In some embodiments the system 100 can detect the user's emotional state through the voice data received from the microphone in the user's zone. In some embodiments system 100 can detect the user's emotional state through visual information received from a camera within the vehicle. In some embodiments system 100 can send emotion information related to the user's current emotional state to an LED avatar. For example, if the system 100 detects the user is upset, the system 100 can be configured to send the emotion information indicating the user's angry state to the LED avatar. In some embodiments the LED avatar can be configured provide certain lighting to calm the user when the user is upset, for example a soft blue light.
[0059] In some embodiments emotion information is obtained or generated through natural language understanding (NLU) algorithms. For example, the system 100 can be configured with an NLU system configured to perform sentiment analysis as known in the art. In some embodiments, the system can also be configured to perform emotion recognition, for example using facial tracking recognition systems and emotion recognition software.
[0060] At step 380, the LED avatar can be configured to display lighting corresponding to the emotion information received from system 100 at step 370.
[0061] For example, as described above, the system can be configured to so that the LED avatar lights up with a red color to indicate anger based on a “anger” prompt at step 370. If the emotion information is “calm” the LED can light up to a blue color to show a calm state.
[0062] It should be understood that elements or functions of the present invention as described above can be implemented in the form of control logic using computer software in a modular or integrated manner. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will know and appreciate other ways and/or methods to implement the present invention using hardware and a combination of hardware and software.
[0063] When a certain structural element is described as “is connected to”, “is coupled to”, or “is in contact with” a second structural element, it should be interpreted that the second structural element can “be connected to”, “be coupled to”, or “be in contact with” another structural element, as well as that the certain structural element is directly connected to or is in direct contact with yet another structural element.
[0064] It should be noted that the terms “first”, “second”, and the like can be used herein to modify various elements. These modifiers do not imply a spatial, sequential or hierarchical order to the modified elements unless specifically stated.
[0065] As used herein, the terms “a” and “an” mean “one or more” unless specifically indicated otherwise.
[0066] As used herein, the term “substantially” means the complete or nearly complete extent or degree of an action, characteristic, property, state, structure, item, or result. For example, an object that is “substantially” enclosed means that the object is either completely enclosed or nearly completely enclosed. The exact allowable degree of deviation from absolute completeness can in some cases depend on the specific context. However, generally, the nearness of completion will be to have the same overall result as if absolute and total completion were obtained.
[0067] As used herein, the term “about” is used to provide flexibility to a numerical range endpoint by providing that a given value can be “a little above” or “a little below” the endpoint. Further, where a numerical range is provided, the range is intended to include any and all numbers within the numerical range, including the end points of the range.
[0068] While the present disclosure has been described with reference to one or more exemplary embodiments, it will be understood by those skilled in the art, that various changes can be made, and equivalents can be substituted for elements thereof without departing from the scope of the present disclosure. In addition, many modifications can be made to adapt a particular situation or material to the teachings of the present disclosure without departing from the scope thereof. Therefore, it is intended that the present disclosure will not be limited to the particular embodiments disclosed herein.
[0069] The operation of certain aspects of the present disclosure have been described with respect to flowchart illustrations. In at least one of various embodiments, processes described in conjunction with
[0070] It will be understood that each block of the flowchart illustrations described herein, and combinations of blocks in the flowchart illustrations, can be implemented by computer program instructions. These program instructions can be provided to a processor to produce a machine, such that the instructions, which execute on the processor, create means for implementing the actions specified in the flowchart block or blocks. The computer program instructions can be executed by a processor to cause a series of operational steps to be performed by the processor to produce a computer-implemented process such that the instructions, which execute on the processor to provide steps for implementing the actions specified in the flowchart block or blocks. The computer program instructions can also cause at least some of the operational steps shown in the blocks of the flowchart to be performed in parallel. Moreover, some of the steps can also be performed across more than one processor, such as might arise in a multi-processor computer system or even a group of multiple computer systems. In addition, one or more blocks or combinations of blocks in the flowchart illustration can also be performed concurrently with other blocks or combinations of blocks, or even in a different sequence than illustrated without departing from the scope or spirit of the present disclosure.
[0071] Accordingly, blocks of the flowchart illustrations support combinations for performing the specified actions, combinations of steps for performing the specified actions and program instruction means for performing the specified actions. It will also be understood that each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations, can be implemented by special purpose hardware-based systems, which perform the specified actions or steps, or combinations of special purpose hardware and computer instructions. The foregoing examples should not be construed as limiting and/or exhaustive, but rather, as illustrative use cases to show an implementation of at least one of the various embodiments of the present disclosure.