Head-tracked spatial audio
11546687 · 2023-01-03
Assignee
Inventors
- Symeon Delikaris Manias (Los Angeles, CA, US)
- Shai Messingher Lang (Santa Clara, CA, US)
- Juha O. Merimaa (San Mateo, CA)
Cpc classification
H04S2400/15
ELECTRICITY
H04R2499/15
ELECTRICITY
H04R2499/11
ELECTRICITY
International classification
Abstract
Spatial filters are generated that map response of an audio capture device to head related transfer functions (HRTFs) for different positions of the audio capture device relative to the HRTFs. A current set of spatial filters are determined based on the plurality of spatial filters and a head position of a user. The microphone signals are convolved with the current set of spatial filters, resulting in a left audio channel and right audio channel that form output binaural audio channels. The binaural audio channels can be used to drive speakers of a headphone set to generate sound that is perceived to have a spatial quality. Other aspects are described and claimed.
Claims
1. A method, performed by a computing device, comprising: generating a plurality of spatial filters that map response of an audio capture device having a plurality of microphones to a plurality of head related transfer functions (HRTFs) fora plurality of positions of the audio capture device relative to the HRTFs; determining a current set of spatial filters based on the plurality of spatial filters and a head position of a user; and convolving microphone signals with the current set of spatial filters, resulting in output binaural audio channels.
2. The method of claim 1, wherein the output binaural audio channels are generated from the microphone signals without creating an intermediate format.
3. The method of claim 1, wherein determining the current set of spatial filters includes selecting one or more of the plurality of spatial filters as the current set of spatial filters, based on the head position of the user.
4. The method of claim 1, wherein determining the current set of spatial filters includes interpolating one or more of the plurality of spatial filters selected based on the head position of the user, to determine the current set of spatial filters.
5. The method of claim 1, wherein generating the plurality of spatial filters includes discarding a subset of the plurality of spatial filters for where a reproduction error exceeds a threshold and storing the plurality of spatial filters on the computing device or on an external computing device in communication with the computing device.
6. The method of claim 1, wherein the plurality of spatial filters are represented as beamforming filters, and generating the plurality of spatial filters includes a) generating virtual beamformed speakers based on directivity pattern of the plurality of microphones of the audio capture device, b) generating virtual beamformed microphone pickups based on the HRTFs, and c) generating the beamforming filters such that the beamforming filters map the virtual beamformed speakers to the virtual beamformed microphone pickups.
7. The method of claim 1, wherein the head position of the user is generated by one or more sensors integrated with a headphone set or head mounted display.
8. The method of claim 7, wherein the output binaural audio channels are applied to a left speaker and right speaker of the headphone set or the head mounted display, to produce spatial binaural audio.
9. The method of claim 1, wherein each of the plurality of spatial filters corresponds to a different position of the user's head.
10. The method of claim 1, wherein the head position of the user defines at least one of a roll, pitch, or yaw of the user's head.
11. An audio processing system comprising a processor, configured to perform operations including: generating a plurality of spatial filters that map response of an audio capture device having a plurality of microphones to a plurality of head related transfer functions (HRTFs) for a plurality of positions of the audio capture device relative to the HRTFs; determining a current set of spatial filters based on the plurality of spatial filters and a head position of a user; and convolving microphone signals with the current set of spatial filters, resulting in output binaural audio channels.
12. The audio processing system of claim 11, wherein the output binaural audio channels are generated from the microphone signals without creating an intermediate format.
13. The audio processing system of claim 11, wherein determining the current set of spatial filters includes selecting one or more of the plurality of spatial filters as the current set of spatial filters, based on the head position of the user.
14. The audio processing system of claim 11, wherein determining the current set of spatial filters includes interpolating one or more of the plurality of spatial filters selected based on the head position of the user, to determine the current set of spatial filters.
15. The audio processing system of claim 11, wherein generating the plurality of spatial filters includes discarding a subset of the plurality of spatial filters for where a reproduction error exceeds a threshold and storing the plurality of spatial filters on the computing device or on an external computing device in communication with the computing device.
16. The audio processing system of claim 11, wherein the plurality of spatial filters are represented as beamforming filters, and generating the plurality of spatial filters includes a) generating virtual beamformed speakers based on directivity pattern of the plurality of microphones of the audio capture device, b) generating virtual beamformed microphone pickups based on the HRTFs, and c) generating the beamforming filters such that the beamforming filters map the virtual beamformed speakers to the virtual beamformed microphone pickups.
17. An electronic device, comprising a processor, configured to perform operations including: accessing a plurality of spatial filters that map response of an audio capture device having a plurality of microphones to a plurality of head related transfer functions (HRTFs) fora plurality of positions of the audio capture device relative to the HRTFs; determining a current set of spatial filters based on the plurality of spatial filters and a head position of a user; and convolving microphone signals with the current set of spatial filters, resulting in output binaural audio channels.
18. The electronic device of claim 17, wherein the output binaural audio channels are generated from the microphone signals without creating an intermediate format.
19. The electronic device of claim 17, wherein determining the current set of spatial filters includes selecting one or more of the plurality of spatial filters as the current set of spatial filters, based on the head position of the user.
20. The electronic device of claim 17, wherein determining the current set of spatial filters includes interpolating one or more of the plurality of spatial filters selected based on the head position of the user, to determine the current set of spatial filters.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Several aspects of the disclosure here are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” aspect in this disclosure are not necessarily to the same aspect, and they mean at least one. Also, in the interest of conciseness and reducing the total number of figures, a given figure may be used to illustrate the features of more than one aspect of the disclosure, and not all elements in the figure may be required for a given aspect.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
DETAILED DESCRIPTION
(9) Several aspects of the disclosure with reference to the appended drawings are now explained. Whenever the shapes, relative positions and other aspects of the parts described are not explicitly defined, the scope of the invention is not limited only to the parts shown, which are meant merely for the purpose of illustration. Also, while numerous details are set forth, it is understood that some aspects of the disclosure may be practiced without these details. In other instances, well-known circuits, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description.
(10)
(11) A spatial filter mapper 14 generates a plurality of spatial filters that map the capture response of the audio capture device to a plurality of head related transfer functions (HRTFs) for a plurality of positions of the audio capture device relative to the HRTFs. The HRTFs can be expressed as pickup beams that are arranged at and about a user's left or right ear. In some aspects, a single best guess HRTF is used to cover a wide range of users. This HRTF can be determined based on routine test and experimentation to best cover all users or a targeted group of users. The capture response of the audio capture device can include positions of the microphones and/or directivity of each of the microphones. The directivity of a microphone can be described as a polar pattern of the microphone.
(12) At filter determination and convolution block 12, a current set of spatial filters are determined based on the plurality of spatial filters and a head position of a user. Microphone signals are convolved with the current set of spatial filters, resulting in output binaural audio channels. In such a manner, rather than a single transform filter, the system can include N transformations that map the capture response of each microphone signal of the capture device using spatial filters corresponding to different directions. Head-tracking can be performed based on those filters (e.g., through interpolation). N sets of spatial filters (e.g., beamforming filters) can be applied to transform the N microphone signals to generate the output audio channels.
(13) These output audio channels can be applied to speakers 17 to generate spatial audio. In particular, the output binaural audio channels can be applied to a left speaker and right speaker of a headphone set or a head mounted display, to produce spatial binaural audio. The output binaural channels are generated from the microphone signals without creating an intermediate format.
(14) In some aspects, the microphone signals are transmitted by a recording and/or encoding device to an intermediate device, a decoding device, or a playback device. The spatial filter mapper can generate the spatial filters ‘offline’ (e.g., at an audio lab or recording studio). These spatial filters can be stored to individual decoding or playback devices. Additionally, or alternatively, the spatial filters can be made available on a networked device (e.g., a server) for a decoding or playback device to access. The spatial filters or sets thereof can be associated with different head positions and retrieved based on head position (e.g., with a look-up algorithm).
(15)
(16) Referring back to
(17) Determining the current set of spatial filters that are used to convolve the microphone signals can include selecting one or more of the plurality of spatial filters as the current set of spatial filters, based on the head position of the user. Each of the plurality of spatial filters can correspond to different positions of the user's head.
(18) For example, the user's head position can be expressed in terms of spherical coordinates such as azimuth and elevation. Each set of spatial filters can be associated with a different head position (e.g., at a particular azimuth and elevation angle). If the user's head is at an azimuth of 120° and an elevation of −10°, then the set of spatial filters associated with 120° and −10° can be selected. Those selected filters are then used to convolve the microphone signals to generate spatialized binaural output channels with the head-tracking information of the user ‘baked in’ to the audio channels, so that sounds heard in the audio reflect the position of the user's head.
(19) In some aspects, spatial filters may not be calculated and/or stored for each and every position of the user's head. The amount of spatial filters may vary depending on application. As the number of spatial filters (and corresponding head positions) that are calculated and stored increases, the spatial resolution of the spatial filter bank also increases. Increasing the number of spatial filters, however, also increases audio processing and storage overhead. Thus, in some aspects, spatial filters can be interpolated, to address when the user head position is not exactly aligned with any of the pre-calculated sets of spatial filters that are each associated with a particular head position.
(20)
(21) In some aspects, as shown in
(22)
(23) The spatial filters can include gains and phase delays or differences of each of the microphone signals. The spatial filters can be different for each microphone signal. A spatial filter set refers to a set of spatial filters for a plurality of microphones that are associated with a particular head direction The spatial filters can be stored on a computing device or on an external computing device in communication with the computing device (e.g., on the cloud). A look-up algorithm can be used to select a spatial filter set based on head position. It should be understood that the spatial filter sets that are not interpolated are pre-calculated (e.g., offline). In some aspects, the spatial filters are linear. In some aspects, the spatial filters are adaptive (e.g., varying over time).
(24) In some aspects, the spatial filters are represented as beamforming filters (e.g., beamforming coefficients defining phase and gain for each microphone signal). Beamforming controls directionality of a speaker array or microphone array to target how wave energy (e.g., sound) is transmitted or received. Beamforming filters (defining phase and gain values) are applied to each of microphone signals or audio channels in order to create a pattern of constructive and destructive interference in a wave front.
(25) For example, each beam can replicate the direction and polar pattern of each microphone of the capture device. Further, virtual beamformed microphone pickups are generated based on the HRTFs. For example, a plurality of pick-up beams 23 are generated at different directions relative to a head. Each beam can have different characteristics that represents a particular head related transfer function at a particular direction relative to a user's head.
(26) The beamforming filters are generated by the spatial filter mapper such that the beamforming filters map the virtual beamformed speakers to the virtual beamformed microphone pickups. The beamforming filters can be generated at various positions of the capture device relative to the virtual pickup beams. Beamforming filter sets can each correspond to different head positions relative to the capture device. For example, the spatial filters can be determined to map from the capture device to the HRTFs at different rotations or directions of the capture device relative to a virtual listener (and respective HRTFs positioned about the virtual listener).
(27)
(28)
(29) The microphone signals may be provided to the processor 152 and to a memory 151 (for example, solid state non-volatile memory) for storage, in digital, discrete time format, by an audio codec. The processor 152 may also communicate with external devices via a communication module 164, for example, to communicate over the internet. The processor 152 is can be a single processor or a plurality of processors.
(30) The memory 151 has stored therein instructions that when executed by the processor 152 perform the processes described herein the present disclosure. Note that some of these circuit components, and their associated digital signal processes, may be alternatively implemented by hardwired logic circuits (for example, dedicated digital filter blocks, hardwired state machines.) The system can include one or more cameras 158, and/or a display 160 (e.g., a head mounted display).
(31) Various aspects descried herein may be embodied, at least in part, in software. That is, the techniques may be carried out in an audio processing system in response to its processor executing a sequence of instructions contained in a storage medium, such as a non-transitory machine-readable storage medium (for example DRAM or flash memory). In various aspects, hardwired circuitry may be used in combination with software instructions to implement the techniques described herein. Thus the techniques are not limited to any specific combination of hardware circuitry and software, or to any particular source for the instructions executed by the audio processing system.
(32) In the description, certain terminology is used to describe features of various aspects. For example, in certain situations, the terms “renderer”, “processor”, “mapper”, “beamformer”, “component,” “block,” “renderer,” “model”, “extractor”, “selector”, and “logic” are representative of hardware and/or software configured to perform one or more functions. For instance, examples of “hardware” include, but are not limited or restricted to an integrated circuit such as a processor (for example, a digital signal processor, microprocessor, application specific integrated circuit, a micro-controller, etc.). Of course, the hardware may be alternatively implemented as a finite state machine or even combinatorial logic. An example of “software” includes executable code in the form of an application, an applet, a routine or even a series of instructions. As mentioned above, the software may be stored in any type of machine-readable medium.
(33) It will be appreciated that the aspects disclosed herein can utilize memory that is remote from the system, such as a network storage device which is coupled to the audio processing system through a network interface such as a modem or Ethernet interface. The buses 162 can be connected to each other through various bridges, controllers and/or adapters as is well known in the art. In one aspect, one or more network device(s) can be coupled to the bus 162. The network device(s) can be wired network devices (e.g., Ethernet) or wireless network devices (e.g., WI-FI, Bluetooth). In some aspects, various aspects described (e.g., extraction of voice and ambience from microphone signals described as being performed at the capture device, or audio and visual processing described as being performed at the playback device) can be performed by a networked server in communication with the capture device and/or the playback device.
(34) Some portions of the preceding detailed descriptions have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the ways used by those skilled in the audio processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as those set forth in the claims below, refer to the action and processes of an audio processing system, or similar electronic device, that manipulates and transforms data represented as physical (electronic) quantities within the system's registers and memories into other data similarly represented as physical quantities within the system memories or registers or other such information storage, transmission or display devices.
(35) The processes and blocks described herein are not limited to the specific examples described and are not limited to the specific orders used as examples herein. Rather, any of the processing blocks may be re-ordered, combined or removed, performed in parallel or in serial, as necessary, to achieve the results set forth above. The processing blocks associated with implementing the audio processing system may be performed by one or more programmable processors executing one or more computer programs stored on a non-transitory computer readable storage medium to perform the functions of the system. All or part of the audio processing system may be implemented as, special purpose logic circuitry (e.g., an FPGA (field-programmable gate array) and/or an ASIC (application-specific integrated circuit)). All or part of the audio system may be implemented using electronic hardware circuitry that include electronic devices such as, for example, at least one of a processor, a memory, a programmable logic device or a logic gate. Further, processes can be implemented in any combination hardware devices and software components.
(36) While certain aspects have been described and shown in the accompanying drawings, it is to be understood that such aspects are merely illustrative of and not restrictive on the broad invention, and the invention is not limited to the specific constructions and arrangements shown and described, since various other modifications may occur to those of ordinary skill in the art. The description is thus to be regarded as illustrative instead of limiting.
(37) To aid the Patent Office and any readers of any patent issued on this application in interpreting the claims appended hereto, applicants wish to note that they do not intend any of the appended claims or claim elements to invoke 35 U.S.C. 112(f) unless the words “means for” or “step for” are explicitly used in the particular claim.
(38) It is well understood that the use of personally identifiable information should follow privacy policies and practices that are generally recognized as meeting or exceeding industry or governmental requirements for maintaining the privacy of users. In particular, personally identifiable information data should be managed and handled so as to minimize risks of unintentional or unauthorized access or use, and the nature of authorized use should be clearly indicated to users.