Systems and methods for dynamic spatial separation of sound objects
12167224 ยท 2024-12-10
Assignee
Inventors
Cpc classification
H04S2400/11
ELECTRICITY
H04R2227/009
ELECTRICITY
H04S3/008
ELECTRICITY
International classification
Abstract
Sound objects are identified within a content item and location metadata is extracted from the content item for each sound object. A reference layout is generated, relative to a user position, for the sound objects based on the location metadata. If a first sound object is within a threshold angle, relative to the user position, from a second sound object, a virtual position of either the first sound object or the second sound object is adjusted by an adjustment angle.
Claims
1. A method for dynamic spatial separation of sound sources in a content item, the method comprising: identifying, in the content item, a plurality of sound objects; extracting location metadata for each sound object of the plurality of sound objects; generating a reference layout, relative to a user position, for the plurality of sound objects based on the location metadata; determining, based on the reference layout, that a first sound object is within a threshold angle, relative to the user position, of a second sound object; and in response to determining that the first sound object is within the threshold angle of the second sound object, adjusting a virtual position of either the first sound object or the second sound object by an adjustment angle, relative to the user position.
2. The method of claim 1, further comprising: detecting that the virtual position of the sound object has been adjusted; and in response to detecting that the virtual position of the sound object has been adjusted: decreasing the threshold angle; and decreasing the adjustment angle.
3. The method of claim 2, further comprising: determining, after adjusting the virtual position of either the first sound object or the second sound object, that a third sound object is within the decreased threshold angle, relative to the user position, of either the first sound object or the second sound object; and in response to determining, after adjusting the virtual position of either the first sound object or the second sound object, that a third sound object is within the decreased threshold angle, relative to the user position, of either the first sound object or the second sound object, adjusting a virtual position of the third sound object by the decreased adjustment angle, relative to the user position.
4. The method of claim 1, further comprising detecting a gaze of the user.
5. The method of claim 4, further comprising: identifying a subset of the plurality of sound objects, wherein each sound object of the subset of sound objects is located along a path defined by the gaze of the user; and wherein the first sound object and the second sound object are members of the subset.
6. The method of claim 4, further comprising: detecting a second gaze of the user; determining that a difference between the gaze of the user and the second gaze of the user is greater than a threshold difference; and in response to determining that the difference is greater than a threshold difference: identifying a sound object along a path defined by the second gaze of the user; and enhancing audio of the sound object.
7. The method of claim 1, further comprising: scaling the threshold angle based on the distance between the user and a sound object.
8. The method of claim 1, further comprising: analyzing respective audio of each sound object; and determining, based on the analyzing, that the first sound object contains a voice.
9. The method of claim 8, wherein adjusting the virtual position of either the first sound object or the second sound object by an adjustment angle further comprises adjusting the virtual position of the second sound object by the adjustment angle.
10. The method of claim 1, further comprising: determining whether the first sound object is within a minimum angle, relative to the user position, of a third sound object; and determining whether the second sound object is within the minimum angle, relative to the user position, of the third sound object.
11. The method of claim 10, wherein adjusting the virtual position of either the first sound object or the second sound object by the adjustment angle further comprises: in response to determining that the first sound object is within the minimum angle, relative to the user position, of the third sound object, and that the second sound object is not within the minimum angle, relative to the user position, of the third sound object, adjusting the virtual position of the first sound object.
12. The method of claim 10, wherein adjusting the virtual position of either the first sound object or the second sound object by the adjustment angle further comprises: in response to determining that the first sound object is not within the minimum angle, relative to the user position, of the third sound object, and that the second sound object is within the minimum angle, relative to the user position, of the third sound object, adjusting the virtual position of the second sound object.
13. The method of claim 10, wherein adjusting the virtual position of either the first sound object or the second sound object by the adjustment angle further comprises: in response to determining that both the first sound object and the second sound object are within the minimum angle, relative to the user position, of the third sound object: adjusting a virtual position of the first sound object by an adjustment angle, relative to the user position, in a first direction that increases a distance between the first sound object and both the third sound object and the second sound object; and adjusting a virtual position of the second sound object by an adjustment angle, relative to the user position, in a second direction that increases a distance between the second sound object and both the third sound object and the first sound object.
14. A system for dynamic spatial separation of sound sources in a content item, the system comprising: input/output circuitry configured to receive the content item; and control circuitry configured to: identify, in the content item, a plurality of sound objects; extract location metadata for each sound object of the plurality of sound objects; generate a reference layout, relative to a user position, for the plurality of sound objects based on the location metadata; determine, based on the reference layout, that a first sound object is within a threshold angle, relative to the user position, of a second sound object; and in response to determining that the first sound object is within the threshold angle of the second sound object, adjust a virtual position of either the first sound object or the second sound object by an adjustment angle, relative to the user position.
15. The system of claim 14, wherein the control circuitry is further configured to: detect that the virtual position of the sound object has been adjusted; and in response to detecting that the virtual position of the sound object has been adjusted: decrease the threshold angle; and decrease the adjustment angle.
16. The system of claim 15, wherein the control circuitry is further configured to: determine, after adjusting the virtual position of either the first sound object or the second sound object, that a third sound object is within the decreased threshold angle, relative to the user position, of either the first sound object or the second sound object; and in response to determining, after adjusting the virtual position of either the first sound object or the second sound object, that a third sound object is within the decreased threshold angle, relative to the user position, of either the first sound object or the second sound object, adjust a virtual position of the third sound object by the decreased adjustment angle, relative to the user position.
17. The system of claim 14, wherein the control circuitry is further configured to detect a gaze of the user.
18. The system of claim 17, wherein the control circuitry is further configured to: identify a subset of the plurality of sound objects, wherein each sound object of the subset of sound objects is located along a path defined by the gaze of the user; and wherein the first sound object and the second sound object are members of the subset.
19. The system of claim 17, wherein the control circuitry is further configured to: detect a second gaze of the user; determine that a difference between the gaze of the user and the second gaze of the user is greater than a threshold difference; and in response to determining that the difference is greater than a threshold difference: identify a sound object along a path defined by the second gaze of the user; and enhance audio of the sound object.
20. The system of claim 14, wherein the control circuitry is further configured to: scale the threshold angle based on the distance between the user and a sound object.
21. The system of claim 14, wherein the control circuitry is further configured to: analyze respective audio of each sound object; and determine, based on the analyzing, that the first sound object contains a voice.
22. The system of claim 21, wherein the control circuitry configured to adjust the virtual position of either the first sound object or the second sound object by an adjustment angle is further configured to adjust the virtual position of the second sound object by the adjustment angle.
23. The system of claim 14, wherein the control circuitry is further configured to: determine whether the first sound object is within a minimum angle, relative to the user position, of a third sound object; and determine whether the second sound object is within the minimum angle, relative to the user position, of the third sound object.
24. The system of claim 23, wherein the control circuitry configured to adjust the virtual position of either the first sound object or the second sound object by the adjustment angle is further configured to: in response to determining that the first sound object is within the minimum angle, relative to the user position, of the third sound object, and that the second sound object is not within the minimum angle, relative to the user position, of the third sound object, adjust the virtual position of the first sound object.
25. The system of claim 23, wherein the control circuitry configured to adjust the virtual position of either the first sound object or the second sound object by the adjustment angle is further configured to: in response to determining that the first sound object is not within the minimum angle, relative to the user position, of the third sound object, and that the second sound object is within the minimum angle, relative to the user position, of the third sound object, adjust the virtual position of the second sound object.
26. The system of claim 23, wherein the control circuitry configured to adjust the virtual position of either the first sound object or the second sound object by the adjustment angle is further configured to: in response to determining that both the first sound object and the second sound object are within the minimum angle, relative to the user position, of the third sound object: adjust a virtual position of the first sound object by an adjustment angle, relative to the user position, in a first direction that increases a distance between the first sound object and both the third sound object and the second sound object; and adjust a virtual position of the second sound object by an adjustment angle, relative to the user position, in a second direction that increases a distance between the second sound object and both the third sound object and the first sound object.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The above and other objects and advantages of the disclosure will be apparent upon consideration of the following detailed description, taken in conjunction with the accompanying drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
DETAILED DESCRIPTION
(18) In the real world, humans often reorient their heads to focus on a particular sound source. In human-to-human conversation, for example, we often turn toward the speaker; we do this not just out of politeness but also because such a head orientation increases intelligibility of the sound. While there is thus a natural improvement of intelligibility due to this action, we can take advantage of orientation-sensing headphones and earbuds to computationally improve intelligibility as well.
(19) With orientation-responsive audio enhancement, the system detects, in real time, the head orientation of the listener and determines the closest spatialized sound object along the user's line of sight (line of hearing in this case). Once that sound object has been identified, it is processed to increase the overall amplitude of the output of that sound object, thus increasing its volume, while also potentially decreasing the amplitude of other sound objects. This allows a sort of selective filtering, in which the sound object the user is orienting toward in the moment is assumed to have the user's attention and focus, and thus be selectively enhanced. This effect mimics the real-world experience of orienting toward a sound source one wishes to attend to.
(20)
(21)
(22) In some embodiments, the distance between the user and each sound object, or the position of each sound object, may be modified from the location metadata based on the user's physical position relative to a display screen. The location metadata may provide virtual locations for each sound object relative to a camera perspective or to a plane defined by the two-dimensional display of the content item. A user's location relative to the display can be determined (e.g., using a camera, infrared sensor, or position data from a smart device or wearable device of the user) and the virtual locations for each sound object modified to account for the user's location. For example, location metadata for a sound object may indicate a distance from the camera perspective of ten feet. If the user is sitting six feet from the display, the virtual location of the sound object can be modified to place it sixteen feet away from the user.
(23) A user's viewing angle of the display may also be used to modify the virtual locations of sound objects. For example, the location metadata may indicate a virtual position of a sound object relative to the center of the camera perspective. However, the user may not be positioned directly in front of the display. The user's distance from the display and an angle between a line from the center of the display extending perpendicular to the display and a line from the user's position to the center of the display can be used to triangulate a new virtual position for the sound object.
(24)
(25) If the user turns their gaze, or otherwise orients themselves toward a specific sound object, audio of the specific sound object is enhanced. For example, as shown in reference layout 350, the user has turned 352 their gaze 354. Path 356 is extrapolated from the new gaze direction. If path 356 is within a threshold angle, relative to the user's position, of a specific sound object, audio from that specific sound object is enhanced 358.
(26) It is noted that, in a listening environment in which multiple users are using headphones, this effect can be differentially applied to multiple users at once. Each user has their own head orientation, and the sound objects in the AV track can be modified separately for each user.
(27) In some embodiments, the users themselves may be considered as sound objects, with a similar process applied. During a video presentation in which multiple users are present and using headphones, they may turn to attend to specific voices or sounds in the audio track, then turn toward another user in order to attend to that user. In this case, audio detected by the focused user is added into the audio mix and selectively enhanced for any user also oriented toward that user. This aspect provides the orientation-responsive audio enhancement of the first aspect, while allowing users to converse with each other, using the same ability to shift auditory focus among either virtual sound objects or other users that are nearby in the real world.
(28) It is noted that this embodiment depends on the ability of the system to process the relative locations of the multiple users in the environment. Such an ability may be accomplished via Ultra Wideband (UWB) ranging and positioning, or other means.
(29)
(30) As noted earlier, human audio perception is impacted by spatialization; in particular, humans' ability to selectively focus on a single voice in a noisy environment is enhanced when sound is spatially located, and further when there is spatial separation among sound sources (so that the sounds appear to be coming from distinct locations rather than the same point in space). Thus, some embodiments of the present disclosure adjust effective spatial separation among clustered audio effects. When multiple sound objects are at the same, or nearly the same, location, the system adjusts their virtual positioning to move them slightly apart from each other. Further, this can be done to specifically isolate voice tracks, by moving sound effects such as explosions and others away from voice. Note that as sound objects are nudged into new virtual locations, they may come into contact with other sound objects, meaning that those other sound objects may have to have their positions adjusted. Thus, there may be a repeated process as the best locations for sound objects are determined. A number of algorithms could be employed to arrive at such best positioning (including force-directed layout, simulated thermal annealing, and others).
(31) At a high level, this process identifies sound objects with a similar angular alignment (as determined by the adjustment window) and attempts to move them to new locations by adjusting their positions by the adjustment factor. If any object is moved, it may be newly brought into overlap with some existing sound object, so the process repeats, decreasing both the window and the factor to attempt to create a new layout by using a smaller nudge. The process repeats until no sound objects have been moved, or the factor is decreased to zero, in which case overlaps may still exists but cannot be nudged further without moving the sound objects too far from their natural positions.
(32)
(33) Upon determining that sound objects 504, 506, and 508 fall within angular slice 512 of the user's visual field, the virtual positions of one or more of sound objects 504, 506, and 508 may be adjusted. The amount of adjustment corresponds to an adjustment angle. For example, the adjustment angle may be two degrees. The adjustment angle may also be a dynamic value that changes based on the distance between the user and the virtual position of the sound object to be moved. For example, if the sound object to be moved is close to the user position, a larger angular distance may be needed to move the sound object by a sufficient linear distance than a sound object that is farther from the user position. In the example of
(34) Frequency spreading can also be employed to enhance intelligibility of audio. Many people have experienced the situation where a high-pitched sound is heard, and changes in head position are used to try to determine where the sound is coming from. This happens because such a high-pitched sound has a very narrow frequency spectrum, which means that it is difficult to localize. In contrast, wide band sound contains elements at a range of frequencies, which are easier for humans to localize. By reorienting our heads we essentially are taking measurements of how the frequencies in that sound are modified by our ear shape at different orientations, giving us different auditory perspectives on that sound and improving our localization ability.
(35) This aspect of the invention applies this notion of broadband sound being easier to localizeand hence, easier to separate from other background soundsby using dynamic frequency spreading to enhance localization performance. Sounds with a narrow spectrum are processed in order to create frequency components above and below the central frequency point, effectively spreading the sound over a broader frequency spectrum. The result is a sound that is perceptually similar to the original but yields better human localization performance. Sounds with a limited frequency spectrum are identified by applying a pre-defined cutofffor example, sounds that have a frequency range within 200 Hz. Such sounds would be candidates for frequency spreading. When such sounds are identified, the mean frequency for the sound, which represents the center of the frequency spectrum for the sound, is computed. Then, for each frequency component both above and beyond this center, a scaling factor is applied in order to spread the sound frequencies more broadly around this center.
(36) Higher scaling factors should yield better localization performance, yet may result in sounds that are perceptibly different from the original. Thus, one variant on this process may be to choose a scaling factor based on the frequency spectrum of the original sound. An original sound that is close to the cutoff threshold may need less processing (a lower scaling factor) than an original sound that is extremely narrow (and which thus requires a larger scaling factor).
(37)
(38) In some embodiments, a user may be consuming the content item on a device that is capable of displaying text or graphics overlaid on the content. For example, a device enabled for augmented reality (AR) or virtual reality (VR) may present a user with additional information or content relating to one or more portions of the content item. If multiple sound objects are located along a path defined by a user's gaze, additional information for each sound object and/or representations of each sound object can be generated for display. The user can then select which object's audio is to be enhanced by turning their gaze toward one set of information, additional content, or representation being displayed.
(39)
(40)
(41) Transceiver circuitry 806 in turn transmits 808 the media content to control circuitry 810. Control circuitry 810 may be based on any suitable processing circuitry and comprises control circuits and memory circuits, which may be disposed on a single integrated circuit or may be discrete components. As referred to herein, processing circuitry should be understood to mean circuitry based on one or more microprocessors, microcontrollers, digital signal processors, programmable logic devices, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), etc., and may include a multi-core processor (e.g., dual-core, quad-core, hexa-core, or any suitable number of cores). In some embodiments, processing circuitry may be distributed across multiple separate processors or processing units, for example, multiple of the same type of processing units (e.g., two Intel Core i7 processors) or multiple different processors (e.g., an Intel Core i5 processor and an Intel Core i7 processor). Control circuitry 810 receives the media content at media processing circuitry 812.
(42) Media processing circuitry 812 processes the media content to extract video and audio data and generate video and audio signals for output. Media processing circuitry 812 also extracts metadata from the media content. The metadata includes location data for each sound object in the media content for each frame of the media content. The location data may be used by surround sound systems to locate sound corresponding to each sound object without a physical space in which the surround sound system is installed, and output the sounds using the correct output devices to simulate the presence of the sound object at the identified location. This may be accomplished using spatial processing circuitry 814.
(43) When multiple sounds are played concurrently, it can be difficult for a user to understand or process them simultaneously and a user may user natural motions. For example, the user may tilt their head or turn their ear toward the source of the sound, to try and better hear the sound. The user's gaze may shift from being generally centered on the display of the media content to being focused on a specific sound object within the media content.
(44) Spatial processing circuitry 814 generates a reference layout of sound objects in a virtual space based on the location metadata. For example, the user's position is set as the origin (i.e., coordinates (0,0) in a two-dimensional reference layout or coordinates (0,0,0) in a three-dimensional reference layout), and each sound object is placed in a position around the user according to its location metadata. As will be discussed below, some modification of sound object positions relative to the user position may be made based on the user's position relative to a display of the media content.
(45) Media device 800 receives 816 user position information from one or more sensors, such as camera 818, inertial measurement unit 820, and accelerometer 822. Camera 818 may be external to, or integrated with, media device 800. Inertial measurement unit 820 and accelerometer 822 may be integrated with media device 800, or with a device worn by, or in the possession of, the user, such as a smartphone, smartwatch, headphones, headset, or other device. The position information is received at transceiver circuitry 806, which in turn transmits 824 the position information to orientation tracking circuitry 826. Orientation tracking circuitry 826 determines where the user's gaze or attention is focused. For example, using camera 818, orientation tracking circuitry may determine the direction of the user's gaze using facial recognition, pupil tracking, or other techniques. Orientation tracking circuitry may use data from inertial measurement unit 820 and/or accelerometer 822 to determine a position and/or pose of the user's head.
(46) Orientation tracking circuitry 826 transmits 828 information relating to the user's gaze direction and/or orientation to spatial processing circuitry 814. Media processing circuitry 812 also transmits 830 the location information extracted from metadata of the content item to spatial processing circuitry 814. Spatial processing circuitry 814 determines a path along which the user is focused. This may be a gaze path based on the direction the user is gazing or may be a path from the user's ear toward a sound source. Spatial processing circuitry 814 projects the path through the reference layout. Depending on the user's position relative to a display of the content item, spatial processing circuitry 814 may modify the path, or may modify the positions of sound objects in the reference layout. For example, the gaze of a user seated to the left of a display screen will start toward the user's right in order to focus on the center of the display. In contrast, a user seated directly in front of the display will have a forward gaze when focused on the center of the display. Thus, the angle at which a user focuses on the display must be corrected, and the gaze line shifted, to account for the user's position relative to the display.
(47) Spatial processing circuitry 814 determines, based on the location information and the gaze line, which sound objects intersect the gaze line or are within a threshold angle, relative to the user's position, of the gaze line. For example, spatial processing circuitry 814 determines whether the gaze line passes through coordinates occupied by a sound object. Spatial processing circuitry 814 may generate secondary bounding lines that diverge from the gaze line by a threshold angle, such as five degrees, in any direction. Spatial processing circuitry 814 then determines whether any sound object falls within a two-dimensional slice or three-dimensional sector of the content item defined by the secondary bounding lines.
(48) If one sound object is located along the gaze path, spatial processing circuitry 814 transmits 832 an instruction to media processing circuitry 812 to enhance audio of the sound object. In response to the instruction, media processing circuitry 812 performs the requested audio enhancement. For example, while processing the audio of the media content for output, media processing circuitry 812 may raise the volume of audio of the sound object. Alternatively or additionally, media processing circuitry 812 may reduce volume of other sound objects in the media content. Media processing circuitry 812 then transmits 834 the video signal and enhanced audio signal to media output circuitry 836. Media output circuitry 836 may be a display driver and/or speaker driver. In some embodiments, media output circuitry 836 may receive 838 instructions from spatial processing circuitry 814 to modify the output of audio to one or more channels of a surround sound system. Media output circuitry 836 then outputs 840 the video and enhanced audio to the user.
(49) In some embodiments, other users present in physical proximity to the user may also be considered sound objects. Media device 800 may receive 842, using transceiver circuitry 806, location data for each user from user location database 844. User location database 844 may be managed by a mobile network operator, internet service provider, or other third party. Alternatively, user location database 844 may be local to media device 800 and may be populated with location data for each user through user location detection methods such as Ultra-Wideband ranging and positioning, GPS signals, etc. Transceiver circuitry 806 transmits 846 the location data to spatial processing circuitry 814, where it is combined with the location metadata extracted from the content item. When the user focuses on one of the other users present, audio captured from that user is enhanced, rather than audio from within the content item.
(50) If more than one sound object is located along the gaze path, or within the slice or sector defined by the secondary bounding lines, spatial processing circuitry 814 repositions one or more sound objects to improve intelligibility of audio of a single sound object. If two sound objects are along the gaze path, spatial processing circuitry 814 may determine which of the two sound objects the user is trying to focus on. For example, one sound object may contain a voice track and the other sound object contains a noise, such as a car horn, siren, or other background noise. Spatial processing circuitry 814, upon identifying the sound object being focused on, may adjust the virtual position of the other sound object by a linear distance corresponding to an adjustment angle. For example, the adjustment angle may be two degrees. The virtual position of the sound object is thus adjusted by a linear distance corresponding to an angular distance of two degrees from the gaze path and the distance from the plane of display of the content. If three or more sound objects are along the gaze path, those on which the user is determined not to be focused are moved away from the sound object of focus and from each other.
(51) After adjusting the virtual position of a sound object, the threshold angle that defines the secondary boundary lines is reduced. The adjustment angle by which virtual positions of sound objects are adjusted is also reduced. The process above is then repeated iteratively until no further adjustments are necessary (i.e., no sound objects are too close to each other) or the adjustment angle or threshold angle reaches a minimum value.
(52) In some embodiments, when multiple sound objects are along the gaze path, representations of each sound object may be generated for display. The user can then select which of the represented sound objects should have its audio enhanced. Media device 800 may receive 848, using transceiver circuitry 806, an input from the user selecting a sound object. Transceiver circuitry 806 transmits 850 the selection to spatial processing circuitry 814. Spatial processing circuitry 814 then instructs media processing circuitry 812 or media output circuitry 836 as discussed above.
(53) In some embodiments, media processing circuitry 812 determines a frequency bandwidth of audio of one or more sound objects. For example, media processing circuitry 812 may determine a frequency bandwidth of audio of the selected sound object on which the user is focused. Alternatively, media processing circuitry 812 may process audio from every sound object in the media content. If the frequency bandwidth of audio of a sound object is below a threshold frequency (e.g., 200 Hz), media processing circuitry 812 performs a frequency spreading operation. Media processing circuitry 812 converts the audio from a time-domain signal to a frequency-domain signal. This may be accomplished using a Fourier transform operation. Media processing circuitry 812 can thus identify each frequency component of the audio signal. Media processing circuitry 812 determines a mean frequency of the signal. Each frequency component below the mean frequency is multiplied by a first scaling factor (e.g., a value between zero and one) to generate additional frequency components below the mean frequency. Similarly, each frequency component above the mean frequency is multiplied by a second scaling factor (e.g., a value greater than one) to generate additional frequency components above the mean frequency. The resulting frequency-domain signal is then converted back to a time-domain signal to generate a new audio signal. Media processing circuitry 812 then transmits 834 the new audio signal for the sound object to media output circuitry 836 in place of the original audio signal for that sound object.
(54)
(55) At 902, control circuitry 810 identifies, within a content item, a plurality of sound objects. For example, control circuitry 810 may extract, from the content item, metadata describing the plurality of sound objects. Alternatively, control circuitry 810 may analyze audio and/or video data of the content item to identify objects emitting sound, including sounds from sources that are not depicted in the video data.
(56) At 904, control circuitry 810 initializes a counter variable N, setting its value to one, and a variable T representing the number of identified sound objects. At 906, control circuitry 810 extracts, from the content item, location metadata for the N.sup.th sound object. For example, the content item may include metadata for each of the plurality of sound objects. The N.sup.th sound object may have a corresponding identifier used to extract metadata specific to the N.sup.th sound object. At 908, control circuitry 810 determines whether N is equal to T, meaning that location metadata for all identified sound objects has been extracted from the content item. If N is not equal to T (No at 908), then, at 910, control circuitry 810 increments the value of N by one, and processing returns to 906.
(57) If N is equal to T (Yes at 908), then, at 912, control circuitry 810 generates a reference layout, relative to a user position, for the plurality of sound objects. For example, control circuitry 810 may generate a two-dimensional or three-dimensional virtual space in which each sound object is placed according to its location metadata. The user is placed in a position within the virtual space from which positions of all sound objects are calculated. For example, in a two-dimensional reference layout, the user may be placed at coordinates (0,0).
(58) At 914, control circuitry 810 detects a gaze of the user. Control circuitry 810 may receive position and/or orientation data of the user from a variety of sensors or sources, such as cameras, inertial measurement devices, and accelerometers. This information is used to determine where the user is looking. At 916, control circuitry 810 identifies, based on the reference layout, a sound object along a path defined by the gaze of the user. Control circuitry 810 generates a path based on the direct of the user's gaze. The path may be superimposed on the reference layout. Control circuitry 810 compares the position of sound objects with the gaze path to determine if a specific sound object falls along the path.
(59) At 918, control circuitry 810 enhances audio of the identified sound object. If a sound object is identified along the path of the user's gaze, audio of that sound object is enhanced. Control circuitry 810 may increase the volume of audio of the sound object. Alternatively or additionally, control circuitry 810 may reduce volume of other sound objects. Other types of enhancements may be performed, including dynamic spatial separation of clustered sound objects and frequency spreading of narrow-band sounds. These are discussed further below.
(60) The actions or descriptions of
(61)
(62) At 1002, control circuitry 810 determines whether more than one sound object is located along the path defined by the gaze of the user. Control circuitry 810 may compare the gaze path with the location metadata and/or the reference layout and identify each sound object that is along the gaze path. In some embodiments, sound objects are considered to be along the gaze path if they fall within a linear distance corresponding to a threshold angle, relative to the user position, from the gaze path.
(63) If more than one sound object is located along the gaze path (Yes at 1002), then, at 1004, control circuitry 810 enhances audio of a first sound object that is along the gaze path. At 1006, control circuitry 810 determines whether a negative input has been received. Control circuitry 810 may receive negative inputs from the user from a physical input device, such as a keyboard or touchscreen, or may capture gestures or speech of the user. For example, the user may shake their head to indicate a negative response to enhancement of a sound object. If a negative input has not been received (No at 1006), then control circuitry 810 continues to enhance audio of the first sound object until the end of audio of the first sound object.
(64) If a negative input has been received (Yes at 1006), then, at 1008, control circuitry 810 stops enhancement of audio of the first sound object. At 1010, control circuitry 810 selects a second sound object for enhancement. Then, at 1012, control circuitry 810 enhances audio of the second sound object.
(65) In some embodiments, control circuitry 810 may first analyze the audio of all the sound objects along the path and rank them in order of likelihood of being the target of the user's focus. For example, audio of one sound object may contain a voice. Control circuitry 810 may determine that the user is most likely to be focusing on the voice and rank that sound object highest. When determining candidate sound objects for enhancement, control circuitry 810 may rank each sound object in order of likelihood of being the target of the user's focus. If a negative input is received for a sound object having a first rank, control circuitry 810 selects the sound object having the next highest rank, moving down in rank until no negative inputs are received.
(66) The actions or descriptions of
(67)
(68) At 1102, control circuitry 810 determines whether more than one sound object is located along the path defined by the gaze of the user. Control circuitry 810 may compare the gaze path with the location metadata and/or the reference layout and identify each sound object that is along the gaze path. In some embodiments, sound objects are considered to be along the gaze path if they fall within a linear distance corresponding to a threshold angle, relative to the user position, from the gaze path.
(69) If more than one sound object is located along the path defined by the gaze of the user (Yes at 1102), then, at 1104, control circuitry 810 initializes a counter variable N, setting its value to one, and a variable T representing the number of sound objects located along the path. At 1106, control circuitry 810 analyzes audio of the N.sup.th sound object and determines its audio characteristics. At 1108, control circuitry 810 determines, based on the audio characteristics, whether the N.sup.th sound object contains a voice. If the N.sup.th sound object contains a voice (Yes at 1108), then, at 1110, control circuitry 810 enhances audio of the N.sup.th sound object. If the N.sup.th sound object does not contain a voice (No at 1108), then, at 1112, control circuitry 810 determines whether N is equal to T, meaning that audio for each of the sound objects has been analyzed. If N is not equal to T (No at 1112), then, at 1114, control circuitry 810 increments the value of N by one, and processing returns to 1106. If N is equal to T (Yes at 1112), then the process ends.
(70) The actions or descriptions of
(71)
(72) At 1202, control circuitry 810 determines whether more than one sound object is located along the path defined by the gaze of the user. Control circuitry 810 may compare the gaze path with the location metadata and/or the reference layout and identify each sound object that is along the gaze path. In some embodiments, sound objects are considered to be along the gaze path if they fall within a linear distance corresponding to a threshold angle, relative to the user position, from the gaze path.
(73) If more than one sound object is located along the path defined by the gaze of the user (Yes at 1202), then, at 1204, control circuitry 810 generates for display a representation of each sound object located along the path. For example, control circuitry 810 may generate for display an outline or highlight of each sound object overlaid on the content item. Alternatively or additionally, identifying text may be generated for display. In some embodiments, the representations are generated for display on a second display device.
(74) At 1206, control circuitry 810 detects a second gaze of the user. This may be accomplished using gaze detection methods described above. At 1208, control circuitry 810 identifies a representation of a sound object along a second gaze path defined by the second gaze. For example, the positions of each representation may be added to the reference layout and the second gaze path compared with those positions. As another example, the second gaze path may be used to identify a position on the display at which the user is focused. Control circuitry 810 then identifies the representation that is along, or closest to, the second gaze path.
(75) The actions or descriptions of
(76) At 1302, control circuitry 810 identifies, within a content item, a plurality of sound objects. This may be accomplished using methods described above in connection with
(77) At 1314, control circuitry 810 determines whether a first sound object is within a threshold angle, relative to the user position, of a second sound object. Control circuitry 810 may generate secondary bounding lines that diverge from the gaze line by a threshold angle, such as five degrees, in any direction. Control circuitry 810 then determines whether any sound object falls within a two-dimensional slice or three-dimensional sector of the reference layout defined by the secondary bounding lines.
(78) If a first sound object is within the threshold angle, relative to the user position, of a second sound object (Yes at 1314), then, at 1316, control circuitry 810 adjusts a virtual position of either the first sound object or the second sound object by an adjustment angle, relative to the user position. For example, the adjustment angle may be two degrees. The virtual position of the sound object is thus adjusted by a linear distance corresponding to an angular distance of two degrees from the gaze path and the distance from the plane of display of the content. The adjustment angle may also be a dynamic value that changes based on the distance between the user and the virtual position of the sound object to be moved. For example, if the sound object to be moved is close to the user position, a larger angular distance may be needed to move the sound object by a sufficient linear distance than a sound object that is farther from the user position.
(79) At 1318, control circuitry 810 determines whether either the threshold angle or the adjustment angle are at a minimum value. Below a minimum value, adjustments of virtual positions of sound objects cease to effectively separate the sound objects. When sound objects are repositioned, the distance between sound objects considered to be too close together must be reduced so that the adjusted positions do not end up being considered too close to other sound objects. If neither the threshold angle nor the adjustment angle has reached a minimum value (No at 1318), then, at 1320, the threshold angle and adjustment angle are decreased. Processing then returns to 1314 and proceeds iteratively until the threshold angle or adjustment angle reaches a minimum value (Yes at 1318) or there are no more sound objects within the threshold angle, relative to the user position, of the second sound object (No at 1314).
(80) The actions or descriptions of
(81)
(82) At 1402, control circuitry 810 determines a distance between the user and a display on which the content item is being output. For example, control circuitry 810 may use a camera, infrared sensor, or other imaging or ranging sensor to determine a distance between the user and the display. Alternatively, control circuitry 810 may separately determine distances from a first point (e.g., a sensor location) to the user and to the display. Control circuitry 810 may then calculate a distance between the user and the display based on these distances.
(83) At 1404, control circuitry 810 calculates, based on the determined distance and the reference layout, a perceived distance between the user and a sound object. For example, the distance between the user and the display may be five feet, and the reference layout may indicate that a sound object is located ten feet from the plane of the display. Thus, the perceived distance between the user and the sound object is fifteen feet. Control circuitry 810 may also determine angles between the user and the display and between the plane of the display and the virtual position of the sound object. Based on these angles, the perceived distance between the user and the sound object can be more accurately calculated and accounts for the user's position relative to the display.
(84) At 1406, control circuitry 810 determines whether the perceived distance differs from a given distance by at least a threshold amount. As distance from the user increases, linear distances corresponding to a given angular distance increase. Thus, if the perceived distance is greater than a set distance (e.g., ten feet), the threshold angle and adjustment angle should be smaller than if the perceived distance is less than the set distance. If the perceived distance differs from the given distance by at least a threshold amount (Yes at 1406), then, at 1408, control circuitry 810 scales the threshold angle and adjustment angle accordingly.
(85) The actions or descriptions of
(86)
(87) At 1502, control circuitry 810 initializes two Boolean flags, First_Third and Second_Third, setting both flags to FALSE. At 1504, control circuitry 810 determines whether the first sound object is within a minimum angle (e.g., the threshold angle), relative to the user position, of a third sound object. If not (No at 1504), processing continues at 1508. If so (Yes at 1504), then, at 1506, control circuitry 810 sets the value of First_Third to TRUE. Control circuitry 810 then, at 1508, determines whether the second sound object in within a minimum angle, relative to the user position, of the third sound object. If not (No at 1508), processing continues at 1512. If so (Yes at 1508), then, at 1510, control circuitry 810 sets the value of Second_Third to TRUE.
(88) Control circuitry 810 then, at 1512, checks the values of both flags. If First_Third is TRUE and Second_Third is FALSE (Yes at 1512), then, at 1514, control circuitry 810 adjusts the virtual position of the first sound object. If First_Third is FALSE and Second_Third is TRUE (Yes at 1516), then, at 1518, control circuitry 810 adjusts the virtual position of the second sound object. If both flags are TRUE (Yes at 1520), then, at 1522, control circuitry 810 adjusts the virtual position of the first sound object in a first direction that increases the distance between the first sound object and both the second sound object and the third sound object. At 1524, control circuitry 810 adjusts the virtual position of the second sound object in a second direction that increases the distance between the second sound object and both the first sound object and the third sound object.
(89) The actions or descriptions of
(90)
(91) At 1602, control circuitry 810 converts audio of a sound object from a time domain to a frequency domain. For example, control circuitry 810 performs a Fourier transform operation to identify all the frequency components of the audio. At 1604, control circuitry 810 determines a frequency range of the audio. Control circuitry 810 may subtract the lowest frequency value from the highest frequency value. For example, the lowest frequency component may have a frequency of 500 Hz and the highest frequency component may have a frequency of 800 Hz. Control circuitry 810 may therefore determine that the audio have a frequency range of 300 Hz.
(92) At 1606, control circuitry 810 determines whether the frequency range is below a threshold range. Control circuitry 810 may compare the frequency range of the audio to a threshold value. For example, the threshold value may be 500 Hz. If the range of the audio is only 300 Hz, as in the example above, then control circuitry 810 determines that the frequency range is below the threshold range. If the frequency range meets or exceeds the threshold range (No at 1606), then the process ends.
(93) If the frequency range of the audio is below the threshold range (Yes at 1606), then, at 1608, control circuitry 810 computes the mean frequency of the audio. Control circuitry 810 calculates an average frequency from all the frequency components of the audio. At 1610, control circuitry 810 initializes a counter variable N, setting its value to one, and a variable T representing the number of frequency components in the audio. At 1612, control circuitry 810 determines whether the N.sup.th frequency component is above the mean frequency. If so (Yes at 1612), then, at 1614, control circuitry 810 applies a positive scaling factor to the N.sup.th frequency component. The positive scaling factor may be a positive number, or may be a value that, when used to scale the N.sup.th frequency component, results in additional frequency components at higher frequencies than that of the N.sup.th frequency component. If the N.sup.th frequency component is not above the mean frequency (No at 1612), then, at 1616, control circuitry 810 applies a negative scaling factor to the N.sup.th frequency component. The negative scaling factor may be a negative number, or may be a value that, when used to scale the N.sup.th frequency component, results in additional frequency components at lower frequencies than that of the N.sup.th frequency component.
(94) After scaling the N.sup.th frequency component, at 1618, control circuitry 810 determines whether N is equal to T, meaning that all frequency components have been scaled. If not (No at 1618), then, at 1620, control circuitry 810 increments the value of N by one and processing returns to 1612. If N is equal to T (Yes at 1618), then, at 1622, control circuitry 810 converts the audio from a frequency domain to a time domain. For example, control circuitry 810 may perform an inverse Fourier transform operation. This results in a new audio signal to replace the audio of the sound object.
(95) The actions or descriptions of
(96) The processes described above are intended to be illustrative and not limiting. One skilled in the art would appreciate that the steps of the processes discussed herein may be omitted, modified, combined, and/or rearranged, and any additional steps may be performed without departing from the scope of the invention. More generally, the above disclosure is meant to be exemplary and not limiting. Only the claims that follow are meant to set bounds as to what the present invention includes. Furthermore, it should be noted that the features and limitations described in any one embodiment may be applied to any other embodiment herein, and flowcharts or examples relating to one embodiment may be combined with any other embodiment in a suitable manner, done in different orders, or done in parallel. In addition, the systems and methods described herein may be performed in real time. It should also be noted that the systems and/or methods described above may be applied to, or used in accordance with, other systems and/or methods.