EYE GAZE AS A PROXY OF ATTENTION FOR VIDEO STREAMING SERVICES
20230164387 · 2023-05-25
Inventors
- Fabián E. Bustamante (Evanston, IL, US)
- Stefan Birrer (Chicago, IL, US)
- Roy Reichbach (Chicago, IL, US)
Cpc classification
H04N21/2402
ELECTRICITY
H04N21/21805
ELECTRICITY
H04N21/4438
ELECTRICITY
H04N21/44218
ELECTRICITY
H04N21/4316
ELECTRICITY
H04N21/4312
ELECTRICITY
H04N21/2662
ELECTRICITY
International classification
H04N21/442
ELECTRICITY
H04N21/431
ELECTRICITY
Abstract
A method uses the eye gaze of a user, as a proxy of their attention, and leverages it to provide a more natural experience in a multi-view, i.e. multi-party and multi-perspective, video streaming service. The method takes advantage of increasingly powerful inexpensive cameras and related software to provide commodity eye-tracking. The method also leverages collected data on user interactions and uses machine learning techniques to customize its response to individual usage patterns. A system is specified for implementing the described method on a streaming architecture.
Claims
1. A method for live and real-time media streaming, comprising: a multi-view application presenting a user with multiple windows, or icons, on a screen on a user device that represents multiple participants or perspectives; capturing a user's eye-gaze with at least one camera facing the user when the user, while operating said user device, gazes at a focus window; using said user's eye-gaze as a focus of attention of the user to identify a specific window or windows that are the focus of user attention; and determining at any given point in time, a fraction of total screen size of, and a selected resolution for, a particular window or icon as a function of a focus value of that window, a fraction of screen resolution associated with a main window, and a maximum allocatable bandwidth available.
2. The method of claim 1, said user device comprising a display, computational resources, and persistent and random-access memory.
3. The method of claim 2, wherein said user device is in communication with said at least one camera facing the user.
4. The method of claim 1, where the user device comprises any of a desktop, a laptop computer, a pad, or a smartphone.
5. The method of claim 1, wherein said multiple windows or icons represent all or a subset of participants in a multiparty call and/or multiple view angles of an event.
6. The method of claim 1, wherein said multi-view application comprises an interface of a multi-view streaming application with different streams, of different sizes and resolutions, as shown by windows, where each window comprises one of a plurality of different streams, of different sizes and resolutions.
7. The method of claim 1, wherein different streams are associated with different participants in a multi-party application and/or with different views in a multi-perspective application.
8. The method of claim 1, wherein every window or icon has an associated focus value, w which is proportional to a fraction of time the user's gaze is focused on a particular window or icon over an observation period.
9. The method of claim 8, wherein a sum of all windows' focus values is equal to:
100(Σ.sub.All w.sub.w=100).
10. The method of claim 1, further comprising: providing a known bandwidth budget for every stream quality level; wherein available levels of stream quality form a discrete set.
11. The method of claim 1, wherein when a main window comprises x % of the screen and the main window maximum resolution requires y % of the maximum allocatable bandwidth; wherein a total screen allocation for all other windows of the screen is no more than 100−x %; and wherein bandwidth demand for all other windows of the screen is more than 100−y % of a maximum allocated bandwidth.
12. The method of claim 1, further comprising: dynamically adjusting a budget allocated to a particular window and its associated stream quality level as a function of user attention when a user's attention shifts between windows during a session by dividing the session into observation periods and tracking a user focus on the different views of a multi-view application interface during each period.
13. The method of claim 12, wherein a focus value of a particular window determined during an observation period t is used to allocate a fraction of said particular window screen size and assign a most appropriate resolution for an associated stream of said particular window during a subsequent observation period t+1.
14. The method of claim 12, further comprising: adjusting views in said multi-view application to complement other interaction modes available to said user to disambiguate said user's input.
15. The method of claim 1, further comprising: detecting user attention or inattention during a video conference call or while viewing content; capturing metrics regarding said attention/inattention; and using said metrics to generate reports.
16. The method of claim 7, further comprising: using audience information regarding user's eye-gaze for a plurality of users in real time to inform a broadcaster that most of their audience prefers one view over another; and any of using said audience information to make global broadcast decisions in real time; using said audience information to change a broadcast stream automatically; and using said audience information to select a dominant camera for a broadcast stream source.
17. A method for live and real-time media streaming, comprising: a multi-view application presenting a user with multiple windows, or icons, on a screen on a user device that represents multiple participants or perspectives; capturing a user's eye-gaze with at least one camera facing the user when the user, while operating said user device, gazes at a focus window; using said user's eye-gaze as a focus of attention of the user to identify a specific window or windows that are the focus of user attention to select among available views of a multi-view application; determining screen size fractions and resolutions; during an initial observation period, providing updated screen size fraction and resolution information when the specific window or windows that are the focus of user attention correspond to an initial focus value; during a next observation period determining a next focus value for said next observation period; replacing the initial focus value for the initial observation period with the next focus value for said next observation period; and updating said screen based on the next focus value for said next observation period.
18. The method of claim 17, further comprising: collecting data during user interactions through eye gaze and through alternative inputs when different users interact with said multi-view application in different ways by constantly shifting their gaze among multiple windows or narrowly focusing on a particular window; and using said device gaze information as an input, compiling and processing said data with machine learning techniques to yield a focus value; and using said focus value to customize a response provided to a specific user.
19. The method of claim 17, further comprising: identifying available bandwidth and providing updated screen size and resolution information; and using as an input a determination of focus values of a current period and, based on data collected during prior user interactions through eye gaze and through alternative inputs as stored in a user log, updating a user model of focus.
20. A method for live and real-time media streaming, comprising: a multi-view application presenting a user with multiple windows, or icons, on a screen on a user device that represents multiple participants or perspectives; capturing a user's eye-gaze with at least one camera facing the user when the user, while operating said user device, gazes at a focus window; using said user's eye-gaze as a focus of attention of the user to identify a specific window or windows that are the focus of user attention to select among available views of a multi-view application; as a user focus shifts from one window to another, providing a proportionally larger portion of the screen with a new dominant window; and streaming said new dominant window at a higher quality level while the previous dominant window takes a smaller portion of the screen and is streamed at a lower quality level.
21. The method of claim 20, further comprising: streaming different streams composing the multi-view application using an adaptive bit-rate method to enable seamless transition between different levels of quality and resolution, to a higher or lower quality for the new or old focus window, respectively.
22. The method of claim 21, further comprising: dynamically or proactively generating an alternative version of a same stream at a different level of quality for seamless transitions during real-time streaming.
23. The method of claim 22, further comprising: supporting a seamless migration of attention focus back into and away from a dominant window by distributing an allocated bandwidth budget to additional backup streams surrounding said dominant window when user focus shifts.
24. The method of claim 23, further comprising: allocating bandwidth budget for backup streams when a user's attention gives preference to certain windows over others and wherein throughout the session the user keeps returning to the preferred window.
25. The method of claim 23, further comprising: using machine learning techniques to analyze collected data of prior user interactions to identify one or more user-specific attention dominant windows and the allocation of bandwidth budget for backup streams associated with said user-specific attention dominant windows to support seamless migration of attention focus back into and away from said one or more user-specific attention dominant windows.
26. An apparatus for live and real-time media streaming, comprising: a user device comprising a display, computational resources, and persistent and random-access memory; at least one camera facing the user when the user, while operating said user device, gazes at a focus window for capturing a user's eye-gaze, wherein said user device is in communication with said at least one camera facing the user; a multi-view application configured for presenting a user with multiple windows, or icons, on a screen on said user device that represents multiple participants or perspectives; a processor configured for using said user's eye-gaze as a focus of attention of the user to identify a specific window or windows that are the focus of user attention; and said processor configured for determining at any given point in time, a fraction of total screen size of, and a selected resolution for, a particular window or icon as a function of a focus value of that window, a fraction of screen resolution associated with a main window, and a maximum allocatable bandwidth available.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029] and
[0030]
DETAILED DESCRIPTION
[0031] The following detailed description describes an embodiment of the invention the comprises a method to capture and use the focus of attention of a user to provide a more natural experience with multi-view video streaming services.
[0032] Embodiments of the invention leverage a user's eye-gaze as a proxy of user attention, thus taking advantage of new, powerful, inexpensive cameras, and new software that uses these cameras, to provide commodity eye-tracking.
[0033] Embodiments of the invention comprise a user device that is wirelessly or wired connected with at least one camera facing the user, and that includes a display, computational resources, and persistent and random-access memory. Embodiments of the device take the form of any of a desktop, a laptop computer, a pad, or a smartphone.
[0034]
[0035] A multi-view application presents a user with multiple windows, or icons, on a screen representing multiple participants or perspectives. In one instantiation, the multiple windows or icons represent all or a subset of the participants in a multiparty call or multiple view angles of a sporting event.
[0036] Embodiments of the invention rely upon eye gaze to identify the specific window or windows that are the focus of user attention. Every window or icon has an associated focus value, w, ranging from 0 to 100, which is proportional to the fraction of time the user's gaze was focused on a particular window or icon over the observation period.
[0037] The sum of all windows' focus values is equal to:
100(Σ.sub.All w.sub.w=100)
if the user was solely focused on the speaker, for example, in window 1 in 1=100. If the user attention had instead shifted back and forth between two windows, e.g. windows 1 and 2 in
1=
2=50.
[0038] At any given point in time, the fraction of the total screen size of, and the selected resolution for, a particular window or icon is a function of the focus value of that window, the fraction of the screen resolution associated with the main window, and the maximum allocatable bandwidth available. There is a known bandwidth budget for every stream quality level; the available levels of stream quality form a discrete set, e.g. Low, Standard, and High Definition.
[0039] The user device's camera and associated software track the user gaze with sufficient precision to select among the available views of a multi-view application. The focus value associated with a window is determined by the input of this device. For instance, if the main window takes 40% of the screen and its maximum resolution requires 30% of the maximum allocatable bandwidth, the total screen allocation for the other windows cannot be larger than 60%; and the bandwidth demand cannot add up to more than 70% of the maximum allocated bandwidth.
Changing Focus of Attention
[0040] During the length of a session, a user's attention shifts between windows perhaps focusing on different participants or the total audience. Embodiments of the invention dynamically adjust the budget allocated to a particular window and its associated stream quality level as a function of user attention. It does this by dividing the session into observation periods, potentially of seconds of duration, and tracking a user focus on the different views of a multi-view application interface during each period. The focus value of a window determined during observation period t is used to allocate its fraction of screen size and assign the most appropriate resolution for its associated stream during observation period t+1.
[0041]
[0042] The herein disclosed method of using eye gaze to identify the focus of attention of a user and adjust the views in a multi-view application accordingly complements other interaction modes available to users, such as mouse click, that can help disambiguate user input. This is labelled as ‘User override’ in
[0043] The method identifies available bandwidth and provides updated screen size and resolution information. If the display corresponds to the focus value, that is the current display is true 46 then the process ends 48 until a next observation period. During the next observation period the method determines focus values of the period 44, which values replace the focus values for the previous observation period, and the process repeats as described.
User-Specific Attention Estimation
[0044] Different users may interact with a multi-view interface in different ways, constantly shifting their gaze among multiple windows or narrowly focusing on a particular one. Embodiments of the invention include data collected during prior user interactions through eye gaze and through alternative inputs, such as mouse clicks (‘User override’). These data are compiled and processed with machine learning techniques to customize the response provided to a specific user, taking the device gaze information as input, and yielding a focus value.
[0045]
Improving Seamless Change of Focus
[0046] As a user focus shifts from one window to another, the new dominant window takes a proportionally larger portion of the interface and is streamed at a higher quality level while the previous dominant window takes a smaller portion of the interface and is streamed at a lower quality level.
[0047] The different streams composing the multi-view application are streamed using an adaptive bit-rate method that enables the seamless transition between different levels of quality and resolution, to a higher or lower quality for the new or old focus window, respectively. For real-time streaming, this seamless transition may require the dynamic or proactive generation of an alternative version of the same stream at a different level of quality. Because doing this for the potentially 10s or 100s of windows in a multi-view application may not scale, an embodiment of the invention distributes the allocated bandwidth budget for backup streams to windows surrounding the dominant window (see
Attention Dominant Windows
[0048] Over time, a user's attention may tend to give preference to certain windows over others, e.g. the speaker or a friend, so that throughout the session the user keeps returning to that window (see
[0049] Embodiments of the invention uses machine learning techniques to analyze the collected data of prior user interactions to identify user-specific attention dominant windows and the allocation of bandwidth budget for backup streams associated with these windows (see views 13 and 17 in
[0050] Embodiments also detect user attention or inattention, for example during a video conference call or while viewing content, such as advertisements, performances, and the like. Metrics regarding such attention/inattention can be captured and used to generate various reports. Such information can also be used in real time to inform, for example, a broadcast service or sporting event promoter that most of their audience prefers one view over another. This information can be used to make global broadcast decisions in real time; alternatively, audience gaze information can be used to change a broadcast stream automatically, for example to select a dominant camera for the stream source.
Computer Implementation
[0051]
[0052] The computing system 80 may include one or more central processing units (“processors”) 81, memory 82, input/output devices 85, e.g. keyboard and pointing devices, touch devices, display devices, storage devices 84, e.g. disk drives, and network adapters 86, e.g. network interfaces, that are connected to an interconnect 83. The interconnect 83 is illustrated as an abstraction that represents any one or more separate physical buses, point to point connections, or both connected by appropriate bridges, adapters, or controllers. The interconnect 83, therefore, may include, for example, a system bus, a Peripheral Component Interconnect (PCI) bus or PCI-Express bus, a HyperTransport or industry standard architecture (ISA) bus, a small computer system interface (SCSI) bus, a universal serial bus (USB), IIC (12C) bus, or an Institute of Electrical and Electronics Engineers (IEEE) standard 1394 bus, also called Firewire.
[0053] The memory 82 and storage devices 84 are computer-readable storage media that may store instructions that implement at least portions of the various embodiments. In addition, the data structures and message structures may be stored or transmitted via a data transmission medium, e.g. a signal on a communications link. Various communications links may be used, e.g. the Internet, a local area network, a wide area network, or a point-to-point dial-up connection. Thus, computer readable media can include computer-readable storage media, e.g. non-transitory media, and computer-readable transmission media.
[0054] The instructions stored in memory 82 can be implemented as software and/or firmware to program the processor 81 to carry out actions described above. In some embodiments, such software or firmware may be initially provided to the processing system 80 by downloading it from a remote system through the computing system 80, e.g. via network adapter 86.
[0055] The various embodiments introduced herein can be implemented by, for example, programmable circuitry, e.g. one or more microprocessors, programmed with software and/or firmware, or entirely in special purpose hardwired (non-programmable) circuitry, or in a combination of such forms. Special-purpose hardwired circuitry may be in the form of, for example, one or more ASICs, PLDs, FPGAs, etc.
[0056] The language used in the specification has been principally selected for readability and instructional purposes. It may not have been selected to delineate or circumscribe the subject matter. It is therefore intended that the scope of the technology be limited not by this Detailed Description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of various embodiments is intended to be illustrative, but not limiting, of the scope of the technology as set forth in the following claims.