PREDICTING MULTIMEDIA SESSION MOS
20210409820 · 2021-12-30
Inventors
- Jing Fu (Solna, SE)
- Junaid Shaikh (Sundbyberg, SE)
- Tomas Lundberg (Lulea, SE)
- Gunnar Heikkilä (Gammelstad, SE)
Cpc classification
H04N21/44209
ELECTRICITY
International classification
H04N21/442
ELECTRICITY
H04N21/24
ELECTRICITY
Abstract
It is provided a method, performed by a MOS, Mean Opinion Score, estimator, for predicting a multimedia session MOS. The multimedia comprises a video and an audio, wherein video quality is represented by a list of per time unit scores of a video quality, an initial buffering event and rebuffering events in the video, and wherein audio quality is represented by a list of per time unit scores of audio quality. The method comprises: generating video features from the list of per time unit scores of the video quality; generating audio features from the list of per time unit scores of the audio quality; generating buffering features from the initial buffering event and rebuffering events in the video; and estimating a multimedia session MOS from the generated video features, generated audio features and generated buffering features by using machine learning technique.
Claims
1. A method, performed by a MOS, Mean Opinion Score, estimator, for predicting a multimedia session MOS, wherein the multimedia comprises a video and an audio, wherein video quality is represented by a list of per time unit scores of the video quality, an initial buffering event and rebuffering events in the video, and wherein audio quality is represented by a list of per time unit scores of the audio quality, the method comprising: generating one or more of the group consisting of: video features from the list of per time unit scores of the video quality; audio features from the list of per time unit scores of the audio quality; buffering features from the initial buffering event and the rebuffering events in the video; and estimating the multimedia session MOS from the one or more of the generated video features, generated audio features, and generated buffering features by using machine learning technique.
2. The method according to claim 1, wherein the video features comprise a feature being a first percentile of the per unit time scores of the video quality.
3. The method according to claim 1, wherein the video features comprise a feature being a fifth percentile of the per unit time scores of the video quality.
4. The method according to claim 1, wherein the video features comprise a feature being a fifteenth percentile of the per unit time scores of the video quality.
5. The method according to claim 1, wherein the step of estimating is based on a random forest based model.
6. The method according to claim 1, wherein the buffering features comprise a feature being total buffering time.
7. The method according to claim 1, wherein the buffering features comprise a feature being number of the rebuffering events.
8. The method according to claim 1, wherein the buffering features comprise a feature being percentage of buffering time divided by video time.
9. The method according to claim 1, wherein the buffering features comprise a feature being number of the rebuffering events per video length.
10. The method according to claim 1, wherein the buffering features comprise a feature being last seen rebuffering from the end of the video.
11. A MOS, Mean Opinion Score, estimator for predicting a multimedia session MOS, wherein the multimedia comprises a video and an audio, wherein video quality is represented by a list of per time unit scores of the video quality and an initial buffering event and rebuffering events in the video and wherein audio quality is represented by a list of per time unit scores of the audio quality, the MOS estimator comprising processing means and a memory comprising instructions which, when executed by the processing means, causes the MOS estimator to: generate one or more of the group consisting of: video features from the input list of per time unit scores of the video quality; audio features from the input list of per time unit scores of the audio quality; buffering features from the initial buffering event and the rebuffering events in the video; and estimate the multimedia session MOS from the one or more of the generated video features, generated audio features, and generated buffering features by using machine learning technique.
12. The MOS estimator according to claim 11, wherein the video features comprise a feature being a first percentile of the per unit time scores of the video quality.
13. The MOS estimator according to claim 11, wherein the video features comprise a feature being a fifth percentile of the per unit time scores of the video quality.
14. The MOS estimator according to claim 11, wherein the video features comprise a feature being a fifteenth percentile of the per unit time scores of the video quality.
15. The MOS estimator according to claim 11, wherein the instructions to estimate comprise instructions which, when executed by the processing means, causes the MOS estimator to estimate using a random forest based model.
16. The MOS estimator according to claim 11, wherein the buffering features comprise a feature being total buffering time.
17. The MOS estimator according to claim 11, wherein the buffering features comprise a feature being number of rebuffering events.
18. The MOS estimator according to claim 11, wherein the buffering features comprise a feature being percentage of buffering time divided by video time.
19. The MOS estimator according to claim 11, wherein the buffering features comprise a feature being number of rebuffering events per video length.
20. The MOS estimator according to claim 11, wherein the buffering features comprise a feature being last seen rebuffering from the end of the video.
21. A MOS, Mean Opinion Score, estimator comprising: a generating module, configured to generate video features from an input list of per time unit scores of video quality, generate audio features from the input list of per time unit scores of audio quality and generate buffering features from an initial buffering event and rebuffering events in a video; and a predicting module, configured to predict a multimedia session MOS from the generated video features, generated audio features and generated buffering features by using machine learning technique.
22. A non-transitory computer-readable storage medium comprising a computer program product including instructions to cause at least one processor to: generate video features from an input list of per time unit scores of video quality; generate audio features from the input list of per time unit scores of audio quality; generate buffering features from an initial buffering event and rebuffering events in the video; and estimate a multimedia session MOS, Mean Opinion Score, from the generated video features, generated audio features and generated buffering features by using machine learning technique.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0038] The invention is now described, by way of example, with reference to the accompanying drawings, in which:
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
DETAILED DESCRIPTION
[0048] The invention will now be described more fully hereinafter with reference to the accompanying drawings, in which certain embodiments of the invention are shown. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided by way of example so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Like numbers refer to like elements throughout the description.
[0049] The subjective MOS is how humans rate the quality of a multimedia sequence. Objective MOS estimation is using models to predict/estimate how humans will rate it. In general, parametric based methods are usually used to predict the multimedia MOS. This kind of parametric based methods usually results in quite a large prediction error.
[0050] The basic idea of embodiments presented herein is to predict the multimedia session MOS using machine learning approaches. The input for the prediction includes the following:
[0051] 1. A list of per time unit scores of the video quality
[0052] 2. A list of per time unit scores of the audio quality
[0053] 3. The initial buffering event and rebuffering events in the video.
[0054] A time unit may be a second. Thus, the lists of per time unit scores of the video and audio quality may be obtained per second.
[0055] From these inputs, a number of features are generated. Next, using these features, a machine learning model is trained with random forest to predict session MOS. Each feature is such that a Boolean condition can be obtained by evaluating the feature according to some criteria, e.g. comparing a scalar feature with a certain value.
[0056]
[0057] We also have an audio quality module 22 to predict the per time unit (e.g., second) scores of the audio quality of a video sequence. The input 16 to this audio module 22 includes audio bitrates, audio codecs, etc. The output 26 is also a list of scores, but for audio quality. For instance, the audio output 26 includes a series of audio scores such as [2.1, 2.2, 3.3, 3.3, 3.5, . . . ].
[0058] Also, there is a buffering module 23 to provide statistics of the buffering during this video playout. The input 17 to the buffering module is buffering events 17. The output 27 of the buffering module 23 contains a list of buffering events, where each event includes the time since start of video clip, and the duration of buffering. For instance, the buffer output 27 includes a series of buffering events such as [[0, 2], [10, 1]], where the first parameter of each event is the video time (i.e. a timestamp in a media timeline) at which point the buffering started and the latter is called buffering time (i.e. duration of the buffering). If the video time is 0, it is called the initial buffering. Otherwise, it is considered as the rebuffering.
[0059] The video input 15, the audio input 16 and the buffering input 17 all relate to one multimedia session, i.e. a single reception of multimedia comprising both video and audio.
[0060] An aggregation module 1, takes the outputs 21, 22, 23 from the video module 21, audio module 22, and buffering module 23 to predict the final session MOS score 29. The aggregation module 1 is also referred to as a MOS estimator herein.
[0061]
[0062] One impact in quality is rebuffering (when the transmission speed is not high enough), as seen in
[0063] When the transmission capacity in a network fluctuates, for instance for a wireless connection, the media player (in the receiver, 13 of
[0064] Note that also in the case of adaptive bitrate, rebufferings may occur, so the combinations of adaptations and rebufferings need to be handled, as in the more complex example of
[0065]
[0066] The aggregation module 1 here uses an objective model for estimation of multimedia streaming quality, belonging to the No-Reference context.
[0067] The aggregation module 1 contains several sub-modules. First, there is a video feature generation module 30. Second, there is an audio feature generation module 31. Third, there is a buffering feature generation module 32. Finally, there is a random forest prediction module 35 together with model builder 36.
[0068] The video feature generation module 30 generates features from the list of per time unit video scores 25 (obtained from the video module 21 of
[0076] However, the embodiments presented herein is by no means limited to the seven features identified above, nor to the exact numerical values given above.
[0077] The audio feature generation module 31 is used to generate features from the list of per second audio scores 26 (obtained from the audio module 22 of
[0080] The buffering feature generation module 32 is used to generate features based on buffering events. The input is the list of buffering events 27 (obtained from the buffering module 23 of
[0081] An example of a list of buffering is shown below. This example comprises three buffering events, one at the beginning of the video with 3 seconds of buffering, one in 10 seconds of the video with 2 seconds of rebuffering, and one in 50 seconds of the video with 1 second of rebuffering.
[0082] The buffering event input 27 are then represented by: [[0, 3], [10, 2], [50, 1]]
[0083] One or more of the following features can be generated according to embodiments herein: [0084] 1. Total buffering time (With some adjustment for initial buffering) [0085] 2. Number of rebuffering events [0086] 3. Percentage of buffering time divided by video time [0087] 4. Number of rebuffering events per video length [0088] 5. Last seen rebuffering from the end of the video
[0089] However, the embodiments presented herein are not limited to the five features above—there may be more features used in this module.
[0090] The total buffering time sums up the seconds of buffering in the video playout. However, the initial buffering event is given ⅓.sup.rd of weight compared to other rebuffering events as users tend to forget about the events occurring at the start of streaming due to the effect of memory. Also, our data driven approach shows that this approach provides higher prediction accuracy. With this approach, the total buffering time is 3/3+2+1=4 on the sample above.
[0091] The number of rebuffering events is the count of all the rebuffering events. In the above example, number of rebuffering events is two. The initial buffering is not considered a rebuffering event.
[0092] The third feature percentage of buffering time compared to video time is calculated by taking the value of feature 1 divided by the total video length. This can also be thought of as a radio of stalling duration. As the test set of video sequences of different length, using this feature is also good. The value of feature 3 in the above example becomes 4/60=0.067 if the media length is 60 seconds long.
[0093] The fourth feature, i.e., the number of rebuffering events per video length is to take feature 2 divided by video length. The value for the example is then 2/60=0.033. This can also be thought of a frequency of rebuffering events.
[0094] Finally, there is a feature about last seen rebuffering. In the example, the video length is 60 seconds, while the (start of the) last rebuffering is done at 50 seconds, so the last seen rebuffering from the end of the video is 60−50=10 seconds. If there is no rebuffering in the session, the last seen rebuffering is set to the media length.
[0095] A random forest MOS prediction module 35 takes in inputs (features) from the video feature generation module 30, audio feature generation module 31 and the buffering feature generation module 32. In the case when the inputs from the modules are as described above, there is a total of fifteen inputs (i.e. features): the seven inputs are from the video feature generation module 30, two inputs are from the audio feature generation module 31 and five inputs are from the buffering feature generation module 32. Finally (and optionally), one more feature about the device type 33, which indicates if the test is on a mobile device or PC device is given as input to the model in the random forest MOS prediction module 35. The output 39 of the random forest MOS prediction module 35 is an estimated (or predicted) MOS value.
[0096]
[0097] The random forest based model 55 may be built on a number of trees, each tree characterised with a maximum depth. For a model according to one scenario, the random forest is built on fifty trees. Each tree has a maximum depth of eight, which means that hierarchically there can be only eight levels in the trees. The depth is set to eight, which provides a good trade-off between accuracy and complexity of model. Upon receiving inputs from video, audio and buffering modules for a particular streaming session, it estimates MOS in the following way.
[0098] The model 55 parses through fifty trees 51a-n that were already constructed during a training phase of the model 55. At each node in a tree, there is a condition check on the value of a certain feature (for example, among the fifteen features described above) is smaller than the specified value of the tree 30 node, if the answer to the condition is YES, it proceeds to the left child node. Otherwise, it proceeds to the right child node. It recursively goes through all the levels in a tree, until it reaches a leaf node of the tree, where it gets a MOS estimate. The leaf node can be reached at any depth between one and eight, depending on trees and the values of feature.
[0099] The above process is performed for all the fifty trees, which can be done in parallel for all the trees. Finally, there are fifty estimates of MOS scores for the streaming session obtained from the corresponding fifty trees. The average of the fifty MOS scores is then calculated. Henceforth, the module gives an output (29 of
[0100] An example of the random forest model 55 with fifty trees and maximum depth of eight is shown in
[0101] The model in the embodiments presented herein may optionally be extended by introducing any one or more several other features, some of which are listed here: [0102] 1. Mean play duration: the play duration (t.sub.p) is the time interval in which a user enjoys smooth playout of multimedia streaming without any rebufferings. It can also be defined as the time duration between two subsequent rebuffering events and thus expressed as:
t.sub.p=T.sub.r.sup.i+1−T.sub.r.sup.i [0103] where T.sub.r.sup.i is the time when rebuffering event i occurs. [0104] Thus, mean play duration (
μ=Σ.sub.i.sup.n(vMOS.sub.i+1−vMOS.sub.i).sup.2 [0117] where vMOS.sub.i is the video MOS score of the i.sup.th second of a streaming session.
[0118]
[0119] In a generate video features step 40, video features are generated from the list of per time unit scores of the video quality. The video features may comprise a feature being a first percentile of the per unit time scores of the video quality. The video features may comprise a feature being a fifth percentile of the per unit time scores of the video quality. The video features may comprise a feature being a fifteenth percentile of the per unit time scores of the video quality.
[0120] In a generate audio features step 42, audio features are generated from the list of per time unit scores of the audio quality.
[0121] In a generate buffering features step 44, buffering features are generated from the initial buffering event and rebuffering events in the video. The buffering features may comprise a feature being total buffering time. The buffering features may comprise a feature being number of rebuffering events. The buffering features may comprise a feature being percentage of buffering time divided by video time. The buffering features may comprise a feature being number of rebuffering events per video length. The buffering features may comprise a feature being last seen rebuffering from the end of the video.
[0122] In a predict a multimedia session MOS step 46, a multimedia session MOS (39 of
[0123]
[0124] A generating means 70 corresponds to steps 40, 42 and 44. A predicting means 72 corresponds to step 46.
[0125]
[0126]
[0127] The memory 64 can be any combination of read and write memory (RAM) and read only memory (ROM). The memory 64 also comprises persistent storage, which, for example, can be any single one or combination of magnetic memory, optical memory, solid state memory or even remotely mounted memory.
[0128] A data memory 66 is also provided for reading and/or storing data during execution of software instructions in the processor 60. The data memory 66 can be any combination of read and write memory (RAM) and read only memory (ROM).
[0129] The MOS estimator further comprises an I/O interface 62 for communicating with other external entities. Optionally, the I/O interface 62 also includes a user interface.
[0130] Other components of the MOS estimator are omitted in order not to obscure the concepts presented herein.
[0131]
[0132] The invention has mainly been described above with reference to a few embodiments. However, as is readily appreciated by a person skilled in the art, other embodiments than the ones disclosed above are equally possible within the scope of the invention, as defined by the appended patent claims.