NETWORK-BASED ASSISTANCE FOR RECEIVER PROCESSING OF VIDEO DATA
20220038756 · 2022-02-03
Inventors
- Hans Maarten Stokking (Wateringen, NL)
- Frank Ter Haar (Woerden, NL)
- Hendrikus Nathaniël Hindriks (Gouda, NL)
Cpc classification
H04N21/23418
ELECTRICITY
H04N21/4424
ELECTRICITY
H04N21/222
ELECTRICITY
H04N21/435
ELECTRICITY
International classification
H04N21/234
ELECTRICITY
Abstract
An intermediary system and method may be provided for assisting a receiver system in processing video data which is streamed as a video stream to the receiver system via a network. The processing of the video data by the receiver system may be dependent on an analysis of the video data. The intermediary system may provide processing assist data to the receiver system which comprises an analysis result or a processing instruction derived from the analysis results. Accordingly, the receiver system may process the video data without a need for the receiver system itself to analyze the video data, thereby offloading computational complexity to the intermediary system. Compared to techniques in which most or all of the processing is performed by the intermediary system, an advantage of continuing to process the video data at the receiver system is that the receiver system may already decode the video stream while the video stream is decoded and/or analyzed by the intermediary system, thereby reducing the delay from transmission to display of the video stream.
Claims
1. A processor system configured for assisting a receiver system in processing video data which is streamed as a video stream to the receiver system via a network, wherein the processing of the video data by the receiver system is dependent on an analysis of the video data, the processor system comprising: a network interface to the network; a processor configured to: via the network interface, receive the video stream; decode at least part of the video stream to obtain a decoded video data part; analyze the decoded video data part to obtain an analysis result; generate processing assist data comprising the analysis result or a processing instruction derived from the analysis results; via the network interface, provide the processing assist data to the receiver system to enable the receiver system to process the video data using the analysis result or the processing instruction provided by the processing assist data.
2. The processor system according to claim 1, wherein the processor is configured to analyze the decoded video data part by at least one of the group of: a segmentation technique, whereby the analysis result comprises a segmentation of an object in the decoded video data part; an object tracking technique, whereby the analysis result comprises a position of an object in the decoded video data part; and a calibration technique, whereby the analysis result comprises a calibration parameter used in the processing of the video data.
3. The processor system according to claim 1, wherein the processing of the video data by the receiver system comprises compositing an object into the video data, and wherein the processor is configured to: via the network interface, provide object data to the receiver system, the object data defining at least part of the object; analyze the decoded video data part to determine, as the analysis result to be included in the processing assist data, a characteristic of said composition of the object into the video data, such as a position and/or orientation of the object.
4. The processor system according to 1, wherein the processor is configured to include timing information in the processing assist data, the timing information being indicative of the part of the video stream or the decoded video data part from which the processing assist data was generated.
5. The processor system according to claim 4, wherein the timing information comprises at least one of the group of: a sequence number; and a content timestamp.
6. The processor system according to claim1, wherein the processor is configured to: sequentially decode the video stream to obtain a series of decoded video data parts; sequentially analyze, and generate processing assist data for, individual ones of the decoded video data parts to obtain a series of processing assist data; and provide the series of processing assist data to the receiver system as a processing assist data stream.
7. The processor system according to claim 1, wherein the processor is configured to, via the network interface, receive the video stream from a stream source in the network and to forward the video stream to the receiver system.
8. A processor system configured for processing video data which is received as a video stream via a network, the processor system comprising: a network interface to the network; a processor configured to: via the network interface, receive the video stream; decode the video stream to obtain the video data; process the video data to obtain processed video data, wherein the processing is dependent on an analysis of at least part of the video data; wherein the processor is further configured to: via the network interface, receive processing assist data comprising an analysis result of the analysis of at least the part of the video data, or a processing instruction derived from the analysis results; and perform the processing of the video data using the analysis result or the processing instruction provided by the processing assist data.
9. The processor system according to claim 8, wherein the processing assist data comprises a segmentation of an object in the part of the video data, and wherein the processor is configured to use the segmentation of the object for processing video data of the object or video data outside of the object.
10. The processor system according to claim 8, wherein the processing assist data comprises timing information, the timing information being indicative of the part of the video stream or the decoded video data part from which the processing assist data was generated, and wherein the processor is configured to identify the part of the video stream or the decoded video data part on the basis of the timing information and to use the analysis result or the processing instruction provided by the processing assist data specifically for the processing of said part.
11. A system comprising the processor system according to claim 1 as an intermediary system and the processor system according to claim 8 as a receiver system, wherein: both the intermediary system and the receiver system are configured to receive the video stream from a stream source in the network; or the intermediary system is configured to receive the video stream from the stream source in the network and forwards the video stream to the receiver system.
12. A non-transitory computer-readable medium comprising processing assist data, the processing assist data comprising an analysis result of an analysis of video data, or a processing instruction derived from the analysis results, wherein the processing assist data enables a receiver system which receives the video data as a video stream to process the video data using the analysis result or the processing instruction provided by the processing assist data.
13. A computer-implemented method for assisting a receiver system in processing video data which is streamed as a video stream to the receiver system via a network, wherein the processing of the video data by the receiver system is dependent on an analysis of the video data, the method comprising: via the network, receiving the video stream; decoding at least part of the video stream to obtain a decoded video data part; analyzing the decoded video data part to obtain an analysis result; generating processing assist data comprising the analysis result or a processing instruction derived from the analysis results; via the network, providing the processing assist data to the receiver system to enable the receiver system to process the video data using the analysis result or the processing instruction provided by the processing assist data.
14. A computer-implemented method for processing video data which is received as a video stream via a network, the method comprising: via the network, receiving the video stream; decoding the video stream to obtain the video data; processing the video data to obtain processed video data, wherein the processing is dependent on an analysis of at least part of the video data; wherein the method further comprises: receiving processing assist data comprising an analysis result of the analysis of at least said part of the video data, or a processing instruction derived from the analysis results; and performing the processing of the video data using the analysis result or the processing instruction provided by the processing assist data.
15. A non-transitory computer-readable medium comprising a computer program, the computer program comprising instructions for causing a processor system to perform the method according to claim 13.
16. A non-transitory computer-readable medium comprising a computer program, the computer program comprising instructions for causing a processor system to perform the method according to claim 14.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0074] These and other aspects of the invention are apparent from and will be elucidated with reference to the embodiments described hereinafter. In the drawings,
[0075]
[0076]
[0077]
[0078]
[0079]
[0080]
[0081]
[0082]
[0083]
[0084]
[0085]
[0086]
[0087]
[0088]
[0089] It should be noted that items which have the same reference numbers in different figures, have the same structural features and the same functions, or are the same signals. Where the function and/or structure of such an item has been explained, there is no necessity for repeated explanation thereof in the detailed description.
LIST OF REFERENCE AND ABBREVIATIONS
[0090] The following list of references and abbreviations is provided for facilitating the interpretation of the drawings and shall not be construed as limiting the claims.
[0091] 010 video data
[0092] 012 pre-processed video data
[0093] 014 processed video data
[0094] 020 sender system
[0095] 022 encode as video stream
[0096] 024 transport video stream
[0097] 040, 042 network
[0098] 060 (NBMP-based) intermediary system
[0099] 062 decode video stream
[0100] 064 process video data
[0101] 066 encode as video stream
[0102] 068 transport video stream
[0103] 080 receiver system
[0104] 082 decode video stream
[0105] 100 intermediary (processor) system
[0106] 102 decode at least part of video stream
[0107] 104 analyse video data part
[0108] 106 forward video stream
[0109] 107 transport video stream
[0110] 108 provide processing assist data
[0111] 110 processing assist data
[0112] 120 network interface
[0113] 122 network data communication
[0114] 140 processor
[0115] 160 data storage
[0116] 200 receiver (processor) system
[0117] 202 decode video stream
[0118] 204 process video data using processing assist data
[0119] 220 network interface
[0120] 222 network data communication
[0121] 240 processor
[0122] 260 display output
[0123] 262 display data
[0124] 280 display
[0125] 300 method for assisting receiver system in processing video data
[0126] 310 receiving video stream
[0127] 320 decoding at least part of video stream
[0128] 330analyzing decoded video data part
[0129] 340 generating processing assist data
[0130] 350 providing processing assist data to receiver system
[0131] 400 method for processing video data received as video stream
[0132] 410 receiving video stream
[0133] 420 decoding video stream
[0134] 430 receiving processing assist data from intermediary system
[0135] 440 processing video data using processing assist data
[0136] 500 computer readable medium
[0137] 510 non-transitory data
[0138] 600 video frame
[0139] 602 person (foreground)
[0140] 604 room (background)
[0141] 610 video frame after background removal
[0142] 620 foreground/background segmentation mask
[0143] 700, 702 video frame
[0144] 710 HMD
[0145] 720 3D model of user
[0146] 730, 732 selected part of 3D model
[0147] 740, 742 image part showing selected part of 3D model
[0148] 750, 752 video frame after HMD removal
[0149] 800 user recorded by handheld camera
[0150] 802 room
[0151] 810-814 handheld camera
[0152] 820-824 video frame recorded by handheld camera
[0153] 1000 exemplary data processing system
[0154] 1002 processor
[0155] 1004 memory element
[0156] 1006 system bus
[0157] 1008 local memory
[0158] 1010 bulk storage device
[0159] 1012 input device
[0160] 1014 output device
[0161] 1016 network adapter
[0162] 1018 application
DETAILED DESCRIPTION OF EMBODIMENTS
[0163] Some of the following embodiments are described within the context of ‘Social VR’ where a number of users participate in a teleconference using HMDs and cameras and in which it may be desirable to process a video containing a live camera recording of a user to make the video suitable for being shown in the virtual environment, for example by background removal or the replacement of an HMD by a 3D model of the user's face. However, the processing assist data and the framework for generating, transmitting and using the processing assist data as described in this specification may also be applied in all other applications in which the processing of the video comprises an analysis phase which may, at the receiver system, be substituted by an analysis result or a processing instruction derived from the analysis result. A typical example may be the replacing of items in a video by other items, e.g., for product placement, such as showing the local beer instead of a national beer brand. It is further noted that in the following, any reference to a ‘video stream’ may refer to a data representation of a video which is suitable for being streamed, e.g., using known streaming techniques. Any reference to ‘video encoding’ and/or ‘video decoding’ may refer to the use of any suitable video coding technique, including but not limited to video coding techniques based on MPEG-2Part 2, MPEG-4 Part 2, H.264 (MPEG-4 Part 10), HEVC, etc. Furthermore, a reference to a ‘video’ may include a video stream but also a data representation of the video which is not (yet) suitable for being streamed or at least conventionally not intended for streaming. In the Figures, video (streams) may be schematically represented by a single video frame.
[0164]
[0165] As indicated earlier and also further discussed with reference to
[0166]
[0167] When following the example of the mirroring of the video data, the processing assist data may contain a processing instruction which instructs the receiver system 200 to mirror the video data along a particular axis, with the need for mirroring and/or the axis being determined by the analysis 104 (which is visually indicated by an adjusted depiction of the figure). Various other types of analysis results and/or processing instructions are also conceivable, and discussed with reference to
[0168] It is noted that in
[0169] For example, the network 040 may include a part of a core network of a telecommunication network, while the network 042 may include a same or adjoining part of the core network and an access network.
[0170]
[0171]
[0172] It can be seen that the decoding 102 of a video stream part by the intermediary system 100 and the decoding 202 of the same video stream part by the receiver system 200 may be at least partially performed in parallel, or at least the decoding 202 may not have to await the transmission 108 of the processing assist data since the decoding 202 of the video stream part does not require the processing assist data. The degree of parallelism may depend on various factors, including network delays. For example, in some embodiments, the network delay between the sender system 020 and the intermediary system 100 may be smaller than the network delay between the sender system 020 and the receiver system 200, thereby causing the transmission 107 of the video stream from the sender system 020 to the receiver system 200 to take longer than the transmission 024 of the video stream from the sender system 020 to the intermediary system 100. An example is that the sender system 020 and the intermediary system 100 may be both located in a core network of a telecommunication network, whereas the receiver system 200 may be connected to the core network via an access network. Similarly, if the video stream is forwarded 106 by the intermediary system 100 to the receiver system 200 instead of being directly transmitted by the sender system 020, the decoding 202 by the receiver system 200 may also be delayed compared to the decoding 102 by the intermediary system 100.
[0173] It can be seen that the overall delay between the encoding 022 by the sender system 020 and the receiver system 200 obtaining a decoded and processed video data part may correspond to D.sub.2, which may be smaller than D.sub.1 of
[0174] Another factor in the end-to-end delay from streaming a video stream by a stream source to display of a video stream by the receiver system may be buffering. Typically, before decoding a video stream, the video stream may be buffered at the receiver system. This may be done to ensure continuous playback. As networks may cause jitter, i.e. certain packets on the network may suffer larger delays than others, buffering may be used to ensure that decoding and displaying of video frames may be continuous. This buffering is typically one of the major factors in the end-to-end delay. As the intermediary system may not (have to) display the video, the intermediary system may omit buffering the video stream before processing, or suffice with a much more limited buffer, and may in general simply process the video stream as it arrives.
[0175]
[0176] In general, the processing assist data 110 may be provided in a manner which enables the receiver system to associate the processing assist data 110 with the video stream or the decoded video data. For example, the processing assist data 110 may contain an identifier of the video stream. Various other means of association are equally conceivable. For example, in some embodiments, the video stream may link to the processing assist data 110, e.g., by containing an URL at which the processing assist data 110 is accessible. In some embodiments, the processing assist data 110 may be provided in a manner which enables the receiver system to associate the processing assist data 110 with a specific part of the video stream or decoded video data part, for example the same part which was used as input to the analysis on the basis of which the processing assist data 110 was generated. For example, the intermediary system may include timing information in the processing assist data 110 which may be indicative of the part of the video stream or the decoded video data part from which the processing assist data was generated. In a specific example, the timing information may be a sequence number and/or a content timestamp which is also present in the video stream, such as a Presentation TimeStamp (PTS) value.
[0177] In general, examples of analysis by the intermediary system may include the following. In the previously mentioned and specific example of Social VR, such analysis may include performing foreground/background segmentation. Accordingly, the processing assist data 110 may comprise a 1-bit mask identifying foreground and background. The analysis may further include detecting the location and orientation of the HMD to identify the location and orientation of the user's head and face in a video frame. Accordingly, the processing assist data 110 may comprise corresponding location data and orientation data. The analysis may further include selecting a part and angle of a 3D model for facial reconstruction, e.g., to replace the HMD occluding part of the user's face. Accordingly, the processing assist data 110 may further comprise an indication of the angle and the part of the 3D model's which is to be used.
[0178] In some embodiments, the processing assist data 110 may be comprised of different types of data, such as the aforementioned 1-bit segmentation mask and location data and orientation data. In such embodiments, the different types of data may also be transmitted separately, e.g., as processing assist data parts, and in some embodiments may be provided at different time intervals. For example, if the processing assist data 110 contains calibration data and a 1-bit segmentation mask, such calibration data may be provided once at a start of streaming while the 1-bit segmentation mark may be provided every n.sup.th video frame, with n≥1, or adaptively and thereby a-periodically depending on an amount of motion in the video data.
[0179]
[0180]
[0181] A (simplified) procedure is shown in
[0182] For such and similar types of HMD removal, the processing assist data may contain several types of data, including but not limited to one or more of: [0183] The detected position and orientation of the HMD in the video frame. This may be described as the center point of the HMD in the frame (indicating x and y coordinates, possible depth if the video frame includes depth), or as the coordinates of a bounding box (which also include size information), and may describe the orientation using an axis system with a third (z) axis orthogonally coming out of the frame, allowing orientation to be described in terms of a vector, or in terms of yaw-pitch-roll. [0184] The part of the 3D model to be used and the scaling to be applied for appropriate sizing. This may assume the same 3D model is available at both the intermediary system and the receiver system. The part of the 3D model may be indicated as coordinates in the 3D model's UV projection, where the orientation may also be described in a 3D axial system. Note that the part of the 3D model to be used may be similar for different orientations, and therefore both coordinates and the orientation may be indicated by the processing assist data. [0185] The exact or at least approximate location where the part of the 3D model may need to be placed in the original video frame. The location may be given in coordinates in the video frame, including depth coordinates if applicable. [0186] The adjustments to be made to the final result, for example in terms of filters to be applied to (possibly specified) parts of the resulting video frame, for example edge smoothing, color correction and/or lighting correction.
[0187] Various other types of analysis for HMD removal, and corresponding types of processing assist data, are equally conceivable. For example, detected facial expression and eye orientation may also be part of the processing assist data.
[0188]
[0189] The intermediary system as described elsewhere may perform an analysis which may assist in such video stabilization. In
[0190] A first way may be to detect the actual movement of the camera 810-814, and indicate this movement as processing assist data to the receiver system. The movement may comprise or consists of a change in position and a change in orientation of the camera. The detection itself may be done using static background parts. In this example, the user 800 may be in a room 802 near the corner of the room. The lines where walls meet and where ceiling and walls meet are shown. As the camera moves, the perspective on this static background changes and thus the camera movement may be derived from captured video frames, as known in the art.
[0191] For describing a change in movement and orientation, an axial system may be defined. Such an axial system typically consists of an X, Y and Z axis, and rotations on the axis may be defined using either a right-handed or left-handed method (e.g., thumb in direction of the axis, fingers point in the positive rotation direction). Looking straight forward may be defined as 0 rotation on all axis. Thus, an initial video frame 820 from a moving camera may be defined by position P=(0, 0, 0) and rotation R=(0, 0, 0). Updates to the position and rotation may be sent by sending new position and rotation value vectors, or by sending updates on the previous values. For example, camera position 2 shown is to the left (negative Y) and a bit forward (positive X), and rotated on the vertical axis (positive Z rotation), which may be represented as position P =(+0.2, −0.5, 0) and rotation R=(0, 0, 20°). Similarly, camera 3 position may be represented as P=(+0.5, −0.8, 0) and rotation R=(0, 0, 60°). This information may be provided as processing assist data to the receiver system, possibly with a reference to the timestamp of a video frame to synchronize said data with the video frame.
[0192] A second way may be to describe the change in the position and orientation of the object captured in the video frame. As the camera moves, the object may be captured from a different position, and the orientation of the object in the captured video frames may be different. The movement of the object in the video frame may be described by a translation and rotation vector on 3 axes. These values may be determined by analyzing the captured video frames, as known in the art.
[0193] In
[0194] It is noted that in
[0195] The processor system 100 may be embodied by a (single) device or apparatus. For example, the processor system 100 may be embodied by a server, workstation, personal computer, etc. The processor system 100 may also be embodied by a distributed system of such devices or apparatuses. An example of the latter may be the functionality of the processor system 100 being at least in part distributed over network elements in a network. In another example, the processor system 100 may be embodied by an edge node of a 5G or next-gen telecommunication network.
[0196]
[0197] The processor 240 may be embodied by a single Central Processing Unit (CPU), but also by a combination or system of such CPUs and/or other types of processing units, such as for example Graphics Processing Units (GPUs). Although not shown in
[0198] In general, the processor system 100 of
[0199]
[0200]
[0201] It will be appreciated that, in general, the operations of method 300 of
[0202] It is noted that any of the methods described in this specification, for example in any of the claims, may be implemented on a computer as a computer implemented method, as dedicated hardware, or as a combination of both. Instructions for the computer, e.g., executable code, may be stored on a computer readable medium 500 as for example shown in
[0203] In an alternative embodiment of the computer readable medium 500 of
[0204]
[0205] The data processing system 1000 may include at least one processor 1002 coupled to memory elements 1004 through a system bus 1006. As such, the data processing system may store program code within memory elements 1004. Furthermore, processor 1002 may execute the program code accessed from memory elements 1004 via system bus 1006. In one aspect, data processing system may be implemented as a computer that is suitable for storing and/or executing program code. It should be appreciated, however, that data processing system 1000 may be implemented in the form of any system including a processor and memory that is capable of performing the functions described within this specification.
[0206] The memory elements 1004 may include one or more physical memory devices such as, for example, local memory 1008 and one or more bulk storage devices 1010. Local memory may refer to random access memory or other non-persistent memory device(s) generally used during actual execution of the program code. A bulk storage device may be implemented as a hard drive, solid state disk or other persistent data storage device. The data processing system 1000 may also include one or more cache memories (not shown) that provide temporary storage of at least some program code in order to reduce the number of times program code is otherwise retrieved from bulk storage device 1010 during execution.
[0207] Input/output (I/O) devices depicted as input device 1012 and output device 1014 optionally can be coupled to the data processing system. Examples of input devices may include, but are not limited to, for example, a microphone, a keyboard, a pointing device such as a mouse, a game controller, a Bluetooth controller, a VR controller, and a gesture-based input device, or the like. Examples of output devices may include, but are not limited to, for example, a monitor or display, speakers, or the like. Input device and/or output device may be coupled to data processing system either directly or through intervening I/O controllers. A network adapter 1016 may also be coupled to data processing system to enable it to become coupled to other systems, computer systems, remote network devices, and/or remote storage devices through intervening private or public networks. The network adapter may comprise a data receiver for receiving data that is transmitted by said systems, devices and/or networks to said data and a data transmitter for transmitting data to said systems, devices and/or networks. Modems, cable modems, and Ethernet cards are examples of different types of network adapter that may be used with data processing system 1000.
[0208] As shown in
[0209] For example, data processing system 1000 may represent a processor system as described with reference to
[0210] In accordance with an abstract of the present specification, an intermediary system and method may be provided for assisting a receiver system in processing video data which is streamed as a video stream to the receiver system via a network. The processing of the video data by the receiver system may be dependent on an analysis of the video data. The intermediary system may provide processing assist data to the receiver system which comprises an analysis result or a processing instruction derived from the analysis results. Accordingly, the receiver system may process the video data without a need for the receiver system itself to analyze the video data, thereby offloading computational complexity to the intermediary system. Compared to techniques in which most or all of the processing is performed by the intermediary system, an advantage of continuing to process the video data at the receiver system may be that the receiver system may already decode the video stream while the video stream is decoded and/or analyzed by the intermediary system. This may reduce the delay from transmission by a sender system to display by the receiver system.
[0211] In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb “comprise” and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. Expressions such as “at least one of” when preceding a list or group of elements represent a selection of all or of any subset of elements from the list or group. For example, the expression, “at least one of A, B, and C” should be understood as including only A, only B, only C, both A and B, both A and C, both B and C, or all of A, B, and C. The article “a” or “an” preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.