Obtaining image data of an object in a scene

11582383 · 2023-02-14

Assignee

Inventors

Cpc classification

International classification

Abstract

A method and processor system are provided which analyze a depth map, which may be obtained from a range sensor capturing depth information of a scene, to identify where an object is located in the scene. Accordingly, a region of interest may be identified in the scene which includes the object, and image data may be selectively obtained of the region of interest, rather than of the entire scene containing the object. This image data may be acquired by an image sensor configured for capturing visible light information of the scene. By only selectively obtaining the image data within the region of interest, rather than all of the image data, improvements may be realized in the computational complexity of a possible further processing of the image data, the storage of the image data and/or the transmission of the image data.

Claims

1. A method of obtaining image data of an object in a scene using a range sensor and an image sensor, wherein the range sensor and the image sensor have a known spatial relation, the range sensor is configured for capturing depth information of the scene, the image sensor is configured for capturing visible light information of the scene, the method comprising: obtaining a depth map of the scene acquired by the range sensor; analyzing the depth map to identify a region of interest in the scene, wherein the region of interest contains the object; generating selection data indicating the region of interest; and based on the selection data, selectively obtaining image data of the region of interest, wherein the image data is acquired by the image sensor and the image data of the scene acquired by the image sensor is accessible by streaming from a media source via a network, and wherein the selectively obtaining the image data of the region of interest comprises signalling the media source the selection data so as to request a selective streaming of the image data of the region of interest.

2. The method according to claim 1, wherein the selectively obtaining the image data of the region of interest comprises selectively receiving the image data of the region of interest via a bandwidth constrained link, such as a bus or a network.

3. The method according to claim 1, wherein the selectively obtaining the image data of the region of interest comprises: configuring the image sensor to selectively acquire the visible light information of the scene within the region of interest; and/or selectively reading-out the image data of the region of interest from a memory comprised in or connected to the image sensor.

4. The method according to claim 1, wherein: the image data of the scene acquired by the image sensor is accessible by tile-based streaming from the media source via the network; and the selection data is generated to comprise an identifier of one or more tiles which comprise the image data of the region of interest, wherein said selection data is generated on the basis of spatial relationship description data defining a spatial relationship between different tiles available for streaming.

5. The method according to claim 1, wherein the identifying the region of interest in the scene comprises: applying an object detection technique to the depth map; and/or identifying an object in the depth map on the basis of the object's depth values indicating a proximity to the depth sensor.

6. The method according to claim 1, further comprising applying a background removal technique to the image data of the region of interest so as to remove a background surrounding the object in the image data.

7. A non-transitory computer-readable medium comprising a computer program, the computer program comprising instructions for causing a processor system to perform the method according to claim 1.

8. A method of obtaining image data of an object in a scene using a range sensor and an image sensor, wherein the range sensor and the image sensor have a known spatial relation, the range sensor is configured for capturing depth information of the scene, the image sensor is configured for capturing visible light information of the scene, the method comprising: obtaining a depth map of the scene acquired by the range sensor; analyzing the depth map to identify a region of interest in the scene, wherein the region of interest contains the object, wherein the identifying the region of interest in the scene comprises: obtaining a first depth map of the scene acquired by the range sensor when the object is not present; obtaining a second depth map of the scene acquired by the range sensor when the object is present; and identifying the region of interest in the scene based on a comparison of the first depth map and the second depth map; generating selection data indicating the region of interest and based on the selection data, selectively obtaining image data of the region of interest, wherein the image data is acquired by the image sensor.

9. The method according to claim 8, wherein the selectively obtaining the image data of the region of interest comprises selectively receiving the image data of the region of interest via a bandwidth constrained link, such as a bus or a network.

10. The method according to claim 8, wherein the selectively obtaining the image data of the region of interest comprises: configuring the image sensor to selectively acquire the visible light information of the scene within the region of interest; and/or selectively reading-out the image data of the region of interest from a memory comprised in or connected to the image sensor.

11. The method according to claim 8, wherein the identifying the region of interest in the scene comprises: applying an object detection technique to the depth map; and/or identifying an object in the depth map on the basis of the object's depth values indicating a proximity to the depth sensor.

12. The method according to claim 8, further comprising applying a background removal technique to the image data of the region of interest so as to remove a background surrounding the object in the image data.

13. A non-transitory computer-readable medium comprising a computer program, the computer program comprising instructions for causing a processor system to perform the method according to claim 8.

14. A method of obtaining image data of an object in a scene using a range sensor and an image sensor, wherein the range sensor and the image sensor have a known spatial relation, the range sensor is configured for capturing depth information of the scene, the image sensor is configured for capturing visible light information of the scene, the method comprising: obtaining a depth map of the scene acquired by the range sensor, wherein the depth map is acquired by the range sensor at a first time instance; analyzing the depth map to identify a region of interest in the scene, wherein the region of interest contains the object; generating selection data indicating the region of interest, wherein the generating selection data comprises compensating for movement of the object with respect to the scene between the first time instance and a second time instance which is later in time than the first time instance; and based on the selection data, selectively obtaining image data of the region of interest, wherein the image data is acquired by the image sensor, and wherein the selectively obtained image data is acquired by the image sensor at the second time instance.

15. The method according to claim 14, wherein said compensating for the movement of the object comprises at least one of: adding a margin to an outline of the region of interest; and adjusting a spatial location of the region of interest based on a prediction of the movement of the object, for example: by applying motion estimation to at least two depth maps acquired at different time instances to determine the movement of the object and extrapolating said movement to the second time instance.

16. The method according to claim 14, wherein the selectively obtaining the image data of the region of interest comprises selectively receiving the image data of the region of interest via a bandwidth constrained link, such as a bus or a network.

17. The method according to claim 14, wherein the selectively obtaining the image data of the region of interest comprises: configuring the image sensor to selectively acquire the visible light information of the scene within the region of interest; and/or selectively reading-out the image data of the region of interest from a memory comprised in or connected to the image sensor.

18. The method according to claim 14, wherein the identifying the region of interest in the scene comprises: applying an object detection technique to the depth map; and/or identifying an object in the depth map on the basis of the object's depth values indicating a proximity to the depth sensor.

19. The method according to claim 14, further comprising applying a background removal technique to the image data of the region of interest so as to remove a background surrounding the object in the image data.

20. A non-transitory computer-readable medium comprising a computer program, the computer program comprising instructions for causing a processor system to perform the method according to claim 14.

21. A processor system configured for obtaining image data of an object in a scene using a range sensor and an image sensor, wherein the range sensor and the image sensor have a known spatial relation, the range sensor is configured for capturing depth information of the scene, the image sensor is configured for capturing visible light information of the scene, wherein the image data of the scene acquired by the image sensor is accessible by streaming from a media source via a network, and the processor system comprising: a communication interface, wherein the communication interface is a network interface to the network comprising a bandwidth constrained link; a processor configured to: via the communication interface, obtain a depth map of the scene acquired by the range sensor; analyze the depth map to identify a region of interest in the scene which contains the object; generate selection data indicating the region of interest; and based on the selection data and using the communication interface, signal the media source a spatial location of the region of interest so as to request a selective streaming of the image data of the region of interest and selectively obtain image data of the region of interest from the media source.

22. The processor system according to claim 21, wherein: the processor system comprises and is connected to the range sensor and the image sensor via the communication interface and an internal bus, or the processor system is connected to the range sensor and the image sensor via the communication interface and an external bus.

23. The processor system according to claim 22, wherein the processor is configured to, via the communication interface: configure the image sensor to selectively acquire the visible light information of the scene within the region of interest; and/or selectively read-out the image data of the region of interest from a memory comprised in or connected to the image sensor.

24. A processor system configured as a media source and comprising: a storage medium for at least temporary storing: at least one depth map of a scene which is acquired by a range sensor configured for capturing depth information of the scene; at least one visible light image of the scene which is acquired by an image sensor configured for capturing visible light information of the scene, the image sensor having a known spatial relation with the range sensor; a network interface to a network comprising a bandwidth constrained link to enable the processor system to communicate with a media client, wherein image data of the scene acquired by the image sensor is accessible by streaming from the media source to the media client via the network; a processor configured to, via the network interface: provide the depth map to the media client; receive selection data from the media client which is indicative of a region of interest with respect to the scene; and based on the selection data, selectively stream the image data of the region of interest to the media client.

25. A processor system configured for obtaining image data of an object in a scene using a range sensor and an image sensor, wherein the range sensor and the image sensor have a known spatial relation, the range sensor is configured for capturing depth information of the scene, the image sensor is configured for capturing visible light information of the scene, and the processor system comprising: a communication interface to a bandwidth constrained link, such as a bus or a network; a processor configured to: via the communication interface, obtain a depth map of the scene acquired by the range sensor; analyze the depth map to identify a region of interest in the scene which contains the object, wherein the identifying the region of interest in the scene comprises: obtaining a first depth map of the scene acquired by the range sensor when the object is not present; obtaining a second depth map of the scene acquired by the range sensor when the object is present; and identifying the region of interest in the scene based on a comparison of the first depth map and the second depth map; generate selection data indicating the region of interest; and based on the selection data and using the communication interface, selectively obtain image data of the region of interest acquired by the image sensor.

26. A processor system configured for obtaining image data of an object in a scene using a range sensor and an image sensor, wherein the range sensor and the image sensor have a known spatial relation, the range sensor is configured for capturing depth information of the scene, the image sensor is configured for capturing visible light information of the scene, and the processor system comprising: a communication interface to a bandwidth constrained link, such as a bus or a network; a processor configured to: via the communication interface, obtain a depth map of the scene acquired by the range sensor at a first time instance; analyze the depth map to identify a region of interest in the scene which contains the object; generate selection data indicating the region of interest comprising compensation for movement of the object with respect to the scene between the first time instance and a second time instance which is later in time than the first time instance; and based on the selection data and using the communication interface, selectively obtain image data of the region of interest acquired by the image sensor at a second time instance.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) These and other aspects of the invention are apparent from and will be elucidated with reference to the embodiments described hereinafter. In the drawings,

(2) FIG. 1 shows a camera comprising a range sensor and an image sensor, and a processor system configured to analyze a depth map provided by the camera to identify a region of interest which contains an object, and to generate selection data identifying the region of interest to as to selectively obtain its image data;

(3) FIG. 2 shows an example of a region of interest within an image;

(4) FIG. 3 shows a message exchange between the processor system and the camera so as to selectively obtain the image data of the region of interest;

(5) FIG. 4 shows an example of a tiled representation of an image, wherein the region of interest is represented by a subset of tiles of said tiled representation;

(6) FIG. 5 shows a message exchange between a media client and a media source in which the image data of the region of interest is obtained by streaming;

(7) FIG. 6 shows a message exchange between a media client and a media source in which the media source ‘pushes’ data and instructions to the media client;

(8) FIG. 7 illustrates prediction of object movement;

(9) FIG. 8 provides an overview of the sender system;

(10) FIG. 9 provides an overview of the rendering system; and

(11) FIG. 10 shows an exemplary data processing system.

(12) It should be noted that items which have the same reference numbers in different figures, have the same structural features and the same functions, or are the same signals. Where the function and/or structure of such an item has been explained, there is no necessity for repeated explanation thereof in the detailed description.

LIST OF REFERENCE AND ABBREVIATIONS

(13) The following list of references and abbreviations is provided for facilitating the interpretation of the drawings and shall not be construed as limiting the claims. 050 bandwidth constrained link 100 processor system 105 processor system configured as media client 120 communication interface 140 processor 142 depth processor 144 image processor 160 data storage 200 camera 220 range sensor 240 image sensor 300 processor system 305 processor system configured as media source 320 communication interface 340 processor 360 data storage 400 depth map 402 object mask 410 selection data 420 image 422, 424 tiled representation of image 430 region of interest 432 image data of region of interest 440 tiles representing region of interest 442-446 tile representing region of interest at different time instances 450 object 452 image data of object 500 determine ROI 502 repeated sending of depth map, determine ROI 510 remove background 512 remove background, update ROI 520 stream foreground 1000 exemplary data processing system 1002 processor 1004 memory element 1006 system bus 1008 local memory 1010 bulk storage device 1012 input device 1014 output device 1016 network adapter 1018 application

DETAILED DESCRIPTION OF EMBODIMENTS

(14) The following embodiments involve or relate to selectively obtaining image data of a region of interest in a scene on the basis of selection data which spatially indicates the region of interest. The selection data may be generated by analyzing a depth map of the scene, and the region of interest may be determined so as to include an object in the scene. A non-limiting example is that the object may be a person.

(15) In the following, specific embodiments or examples are described which apply background removal/foreground extraction to the image data of the region of interest, e.g., for video-based VR conferencing, which is in more detail described under the heading ‘Foreground extraction’. The object may here also be denoted as ‘foreground object’ or simply as ‘foreground’. However, as already indicated in the introductory section, such image data of a region of interest may also be used for various other use-cases, e.g., reducing storage space when storing security video footage or reducing computational complexity when performing image enhancements. Additional embodiments pertaining to such other use-cases are well within reach of the skilled person on the basis of the present disclosure.

(16) Processor System and Camera Embodiments

(17) FIG. 1 shows a first embodiment in which a camera 200 is provided which comprises a range sensor 220 and an image sensor 240. Such type of camera is known per se, e.g., in the form of a Microsoft Kinect or a Razer Stargazer or Asus Zenfone AR, and may use various types of range sensors, as for example described in the summary of the invention. Alternatively to having a separate range sensor 220, the camera 200 may also comprise an additional image sensor arranged in a stereoscopic manner with the first-mentioned image sensor 240. In this example, both image sensors may together represent a range sensor 220 operating on the principle of stereoscopic triangulation.

(18) In a specific example, the camera 200 may be a so-termed RGB-D camera having an image sensor 240 configured to acquire color image data having R, G and B color components, whereas the range sensor 220 may provide depth data comprising Depth (D) values. In the following, such color image data may be referred to as ‘color image’ or simply by the label ‘color’. It will be understood that the color image data may also be acquired and/or obtained in any other known color format, including but not limited to YUV or HSV. Additionally, instead of three-component color image data, also any other number of components may be acquired, including one (monochromatic).

(19) As also shown in FIG. 1, a processor system 100 may be provided which may be connected to the camera 200 via a bandwidth constrained link 050, such as an internal bus, external bus or network. The processor system 100 may be configured for obtaining, via the bandwidth constrained link 050, a depth map 400 of the scene acquired by the range sensor. The processor system 100 may comprise a depth processor 142 which may be configured for analyzing the depth map 400 to identify a region of interest in the scene, being specifically a region of interest containing a foreground object, and for generating selection data 410 indicating the region of interest. In this example, the processor system 100 may provide the selection data 410 to the camera 200, e.g., via the bandwidth constrained link 050. This may in turn cause the camera 200 to provide the image data 432 of the region of interest via the bandwidth constrained link 050 to the processor system 100. Note that multiple links, e.g., busses or networks, may together form the connection between camera and processor system. For example, a different network may be used for uplink and for downlink, or different links may be used for depth data and image data.

(20) In a specific example, the image data 432 of the region of interest may be selectively obtained on the basis of the processor system 100 configuring the image sensor 240 to selectively acquire the visible light information of the scene within the region of interest, e.g., as described in [2] or [3]. In this example, the selection data 410 may be accompanied by control data and/or configuration data so as to effect said configuration of the image sensor 240. In another specific example, the image data 432 of the region of interest may be selectively obtained by way of the processor system 100, and in particular the image processor 144, selectively reading-out the image data 432 from a memory comprised in or connected to the image sensor 240 (not shown). In yet another example, the image sensor 240 may capture the entire scene, with the camera comprising a processor (not shown) which selectively outputs the image data 432 of the region of interest on the basis of the received selection data 410.

(21) In accordance with the above-described use-case of video-based VR conferencing, the depth processor 142 may additionally generate a mask 402 representing the foreground object, e.g., as also described under ‘Foreground extraction’, and the processor system 100 may further comprise an image processor 144 which may be configured to, based on the mask 402, apply foreground extraction to the image data 432 so as to selectively obtain and output the image data 452 of the foreground object, e.g., to other participants in a video-based VR conferencing.

(22) With further reference to FIG. 1, a specific example may be described as follows. This example also discusses the temporal relationship between the analyzed depth map and the selectively obtained image data. Here, the camera 200 may be an RGB-D camera connected through a USB connection (the bandwidth constrained link 050) to a personal computer (in short ‘PC’, representing the processor system 100).

(23) The PC 100 may first retrieve a reference RGB-D image, referring to the combination of an image and a depth map acquired at substantially the same time. The PC 100 may then retrieve a stream of depth maps which may be acquired sequentially over time. Each depth map may be encoded with a timestamp or a sequence number. The PC 100 may then subtract the reference depth map from each subsequent depth map, and optionally post-process the result so as to obtain a depth mask which represents the foreground object. This depth mask is indicated as ‘depth foreground’ 402 in FIG. 1. The PC 100 may then determine the region of interest based on the depth mask. For example, the region of interest may be a rectangular area which encloses the foreground object while adding a margin around the foreground object. Image data of the region of interest may then be obtained from the camera 200. Such image data may correspond in time to the depth map from which the region of interest was determined, e.g., be associated with a same timestamp or sequence number. However, this may require the camera to buffer image data. Alternatively, the image data may be acquired at a later moment in time, e.g., after receiving the selection data, and thus may be associated with a later timestamp or sequence number.

(24) FIG. 2 shows an example of a region of interest 430 within an image 420, as may be spatially defined by the selection data. Here, the region of interest 430 is defined as a rectangular area between coordinates (800, 230) and (1200, 850). The object is here shown to be a person, in particular his/her upper torso and head.

(25) FIG. 3 shows an example of a message exchange between the camera 200 and the processor system 100. This example also discusses the temporal relationship between the analyzed depth map and the selectively obtained image data. Here and in subsequent Figs. pertaining to message exchanges, the entities involved in the message exchange are represented by respective vertical lines, with horizontal arrows indicating a message exchange and the vertical axis representing a time axis.

(26) In this example, the camera 200 may be configured or modified to support partial capture of the RGB images, e.g., it may support a region of interest (ROI)-based capture. Furthermore, the camera 200 may support the requesting of images having a certain timestamp or sequence number. The camera 200 may further be configured to buffer a number of images to allow the processor system 100 to determine the region of interest and request the image data 432 of the region of interest for a particular timestamp or sequence number. Accordingly, the processor system 100 may request a depth map acquired at time T1 with a message labeled ‘REQ(Depth_T1)’. The camera 200 may respond by providing the depth map, see the arrow labeled ‘Depth_T1’. The processor system 100 may then analyze the depth map to determine 500 the region of interest, see the block labeled ‘Determine ROI’, and based thereon request image data of the region of interest acquired at time T1 with a message labeled ‘REQ(Image_T1, ROI)’. The camera 200 may respond by providing the image data of the region of interest, see the arrow labeled ‘Image_T1, ROI’. The processor system 100 may then perform a background removal 510, see the block labeled ‘Remove background’, and stream 520 the image data of the foreground object, e.g., to another entity, see the arrow labeled ‘stream foreground_image’. Such background removal may involve, e.g., replacing the background by a solid color such as green, or by setting the transparency in the background to 100%, etc.

(27) In some embodiments, the processor system 100 may need to be aware of the spatial relationship between, on the one hand, the image data of the region of interest which was received and, on the other hand, the entire image. This information may be provided by the processor system 100 operating in a ‘stateful’ manner, e.g., by buffering the selection data, e.g., in association with a timestamp or sequence number. Additionally or alternatively, the camera 200 may indicate the spatial relation together with the image data, e.g., by including position metadata such as, or position metadata similar in type to, Spatial Relationship Descriptor metadata as described in [5]. This may allow the processor system 100 to match the depth data from T1 to the partial image data from T1. Another alternative is that the processor system 100 may match the image data of region of interest to the depth data, so that the depth data spatially matches the image data and foreground extraction of the object can be performed.

(28) Media Client and Media Source Embodiments

(29) FIGS. 1-3 relate to the image data of the region of interest being selectively acquired by and/or selectively obtained from a camera. Other types of embodiments, which are described with reference to FIGS. 4-7, relate to the image data of the region of interest being selectively obtained by a media client from a media source via a network. Here, the media source may be a media server which stores image data and depth maps, but which may not comprise or be directly connected to the image sensor and/or range sensor. Alternatively, the media source may be a device which comprises or is directly connected to the image sensor and/or range sensor, and which may make the captured image data and/or depth map(s) accessible via a network. For example, the media source may be represented by a smartphone or networked camera, and may in the context of telecommunication also be referred to as ‘terminal’ or ‘terminal device’.

(30) The media client may be a network node which may be configured for processing the image data obtained from the media source, e.g., to perform background removal for video-based VR conferencing, and may be configure to selectively obtain the image data of the region of interest from the media server. The media client may be located in the network, in some embodiments near the media source or near the ultimate destination of the foreground image, e.g., in an edge node such as a 5G Mobile Edge Computer (MEC).

(31) The media client may also be, e.g., an augmented reality or virtual reality rendering system, or a media rendering system such as a display device including televisions, Head Mounted Displays, VR/AR glasses, user equipment or mobile phones, tablets, laptops. In particular, a media client comprising such a display device, or a media client or media rendering system connectable to such a display device, where multiple objects or persons need to be combined in one view and/or an object or person has to be merged in a different background or surrounding, such as in a virtual conference system or any social VR system, may benefit from the present invention.

(32) The selective obtaining of the image data of the region of interest may comprise the media client requesting a reference depth map from the media source via the network. The media client may then request one or more depth maps from the media source so as to identify the region of interest, e.g., in a manner as described with reference to FIGS. 1-3. Having determined the region of interest, the media client may request the image data of the region of interest from the media source

(33) In the examples of FIGS. 4-7, the image data of the region of interest is provided to the media client by streaming, for example by tile-based streaming. Such tile-based or tiled streaming is known in the art, see, e.g., [5]. Briefly speaking, and as also illustrated in FIG. 4, an image may be spatially segmented in tiles 422. The spatial relationship between tiles may be described using a Spatial Relationship Description (SRD), which may be included in an MPD (Media Presentation Description). Tiles may then be requested individually. In this example, so-termed High Efficiency Video Coding (HEVC) tiles are used, in which the image is split-up in 5 rows and 5 columns, which is maximum for a full HD picture, according to current HEVC specifications [https://www.itu.int/rec/T-REC-H.265-201612-I/en]. Accordingly, each tile's size may be 384×216 pixels, thereby establishing the 25-tile representation as shown in FIG. 4.

(34) The media client may, after determining the region of interest, map this region to a select number of tiles, or directly determine the region of interest as a select number of tiles. Having determined the select number of tiles, the selection of these tiles may be identified to the media source, e.g., in the form of selection data. Accordingly, the region of interest may be expressed as a selection of one of more of the tiles 422. For example, in the example of FIG. 4, the region of interest 440 may be represented by a 2×4 block of tiles, e.g., a selection of 8 tiles. The media source may thus stream the image data of the region of interest by streaming only the tiles 440.

(35) Alternatively, another form of spatially segmented streaming of image data may be used, e.g., other than tile-based streaming. Yet another alternative is that the media source may simply crop or selectively encode the region of interest, e.g., in accordance with coordinates or other type of selection data provided by the media client, rather than relying on a selection of one or more predefined spatial segments. Another alternative is that all image data outside the region of interest may be replaced by one color, thereby replacing this ‘outside’ image data with a uniform color, which may be encoded more efficiently. Even though the data which is then transmitted may have the resolution of the original image data, the data may be encoded more efficiently thus also allows for saving bandwidth on a bandwidth constrained link.

(36) FIG. 5 shows a message exchange illustrating the above-described obtaining of the image data of the region of interest by streaming. The message exchange applies to streaming, including but not limited to tile-based streaming.

(37) In FIG. 5, the media client 105 is shown to request the media source 305 to start streaming a depth stream, e.g., a stream of depth maps, by way of a message labeled ‘REQ(depth stream)’. In response, the media source 305 may stream the depth stream to the media client 105, see label ‘Depth stream’. The media client 305 may then determine 500 the region of interest based on the depth stream, see the block labeled ‘Determine ROI’, and request an image stream on the basis of the determined region of interest by way of a message labeled ‘REQ(image stream, ROI)’. In case of tile-based streaming, the message may contain a selection of one or more tiles. Alternatively, the message may contain a selection of another type of spatial segment, or simply coordinates (e.g. as in an SRD) or a mask representing the region of interest, etc. In response, the media source 305 may stream the image data of the region of interest, see label ‘Image_ROI stream’. The media client 105 may then perform background removal 512, see the block labeled ‘Remove background’, and stream 520 the image data of the foreground object, e.g., to another entity, see the label ‘stream foreground_image’.

(38) In a continuous or semi-continuous process, the media client 105 may monitor the region of interest based on one or more later received depth maps of the depth stream. If the object starts moving, the media client 105 may update 512 the definition of the region of interest, as also shown by the label ‘update ROI’ following the ‘Remove background’. This may result in an update of the selection data, which may be provided to the media source 305 directly or in another form, see the arrow labeled ‘UPDATE(ROI)’. In case of tile-based streaming, this may cause the media source 305 to stop streaming one or more tiles and start streaming one or more other tiles. When performing streaming by some form of HTTP Adaptive Streaming, such as DASH or HLS, the client may request the streaming by continuously requesting small segments in time of the entire stream. The requesting of a certain region of interest may then be performed by requesting this region of interest for certain segments, and the updating of the region of interest may be performed by simply requesting other spatial parts of segments, e.g. using a tiled streaming based approach.

(39) In the example of FIG. 5 and others, a depth map may be used for determining the region of interest, but also to remove the background in the received image data of the region of interest. In some embodiments, the first one or more depth maps may be used solely to determine the region of interest. Once the image data of the region of interest is received, the background removal may then start.

(40) In some embodiments, depth maps and images may be transmitted together. In these embodiments, it is not needed to make use of sequence numbers or timestamps, since each depth map is directly linked to its accompanying image. In such embodiments, a depth map may, on the one hand, be used to determine the region of interest in ‘future’ images since it serves as a basis for a subsequent request of image data of the region of interest. The region of interest may thus effectively represent a prediction of the location of the object. On the other hand, the depth map may be used to remove the background in the ‘current’ image data of the region of interest.

(41) The example of FIG. 5 is based on a ‘pull’ mechanism, in that the media client 105, e.g., a network node, requests data from the media source 305, e.g., a terminal. Instead of a pull model, also a ‘push’ mechanism may be used, in that the media source may ‘push’ data and/or instructions to the media client.

(42) FIG. 6 shows an example of such a ‘push’ mechanism, in which the media source 305 may instantiate a function on the media client 105 to assist the media source with establishing the push mechanism. Firstly, the media source 305 may instruct the media client 105 to establish a stream of the image data of a foreground object to a particular network destination, as shown in FIG. 6 by a message labeled ‘CreateForegroundStream(destination)’. The destination may, for example, be specified by a destination address, port number and protocol to use. For this purpose, TURN (IETF RFC 5766 Traversal Using Relays around NAT) may be used, or at least the principles of TURN, as TURN typically allows a client (here the media source) to instruct a relay server (here the media client) to allocate a part for relaying a stream, in which the client can instruct the server to bind a channel to a certain destination peer.

(43) The media client 105 may subsequently be provided with a reference comprising the depth map and optionally an image, as shown in FIG. 6 by a message labeled ‘SupplyReference(Depth, Image)’. Next, the media source 305 may send a depth map to the media client 105 and request the media client to specify a region of interest, see the message labled ‘REQ_ROI(depth_map)’. Such specification of the region of interest may take any suitable form as previously discussed with reference to the selection data. In this particular example, the SRD from DASH [5] may be used to describe a rectangular region of interest. For example, the entire image may be divided in 16×9 tiles (horizontal×vertical), and the media client may determine 500 the region of interest to correspond to the tiles from (8,4) to (9,5), e.g., a section of 2×2 tiles having as their upper-leftmost corner (8,4), see the label ‘Determine ROI’ in FIG. 6.

(44) The media client 105 may send a specification of the region of interest, see the message labeled ‘ROI’, e.g., in the form of selection data comprising the string (0, 8, 4, 2, 2, 16, 9), which may be defined in accordance with the below syntax:

(45) TABLE-US-00001 Property name Property value Comments source_id 0 Unique identifier for the source of the content, to show what content the spatial part belong to object_x 8 x-coordinate of the upper-left corner of the tile object_y 4 y-coordinate of the upper-left corner of the tile object_width 2 Width of the tile object_height 2 Height of the tile total_width 16 Total width of the content total_height 9 Total height of the content

(46) After the specification of the region of interest is received by the media source 305, the media source may selectively stream the image data of the region of interest, while optionally, also the depth values in the region of interest may be selectively streamed, e.g., instead of all of the depth map, see label ‘StreamROI(Depth, Image)’. The media client 105 may then perform background removal and update 512 the requested ROI, see the block labeled ‘Remove background, update ROI’, and stream 520 the image data of the foreground object, e.g., to another entity, see the arrow labeled ‘stream foreground_image’. In the example of FIG. 6, the media source 305 is also shown to regularly send an entire depth map to the media client to update the ROI, see accolade 502 representing a repeat of ‘REQ_ROI(depth_map)’ and ‘ROI’. This may facilitate the object detection, as in some cases, e.g., in case of large or erratic movement, the object may not be (fully) shown in the depth values which are currently available to the media client from a region of interest in a previous depth map.

(47) In general, there exist various alternatives to the push mechanism illustrated in FIG. 6. For example, the entire depth maps may be streamed continuously, and a Subscribe/Notify principle may be used to receive updates on the selection of the region of interest by the media client. This may avoid the situation that the object is not (fully) shown in the depth values currently available to the media client. The image data may nevertheless still be streamed selectively, e.g., within the region of interest.

(48) Another alternative is to stream only the image data and depth values within the region of interest, and use prediction to adjust the region of interest over time. Again, a Subscribe/Notify principle may be used to receive region of interest updates, as the region of interest may be predicted on the media client.

(49) Yet another alternative is that the media client may continuously send the spatial location of the detected object to the media source, e.g., as coordinates. The media source may then perform the aforementioned prediction and determine within which region of interest the image data is to be streamed to the media client.

(50) FIG. 7 illustrates the prediction of object movement within the context of tile-based streaming, which is illustrated by a tile-based representation 424 of an image which is divided in 9 regions labeled A1 to C3. A foreground object is shown in the center at time T1 442, located in the center of region B2. This foreground object may move towards the right, as shown for T2 444 and T3 446. If this movement is extrapolated, the foreground object is expected to move from region B2 to C2. In this case, the region of interest may include only region B2 at the start, but after T3 may be extended to include C2 as well. If the foreground object were to continue moving to the right and entirely arrive in region C2, then B2 may be omitted from the region of interest. Such movement, and thus also its prediction, may in general also be horizontal, vertically or diagonal, and vary in speed and acceleration/deceleration.

(51) Foreground Extraction

(52) The following discusses how to perform background removal/foreground extraction, which may in some embodiments be applied to the selectively obtained image data of the region of interest, e.g., for video-based VR conferencing. Various such techniques are known, e.g., from the fields of image analysis, image processing and computer vision [1]. A specific example is that foreground extraction may involve a reference being captured first without the foreground object, with the reference comprising a depth map and optionally a visible-light image. For subsequent depth maps, which may now show the foreground object, the reference depth map may be subtracted from a current depth map, thereby obtaining an indication of the foreground object in the form of depth values having (significant) non-zero values. This subtraction result is henceforth also simply referred to as foreground depth map.

(53) However, the result of the subtraction may be noisy. For example, there may be ‘holes’ in the region of the foreground depth map corresponding to the foreground object. The result may therefore be post-processed using known techniques. For example, 1) zero and negative values may be replaced by higher values, 2) only pixels with depth values in a desired range—which may be a dynamic range—may be selected as foreground, and 3) the holes may be filled using erosion and dilation operations.

(54) Of course, not all objects appearing in the subtraction result may correspond to the desired ‘foreground’ object, as there may be other objects which have entered, left or changed their location with respect to the scene. Therefore, a connected components analysis [6], [7] may be performed which enables distinguishing between, e.g., objects ‘person’ and ‘desk’, albeit non-semantically. Objects in the foreground depth map may thus be addressed individually and included or excluded from the region of interest as desired. Alternatively, semantic object labeling [8] may be used but this may be limited to prior (training) information and a limited number of classes. It is noted that connected component analysis and similar techniques may allow compensating for a moving background, e.g., due to a camera pan or actual movement of the background, namely by allowing a specific object in the scene to be selected.

(55) The region of interest may now be determined, for example as a bounding box or similar geometric construct around the object of interest in the foreground depth map. Alternatively, the foreground depth map may be directly used as a mask representing the foreground object, and thus the region of interest. This may work best when a pixel-to-pixel mapping between the depth map and the image exists. If this is not the case, a correction step for this mapping may be needed, which may comprise warping the depth map on the image, for example using a feature-based homography computation. This way, also affine transformations may be taken into account which may result in a more accurate selection of the object. The mask may be used to selectively read-out the image sensor, read-out a memory connected to the image sensor or in general to selectively obtain the image data of the region of interest.

(56) General Remarks

(57) In general, the term ‘obtaining’ may refer to ‘receiving’ and the term ‘providing’ may refer to ‘sending’, e.g., via an internal bus, external bus or a network.

(58) Instead of a range sensor yielding a depth map, also a heat sensor may be used which yields a heat map and which may be used to select a region of interest containing an object on the basis of a heat signature of the object. Any reference to ‘depth’, as adjective or noun, may thus instead be read as ‘heat’, mutatis mutandis.

(59) In addition to the depth map, also image data may be used to select the region of interest. This may enable a more accurate selection of the region of interest. For example, updates of the region of interest, e.g., to accommodate a change in spatial location of the object with respect to the scene and/or the image sensor, may be determined based on depth data and/or image data.

(60) The region of interest may be defined in any manner known per se, e.g., as a bounding box, but also by a non-rectangular shape, e.g. using geometric forms, using a formula, using a ‘negative’ description of what is not part of the region, etc.

(61) Once an initial region of interest has been determined, the depth map may now also be selectively obtained, e.g., as depth values within the region of interest. To allow for object movement, motion prediction and/or a spatial margin may be used.

(62) The foreground extraction may, in some embodiments, not make use of the depth map, but rather use, e.g., a ‘subtraction’ image between a reference image which does not contain the object and a later image containing the object. In such embodiments, the depth map may be low-resolution as it is only or predominately used to identify the region of interest, but not for extracting the actual foreground.

(63) The image sensor may be a CMOS sensor configured for partial capture.

(64) The range sensor and the image sensor may be combined in one device, but may also be separately provided but spatially and temporally aligned.

(65) For a moving camera or moving background, the reference depth map and/or image may be continuously updated so as to allow foreground extraction from the depth map and/or image.

(66) When the object is not moving for a certain period, the depth map may be retrieved only every X frames, e.g. every other frame, every 10 frames, etc.

(67) If there is a difference in resolution, viewpoint or perspective between the depth map and the image, a spatial mapping may be used to map one to the other.

(68) Timestamps or sequence numbers may be omitted if the depth map and the accompanying image are transmitted together, e.g. in an MPEG-TS.

(69) If only the region of interest of the image is outputted, e.g., as in the case of streaming to a video-based VR conference, the coordinates of the region of interest may be indicated in the stream, e.g., as metadata, so that if an object is moving but due to an update of the region of interest appears static in the streamed image data, the image data may be displayed appropriately, e.g., also in a moving manner.

(70) In some embodiments, the selection data may be provided, e.g., as a signal or as data stored on a transitory or non-transitory computer readable medium.

(71) In some embodiments, a camera may be provided which may comprise a visible light image sensor and optionally a range sensor. The camera may be configured for enabling selective capture or selective read-out from a memory of image data of a region of interest as defined by the selection data described in this specification. Additionally or alternatively, the camera may comprise a processor configured to selectively output said image data on the basis of the selection data.

(72) Processor Systems

(73) FIG. 8 shows a more detailed view of a processor system 100 which may be configured for analyzing the depth map to identify a region of interest in the scene and generating selection data. The processor system 100 of FIG. 8 may correspond to the processor system 100 described with reference to FIGS. 1 and 3 and others, and/or to the media client 105 as described with reference to FIGS. 5 and 6 and others.

(74) The processor system 100 is shown to comprise a communication interface 120 for obtaining a depth map of a scene acquired by a range sensor, and for selectively obtaining image data of the region of interest. For example, the communication interface may be a communication interface to an internal bus or an external bus such as a Universal Serial Bus (USB) via which the range sensor and/or the image sensor may be accessible. Alternatively, the communication interface may be a network interface, including but not limited to a wireless network interface, e.g., based on Wi-Fi, Bluetooth, ZigBee, 4G mobile communication or 5G mobile communication, or a wired network interface, e.g., based on Ethernet or optical fiber. In this case, the processor system 100 may access the depth map and the image data via the network, e.g., from a media source such as that shown in FIG. 9. For example, the network interface may be a local area network (LAN) network interface, but may also be a network interface to a wide area network (WAN), e.g., the Internet.

(75) FIG. 8 further shows the processor system 100 comprising a data storage 160, such as internal memory, a hard disk, a solid-state drive, or an array thereof, which may be used to buffer data, e.g., the depth map and the image data of the region of interest. The processor system 100 may further comprise a processor 140 which may be configured, e.g., by hardware design or software, to perform the operations described with reference to FIGS. 1, 3, 5, 6 and others, e.g., at least in as far as pertaining to the analyzing of the depth map to identify a region of interest in the scene, and to the subsequent generating of the selection data. For example, the processor 140 may be embodied by a single Central Processing Unit (CPU), but also by a combination or system of such CPUs and/or other types of processing units.

(76) The processor system 100 may be embodied by a (single) device or apparatus. For example, the processor system 100 may be embodied as smartphone, personal computer, laptop, tablet device, gaming console, set-top box, television, monitor, projector, smart watch, smart glasses, media player, media recorder, etc., and may in the context of telecommunication also be referred to as ‘terminal’ or ‘terminal device’. The processor system 100 may also be embodied by a distributed system of such devices or apparatuses. An example of the latter may be the functionality of the processor system 100 being distributed over different network elements in a network.

(77) FIG. 9 shows a more detailed view of a processor system 300 which may be configured as media source. The processor system 300 of FIG. 9 may correspond to the media source 305 as described with reference to FIGS. 5 and 6 and others.

(78) It can be seen that the processor system 300 comprises a network interface 320 for communicating with the processor system 100 as described in FIG. 8. The network interface 320 may take any suitable form, including but not limited to those described with reference to the network interface 120 of the processor system 100 of FIG. 8.

(79) The processor system 300 may further comprise a processor 340 which may be configured, e.g., by hardware design or software, to perform the operations described with reference to FIGS. 5 and 6 and others in as far as pertaining to the media source. The processor 340 may be embodied by a single Central Processing Unit (CPU), but also by a combination or system of such CPUs and/or other types of processing units. The processor system 300 may further comprise a storage medium 360 for at least temporary storing at least one depth map of a scene, and/or at least one visible light image of the scene. The storage medium 360 may take any suitable form, including but not limited to those described with reference to the storage medium 160 of the processor system 100 of FIG. 8.

(80) The processor system 300 may be embodied by a (single) device or apparatus. The processor system 300 may also be embodied by a distributed system of such devices or apparatuses. An example of the latter may be the functionality of the processor system 300 being distributed over different network elements in a network. In a specific example, the processor system 300 may be embodied by a network node, such as a server or an edge node such as a 5G Mobile Edge Computer (MEC).

(81) In general, the processor system 100 of FIG. 8 and the processor system 300 of FIG. 9 may each be embodied as, or in, a device or apparatus. The device or apparatus may comprise one or more (micro)processors which execute appropriate software. The processors of either system may be embodied by one or more of these (micro)processors. Software implementing the functionality of either system may have been downloaded and/or stored in a corresponding memory or memories, e.g., in volatile memory such as RAM or in non-volatile memory such as Flash. Alternatively, the processors of either system may be implemented in the device or apparatus in the form of programmable logic, e.g., as a Field-Programmable Gate Array (FPGA). Any input and/or output interfaces may be implemented by respective interfaces of the device or apparatus, such as a network interface. In general, each unit of either system may be implemented in the form of a circuit. It is noted that either system may also be implemented in a distributed manner, e.g., involving different devices.

(82) It is noted that any of the methods described in this specification, for example in any of the claims, may be implemented on a computer as a computer implemented method, as dedicated hardware, or as a combination of both. Instructions for the computer, e.g., executable code, may be stored on a computer readable medium, e.g., in the form of a series of machine readable physical marks and/or as a series of elements having different electrical, e.g., magnetic, or optical properties or values. The executable code may be stored in a transitory or non-transitory manner. Examples of computer readable mediums include memory devices, optical storage devices, integrated circuits, servers, online software, etc.

(83) FIG. 10 is a block diagram illustrating an exemplary data processing system that may be used in the embodiments described in this specification. Such data processing systems include data processing entities described in this specification, including but not limited to the processor systems, the media client, media source, etc.

(84) The data processing system 1000 may include at least one processor 1002 coupled to memory elements 1004 through a system bus 1006. As such, the data processing system may store program code within memory elements 1004. Further, processor 1002 may execute the program code accessed from memory elements 1004 via system bus 1006. In one aspect, data processing system may be implemented as a computer that is suitable for storing and/or executing program code. It should be appreciated, however, that data processing system 1000 may be implemented in the form of any system including a processor and memory that is capable of performing the functions described within this specification.

(85) Memory elements 1004 may include one or more physical memory devices such as, for example, local memory 1008 and one or more bulk storage devices 1010. Local memory may refer to random access memory or other non-persistent memory device(s) generally used during actual execution of the program code. A bulk storage device may be implemented as a hard drive, solid state disk or other persistent data storage device. The data processing system 1000 may also include one or more cache memories (not shown) that provide temporary storage of at least some program code in order to reduce the number of times program code must be retrieved from bulk storage device 1010 during execution.

(86) Input/output (I/O) devices depicted as input device 1012 and output device 1014 optionally can be coupled to the data processing system. Examples of input devices may include, but are not limited to, for example, a microphone, a keyboard, a pointing device such as a mouse, a game controller, a Bluetooth controller, a VR controller, and a gesture based input device, or the like. Examples of output devices may include, but are not limited to, for example, a monitor or display, speakers, or the like. Input device and/or output device may be coupled to data processing system either directly or through intervening I/O controllers. A network adapter 1016 may also be coupled to data processing system to enable it to become coupled to other systems, computer systems, remote network devices, and/or remote storage devices through intervening private or public networks. The network adapter may comprise a data receiver for receiving data that is transmitted by said systems, devices and/or networks to said data and a data transmitter for transmitting data to said systems, devices and/or networks. Modems, cable modems, and Ethernet cards are examples of different types of network adapter that may be used with data processing system 1000.

(87) As shown in FIG. 10, memory elements 1004 may store an application 1018. It should be appreciated that data processing system 1000 may further execute an operating system (not shown) that can facilitate execution of the application. The application, being implemented in the form of executable program code, can be executed by data processing system 1000, e.g., by processor 1002. Responsive to executing the application, the data processing system may be configured to perform one or more operations to be described herein in further detail.

(88) In one aspect, for example, data processing system 1000 may represent one of the entities indicated by numerals 100, 105, 300, 305 in this specification, e.g., a processor system, media source or media client. In that case, application 1018 may represent an application that, when executed, configures data processing system 1000 to perform the functions described herein with reference to said entity.

REFERENCES

(89) [1] Camplani, M., & Salgado, L. (2014). Background foreground segmentation with RGB-D Kinect data: An efficient combination of classifiers. Journal of Visual Communication and Image Representation, 25(1), 122-136.

(90) [2] Caselle, Michele, et al. “Ultrafast streaming camera platform for scientific applications.” IEEE Transactions on Nuclear Science 60.5 (2013): 3669-3677.

(91) [3] Schrey, Olaf, et al. “A 1 K/spl times/1 K high dynamic range CMOS image sensor with on-chip programmable region-of-interest readout.” IEEE Journal of Solid-State Circuits 37.7 (2002): 911-915.

(92) [4] Barber, Charles P., et al. “Reading apparatus having partial frame operating mode.” U.S. Pat. No. 8,702,000. 22 Apr. 2014.

(93) [5] Ochi, Daisuke, et al. “Live streaming system for omnidirectional video” Virtual Reality (VR), 2015 IEEE.

(94) [6] https://en.wikipedia.org/wiki/Connected-component_labeling

(95) [7] Samet, H.; Tamminen, M. (1988). “Efficient Component Labeling of Images of Arbitrary Dimension Represented by Linear Bintrees”. IEEE Transactions on Pattern Analysis and Machine Intelligence. IEEE. 10 (4): 579. doi:10.1109/34.3918.

(96) [8] Camplani, M., & Salgado, L. (2014). Background foreground segmentation with RGB-D Kinect data: An efficient combination of classifiers. Journal of Visual Communication and Image Representation, 25(1), 122-136.

(97) In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb “comprise” and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. The article “a” or “an” preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.