Image processing
11710273 · 2023-07-25
Assignee
Inventors
Cpc classification
H04N13/282
ELECTRICITY
H04N2013/0081
ELECTRICITY
H04N23/90
ELECTRICITY
H04N13/271
ELECTRICITY
International classification
H04N13/271
ELECTRICITY
Abstract
Apparatus comprises a camera configured to capture images of a user in a scene; a depth detector configured to capture depth representations of the scene, the depth detector comprising an emitter configured to emit a non-visible signal; a mirror arranged to reflect at least some of the non-visible signal emitted by the emitter to one or more features within the scene that would otherwise be occluded by the user and to reflect light from the one or more features to the camera; a pose detector configured to detect a position and orientation of the mirror relative to at least one of the camera and depth detector; and a scene generator configured to generate a three-dimensional representation of the scene in dependence on the images captured by the camera and the depth representations captured by the depth detector and the pose of the mirror detected by the pose detector.
Claims
1. Apparatus comprising: a camera configured to capture images of a user in a scene; a depth detector configured to capture depth representations of the scene, the depth detector comprising an emitter configured to emit a non-visible signal; a mirror arranged to reflect at least some of the non-visible signal emitted by the emitter to one or more features within the scene that would otherwise be occluded by the user and to reflect light from the one or more features to the camera; a pose detector configured to detect a position and orientation of the mirror relative to at least one of the camera and the depth detector; a scene generator configured to generate a three-dimensional representation of the scene in dependence on the images captured by the camera and the depth representations captured by the depth detector and the pose of the mirror detected by the pose detector; a mirror actuator configured to move the mirror; a controller configured to control a position and/or orientation of the mirror using the mirror actuator; and a feature detector configured to detect a feature of interest in the scene, and to detect whether the feature of interest is present in images captured by the camera, wherein the controller is configured to adjust a position and/or orientation of the mirror, in response to determining that the feature of interest is not present in images captured by the camera.
2. Apparatus according to claim 1, in which the controller is configured to adjust the position and/or orientation of the mirror so as to provide the camera with two or more different views of the scene.
3. Apparatus according to claim 1, in which the feature detector is configured to detect an object of interest and to detect, as the feature of interest, a surface of the object of interest at least partly occluded from a direct view of the camera.
4. Apparatus according to claim 1, comprising: two or more mirrors, each having an associated pose detector.
5. Apparatus according to claim 1, in which the camera and at least a non-visible emitter of the depth detector are substantially co-sited.
6. Apparatus according to claim 1, comprising a free viewpoint viewing arrangement to view the three-dimensional representation of the scene.
7. A method comprising: capturing images of a user in a scene; capturing depth representations of the scene, including emitting a non-visible signal; providing a mirror arranged to reflect at least some of the non-visible signal emitted by the emitter to one or more features within the scene that would otherwise be occluded by the user and to reflect light from the one or more features for the step of capturing images; detecting a position and orientation of the mirror; generating a three-dimensional representation of the scene in dependence on the captured images and depth representations and the detected pose of the mirror; moving the mirror in response to control signals to control a position and/or orientation of the mirror; and detecting a feature of interest in the scene, and whether the feature of interest is present in images captured by the camera, wherein the moving includes adjusting a position and/or orientation of the mirror, in response to determining that the feature of interest is not present in images captured by the camera.
8. A non-transitory machine-readable storage medium which stores computer software which, when executed by a computer, causes the computer to perform a method for: capturing images of a user in a scene; capturing depth representations of the scene, including emitting a non-visible signal; providing a mirror arranged to reflect at least some of the non-visible signal emitted by the emitter to one or more features within the scene that would otherwise be occluded by the user and to reflect light from the one or more features for the step of capturing images; detecting a position and orientation of the mirror; generating a three-dimensional representation of the scene in dependence on the captured images and depth representations and the detected pose of the mirror; moving the mirror in response to control signals to control a position and/or orientation of the mirror; and detecting a feature of interest in the scene, and whether the feature of interest is present in images captured by the camera, wherein the moving includes adjusting a position and/or orientation of the mirror, in response to determining that the feature of interest is not present in images captured by the camera.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) A more complete appreciation of the disclosure and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
DESCRIPTION OF THE EMBODIMENTS
(18) A number of different approaches for implementing free viewpoint content are considered to be suitable, including photogrammetric, light field/multiscopic, and volumetric approaches. Of course, a number of other approaches (or combinations of the above) may be considered.
(19) The first of these approaches comprises the manipulation of captured images in order to appear three-dimensional; this can add freedom to the viewpoint by enabling the user to peer ‘around’ an object in the image—this can often be rather limited in scope, but is suitable for a number of purposes. Reprojection of the captured image is often used in methods following this approach, enabling the simulation of the ‘correct’ view (that is, a view that appears to be from the correct position).
(20) The second approach relies on the capturing of a number of images of the environment from different locations. A free viewpoint experience may then be provided to the user by using interpolation between the captured images; the user is able to manipulate the viewpoint freely within the bounds of the image capture area (that is, the area or volume bounded by the image capture devices).
(21) The third approach that is considered, which is the approach in the context of which the present application is provided, comprises the generation of a virtual scene representing the imaged volume in the content capture process. This may include identifying the geometry of the volume and the objects within it, as well as determining any other parameters (such as lighting effects) as appropriate. Such an approach is discussed in ‘Multi-View Stereo: A Tutorial’ (Y Furukawa, C Hernández, Foundations and Trends in Computer Graphics and Vision, Vol 9, No. 1-2, 2013).
(22) While the present application is framed within the context of the volumetric approach to free viewpoint content, it is considered that the techniques discussed within may be applicable to one or more other approaches.
(23)
(24) A step 100 comprises capturing the content. The content capturing process includes the use of image sensors, such as cameras, and may further include the use of microphones or the like for capturing audio. While in some cases the captured image content may be entirely two-dimensional, in other cases the content capturing process includes the capturing of depth information for a scene—this can be achieved using stereoscopic or depth cameras/detectors, for example, or any other method for determining the distance to an object in the capture environment. Examples of content capturing arrangements are described below with reference to
(25) A step 110 comprises performing processing on the captured content, with the aim of generating content that a user is able to use to explore the captured environment with the aid of a free viewpoint. Examples of processing include the estimating of the depth of objects within the captured images, and the encoding of the processed data into a suitable format for storage and/or output to a viewer. Each of these is discussed below with reference to
(26) The processed data comprises a three-dimensional representation of the environment for which the content capture is performed (or is sufficiently complete so as to enable the generation of such a representation). This representation may be able to be distributed to a user to enable them to generate free viewpoint experiences locally, or it may be able to be used (for example, at a server) to generate image frames in accordance with a viewpoint defined by a client device.
(27) A step 120 comprises the output of the free viewpoint content to a viewer. This may be performed in a number of different ways; for example, the viewer may request a particular viewpoint from a server which holds the encoded data. The server may then generate images representing the viewpoint at the requested position, and transmit this to the viewer. In some embodiments, the viewer may instead be provided with encoded data for the whole (or at least a part of) the captured environment such that processing for generating image content is performed locally.
(28)
(29) In this Figure, a plurality of cameras 210 are arranged so as to capture images of a person 200 (such as an actor in a movie) from a range of different angles. The cameras 210 may also be configured to capture audio in the environment, although this may instead be captured separately. In some embodiments it is advantageous to be able to synchronise the cameras or establish the timing offset between their image capture—this may assist with generating a high-quality output for a user.
(30) Between them, the cameras 210 may be arranged so as to be able to capture images of a significant proportion of the environment and objects within the environment. In an ideal scenario every part of every surface within the environment is imaged by the arrangement of cameras, although in practice this is rarely possible due to factors such as occlusions by other objects in the environment. Such an issue may be addressed in a number of manners, a selection of which are discussed below.
(31) For example, the arrangement of cameras 210 as shown in
(32)
(33)
(34)
(35)
(36)
(37) In some cases, the camera system for capturing images of the environment may be robust to such occlusions—for example, given enough cameras it is possible that the arrangement leads to every part of the environment (or at least a sufficient number of parts of the environment) being imaged by more than one camera. In such a case, it is possible that images of an area occluded from one camera's view are captured by another camera.
(38) Alternatively, or in addition, a number of processing techniques may be used to fill such gaps. For instance, information about that area (such as the colour of the trousers worn by the person 200) may be stored from previously captured frames, or determined in dependence upon other information—for example, it may be assumed that the colour is constant (either over time, spatially, or both), and so any image of the trousers may be enough to supply the colour information despite being captured at a different time, and/or imaging a different portion of the trousers. Similarly, the colour could be input by an operator or the like.
(39)
(40) A step 500 comprises an estimation of the depth of one or more parts of the environment that is imaged. In some cases, this may be performed by identifying the disparity associated with an object between a pair of stereoscopic images; in other cases, monoscopic depth detection may be performed, or a position may be estimated from a number of images based upon knowledge about the position and orientation of the cameras used to capture those images.
(41) A step 510 comprises the fusion of image data. Fusion of image data is the process of combining the information that is obtainable from each of a plurality of images in order to generate a three-dimensional space using images in a two-dimensional space. For example, image data may be fused so as to generate a three-dimensional model of an object that comprises two-dimensional information about each side of the object, as imaged by a corresponding plurality of cameras. This is discussed below in more detail, with reference to
(42) A step 520 comprises the encoding of the processed image data, for example to generate data that is in a format that is suitable for storage and/or transmission to a user. Examples of suitable representations of the content include the use of point clouds and/or meshes to represent objects and features in the environment.
(43) Further processing may also be performed in addition to, or instead of, one or more of the steps shown in
(44) As discussed with reference to step 510, fusion of image data may be performed in order to generate a more complete description of the environment in which image capture is performed. For example, image data from a second camera may be used to supplement the image data from a first camera, which can mitigate the problem of occlusion.
(45) In general, fusion techniques utilise a number of captured images that each capture an image (a two-dimensional image and depth information) of the environment, the images being captured at different times or from different camera positions. These images are then processed to extract information to enable a three-dimensional reconstruction or generation.
(46) At a first stage, segmentation is performed. This process results in a separation of an imaged object and a background of the image from one another, such that the background may be removed from the image. The segmented image of the object, in conjunction with the depth data that is captured, can then be used to generate a three-dimensional image of the object from one side, where every pixel of the image represents a point in three-dimensional space.
(47) By generating multiple such images from a number of viewpoints, three-dimensional images may be generated for an object from a number of different sides; this can enable the construction of a full three-dimensional volume representing the external shape of the object. The fusion process here is used to correlate matching points as captured by the different cameras, and to remove any erroneous points, so as to enable a combination of the captured three-dimensional images into a three-dimensional representation.
(48)
(49) Temporal fusion is a fusion technique that may be performed within a single image data set (that is, an image data set captured by a single camera over a time duration). In
(50) Spatial fusion may be performed between the two image data sets 601 and 610 (that is, image data sets captured by cameras located at different viewpoints); for example, image data from the frame labelled 1′ may be used to supplement the image data derived from the frame labelled 1. This may be performed for any pairing of image frames, rather than necessarily being limited to those captured at (at least substantially) the same time. Spatial fusion is advantageous in that the image data from each of the image data sets is obtained from a different position—different views of the same object may therefore be captured.
(51)
(52) In the second, labelled 710, the back, left, and top portions of the object can be seen by an image capture device. In the context of
(53) In either case, the data from each of the images 700 and 710 may be combined so as to generate a more complete description of the imaged object than would be available using only a single image frame comprising the object. Of course, any suitable combination of spatial and temporal fusion may be used as appropriate—the fusion process should not be limited to the specific examples provided above.
(54) At the conclusion of the method described with reference to
(55) As noted above, this information may be provided to the user in a raw form including data (such as a point cloud representation of the environment, in addition to texture and lighting information) for the whole of the environment. However, this represents a significant amount of data to transmit and store (point clouds may comprise millions or even billions of data points) and may therefore be inappropriate in a number of scenarios.
(56) As an alternative, this information may be provided to a viewer by generating an image at a server in response to an input viewpoint position/orientation. While this may result in a degree of input latency, it may be responsive enough to provide a suitable free viewpoint experience to a user.
(57) In either case, rendering of a viewpoint must be performed based upon the encoded data. For example, when using a point cloud representation to store information about the captured environment the rendering process comprises a surface reconstruction process as a part of generating an image for display. This is performed so as to enable to generation of surfaces from a set of discrete points in the point cloud.
(58) As a development and variant of the techniques discussed above,
(59) In
(60) Images of a front side 830 of the user 820 can be captured by a direct view 840 from a camera 850 (not shown in
(61) In example embodiments, images can still be captured of the occluded regions of an object of interest such as the user 820 by an indirect view 870, which is to say a view via a reflective surface or mirror 880. As discussed below, from the direct view 840 and the indirect view 870, a three-dimensional representation of the object of interest, for example the user 820, can be constructed.
(62) The mirror 880 is associated with apparatus 890. Variants of the apparatus 890 will be discussed in more detail below. For example, the apparatus 890 can provide primarily a sensing function so as to allow the data processing functionality of the laptop computer 800 to know the position and/or orientation of the mirror 880. Or the apparatus 890 can provide primarily a positioning function so that, for example under the control of the data processing functionality of the laptop computer 800, the apparatus 890 controls the position and/or orientation of the mirror 880 relative to the camera 850. As mentioned, both of these variants will be discussed further below.
(63)
(64) The camera 850 may be a camera 940 built into the casing 920, or may be a peripheral camera 950 attachable to the casing 920. In the particular examples shown, the camera 850 is provided by the built in camera 940, and the peripheral device 950 in fact represents a depth detector, and in particular a depth detector by which a non-visible signal such as an infra-red signal is emitted by an emitter so that a depth representation of a scene to be captured can be obtained. So, in the examples shown, the device 940 represents a camera to capture images of a scene and the device 950 represents a depth detector configured to capture depth representations of the scene, the depth detector having an emitter configured to emit a non-visible signal.
(65) As shown, in some examples, the camera and at least a non-visible emitter of the depth detector are substantially co-sited.
(66) It will be appreciated that in other examples, neither of the camera and the depth detector may be embodied within the laptop computer 800, or in further different embodiments, both of the camera and the depth detector could be embodied within the casing 920 of the laptop computer 800.
(67) The depth detector may be represented by a so-called LI DAR sensor or the like. A LIDAR sensor, where “LIDAR” signifies “light detection and ranging”, and uses emitted light to measure distances from the sensor, based upon a measure of the amount of time it takes for an emitted pulse of light to be reflected back to the LIDAR sensor.
(68) In terms of the direct view 840 in
(69) In terms of the indirect view 870, the range-sensing light (the non-visible signal) emitted by the depth detector is reflected by the mirror 880 onto the rear or occluded surface 860 of the user 820, from which it is reflected back to the mirror 880 and from there back to the depth detector. Similarly, images are captured by the camera 850 of the view formed by reflection in the mirror 880. Therefore, the images of the occluded surface 860 of the user 820 are associated with depth measurements representing a depth equal to the length 875 from the camera 850 to the mirror 880 plus the length 876 of the path from the mirror 880 to the occluded surface 860 of the user 820. Note that the depth detected by the direct view 840 will represent the length 842 of the direct path between the depth detector and the front surface 830 of the user 820.
(70) As a comparison of the techniques in
(71) In
(72) Note that the mirrors 880 are shown in
(73)
(74) In order to combine the images, the processor 1140 has access to information 1160 indicating a current position and/or orientation of each of the mirrors 880. From this, the processor establishes features such as: the angle of view provided by the indirect path 870 (note that in the example of
(75) The system can also obtain the camera-to-mirror distance 875 from a LIDAR or similar range detection of a non-reflective or marker portion of the mirror arrangement.
(76)
(77) In
(78) An example of the latter arrangement (
(79) Referring to schematic flowchart in
(80) The alternative arrangement of
(81)
(82) The controller 1610 generates a required position and/or orientation of the mirror 880 in response to signals 1200 and also output signals 1210 indicating a current controlled position and/or orientation of the mirror 880.
(83) The control in this way of the mirror configuration can be useful in example arrangements in which the mirror configuration is potentially changed to ensure that an appropriate view of an object of interest (such as the user 820 in the drawings discussed above) is available. Such an arrangement is shown in the schematic flowchart of
(84) Referring to
(85) If an adequate view is provided of the occluded surface then the process can end 1715. Otherwise, at a step 1720 the processor 1140 derives another configuration of the mirror 880 and provides this as data 1200 to the apparatus 890 which then, at a step 1730, adjusts the mirror configuration to the one indicated the data 1200 before control returns to the step 1700.
(86) By way of summary,
(87) capturing (at a step 1800) images of a user in a scene;
(88) capturing (at a step 1810) depth representations of the scene, including emitting a non-visible signal;
(89) providing (at a step 1820) a mirror arranged to reflect at least some of the non-visible signal emitted by the emitter to one or more features within the scene (such as a rear surface of the user) that would otherwise be occluded by the user and to reflect light from the one or more features for the step of capturing images;
(90) detecting (at a step 1830) a position and orientation of the mirror; and
(91) generating (at a step 1840) a three-dimensional representation of the scene in dependence on the captured images and depth representations and the detected pose of the mirror.
(92)
(93) The resulting scene may be viewed using a display arrangement having screen or a head mountable display 2000 (
(94)
(95) a camera 2100 configured to capture images of a user in a scene;
(96) a depth detector 2110 configured to capture depth representations of the scene, the depth detector comprising an emitter configured to emit an non-visible signal;
(97) a mirror 2120 arranged to reflect at least some of the non-visible signal emitted by the emitter to one or more features within the scene that would otherwise be occluded by the user and to reflect light from the one or more features to the camera;
(98) a pose detector 2130 configured to detect a position and orientation of the mirror relative to at least one of the camera and the depth detector; and
(99) a scene generator 2140 configured to generate a three-dimensional representation of the scene in dependence on the images captured by the camera and the depth representations captured by the depth detector and the pose of the mirror detected by the pose detector.
(100) The apparatus may work in conjunction with a feature detector 2150 configured to detect a feature of interest in the scene, such as an occluded surface or other occluded object. The feature detector may be configured to detect whether the feature of interest is present in images captured by the camera. The feature detector may be configured to detect an object of interest and to detect, as the feature of interest, a surface of the object of interest at least partly occluded from a direct view of the camera.
(101) It will be appreciated that example embodiments can be implemented by computer software operating on a general purpose computing system such as a games machine. In these examples, computer software, which when executed by a computer, causes the computer to carry out any of the methods discussed above is considered as an embodiment of the present disclosure. Similarly, embodiments of the disclosure are provided by a non-transitory, machine-readable storage medium which stores such computer software.
(102) It will also be apparent that numerous modifications and variations of the present disclosure are possible in light of the above teachings. It is therefore to be understood that within the scope of the appended claims, the disclosure may be practised otherwise than as specifically described herein.