AN IMAGE PROCESSOR AND A METHOD THEREIN FOR PROVIDING A TARGET IMAGE
20220375025 · 2022-11-24
Assignee
Inventors
Cpc classification
G06V10/255
PHYSICS
G06V20/52
PHYSICS
G06V10/758
PHYSICS
International classification
G06V10/75
PHYSICS
Abstract
An image processor and a method therein to provide a target image for evaluation with an object detector. The method comprises: obtaining a source image captured by a camera and depicting an object, and applying an inverse pixel transform to each target pixel of a target image to determine one or more source pixels located at a position in the source image corresponding to a position of each target pixel in the target image. The method further comprises assigning, to each target pixel, a target pixel value determined based on one or more source pixel values of the determined one or more source pixels located at the corresponding position, thereby is a size c in the target image of the depicted object of a specific object type normalized in at least one size dimension. Thereafter, the target image is fed to an object detector for evaluation.
Claims
1. A method to provide a target image for evaluation with an object detector, the method comprises: obtaining a source image captured by a camera and depicting one or more objects in a scene; applying an inverse pixel transform T.sup.−1 to each target pixel of a target image to determine one or more source pixels located at a position in the source image corresponding to a position of each target pixel in the target image, assigning, to each target pixel, a target pixel value determined based on one or more source pixel values of the determined one or more source pixels located at the corresponding position, thereby is a size c in the target image of depicted one or more objects of a specific object type normalized in at least one size dimension of the depicted one or more objects of the specific object type; and feeding the target image to an object detector for evaluation.
2. The method of claim 1, wherein the inverse pixel transform T.sup.−1 takes as input an assumed size in the at least one size dimension of an object of the specific object type, one or more mounting parameters of the camera, and a distance d.sub.t from a principal point P.sub.t of the target image to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied.
3. The method of claim 2, wherein the assumed object size is an assumed object height and/or an assumed object width, wherein the one or more mounting parameters comprise a camera mounting height above ground, and wherein the assumed object size and the one or more mounting parameters are given in a world coordinate system.
4. The method of claim 2, wherein the inverse pixel transform T.sup.−1 applied to each target pixel forces the target pixels to reproduce the depicted one or more objects of the specific object type in the target image with a constant size in the at least one size dimension irrespective of the location of each of the depicted one or more objects of the specific object type in the target image.
5. The method of claim 4, wherein the inverse pixel transform T.sup.−1 as a function of the distance d.sub.t is given by: T.sup.−1(d.sub.t)=exp.sup.[(dt−b)/a] if d.sub.t>d.sub.cutoff and T.sup.−1(d.sub.t)=d.sub.t if d.sub.t d.sub.cutoff; and wherein: a=c/[log(z.sub.1/z.sub.2)]; b=d.sub.cutoff−a log(d.sub.cutoff); c is a parameter specifying a size in the target image of an object of the specific object type and is set to c=[(d.sub.cutoff)(z.sub.0/z.sub.2)]−[(d.sub.cutoff)(z.sub.0/z.sub.1)] to make the object at the distance d.sub.cutoff in the source image and in the target image of equal size; z.sub.0 is an assumed height above ground of a center of an object of the specific type and is given by (z.sub.1+z.sub.2)/2; z.sub.1 is the camera mounting height above ground; z.sub.2 is the camera mounting height above ground minus the assumed object height; d.sub.t is the distance in the target image from the principal point P.sub.t to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied; and d.sub.cutoff is a distance threshold value at which an object of the specific object type in the source image will have the same size as the object of the specific object type in the target image.
6. The method of claim 2, wherein the inverse pixel transform T.sup.−1 is a function depending on a scaling factor s.sub.factor, and on the distance d.sub.t.
7. The method of claim 6, wherein the inverse pixel transform T.sup.−1 as a function of the distance d.sub.t is given by: T.sup.−1(d.sub.t)=s.sub.factor(d.sub.t).sup.2/2+s.sub.1(d.sub.t)+s.sub.0 if d.sub.t>d.sub.cutoff and T.sup.−1(d.sub.t)=s.sub.center(d.sub.t) if d.sub.t d.sub.cutoff; and wherein: s.sub.1=s.sub.center−s.sub.factor(d.sub.cutoff); s.sub.0=s.sub.factor(d.sub.cutoff).sup.2/2; s.sub.factor is the scaling factor given by s.sub.factor=[(z.sub.0/z.sub.2)−(z.sub.0/z.sub.1)]; z.sub.1 is the camera mounting height above ground; z.sub.2 is the camera mounting height above ground minus the assumed object height; z.sub.0 is an assumed height above ground of a center of the object of the specific type and is given by (z.sub.1+z.sub.2)/2; d.sub.t is the distance in the target image from the principal point P to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied; d.sub.cutoff is a distance threshold value at which an object of the specific object type in the source image will have the same size as the object (104) of the specific object type in the target image; and s.sub.center is a constant scaling factor used for source pixels located within the distance d.sub.cutoff from the principal point P.
8. The method of claim 6, wherein the object of the specific type is modelled as a bounding box, and wherein the inverse pixel transform T.sup.−1 as a function of the distance dt is given by: T.sup.−1(d.sub.t)=[s.sub.centers.sub.factor/(2s.sub.box)](d.sub.t).sup.2+s.sub.center(d.sub.t); and wherein: d.sub.t is the distance in the target image from the principal point P.sub.t to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied; s.sub.center is a constant scaling factor used for source pixels located within a distance of d.sub.cutoff from the principal point P.sub.t; s.sub.factor is the scaling factor given by s.sub.factor=[(z.sub.0/z.sub.2)−(z.sub.0/z.sub.1)]; z.sub.1 is the camera mounting height above ground; z.sub.2 is the camera mounting height above ground minus the assumed object height; and s.sub.box is the projected height of the top of the bounding box, which bounding box is w.sub.box wide and is given by s.sub.box=w.sub.box/z.sub.2.
9. The method of claim 2, wherein at least one part of the object of the specific object type is modelled as a sphere, and wherein a scaling factor is proportional to a projected radius r of the sphere in the source image.
10. The method of claim 9, wherein the inverse pixel transform T.sup.−1 as a function of the distance d.sub.t is given by:
11. The method of claim 1, wherein the inverse pixel transform T.sup.−1 is rotationally symmetric around a principal point P.sub.t of the target image.
12. An image processor configured to provide a target image for evaluation with an object detector; wherein the image processor is configured to: obtain a source image captured by an image capturing module and depicting one or more objects in a scene, apply an inverse pixel transform T.sup.−1 to each target pixel of a target image to determine one or more source pixels located at a position in the source image corresponding to a position of each target pixel in the target image; assign, to each target pixel, a target pixel value determined based on one or more source pixel values of the determined one or more source pixels located at the corresponding position, thereby is a size c in the target image of depicted one or more objects of the specific object type normalized in at least one size dimension of the depicted one or more objects of the specific object type; and feed the target image to an object detector for evaluation.
13. A camera comprising the image processor of claim 12, and configured to: capture the source image depicting the one or more objects in the scene; provide the source image to the image processor; and wherein the camera further comprises the object detector configured to receive the target image and to evaluate the target image by performing object detection on the target image.
14. The camera of claim 13, wherein the camera is a monitoring camera.
15. A non-transitory computer-readable medium having stored thereon computer code instructions adapted to carry out a method when executed by a device having processing capability, the method providing a target image for evaluation with an object detector, the method comprising: obtaining a source image captured by a camera and depicting one or more objects in a scene; applying an inverse pixel transform T.sup.−1 to each target pixel of a target image to determine one or more source pixels located at a position in the source image corresponding to a position of each target pixel in the target image, assigning, to each target pixel, a target pixel value determined based on one or more source pixel values of the determined one or more source pixels located at the corresponding position, thereby is a size c in the target image of depicted one or more objects of a specific object type normalized in at least one size dimension of the depicted one or more objects of the specific object type; and feeding the target image to an object detector for evaluation.
16. The method of claim 15, wherein the inverse pixel transform T.sup.−1 takes as input an assumed size in the at least one size dimension of an object of the specific object type, one or more mounting parameters of the camera, and a distance d.sub.t from a principal point P.sub.t of the target image to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied.
17. The method of claim 16, wherein the assumed object size is an assumed object height and/or an assumed object width, wherein the one or more mounting parameters comprise a camera mounting height above ground, and wherein the assumed object size and the one or more mounting parameters are given in a world coordinate system.
18. The method of claim 16, wherein the inverse pixel transform T.sup.−1 applied to each target pixel forces the target pixels to reproduce the depicted one or more objects of the specific object type in the target image with a constant size in the at least one size dimension irrespective of the location of each of the depicted one or more objects of the specific object type in the target image.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0018] The above, as well as additional objects, features and advantages of the present disclosure, will be better understood through the following illustrative and non-limiting detailed description of embodiments of the present disclosure, with reference to the appended drawings, where the same reference numerals will be used for similar elements, wherein:
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
DETAILED DESCRIPTION
[0029] The present disclosure will be described more fully hereinafter with reference to the accompanying drawings, in which embodiments of the disclosure are shown.
[0030]
[0031] The term field of view refers to the part of the scene that is visible through the camera 108 at a particular position and orientation in space of the camera 108 and at a particular zoom setting of the camera 108. The particular position is given by the installation location and the orientation is given by the pan setting and/or tilt setting of the camera. Thus, it should be understood that the field of view may depend on one or more different camera parameters. For example, the field of view may depend on the installation location of the camera 108 such as height above ground, the type of camera lens, the zoom setting of the camera 108, the pan setting of the camera 108 and/or the tilt setting of the camera.
[0032] The camera 108 may be a monitoring camera, sometimes also referred to as surveillance camera. Further, the camera may be a fixed camera, e.g., a stationary camera, or a movable camera, e.g., a pan, tilt and zoom (PTZ) camera. The camera 108 may be a visible light camera, a thermal camera, or a camera comprising both a visible light camera and a thermal camera.
[0033] As further illustrated in
[0034] The client 116 may have a display where an operator can view images and/or video streams from the camera. Typically, the client 116 is also connected to the server 118, where the images and/or video streams can be stored and/or processed further. The client 116 may be used to control the camera 108, for example, by the operator issuing control commands at the client 116.
[0035]
[0036] The camera 108 comprises further an object detector 209. The object detector 209 may be implemented in software, e.g., as an object detection algorithm, to determine object recognitions in a captured image of the scene 102. Alternatively, or additionally, the object detector 209 may implemented as an object detection neural network, e.g., a convolutional neural network (CNN), performing an object detection algorithm to determine the object recognitions in the captured image of the scene 102. These object recognitions may be associated with their images by including them as metadata to the images. The association may be kept through the subsequent encoding process that is performed by the encoder 210 of the camera 108. Object detectors as such are well known to those having ordinary skill in the art and thus will not be described in any more detail in this disclosure.
[0037] The camera 108 comprises also an image processor 208, an encoder 210 and an input/output interface 212.
[0038] The image processor 208, sometimes referred to as an image processing pipeline, processes captured images by different processing algorithms. In
[0039] The image processor 208 may be configured to perform a range of various other operations on images received from the image sensor 204. Such operations may include filtering, demosaicing, colour correction, noise filtering for eliminating spatial and/or temporal noise, distortion correction for eliminating effects of e.g., barrel distortion, global and/or local tone mapping, e.g., enabling imaging of scenes containing a wide range of intensities, transformation, e.g., rotation, flat-field correction, e.g., for removal of the effects of vignetting, application of overlays, e.g., privacy masks, explanatory text, etc.
[0040] Following the image processor 208, the images are forwarded to the encoder 210, in which the images are encoded according to an encoding protocol and forwarded to a receiver, e.g., the client 114 and/or the server 116, over the network 112 using the input/output interface 212. It should be noted that the camera 108 illustrated in
[0041] The camera 108 may also comprise a data storage 214 for storing data relating to the captured images, to the processing of the captured images and to the object recognitions, just to give some examples. The data storage may be a non-volatile memory, such as an SD card.
[0042] There are a number of conventional video encoding protocols. Some common video encoding protocols that work with the various embodiments of the present disclosure include: High Efficiency Video Coding (HEVC), also known as H.265 and MPEG-H Part 2; Advanced Video Coding (AVC), also known as H.264 and MPEG-4 Part 10; Versatile Video Coding (VVC), also known as H.266, MPEG-I Part 3 and Future Video Coding (FVC); VP9, VP10 and AOMedia Video 1 (AV1), just to give some examples.
[0043] As mentioned, the input and output interface 212 is configured to act as a communication interface between the camera 108, the network 112 and one or more receiving devices such as the client 114 and the server 116. Thus, the camera 108 may receive instructions from the client 114 and may transmit video streams to the client and/or the server 116 via the input and output interface 212.
[0044] The image processor 208, the encoder 210 and the input/output interface 212 may form an image processing and encoding module 214, which is connected to the image capturing module 206. The image capturing module 206 and the image processing and encoding module 214 may be arranged as two separate units arranged at a distance from each other and in communication with each other. In such scenario the camera 108 may also be referred to as a camera system. Alternatively, the image capturing module 206 and the image processing and encoding module 214 may be arranged as a single unit comprised in the camera 108. Further, the image capturing module 206 may be movable, e.g., in pan and/or tilt directions, while the image processing and encoding module 214 may be stationary.
[0045]
[0046] As illustrated in
[0047]
[0048]
[0049]
[0050]
[0051] It should be noted that in the image captured with the pinhole camera, a non-flat object, e.g., an object having an extension in the longitudinal direction being larger than its extension in the lateral direction, being located close to the optical axis is depicted smaller than the same non-flat object when being located far from the optical axis. The difference in depicted sizes of the same non-flat object at different distances from the optical axis will increase with an increasing extension of the object in the longitudinal direction. Thus, if the object is a person and the extension in the longitudinal direction is given by the person's length, the difference in depicted sizes of the person when being at different distances from the optical axis will increase with increasing length of the person. This means that, in the image captured with the pinhole camera, a non-flat object with a given extension in the longitudinal direction and being at a smaller lateral distance from the optical axis is depicted smaller than the same object would have been depicted had it been at a larger lateral distance. For an image captured by the fisheye camera the opposite is true, i.e., the non-flat object will be depicted larger when being closer to the optical axis than when being farther away from the optical axis. By embodiments disclosed herein, an object 104 at the same longitudinal distance from the camera 108 will be depicted with the same or almost the same size in the target image irrespective of its lateral distance to the optical axis, i.e., irrespective of where in the camera's field of view the object occurred, and thus irrespective of being depicted with different sizes in the source image.
[0052] A method to provide a target image that is to be evaluated with an object detector, e.g., the object detector 209, will now be described with reference to the flowchart of
[0053] In action 502, a source image captured by the camera 108 is obtained. The source image depicts one or more objects 104 in the scene 102. The one or more objects 104 may be of one or more object types. For example, the one or more object types may be a human being, an animal or a vehicle. Thus, the object may be a person, a dog or a car. The source image may be captured by the image capturing module 206 of the camera 108. As previously mentioned, the image capturing module 206 may comprise the lens 202, e.g., a fisheye lens, and the image sensor 204. The source image has an image format corresponding to the size of the image sensor 204 such that each source pixel in the source image corresponds to a read out from a corresponding sensor element of the image sensor 204. For example, a source image format may be 1024×1024, 1920×1080, 2560×1440, 2880×2880 or 4096×2160 pixels. The image processor 208 may obtain the captured source image directly from the image capturing module 206, from an internal storage, e.g., the data storage 214 of the camera 108, or from an external device via the input/output interface 212.
[0054] In action 504, an inverse pixel transform T.sup.−1 is applied to each target pixel of a target image to determine one or more source pixels located at a position in the source image corresponding to a position of each target pixel in the target image. Thus, the inverse pixel transform T.sup.−1 is applied to every pixel of the target image irrespective of where in the source image the one or more objects 104 may occur. This means that the source image is not processed in any way to detect the one or more objects 104 or their locations in the source image before the inverse pixel transform T.sup.−1 is applied to the target image. Instead by applying the inverse pixel transform T.sup.−1 to the target image, the target image is provided based on the captured source image in such a way that an object of a specific object type will be depicted in the target image with a size that has been normalised in at least one dimension irrespective of where in the source image it is depicted. This will be described in more detail below. The target image is of an image format suitable for being analysed by the object detector 209. For example, a suitable target image format may be 1024×1024, 1536×1536, 1920×1080, 2048×512, 2560×1440, or 4096×2160 pixels. It should be understood that the source image format and target image format do not have to be the same.
[0055] In some embodiments, the inverse pixel transform T.sup.−1 takes as input an assumed size in the at least one size dimension of the object 104 of the specific object type, one or more mounting parameters of the camera 108, and a distance d.sub.t from the principal point P of the target image to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied.
[0056] For example, the assumed object size may be an assumed object height and/or an assumed object width, and the one or more mounting parameters may comprise a camera mounting height above ground. The assumed object size and the one or more mounting parameters are given in the world coordinate system. If, for example, the object 104 of the specific type is a person, the assumed size in the at least one size dimension of the object 104 may be an assumed length of the person. For example, the assumed length of the person may be set to an average length of a human being e.g., set to a value in the range 1-2 meter, such as 1.8 m. The assumed size may be a predetermined value.
[0057] The closer the assumed size in the at least one dimension is to the object's 104 real size in the at least one dimension, the better result of the inverse pixel transform is obtained. Thus, a better normalisation of the depicted object's size in the at least one dimension in the target image will be achieved when the assumed size in the at least one dimension is the same as or close to the object's real size in the at least one dimension as compared to the case when the object's real size differs from the assumed size. Therefore, it may be advantageous to select the assumed size depending on the type of object to be depicted and the expected size of that object.
[0058] Different inverse pixel transforms T.sup.−1 will be described in more detail below. The different inverse pixel transform T.sup.−1 have all in common that they are rotationally symmetric around the principal point P.sub.t of the target image. This may also be expressed as they are rotational symmetric around the principal point P.sub.s of the source image, and around the camera's 108 optical axis, i.e., around the z-axis. It should be understood that the inverse pixel transforms T.sup.−1 may be applied also for wall mounted cameras or other cameras not capturing images from above the object 104. In such scenarios, the source image is captured by the wall mounted camera, and a new inverse pixel transform T.sub.new.sup.−1 is determined. The new inverse pixel transform T.sub.new.sup.−1 gives as a virtual intermediate image an image of the source image had it been captured by a camera mounted in the ceiling at the same height as the wall mounted camera. By combining the new inverse pixel transform T.sub.new.sup.−1 with the inverse pixel transform T.sup.−1 a combined inverse pixel transform T.sub.com.sup.−1 is obtained which may be applied to each target pixel of the target image to determine one or more source pixels located at a position in the source image corresponding to a position of each target pixel in the target image.
[0059] In action 506, a respective target pixel value is assigned to each target pixel. The respective target pixel value is determined based on one or more source pixel values of the determined one or more source pixels located at the corresponding position. Thereby is a size c in the target image of depicted one or more objects 104 of a specific object type normalized in at least one size dimension of the depicted one or more objects of the specific object type. It should be understood that when the inverse pixel transform T.sup.−1 is applied to the target pixel, it may find a location in the source image that contributes to the target pixel, i.e., to the target pixel value. This location does not have to correspond to source pixels in the source image but may be a location between two or more source pixels. In such case, an image value for the location may be determined by performing an interpolation of two or more source pixel values of two or more source pixels surrounding the location. If more than one such location in the source image contribute to the target pixel value, the target pixel value may be determined by performing an interpolation of the more than one such location. The interpolation may be a nearest-neighbour interpolation, a bicubic interpolation, a bilinear interpolation or another interpolation suitable for interpolating image values. The interpolation may additionally or alternatively comprise a downscaling interpolation such as a Lanczos interpolation, a spline interpolation, an area interpolation or a mipmap interpolation. These interpolations are well known to a person skilled in the art and will therefore not be described in more detail.
[0060] Actions 504 and 506 will now be described with reference to the exemplifying
[0061] As shown, the inverse pixel transform T.sup.−1 will result in that the target pixels at a distance d.sub.t,1 and d.sub.t,2, respectively, from the principal point P.sub.t of the target image will both obtain the source pixel value of the source pixel at a distance d.sub.s from the principal point P.sub.s in the source image. Thereby, an object at distance d.sub.s from the principal point P.sub.s in the source image will be enlarged in the target image. It should be understood that when the source image is captured by the camera 108 being a pinhole camera the opposite is true, i.e., in analogy with the example above, two pixels in the source image will be combined into one pixel of the target image and thereby reducing the size of the object in the target image.
[0062]
[0063] In action 508, the target image is fed to the object detector 209 for evaluation. Thus, the target image is provided to the object detector 209. For example, the image processor 208 may provide the target image directly to the object detector 209 or via the data storage 214 from where the object detector 209 may retrieve the target image. When in receipt of the target image, the object detector 209 is able to perform object detection on the target image. If the object detector 209 detects one or more objects of the specific type in the target image it will conclude that the detected one or more objects were present also in the source image. The detections may be transformed back to the source image by transforming them using the inverse pixel transforms T.sup.−1.
Different Inverse Pixel Transforms T.SUP.−1
[0064] Different inverse pixel transforms T.sup.−1 will now be described in more detail. In the description below, the camera 108 is mounted at a height, i.e., in the ceiling, and is capturing the image of the object 104 from above.
[0065] It should be noted that the equations relating to the different inverse pixel transforms T.sup.−1 are given for the case when the camera 108 is a pinhole camera, i.e., a camera with a narrow-angle lens. In case the camera 108 is a fisheye camera, i.e., a camera with a fisheye lens, the inverse pixel transform T.sup.−1 may be combined with a dewarping transform T.sub.dewarp taking the difference between the fish-eye lens and the narrow-angle lens into consideration. The reason for this is that the spherical view of the fisheye camera causes angular distortion of straight lines, giving objects a strange bulged and deformed appearance. Therefore, a dewarping transform T.sub.dewarp is applied to take the output of a fisheye lens and correct the deformed image so that lines appear straight again, and the objects look normal. The image may also be rotated so that all portions of the view appear right-side-up.
[0066] Further, it should be understood that the inverse pixel transform T.sup.−1 is a transformation that transforms a distance d.sub.t from the principal point in the target image to a distance d.sub.p from the principal point in the source image produced by the pinhole camera, i.e., d.sub.p=T.sup.−1(d.sub.t). The cameras used in the real world today are not perfect pinhole cameras. However, their images may be transformed using classical dewarping transforms to produce the same images as a pinhole camera would produce. Such a dewarping transform T.sub.dewarp may be combined with one of the inverse pixel transforms T.sup.−1 described herein in order to apply them to real world cameras. The same approach may be used to apply them to different types of cameras, such as for example fisheye cameras, by using different dewarping parameters in the dewarping transform T.sub.dewarp. Let T.sub.dewarp.sup.−1 be the inverse dewarping transform that transforms the distance d.sub.p into the distance d.sub.s from the principal point of a warped source image, i.e., d.sub.s=T.sub.dewarp.sup.−1(d.sub.p). The full transform going from the target image to the warped image produced by the fisheye camera then becomes d.sub.s=T.sub.dewarp.sup.1(T.sup.−1(d.sub.t)).
[0067] For example, the inverse dewarping transform T.sub.dewarp.sup.−1 may be a polynomial relating a viewing angle alpha=arctan(d.sub.p/f) to the distance d.sub.s as:
d.sub.s=c.sub.0+c.sub.1alpha+c.sub.2alpha.sup.2+c.sub.3alpha.sup.3+c.sub.4alpha.sup.4
[0068] wherein c.sub.0, c.sub.1, c.sub.2, c.sub.3 and c.sub.4 are constants describing the lens used, e.g., describing the used narrow-angle lens or the used fisheye lens. The focal length of the lens is given by f. The viewing angle is sometimes referred to as an angle of view or a field of view. For a fisheye lens, the viewing angle alpha may be between 100 and 180 degrees. The constants c.sub.0, c.sub.1, c.sub.2, c.sub.3 and c.sub.4 may for a fisheye camera be c.sub.0=0, c.sub.2=0.01, c.sub.3=0.03, c.sub.4=0.001, just to give an example. However, it should be understood that the constants may be varied to describe another type of fisheye lens or a narrow-angle lens.
Height Scaling and Box Scaling
[0069] In some embodiments, herein described as relating to height scaling and box scaling, respectively, the inverse pixel transform T.sup.−1 is a function depending on a scaling factor s.sub.factor, and on the distance d.sub.t between the principal point P.sub.t of the target image and the target pixel to which the inverse pixel transform T.sup.−1 is to be applied.
[0070] The determination of the inverse pixel transform T.sup.−1 for embodiments relating to height scaling will now be described. For these embodiments, the inverse pixel transform T.sup.−1 may be seen as a transform that will rescale the source image with a scale factor that varies over the image and is proportional to the inverse of the height h.sub.object of the object 104 viewed from a camera 108 installed at a height h.sub.camera and looking straight down. Referring back to
d.sub.2−d.sub.1=f(x/z.sub.2−x/z.sub.1)=(z.sub.0/z.sub.2−z.sub.0/z.sub.1)d.sub.0=s.sub.factord.sub.0,
[0071] where s.sub.factor=(z.sub.0/z.sub.2−z.sub.0/z.sub.1).
[0072] Let F(d) define a function that scales the target image with this projected height in every pixel. That will give the inverse of the pixel transformation that is to be applied to normalize the size of the object 104 in the target image. Thus, the inverse pixel transform T.sup.−1 is equal to F(d). However, the function F(d) breaks down in the centre of the source image where the projected height would be 0. Therefore, a distance threshold value d.sub.cutoff is introduced and the scale is set to be constant s.sub.center for distances d below the distance threshold d.sub.cutoff. The local scaling of a function is its derivative, which gives
F′(d)=s.sub.factord+s.sub.1 if d>d.sub.cutoff, and
[0073] F′(d)=s.sub.center if d≤d.sub.cutoff,
wherein s.sub.1 is an unknown constant. Integrating this gives:
F(d)=s.sub.factord.sup.2/2+s.sub.1d+s.sub.0 if d>d.sub.cutoff and
[0074] F(d)=s.sub.centerd+ŝ.sub.0 if d≤d.sub.cutoff;
wherein s.sub.0 and ŝ.sub.0 are some unknown constants.
[0075] The image centre at distance d=0 should remain at d=0, i.e.,
[0076] F(d=0)=0 giving that ŝ.sub.0=0.
Also, the scale and position at the distance threshold d.sub.cutoff should be the same for the two pieces of the function, i.e.,
[0077] F′(d.sub.cutoff)=s.sub.factord.sub.cutoff+s.sub.1=s.sub.center giving that s.sub.1=s.sub.center−s.sub.factord.sub.cutoff, and
[0078] F(d.sub.cutoff)=s.sub.factord.sub.cutoff.sup.2/2+s.sub.1d.sub.cutoff+s.sub.0=s.sub.centerd.sub.cutoff giving that s.sub.0=s.sub.centerd.sub.cutoff−s.sub.factord.sub.cutoff.sup.2/2−s.sub.1d.sub.cutoff=s.sub.factord.sub.cutoff.sup.2/2
[0079] Thus, for embodiments relating to height scaling, the inverse pixel transform T.sup.−1 as a function of the distance d.sub.t is given by: [0080] T.sup.−1(d.sub.t)=s.sub.factord.sub.t.sup.2/2+s.sub.1d.sub.t+s.sub.0 if d.sub.t>d.sub.cutoff and [0081] T.sup.−1(d.sub.t)=s.sub.centerd.sub.t if d.sub.t≤d.sub.cutoff; and wherein:
[0082] s.sub.1=s.sub.center−s.sub.factord.sub.cutoff;
[0083] s.sub.0=s.sub.factord.sub.cutoff.sup.2/2;
[0084] s.sub.factor is the scaling factor given by s.sub.factor=[(z.sub.0/z.sub.2)−(z.sub.0/z.sub.1)];
[0085] z.sub.1 is the camera mounting height above ground;
[0086] z.sub.2 is the camera mounting height above ground minus the assumed object height;
[0087] z.sub.0 is an assumed height above ground of a centre of the object of the specific type and is given by (z.sub.1+z.sub.2)/2;
[0088] d.sub.t is the distance in the target image from the principal point P.sub.t to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied;
[0089] d.sub.cutoff is a distance threshold value at which an object of the specific object type in the source image will have the same size as the object of the specific object type in the target image; and
[0090] s.sub.center is a constant scaling factor used for source pixels located within the distance d.sub.cutoff from the principal point P.
[0091] In some embodiments relating to box scaling, the object 104 of the specific type is modelled as a bounding box. This may be the case when the object 104, being for example a person, is modelled using a rectangular shape. The bounding box may be referred to as an enclosing box.
[0092] The determination of the inverse pixel transform T.sup.−1 for embodiments relating to box scaling will now be described. To avoid the need to specify a distance cutoff threshold, i.e., the distance threshold d.sub.cutoff mentioned above, it is possible to consider a bounding box instead of only considering the height. The bounding box may be a full 3D bounding box. For simplicity, the bounding box is assumed to be a rotated bounding box that always faces the camera.
s.sub.box=(x+w.sub.box)/z.sub.2−x/z.sub.2=w.sub.box/z.sub.2.
The total projected height will then be s.sub.factord+s.sub.box and this gives the pixel transformation function by letting:
F′(d)=s.sub.1(s.sub.factord+s.sub.box),
wherein s.sub.1 is an unknown constant that can be defined by specifying that the scale at the center is set to be s.sub.center:
[0093] F′(0)=s.sub.1s.sub.box=s.sub.center giving that s.sub.1=s.sub.center/s.sub.box which gives that:
[0094] F′(d)=s.sub.center/s.sub.box (s.sub.factord+s.sub.box)=s.sub.centers.sub.factord/(s.sub.box)+s.sub.center.
Integrating F′(d) gives that:
[0095] F(d)=[s.sub.centers.sub.factor/(2s.sub.box)]d.sup.2+s.sub.centerd;
wherein the constant has been set to 0 to have F(d=0)=0.
[0096] Thus, in embodiments relating to box scaling and since the inverse pixel transform T.sup.−1 is equal to F(d), the inverse pixel transform T.sup.−1 as a function of the distance d.sub.t is given by:
[0097] T.sup.−1(d.sub.t)=[s.sub.centers.sub.factor/(2s.sub.box)]d.sub.t.sup.2+s.sub.centerd.sub.t; wherein,
[0098] d.sub.t is the distance in the target image from the principal point P.sub.t to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied;
[0099] s.sub.center is a constant scaling factor used for target pixels located at the principal point P.sub.t;
[0100] s.sub.factor is the scaling factor given by s.sub.factor=[(z.sub.0/z.sub.2)−(z.sub.0/z.sub.1)];
[0101] z.sub.1 is the camera mounting height above ground;
[0102] z.sub.2 is the camera mounting height above ground minus the assumed object height; and
[0103] s.sub.box is the projected height of the top of the bounding box, which, if the bounding box is w.sub.box wide, is given by s.sub.box=w.sub.box/z.sub.2.
[0104] By the respective inverse pixel transforms T.sup.−1 according to the height scaling and the box scaling approaches, the size of the object is normalized in one size dimension of the depicted one or more objects of the specific object type. In the example when the object 104 is a person and the size is a length of the person, the respective inverse pixel transforms T.sup.−1 according to the height scaling and the box scaling approaches will normalize the size of the object in the object's length direction.
Constant Size
[0105] In some embodiments, herein described as relating to transforming objects into a constant size, the inverse pixel transform T.sup.−1 applied to each target pixel forces the target pixels to reproduce the depicted one or more objects 104 of the specific object type in the target image with a constant size in the at least one size dimension irrespective of the location of each of the depicted one or more objects of the specific object type in the target image.
[0106] The determination of the inverse pixel transform T.sup.−1 for embodiments relating to a constant size approach will now be described. Letting the scale be relative to the projected object's height, as for example in the height scaling approach described above, does not necessarily mean that the height of the transformed object will be constant, i.e., the same, at different locations of the target image. The reason for this is that different parts of the object 104 will be transformed with different scales when using the inverse pixel transform T.sup.−1 as in the height scaling approach. Therefore, in the constant size approach the transformed height of an object of known height is enforced to always be the same, i.e., constant. An object located at do in the target image will be located at x=d.sub.0/fz.sub.0 in the world with its top located at d.sub.2=fx/z.sub.2=fd.sub.0/fz.sub.0/z.sub.2=d.sub.0z.sub.0/z.sub.2 and similarly its bottom at d.sub.1=d.sub.0z.sub.0/z.sub.1. Assume that the transformed height, using the pixel transformation T, of an object is c, then:
[0107] T((d.sub.0z.sub.0)/(z.sub.2))−T((d.sub.0z.sub.0)/(z.sub.1))=c.
Also assume that T(d)=a log(d)+b for some constants a and b, then:
[0108] c=(a log((d.sub.0z.sub.0)/(z.sub.2))+b)−(a log((d.sub.0z.sub.0)/(z.sub.1))+b)=a log(z.sub.1/z.sub.2) giving that
[0109] a=c/(log(z.sub.1/z.sub.2)).
[0110] After introducing a distance threshold d.sub.cutoff as in the height scaling approach, this gives the inverse pixel transform:
[0111] T.sup.−1(d)=exp.sup.[(d−b)/a] if d>d.sub.cutoff and,
[0112] T.sup.−1(d)=d if d≤d.sub.cutoff.
At the distance threshold d.sub.cutoff the position should be the same. Thus:
[0113] T.sup.−1(d.sub.cutoff)=exp((d.sub.cutoff−b)/a)=d.sub.cutoff giving that b=d.sub.cutoff−a log d.sub.cutoff, and by fixing the height c of the object at the distance threshold d.sub.cutoff where T.sup.−1(d.sub.cutoff)=d.sub.cutoff gives that:
[0114] c=[(d.sub.cutoff)(z.sub.0/z.sub.2)]−[(d.sub.cutoff)(z.sub.0/z.sub.1)].
[0115] Thus, for embodiments relating to the constant size approach, the inverse pixel transform T.sup.−1 as a function of the distance d.sub.t may be given by: [0116] T.sup.−1(d.sub.t)=exp.sup.[(dt−b)/a] if d.sub.t>d.sub.cutoff and,
[0117] T.sup.−1(d.sub.t)=d.sub.t if d.sub.t≤d.sub.cutoff; wherein [0118] a=c/[log(z.sub.1/z.sub.2)]; [0119] b=d.sub.cutoff−a log(d.sub.cutoff); [0120] c is a parameter specifying a size in the target image of an object 104 of the specific object type and is set to c=[(d.sub.cutoff)(z.sub.0/z.sub.2)]−[(d.sub.cutoff)(z.sub.0/z.sub.1)] to make the object at the distance d.sub.cutoff in the source image and in the target image of equal size; [0121] z.sub.0 is an assumed height above ground of a centre of an object 104 of the specific type and is given by (z.sub.1+z.sub.2)/2; [0122] z.sub.1 is the camera mounting height above ground; [0123] z.sub.2 is the camera mounting height above ground minus the assumed object height;
[0124] d.sub.t is the distance in the target image from the principal point P to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied; and
[0125] d.sub.cutoff is a distance threshold value at which distance from the principal point an object of the specific object type in the source image will have the same size as the object of the specific object type in the target image.
[0126] By the inverse pixel transform T.sup.−1 according to the constant size approach, the size of the object is normalized in one size dimension of the depicted one or more objects of the specific object type. In the example when the object 104 is a person and the size is a length of the person, the inverse pixel transform T.sup.−1 according to the constant size approach will normalize the size of the object in the object's length direction.
Sphere Scaling
[0127] In some embodiments, herein described as relating to sphere scaling, at least one part of the object 104 of the specific object type is modelled as a sphere, and a scaling factor is proportional to a projected radius r of the sphere in the source image. The determination of the inverse pixel transform T.sup.−1 for embodiments relating to sphere scaling will now be described. In these embodiments, the object 104 is represented with a sphere and a scale factor is set to be proportional to the radius of the projection of that sphere.
a=(a.sub.x,a.sub.z)=(zd/f,z).
To find the boarders of the sphere 104.sub.sphere, the vector a can be rotated 90 degrees into a vector b=(−a.sub.z, a.sub.x). The borders, p and q, can then be found as:
p=(p.sub.x,p.sub.z)=a+rb/|b|;
q=(q.sub.x,q.sub.z)=a−rb/|b|;
which can be projected back into the source image at pixels d.sub.p=fp.sub.x/p.sub.z and d.sub.q=fq.sub.x/q.sub.z. The diameter of the projection would be:
and if the scale in the center is to be 1, a pixel transformation with a local scaling of s(d)/s(0) at a distance d is to be applied. Integrating s(d) gives the inverse pixel transform T.sup.−1(d) as:
[0128] Thus, in such embodiments relating to sphere scaling, the inverse pixel transform T.sup.−1 as a function of the distance d.sub.t in the target image is given by:
wherein:
[0129] z is an assumed distance between the camera 108 and the at least one part of the object 104 modelled as the sphere; and
[0130] d.sub.t is the distance in the target image from the principal point P.sub.t to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied.
[0131] If the integration above is performed numerically between 0 and 4 and a 4th degree polynom is fitted to the result, the inverse pixel transform T.sup.−1 is for r=250 and z=270−180+r (in lightning_crowd: lens.py, rectst_sphere.py) given by T.sup.−1(d.sub.t)=−0.020750756.sup.4+0.21880565d.sub.t.sup.3+0.281135226d.sub.t.sup.2+0.83111921d.sub.t+0.01962886.
[0132] The at least one part of the object 104 modelled as a sphere may be a head or a torso when the object 104 is a person.
[0133] By the inverse pixel transform T.sup.−1 according to the sphere scaling approach, the size of the object is normalized in two size dimensions of the depicted one or more objects of the specific object type. In the example when the object 104 is a person and the size is two dimensional and given by a length and a width of the person, the inverse pixel transform T.sup.−1 according to the sphere scaling approach will normalize the size of the object in both the object's length direction and width direction.
[0134] Embodiments also relate to the image processor 208 configured to provide a target image for evaluation with the object detector 209. The image processor 208 is configured to obtain a source image captured by an image capturing module 206 and depicting one or more objects 104 in a scene 102. The image processor 208 is further configured to apply an inverse pixel transform T.sup.−1 to each target pixel of a target image to determine one or more source pixels located at a position in the source image corresponding to a position of each target pixel in the target image. Furthermore, the image processor 208 is configured to assign, to each target pixel, a target pixel value determined based on one or more source pixel values of the determined one or more source pixels located at the corresponding position, thereby is a size c in the target image of depicted one or more objects of the specific object type normalized in at least one size dimension of the depicted one or more objects of the specific object type. Yet further, the image processor 208 is configured to feed the target image to an object detector for evaluation.
[0135] Embodiments also relate to a camera 108 for providing a target image that is to be evaluated by the object detector 209. The camera 108 comprises the image processor 208. The camera 108 is, e.g., by means of the image capturing module 208, configured to capture the source image depicting the one or more objects 104 in the scene 102. Further, the camera 108 is configured to provide the source image to the image processor 208. Furthermore, the camera 108 comprises the object detector 209 configured to receive the target image and to evaluate the target image by performing object detection on the target image.
[0136] Embodiments also relates to a non-transitory computer-readable medium having stored thereon computer code instructions adapted to carry out embodiments of the method described herein when executed by a device having processing capability.
[0137] As described above, the camera 108, e.g., the image processor 208 of the camera 108, may be configured to implement a method for providing the target image. For this purpose, the camera 108, e.g., the image processor 208 of the camera 108, may include circuitry which is configured to implement the various method steps described herein.
[0138] In a hardware implementation, the circuitry may be dedicated and specifically designed to implement one or more of the method steps. The circuitry may be in the form of one or more integrated circuits, such as one or more application specific integrated circuits or one or more field-programmable gate arrays. By way of example, the camera 108 may hence comprise circuitry which, when in use, obtains a source image, and applies an inverse pixel transform T.sup.−1 to each target pixel of the target image to determine one or more source pixels located at a position in the source image corresponding to a position of each target pixel in the target image. The camera 108 may further comprise circuitry which, when in use assigns, to each target pixel, a target pixel value determined based on one or more source pixel values of the determined one or more source pixels located at the corresponding position. Thereby is a size c in the target image of depicted one or more objects of a specific object type normalized in at least one size dimension of the depicted one or more objects of the specific object type. The camera 108 may further comprise circuitry which, when in use feeds the target image to an object detector for evaluation.
[0139] In a software implementation, the circuitry may instead be in the form of a processor, such as a microprocessor, which in association with computer code instructions stored on a (non-transitory) computer-readable medium, such as a non-volatile memory, causes the camera 108, e.g., the image processor 208 of the camera 108, to carry out any method disclosed herein. Examples of non-volatile memory include read-only memory, flash memory, ferroelectric RAM, magnetic computer storage devices, optical discs, and the like. In a software case, each of the method steps described above may thus correspond to a portion of computer code instructions stored on the computer-readable medium, that, when executed by the processor, causes the camera 108 to carry out any method disclosed herein.
[0140] It is to be understood that it is also possible to have a combination of a hardware and a software implementation, meaning that some method steps are implemented in hardware and others in software.
[0141] It will be appreciated that a person skilled in the art can modify the above-described embodiments in many ways and still use the advantages of the disclosure as shown in the embodiments above. For example, the camera 108 does not need to be a single unit comprising the image capturing module 206 and the image processing and encoding module 214 at one location but it could be a virtual unit, wherein image capturing part 206 and the image processing and encoding module 214 operate together but they are provided at different locations. Further, the object detector 209 does not need to be arranged in the image processor 208, but could be arranged as a separate unit of the image processing and encoding module 214 and could be arranged in communication with the image processor 208, the encoder 210, the input and output interface 212, and the data storage 214. Thus, the disclosure should not be limited to the shown embodiments but should only be defined by the appended claims. Additionally, as the skilled person understands, the shown embodiments may be combined.