AN IMAGE PROCESSOR AND A METHOD THEREIN FOR PROVIDING A TARGET IMAGE

Abstract

An image processor and a method therein to provide a target image for evaluation with an object detector. The method comprises: obtaining a source image captured by a camera and depicting an object, and applying an inverse pixel transform to each target pixel of a target image to determine one or more source pixels located at a position in the source image corresponding to a position of each target pixel in the target image. The method further comprises assigning, to each target pixel, a target pixel value determined based on one or more source pixel values of the determined one or more source pixels located at the corresponding position, thereby is a size c in the target image of the depicted object of a specific object type normalized in at least one size dimension. Thereafter, the target image is fed to an object detector for evaluation.

Claims

1. A method to provide a target image for evaluation with an object detector, the method comprises: obtaining a source image captured by a camera and depicting one or more objects in a scene; applying an inverse pixel transform T.sup.−1 to each target pixel of a target image to determine one or more source pixels located at a position in the source image corresponding to a position of each target pixel in the target image, assigning, to each target pixel, a target pixel value determined based on one or more source pixel values of the determined one or more source pixels located at the corresponding position, thereby is a size c in the target image of depicted one or more objects of a specific object type normalized in at least one size dimension of the depicted one or more objects of the specific object type; and feeding the target image to an object detector for evaluation.

2. The method of claim 1, wherein the inverse pixel transform T.sup.−1 takes as input an assumed size in the at least one size dimension of an object of the specific object type, one or more mounting parameters of the camera, and a distance d.sub.t from a principal point P.sub.t of the target image to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied.

3. The method of claim 2, wherein the assumed object size is an assumed object height and/or an assumed object width, wherein the one or more mounting parameters comprise a camera mounting height above ground, and wherein the assumed object size and the one or more mounting parameters are given in a world coordinate system.

4. The method of claim 2, wherein the inverse pixel transform T.sup.−1 applied to each target pixel forces the target pixels to reproduce the depicted one or more objects of the specific object type in the target image with a constant size in the at least one size dimension irrespective of the location of each of the depicted one or more objects of the specific object type in the target image.

5. The method of claim 4, wherein the inverse pixel transform T.sup.−1 as a function of the distance d.sub.t is given by: T.sup.−1(d.sub.t)=exp.sup.[(dt−b)/a] if d.sub.t>d.sub.cutoff and T.sup.−1(d.sub.t)=d.sub.t if d.sub.t d.sub.cutoff; and wherein: a=c/[log(z.sub.1/z.sub.2)]; b=d.sub.cutoff−a log(d.sub.cutoff); c is a parameter specifying a size in the target image of an object of the specific object type and is set to c=[(d.sub.cutoff)(z.sub.0/z.sub.2)]−[(d.sub.cutoff)(z.sub.0/z.sub.1)] to make the object at the distance d.sub.cutoff in the source image and in the target image of equal size; z.sub.0 is an assumed height above ground of a center of an object of the specific type and is given by (z.sub.1+z.sub.2)/2; z.sub.1 is the camera mounting height above ground; z.sub.2 is the camera mounting height above ground minus the assumed object height; d.sub.t is the distance in the target image from the principal point P.sub.t to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied; and d.sub.cutoff is a distance threshold value at which an object of the specific object type in the source image will have the same size as the object of the specific object type in the target image.

6. The method of claim 2, wherein the inverse pixel transform T.sup.−1 is a function depending on a scaling factor s.sub.factor, and on the distance d.sub.t.

7. The method of claim 6, wherein the inverse pixel transform T.sup.−1 as a function of the distance d.sub.t is given by: T.sup.−1(d.sub.t)=s.sub.factor(d.sub.t).sup.2/2+s.sub.1(d.sub.t)+s.sub.0 if d.sub.t>d.sub.cutoff and T.sup.−1(d.sub.t)=s.sub.center(d.sub.t) if d.sub.t d.sub.cutoff; and wherein: s.sub.1=s.sub.center−s.sub.factor(d.sub.cutoff); s.sub.0=s.sub.factor(d.sub.cutoff).sup.2/2; s.sub.factor is the scaling factor given by s.sub.factor=[(z.sub.0/z.sub.2)−(z.sub.0/z.sub.1)]; z.sub.1 is the camera mounting height above ground; z.sub.2 is the camera mounting height above ground minus the assumed object height; z.sub.0 is an assumed height above ground of a center of the object of the specific type and is given by (z.sub.1+z.sub.2)/2; d.sub.t is the distance in the target image from the principal point P to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied; d.sub.cutoff is a distance threshold value at which an object of the specific object type in the source image will have the same size as the object (104) of the specific object type in the target image; and s.sub.center is a constant scaling factor used for source pixels located within the distance d.sub.cutoff from the principal point P.

8. The method of claim 6, wherein the object of the specific type is modelled as a bounding box, and wherein the inverse pixel transform T.sup.−1 as a function of the distance dt is given by: T.sup.−1(d.sub.t)=[s.sub.centers.sub.factor/(2s.sub.box)](d.sub.t).sup.2+s.sub.center(d.sub.t); and wherein: d.sub.t is the distance in the target image from the principal point P.sub.t to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied; s.sub.center is a constant scaling factor used for source pixels located within a distance of d.sub.cutoff from the principal point P.sub.t; s.sub.factor is the scaling factor given by s.sub.factor=[(z.sub.0/z.sub.2)−(z.sub.0/z.sub.1)]; z.sub.1 is the camera mounting height above ground; z.sub.2 is the camera mounting height above ground minus the assumed object height; and s.sub.box is the projected height of the top of the bounding box, which bounding box is w.sub.box wide and is given by s.sub.box=w.sub.box/z.sub.2.

9. The method of claim 2, wherein at least one part of the object of the specific object type is modelled as a sphere, and wherein a scaling factor is proportional to a projected radius r of the sphere in the source image.

10. The method of claim 9, wherein the inverse pixel transform T.sup.−1 as a function of the distance d.sub.t is given by: $T^{- 1} (d_{t}) = \int \frac{s (d_{t})}{s (0)} d (d_{t});$ wherein: $s (d_{t}) = \frac{2 r \sqrt{z^{2} (1 + {(d_{t})}^{2})} (1 + {(d_{t})}^{2})}{z^{2} - {r^{2} (d_{t})}^{2} + {(d_{t})}^{2} z^{2}};$ $s (0) = \frac{2 r}{z};$ z is an assumed distance between the camera and the at least one part of the object modelled as the sphere; and d.sub.t is the distance in the target image from the principal point P.sub.t to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied.

11. The method of claim 1, wherein the inverse pixel transform T.sup.−1 is rotationally symmetric around a principal point P.sub.t of the target image.

12. An image processor configured to provide a target image for evaluation with an object detector; wherein the image processor is configured to: obtain a source image captured by an image capturing module and depicting one or more objects in a scene, apply an inverse pixel transform T.sup.−1 to each target pixel of a target image to determine one or more source pixels located at a position in the source image corresponding to a position of each target pixel in the target image; assign, to each target pixel, a target pixel value determined based on one or more source pixel values of the determined one or more source pixels located at the corresponding position, thereby is a size c in the target image of depicted one or more objects of the specific object type normalized in at least one size dimension of the depicted one or more objects of the specific object type; and feed the target image to an object detector for evaluation.

13. A camera comprising the image processor of claim 12, and configured to: capture the source image depicting the one or more objects in the scene; provide the source image to the image processor; and wherein the camera further comprises the object detector configured to receive the target image and to evaluate the target image by performing object detection on the target image.

14. The camera of claim 13, wherein the camera is a monitoring camera.

15. A non-transitory computer-readable medium having stored thereon computer code instructions adapted to carry out a method when executed by a device having processing capability, the method providing a target image for evaluation with an object detector, the method comprising: obtaining a source image captured by a camera and depicting one or more objects in a scene; applying an inverse pixel transform T.sup.−1 to each target pixel of a target image to determine one or more source pixels located at a position in the source image corresponding to a position of each target pixel in the target image, assigning, to each target pixel, a target pixel value determined based on one or more source pixel values of the determined one or more source pixels located at the corresponding position, thereby is a size c in the target image of depicted one or more objects of a specific object type normalized in at least one size dimension of the depicted one or more objects of the specific object type; and feeding the target image to an object detector for evaluation.

16. The method of claim 15, wherein the inverse pixel transform T.sup.−1 takes as input an assumed size in the at least one size dimension of an object of the specific object type, one or more mounting parameters of the camera, and a distance d.sub.t from a principal point P.sub.t of the target image to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied.

17. The method of claim 16, wherein the assumed object size is an assumed object height and/or an assumed object width, wherein the one or more mounting parameters comprise a camera mounting height above ground, and wherein the assumed object size and the one or more mounting parameters are given in a world coordinate system.

18. The method of claim 16, wherein the inverse pixel transform T.sup.−1 applied to each target pixel forces the target pixels to reproduce the depicted one or more objects of the specific object type in the target image with a constant size in the at least one size dimension irrespective of the location of each of the depicted one or more objects of the specific object type in the target image.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0018] The above, as well as additional objects, features and advantages of the present disclosure, will be better understood through the following illustrative and non-limiting detailed description of embodiments of the present disclosure, with reference to the appended drawings, where the same reference numerals will be used for similar elements, wherein:

[0019] FIG. 1 schematically an exemplary environment in which various embodiments of the disclosure can be implemented.

[0020] FIG. 2 schematically illustrates a camera according to embodiments.

[0021] FIG. 3 schematically illustrates an exemplifying geometrical relationship between a camera and an object.

[0022] FIG. 4A schematically illustrates an exemplifying geometrical relationship between a camera and two objects.

[0023] FIGS. 4B and 4C schematically illustrate a source image captured by a pinhole camera and a fisheye camera, respectively.

[0024] FIG. 5 is a flowchart of a method for providing a target image according to embodiments.

[0025] FIGS. 6A and 6B schematically illustrate source pixels of a simplified source image assumed to be captured by a fisheye camera according to embodiments, and target pixels of a simplified target image, respectively.

[0026] FIGS. 7A and 7B schematically illustrate a source image captured by a fisheye camera and a corresponding target image, respectively.

[0027] FIG. 8 schematically illustrates an exemplifying geometrical relationship between a camera and an object when the object is modelled using a bounding box.

[0028] FIG. 9 schematically illustrates an exemplifying geometrical relationship between a camera and an object when a part of the object is modelled using a sphere.

DETAILED DESCRIPTION

[0029] The present disclosure will be described more fully hereinafter with reference to the accompanying drawings, in which embodiments of the disclosure are shown.

[0030] FIG. 1 shows a schematic diagram of an exemplary environment 100 in which various embodiments of the disclosure can be implemented. As can be seen in FIG. 1, a scene 102 with an object 104, e.g., a person walking on a floor 106 in for example a restaurant or an office or in a retail store, is captured by a camera 108 mounted in a ceiling 110. The scene 102 of the exemplary environment comprises also two free tables. It should be noted that the depiction of the scene 102 is merely a simplistic view for illustrative purposes. A scene can be described, in a more general sense, as any three-dimensional physical space whose size and shape are defined by the field of view (FOV) of a camera, e.g., the camera 108, recording the scene. Further, it should be understood that the camera 108 may be mounted at other suitable places to capture images of the scene from above, such as on a building, at a pole, on a wall or a fence, and thus that the mounting in the ceiling 110 is just given as an example. Furthermore, it should also be understood that when the camera 108 for example is mounted outdoors the object 104 may be moving on a pavement, a road or on the ground, just to give some example. However, in this disclosure reference is in general made to the ground for both indoor and outdoor scenarios without any intention to be limiting. As mentioned, the object 104 may be a person but it could also be an animal, a bag, a sign, a crate, or a vehicle, just to give some examples.

[0031] The term field of view refers to the part of the scene that is visible through the camera 108 at a particular position and orientation in space of the camera 108 and at a particular zoom setting of the camera 108. The particular position is given by the installation location and the orientation is given by the pan setting and/or tilt setting of the camera. Thus, it should be understood that the field of view may depend on one or more different camera parameters. For example, the field of view may depend on the installation location of the camera 108 such as height above ground, the type of camera lens, the zoom setting of the camera 108, the pan setting of the camera 108 and/or the tilt setting of the camera.

[0032] The camera 108 may be a monitoring camera, sometimes also referred to as surveillance camera. Further, the camera may be a fixed camera, e.g., a stationary camera, or a movable camera, e.g., a pan, tilt and zoom (PTZ) camera. The camera 108 may be a visible light camera, a thermal camera, or a camera comprising both a visible light camera and a thermal camera.

[0033] As further illustrated in FIG. 1, the camera 108 may be configured to transmit, e.g., wirelessly over a radio link 112, captured images to a network 114, and eventually to a client 116 and/or a server 118, which may be connected to the network 114. The captured images may be transmitted as a stream of images, e.g., as a video stream, to the network 114. It should be understood that there are many combinations of wireless and wired transmission models that can be used for transmissions between the camera 108 and the network 114, and between the network 114 and the client 116 and/or server 118, and that FIG. 1 only illustrates one example.

[0034] The client 116 may have a display where an operator can view images and/or video streams from the camera. Typically, the client 116 is also connected to the server 118, where the images and/or video streams can be stored and/or processed further. The client 116 may be used to control the camera 108, for example, by the operator issuing control commands at the client 116.

[0035] FIG. 2 schematically illustrates embodiments of the camera 108. The camera 108 may comprise a lens 202, e.g., a fisheye lens or a narrow-angle lens, that captures the scene 102 and projects it onto an image sensor 204. In this disclosure a camera with a narrow-angle lens is sometimes referred to as a pinhole camera. By the term “narrow-angle lens” when used herein is meant a lens that covers lens settings from approximately 15 degrees to 5 degrees or less. The image sensor 204 may be a conventional complementary metal-oxide-semiconductor (CMOS) sensor. Together, the lens 202 and image sensor 204 may form an image capturing module 206. The image sensor 204 captures images forming the video streams.

[0036] The camera 108 comprises further an object detector 209. The object detector 209 may be implemented in software, e.g., as an object detection algorithm, to determine object recognitions in a captured image of the scene 102. Alternatively, or additionally, the object detector 209 may implemented as an object detection neural network, e.g., a convolutional neural network (CNN), performing an object detection algorithm to determine the object recognitions in the captured image of the scene 102. These object recognitions may be associated with their images by including them as metadata to the images. The association may be kept through the subsequent encoding process that is performed by the encoder 210 of the camera 108. Object detectors as such are well known to those having ordinary skill in the art and thus will not be described in any more detail in this disclosure.

[0037] The camera 108 comprises also an image processor 208, an encoder 210 and an input/output interface 212.

[0038] The image processor 208, sometimes referred to as an image processing pipeline, processes captured images by different processing algorithms. In FIG. 2, the image processor 208 is illustrated as comprising the object detector 209. However, it should be understood that this is only an exemplary illustration and that the object detector 208 may be arranged external of the image processor 208.

[0039] The image processor 208 may be configured to perform a range of various other operations on images received from the image sensor 204. Such operations may include filtering, demosaicing, colour correction, noise filtering for eliminating spatial and/or temporal noise, distortion correction for eliminating effects of e.g., barrel distortion, global and/or local tone mapping, e.g., enabling imaging of scenes containing a wide range of intensities, transformation, e.g., rotation, flat-field correction, e.g., for removal of the effects of vignetting, application of overlays, e.g., privacy masks, explanatory text, etc.

[0040] Following the image processor 208, the images are forwarded to the encoder 210, in which the images are encoded according to an encoding protocol and forwarded to a receiver, e.g., the client 114 and/or the server 116, over the network 112 using the input/output interface 212. It should be noted that the camera 108 illustrated in FIG. 2 also includes numerous other components, such as processors, memories, etc., which are common in conventional camera systems and whose purpose and operations are well known to those having ordinary skill in the art. Such components have been omitted from the illustration and description of FIG. 2 for clarity reasons.

[0041] The camera 108 may also comprise a data storage 214 for storing data relating to the captured images, to the processing of the captured images and to the object recognitions, just to give some examples. The data storage may be a non-volatile memory, such as an SD card.

[0042] There are a number of conventional video encoding protocols. Some common video encoding protocols that work with the various embodiments of the present disclosure include: High Efficiency Video Coding (HEVC), also known as H.265 and MPEG-H Part 2; Advanced Video Coding (AVC), also known as H.264 and MPEG-4 Part 10; Versatile Video Coding (VVC), also known as H.266, MPEG-I Part 3 and Future Video Coding (FVC); VP9, VP10 and AOMedia Video 1 (AV1), just to give some examples.

[0043] As mentioned, the input and output interface 212 is configured to act as a communication interface between the camera 108, the network 112 and one or more receiving devices such as the client 114 and the server 116. Thus, the camera 108 may receive instructions from the client 114 and may transmit video streams to the client and/or the server 116 via the input and output interface 212.

[0044] The image processor 208, the encoder 210 and the input/output interface 212 may form an image processing and encoding module 214, which is connected to the image capturing module 206. The image capturing module 206 and the image processing and encoding module 214 may be arranged as two separate units arranged at a distance from each other and in communication with each other. In such scenario the camera 108 may also be referred to as a camera system. Alternatively, the image capturing module 206 and the image processing and encoding module 214 may be arranged as a single unit comprised in the camera 108. Further, the image capturing module 206 may be movable, e.g., in pan and/or tilt directions, while the image processing and encoding module 214 may be stationary.

[0045] FIG. 3 schematically illustrates an exemplifying geometrical relationship between the camera 108 mounted in the ceiling 110 and the object 104, e.g., the person, walking on the floor 106 under the camera 108. It should be understood that the figure is only for illustration and not drawn to scale. The coordinate system used to describe the actual geometric relation between the camera 108 and the object 104 is given by the x-axis, y-axis and the z-axis. The y-axis is perpendicular to both the x-axis and the z-axis, and is not shown in the figure. The z-axis may also be referred to as the optical axis. A longitudinal distance between the camera 108 and the object 104 is determined as a distance along or parallel to the optical axis, while a lateral distance between the camera 108 and the object 104 is determined as a distance perpendicular to the optical axis. The x,y,z-coordinate system is sometimes in this disclosure referred to as the world coordinate system, as opposed to an image coordinate system used when describing or referring to locations or points in an image captured by the camera 108. The image plane is displaced along the optical axis, i.e., along the z-axis, and the position where the optical axis intersects the image plane is called the principal point P. The distance between the image plane and the optical centre of the camera 108 is called the focal length f.

[0046] As illustrated in FIG. 3, the camera 108 is mounted a distance z.sub.1 from the floor 106. The top of the person's head is at a longitudinal distance z.sub.2 and at a lateral distance x.sub.2 from the camera 108. Since the person is walking on the floor 106, the person's height can be determined as z.sub.1-z.sub.2. Further, FIG. 3 illustrates that the centre of the person along the z-axis is z.sub.0 which can be determined as z.sub.0=(z.sub.1+z.sub.2)/2.

[0047] FIG. 3 also schematically illustrates the image plane which extends along the d-axis and along an axis (not shown) being perpendicular to the d-axis and the z-axis. For simplicity of drawing, the image plane is illustrated in front of the camera 108, i.e., between the camera 108 and the person 104, so the person's feet at z.sub.1 will be captured at a position d.sub.1 in the image and the person's top of the head at z.sub.2 will be captured at a position d.sub.2 in the image. In correspondence, the person centre at z.sub.0 is captured at do in the image. However, as understood by those skilled in the art, the image plane is comprised within the camera 108. It should also be understood that the image plane is arranged within the camera at a focal length f from the optical centre of the camera. In FIG. 3 the focal length f is exaggerated for illustrative purpose.

[0048] FIG. 4A is similar to FIG. 3 but with two persons 104, 104′ in the scene, i.e., within the field of view of the camera 108 mounted in the ceiling 110. The two persons are assumed to be of the same height and are assumed to be walking on a floor at the same longitudinal distance from the camera 108. However, the two persons are at different lateral distances from the camera 108. That is z.sub.1′=z.sub.1, z.sub.2′=z.sub.2, and z.sub.0′=z.sub.0, but x.sub.1≠x.sub.1′.

[0049] FIG. 4B schematically illustrates an image of the scene illustrated in FIG. 4A and captured by the camera 108 being a pinhole camera. As schematically illustrated in FIG. 4B, the second person 104′ at a larger lateral distance x.sub.1′ from the optical axis, compared to the first person 104 being at a lateral distance x.sub.1 from the optical axis, will be depicted larger than the first person 104 in the image captured by the pinhole camera. In FIG. 4B, the first person 104 will be depicted with a height given as d.sub.2,p-d.sub.1,p, wherein d.sub.2,p is the radius from the principal point P of the captured image to the first person's top of the head, and d.sub.1,p is the radius from the principal point P to the first person's foot. In correspondence, the second person 104′ will be depicted with a height given as |d.sub.2,p′-d.sub.1,p′|, wherein d.sub.2,p′ is the radius from the principal point P of the captured image to the second person's top of the head, and d.sub.1,p′ is the radius from the principal point P to the second person's foot.

[0050] FIG. 4C schematically illustrates an image of the scene illustrated in FIG. 4A and captured by the camera 108 when being a fisheye camera. As schematically illustrated in FIG. 4C, the second person 104′ at a larger lateral distance x.sub.1′ from the optical axis, compared to the first person 104 being at a lateral distance x.sub.1 from the optical axis, will be depicted smaller than the first person in the image captured by the fisheye camera. In FIG. 4C, the first person 104 will be depicted with a height given as d.sub.2,f-d.sub.1,f, wherein d.sub.2,f is the radius from the principal point P of the captured image to the first person's top of the head, and d.sub.1,f is the radius from the principal point P to the first person's foot. In correspondence, the second person 104′ will be depicted with a height given as |d.sub.2,f′-d.sub.1,f′|, wherein d.sub.2,f′ is the radius from the principal point P of the captured image to the second person's top of the head, and d.sub.1,f′ is the radius from the principal point P to the second person's foot.

[0051] It should be noted that in the image captured with the pinhole camera, a non-flat object, e.g., an object having an extension in the longitudinal direction being larger than its extension in the lateral direction, being located close to the optical axis is depicted smaller than the same non-flat object when being located far from the optical axis. The difference in depicted sizes of the same non-flat object at different distances from the optical axis will increase with an increasing extension of the object in the longitudinal direction. Thus, if the object is a person and the extension in the longitudinal direction is given by the person's length, the difference in depicted sizes of the person when being at different distances from the optical axis will increase with increasing length of the person. This means that, in the image captured with the pinhole camera, a non-flat object with a given extension in the longitudinal direction and being at a smaller lateral distance from the optical axis is depicted smaller than the same object would have been depicted had it been at a larger lateral distance. For an image captured by the fisheye camera the opposite is true, i.e., the non-flat object will be depicted larger when being closer to the optical axis than when being farther away from the optical axis. By embodiments disclosed herein, an object 104 at the same longitudinal distance from the camera 108 will be depicted with the same or almost the same size in the target image irrespective of its lateral distance to the optical axis, i.e., irrespective of where in the camera's field of view the object occurred, and thus irrespective of being depicted with different sizes in the source image.

[0052] A method to provide a target image that is to be evaluated with an object detector, e.g., the object detector 209, will now be described with reference to the flowchart of FIG. 5. As will be described below, the target image is prepared based on a source image. For example, the source image may be one of the captured images mentioned above in relation to FIGS. 4B and 4C. Thus, the source image may be an image captured by a pinhole camera or by a fisheye camera, i.e., a camera provided with a fisheye lens. However, it should be understood that the image may be captured by another suitable camera at hand. Further, it should be understood that the method illustrated in FIG. 5 may be performed by different parts of the camera 108, that some of the actions may be optional and that actions may be taken in another suitable order. Furthermore, in this disclosure one or more actions are described as being performed on a source image. This should be understood as one or more actions may be performed on a captured image, e.g., on a temporal part of a captured video stream.

[0053] In action 502, a source image captured by the camera 108 is obtained. The source image depicts one or more objects 104 in the scene 102. The one or more objects 104 may be of one or more object types. For example, the one or more object types may be a human being, an animal or a vehicle. Thus, the object may be a person, a dog or a car. The source image may be captured by the image capturing module 206 of the camera 108. As previously mentioned, the image capturing module 206 may comprise the lens 202, e.g., a fisheye lens, and the image sensor 204. The source image has an image format corresponding to the size of the image sensor 204 such that each source pixel in the source image corresponds to a read out from a corresponding sensor element of the image sensor 204. For example, a source image format may be 1024×1024, 1920×1080, 2560×1440, 2880×2880 or 4096×2160 pixels. The image processor 208 may obtain the captured source image directly from the image capturing module 206, from an internal storage, e.g., the data storage 214 of the camera 108, or from an external device via the input/output interface 212.

[0054] In action 504, an inverse pixel transform T.sup.−1 is applied to each target pixel of a target image to determine one or more source pixels located at a position in the source image corresponding to a position of each target pixel in the target image. Thus, the inverse pixel transform T.sup.−1 is applied to every pixel of the target image irrespective of where in the source image the one or more objects 104 may occur. This means that the source image is not processed in any way to detect the one or more objects 104 or their locations in the source image before the inverse pixel transform T.sup.−1 is applied to the target image. Instead by applying the inverse pixel transform T.sup.−1 to the target image, the target image is provided based on the captured source image in such a way that an object of a specific object type will be depicted in the target image with a size that has been normalised in at least one dimension irrespective of where in the source image it is depicted. This will be described in more detail below. The target image is of an image format suitable for being analysed by the object detector 209. For example, a suitable target image format may be 1024×1024, 1536×1536, 1920×1080, 2048×512, 2560×1440, or 4096×2160 pixels. It should be understood that the source image format and target image format do not have to be the same.

[0055] In some embodiments, the inverse pixel transform T.sup.−1 takes as input an assumed size in the at least one size dimension of the object 104 of the specific object type, one or more mounting parameters of the camera 108, and a distance d.sub.t from the principal point P of the target image to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied.

[0056] For example, the assumed object size may be an assumed object height and/or an assumed object width, and the one or more mounting parameters may comprise a camera mounting height above ground. The assumed object size and the one or more mounting parameters are given in the world coordinate system. If, for example, the object 104 of the specific type is a person, the assumed size in the at least one size dimension of the object 104 may be an assumed length of the person. For example, the assumed length of the person may be set to an average length of a human being e.g., set to a value in the range 1-2 meter, such as 1.8 m. The assumed size may be a predetermined value.

[0057] The closer the assumed size in the at least one dimension is to the object's 104 real size in the at least one dimension, the better result of the inverse pixel transform is obtained. Thus, a better normalisation of the depicted object's size in the at least one dimension in the target image will be achieved when the assumed size in the at least one dimension is the same as or close to the object's real size in the at least one dimension as compared to the case when the object's real size differs from the assumed size. Therefore, it may be advantageous to select the assumed size depending on the type of object to be depicted and the expected size of that object.

[0058] Different inverse pixel transforms T.sup.−1 will be described in more detail below. The different inverse pixel transform T.sup.−1 have all in common that they are rotationally symmetric around the principal point P.sub.t of the target image. This may also be expressed as they are rotational symmetric around the principal point P.sub.s of the source image, and around the camera's 108 optical axis, i.e., around the z-axis. It should be understood that the inverse pixel transforms T.sup.−1 may be applied also for wall mounted cameras or other cameras not capturing images from above the object 104. In such scenarios, the source image is captured by the wall mounted camera, and a new inverse pixel transform T.sub.new.sup.−1 is determined. The new inverse pixel transform T.sub.new.sup.−1 gives as a virtual intermediate image an image of the source image had it been captured by a camera mounted in the ceiling at the same height as the wall mounted camera. By combining the new inverse pixel transform T.sub.new.sup.−1 with the inverse pixel transform T.sup.−1 a combined inverse pixel transform T.sub.com.sup.−1 is obtained which may be applied to each target pixel of the target image to determine one or more source pixels located at a position in the source image corresponding to a position of each target pixel in the target image.

[0059] In action 506, a respective target pixel value is assigned to each target pixel. The respective target pixel value is determined based on one or more source pixel values of the determined one or more source pixels located at the corresponding position. Thereby is a size c in the target image of depicted one or more objects 104 of a specific object type normalized in at least one size dimension of the depicted one or more objects of the specific object type. It should be understood that when the inverse pixel transform T.sup.−1 is applied to the target pixel, it may find a location in the source image that contributes to the target pixel, i.e., to the target pixel value. This location does not have to correspond to source pixels in the source image but may be a location between two or more source pixels. In such case, an image value for the location may be determined by performing an interpolation of two or more source pixel values of two or more source pixels surrounding the location. If more than one such location in the source image contribute to the target pixel value, the target pixel value may be determined by performing an interpolation of the more than one such location. The interpolation may be a nearest-neighbour interpolation, a bicubic interpolation, a bilinear interpolation or another interpolation suitable for interpolating image values. The interpolation may additionally or alternatively comprise a downscaling interpolation such as a Lanczos interpolation, a spline interpolation, an area interpolation or a mipmap interpolation. These interpolations are well known to a person skilled in the art and will therefore not be described in more detail.

[0060] Actions 504 and 506 will now be described with reference to the exemplifying FIGS. 6A and 6B. FIG. 6A schematically illustrates source pixels of a simplified source image assumed to be captured by the camera 108 having a fisheye lens, and FIG. 6B schematically illustrates target pixels of a simplified target image. In FIGS. 6A and 6B, the cross-section of the dashed lines within each pixel represents the position of the pixel value for that pixel. Since the source image is assumed to have been taken by the camera 108 having a fisheye lens, an object of the specific type being at a larger lateral distance and thus being depicted farther away from the principal point P.sub.s in the captured source image need to be enlarged in the target image relatively the same object when being at a smaller lateral distance and thus being depicted closer to the principal point P.sub.s. Thereby, the depicted size of the object in the target image is changed so it is depicted with the same size in the target image irrespective of where in the captured source image it is depicted. Thus, if the object 104 is a person and the assumed object size in the at least one dimension is a length equal to 1.8 m, the person in the target image should be depicted with the same size relative to this assumed object size irrespective of where in the target image the person is depicted. In other words, the depicted size of the person should be rescaled to arrive at a size in the target image that is the same or almost the same irrespective of where in the source image the object is depicted. This may be referred to as normalising the size of the depicted object in the target image.

[0061] As shown, the inverse pixel transform T.sup.−1 will result in that the target pixels at a distance d.sub.t,1 and d.sub.t,2, respectively, from the principal point P.sub.t of the target image will both obtain the source pixel value of the source pixel at a distance d.sub.s from the principal point P.sub.s in the source image. Thereby, an object at distance d.sub.s from the principal point P.sub.s in the source image will be enlarged in the target image. It should be understood that when the source image is captured by the camera 108 being a pinhole camera the opposite is true, i.e., in analogy with the example above, two pixels in the source image will be combined into one pixel of the target image and thereby reducing the size of the object in the target image.

[0062] FIGS. 7A and 7B schematically illustrate a source image captured by the camera 108 when being a fisheye camera and a corresponding target image, respectively. There is a one-to-one relationship between the illustrated grid in FIG. 7A and the illustrated grid in FIG. 7B, meaning that for each target pixel within one of the boxes in the grid of FIG. 7B corresponding one or more source pixels are determined within a corresponding box in the grid of FIG. 7A. Since the source image illustrated in FIG. 7A is captured by the fisheye camera the grid of the source image is illustrated as having a convex non-rectilinear appearance. Further, in FIG. 7A the projected length of one object is illustrated as d.sub.2−d.sub.1, and in FIG. 7B the transformed length of the object is illustrated as c. Further, in the source image and the target image the same person is depicted at five different locations. In the source image the projected length of the person is different between the five different locations depending on the distance from the location to the principal point of the source image. However, in the target image the transformed length of the person is the same or almost the same at all the five different locations. Thus, the size of the object, e.g., the length, has been normalized in at least one dimension.

[0063] In action 508, the target image is fed to the object detector 209 for evaluation. Thus, the target image is provided to the object detector 209. For example, the image processor 208 may provide the target image directly to the object detector 209 or via the data storage 214 from where the object detector 209 may retrieve the target image. When in receipt of the target image, the object detector 209 is able to perform object detection on the target image. If the object detector 209 detects one or more objects of the specific type in the target image it will conclude that the detected one or more objects were present also in the source image. The detections may be transformed back to the source image by transforming them using the inverse pixel transforms T.sup.−1.

Different Inverse Pixel Transforms T.SUP.−1

[0064] Different inverse pixel transforms T.sup.−1 will now be described in more detail. In the description below, the camera 108 is mounted at a height, i.e., in the ceiling, and is capturing the image of the object 104 from above.

[0065] It should be noted that the equations relating to the different inverse pixel transforms T.sup.−1 are given for the case when the camera 108 is a pinhole camera, i.e., a camera with a narrow-angle lens. In case the camera 108 is a fisheye camera, i.e., a camera with a fisheye lens, the inverse pixel transform T.sup.−1 may be combined with a dewarping transform T.sub.dewarp taking the difference between the fish-eye lens and the narrow-angle lens into consideration. The reason for this is that the spherical view of the fisheye camera causes angular distortion of straight lines, giving objects a strange bulged and deformed appearance. Therefore, a dewarping transform T.sub.dewarp is applied to take the output of a fisheye lens and correct the deformed image so that lines appear straight again, and the objects look normal. The image may also be rotated so that all portions of the view appear right-side-up.

[0066] Further, it should be understood that the inverse pixel transform T.sup.−1 is a transformation that transforms a distance d.sub.t from the principal point in the target image to a distance d.sub.p from the principal point in the source image produced by the pinhole camera, i.e., d.sub.p=T.sup.−1(d.sub.t). The cameras used in the real world today are not perfect pinhole cameras. However, their images may be transformed using classical dewarping transforms to produce the same images as a pinhole camera would produce. Such a dewarping transform T.sub.dewarp may be combined with one of the inverse pixel transforms T.sup.−1 described herein in order to apply them to real world cameras. The same approach may be used to apply them to different types of cameras, such as for example fisheye cameras, by using different dewarping parameters in the dewarping transform T.sub.dewarp. Let T.sub.dewarp.sup.−1 be the inverse dewarping transform that transforms the distance d.sub.p into the distance d.sub.s from the principal point of a warped source image, i.e., d.sub.s=T.sub.dewarp.sup.−1(d.sub.p). The full transform going from the target image to the warped image produced by the fisheye camera then becomes d.sub.s=T.sub.dewarp.sup.1(T.sup.−1(d.sub.t)).

[0067] For example, the inverse dewarping transform T.sub.dewarp.sup.−1 may be a polynomial relating a viewing angle alpha=arctan(d.sub.p/f) to the distance d.sub.s as:

d.sub.s=c.sub.0+c.sub.1alpha+c.sub.2alpha.sup.2+c.sub.3alpha.sup.3+c.sub.4alpha.sup.4

[0068] wherein c.sub.0, c.sub.1, c.sub.2, c.sub.3 and c.sub.4 are constants describing the lens used, e.g., describing the used narrow-angle lens or the used fisheye lens. The focal length of the lens is given by f. The viewing angle is sometimes referred to as an angle of view or a field of view. For a fisheye lens, the viewing angle alpha may be between 100 and 180 degrees. The constants c.sub.0, c.sub.1, c.sub.2, c.sub.3 and c.sub.4 may for a fisheye camera be c.sub.0=0, c.sub.2=0.01, c.sub.3=0.03, c.sub.4=0.001, just to give an example. However, it should be understood that the constants may be varied to describe another type of fisheye lens or a narrow-angle lens.

Height Scaling and Box Scaling

[0069] In some embodiments, herein described as relating to height scaling and box scaling, respectively, the inverse pixel transform T.sup.−1 is a function depending on a scaling factor s.sub.factor, and on the distance d.sub.t between the principal point P.sub.t of the target image and the target pixel to which the inverse pixel transform T.sup.−1 is to be applied.

[0070] The determination of the inverse pixel transform T.sup.−1 for embodiments relating to height scaling will now be described. For these embodiments, the inverse pixel transform T.sup.−1 may be seen as a transform that will rescale the source image with a scale factor that varies over the image and is proportional to the inverse of the height h.sub.object of the object 104 viewed from a camera 108 installed at a height h.sub.camera and looking straight down. Referring back to FIG. 3, and letting z.sub.1=h.sub.camera, z.sub.2=h.sub.camera−h.sub.object and z.sub.0=(z.sub.1+z.sub.2)/2 be the bottom, top and centre z-coordinates of the object 104 in the world xz-coordinate system. An object centre that is located at a lateral distance x from the optical axis is projected onto a pixel of the source image located do pixels from the principal point will have x=z.sub.0d.sub.0/f and equivalently d.sub.0=fx/z.sub.0, and the projected height of that object in the source image will be:

d.sub.2−d.sub.1=f(x/z.sub.2−x/z.sub.1)=(z.sub.0/z.sub.2−z.sub.0/z.sub.1)d.sub.0=s.sub.factord.sub.0,

[0071] where s.sub.factor=(z.sub.0/z.sub.2−z.sub.0/z.sub.1).

[0072] Let F(d) define a function that scales the target image with this projected height in every pixel. That will give the inverse of the pixel transformation that is to be applied to normalize the size of the object 104 in the target image. Thus, the inverse pixel transform T.sup.−1 is equal to F(d). However, the function F(d) breaks down in the centre of the source image where the projected height would be 0. Therefore, a distance threshold value d.sub.cutoff is introduced and the scale is set to be constant s.sub.center for distances d below the distance threshold d.sub.cutoff. The local scaling of a function is its derivative, which gives

F′(d)=s.sub.factord+s.sub.1 if d>d.sub.cutoff, and

[0073] F′(d)=s.sub.center if d≤d.sub.cutoff,

wherein s.sub.1 is an unknown constant. Integrating this gives:

F(d)=s.sub.factord.sup.2/2+s.sub.1d+s.sub.0 if d>d.sub.cutoff and

[0074] F(d)=s.sub.centerd+ŝ.sub.0 if d≤d.sub.cutoff;

wherein s.sub.0 and ŝ.sub.0 are some unknown constants.

[0075] The image centre at distance d=0 should remain at d=0, i.e.,

[0076] F(d=0)=0 giving that ŝ.sub.0=0.

Also, the scale and position at the distance threshold d.sub.cutoff should be the same for the two pieces of the function, i.e.,

[0077] F′(d.sub.cutoff)=s.sub.factord.sub.cutoff+s.sub.1=s.sub.center giving that s.sub.1=s.sub.center−s.sub.factord.sub.cutoff, and

[0078] F(d.sub.cutoff)=s.sub.factord.sub.cutoff.sup.2/2+s.sub.1d.sub.cutoff+s.sub.0=s.sub.centerd.sub.cutoff giving that s.sub.0=s.sub.centerd.sub.cutoff−s.sub.factord.sub.cutoff.sup.2/2−s.sub.1d.sub.cutoff=s.sub.factord.sub.cutoff.sup.2/2

[0079] Thus, for embodiments relating to height scaling, the inverse pixel transform T.sup.−1 as a function of the distance d.sub.t is given by: [0080] T.sup.−1(d.sub.t)=s.sub.factord.sub.t.sup.2/2+s.sub.1d.sub.t+s.sub.0 if d.sub.t>d.sub.cutoff and [0081] T.sup.−1(d.sub.t)=s.sub.centerd.sub.t if d.sub.t≤d.sub.cutoff; and wherein:

[0082] s.sub.1=s.sub.center−s.sub.factord.sub.cutoff;

[0083] s.sub.0=s.sub.factord.sub.cutoff.sup.2/2;

[0084] s.sub.factor is the scaling factor given by s.sub.factor=[(z.sub.0/z.sub.2)−(z.sub.0/z.sub.1)];

[0085] z.sub.1 is the camera mounting height above ground;

[0086] z.sub.2 is the camera mounting height above ground minus the assumed object height;

[0087] z.sub.0 is an assumed height above ground of a centre of the object of the specific type and is given by (z.sub.1+z.sub.2)/2;

[0088] d.sub.t is the distance in the target image from the principal point P.sub.t to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied;

[0089] d.sub.cutoff is a distance threshold value at which an object of the specific object type in the source image will have the same size as the object of the specific object type in the target image; and

[0090] s.sub.center is a constant scaling factor used for source pixels located within the distance d.sub.cutoff from the principal point P.

[0091] In some embodiments relating to box scaling, the object 104 of the specific type is modelled as a bounding box. This may be the case when the object 104, being for example a person, is modelled using a rectangular shape. The bounding box may be referred to as an enclosing box.

[0092] The determination of the inverse pixel transform T.sup.−1 for embodiments relating to box scaling will now be described. To avoid the need to specify a distance cutoff threshold, i.e., the distance threshold d.sub.cutoff mentioned above, it is possible to consider a bounding box instead of only considering the height. The bounding box may be a full 3D bounding box. For simplicity, the bounding box is assumed to be a rotated bounding box that always faces the camera. FIG. 8 schematically illustrates an exemplifying geometrical relationship between the object 104 and the camera 108 when the object 104 is modelled using a bounding box 104.sub.box. The projected height of such a box 104.sub.box in the source image will be the projected height of the object 104 of the specific type as in the height scaling approach plus the projected height of the top of the box. The top of the box, which is w.sub.box wide and located at position x, will have a projected height s.sub.box in the source image that is given by:

s.sub.box=(x+w.sub.box)/z.sub.2−x/z.sub.2=w.sub.box/z.sub.2.

The total projected height will then be s.sub.factord+s.sub.box and this gives the pixel transformation function by letting:

F′(d)=s.sub.1(s.sub.factord+s.sub.box),

wherein s.sub.1 is an unknown constant that can be defined by specifying that the scale at the center is set to be s.sub.center:

[0093] F′(0)=s.sub.1s.sub.box=s.sub.center giving that s.sub.1=s.sub.center/s.sub.box which gives that:

[0094] F′(d)=s.sub.center/s.sub.box (s.sub.factord+s.sub.box)=s.sub.centers.sub.factord/(s.sub.box)+s.sub.center.

Integrating F′(d) gives that:

[0095] F(d)=[s.sub.centers.sub.factor/(2s.sub.box)]d.sup.2+s.sub.centerd;

wherein the constant has been set to 0 to have F(d=0)=0.

[0096] Thus, in embodiments relating to box scaling and since the inverse pixel transform T.sup.−1 is equal to F(d), the inverse pixel transform T.sup.−1 as a function of the distance d.sub.t is given by:

[0097] T.sup.−1(d.sub.t)=[s.sub.centers.sub.factor/(2s.sub.box)]d.sub.t.sup.2+s.sub.centerd.sub.t; wherein,

[0098] d.sub.t is the distance in the target image from the principal point P.sub.t to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied;

[0099] s.sub.center is a constant scaling factor used for target pixels located at the principal point P.sub.t;

[0100] s.sub.factor is the scaling factor given by s.sub.factor=[(z.sub.0/z.sub.2)−(z.sub.0/z.sub.1)];

[0101] z.sub.1 is the camera mounting height above ground;

[0102] z.sub.2 is the camera mounting height above ground minus the assumed object height; and

[0103] s.sub.box is the projected height of the top of the bounding box, which, if the bounding box is w.sub.box wide, is given by s.sub.box=w.sub.box/z.sub.2.

[0104] By the respective inverse pixel transforms T.sup.−1 according to the height scaling and the box scaling approaches, the size of the object is normalized in one size dimension of the depicted one or more objects of the specific object type. In the example when the object 104 is a person and the size is a length of the person, the respective inverse pixel transforms T.sup.−1 according to the height scaling and the box scaling approaches will normalize the size of the object in the object's length direction.

Constant Size

[0105] In some embodiments, herein described as relating to transforming objects into a constant size, the inverse pixel transform T.sup.−1 applied to each target pixel forces the target pixels to reproduce the depicted one or more objects 104 of the specific object type in the target image with a constant size in the at least one size dimension irrespective of the location of each of the depicted one or more objects of the specific object type in the target image.

[0106] The determination of the inverse pixel transform T.sup.−1 for embodiments relating to a constant size approach will now be described. Letting the scale be relative to the projected object's height, as for example in the height scaling approach described above, does not necessarily mean that the height of the transformed object will be constant, i.e., the same, at different locations of the target image. The reason for this is that different parts of the object 104 will be transformed with different scales when using the inverse pixel transform T.sup.−1 as in the height scaling approach. Therefore, in the constant size approach the transformed height of an object of known height is enforced to always be the same, i.e., constant. An object located at do in the target image will be located at x=d.sub.0/fz.sub.0 in the world with its top located at d.sub.2=fx/z.sub.2=fd.sub.0/fz.sub.0/z.sub.2=d.sub.0z.sub.0/z.sub.2 and similarly its bottom at d.sub.1=d.sub.0z.sub.0/z.sub.1. Assume that the transformed height, using the pixel transformation T, of an object is c, then:

[0107] T((d.sub.0z.sub.0)/(z.sub.2))−T((d.sub.0z.sub.0)/(z.sub.1))=c.

Also assume that T(d)=a log(d)+b for some constants a and b, then:

[0108] c=(a log((d.sub.0z.sub.0)/(z.sub.2))+b)−(a log((d.sub.0z.sub.0)/(z.sub.1))+b)=a log(z.sub.1/z.sub.2) giving that

[0109] a=c/(log(z.sub.1/z.sub.2)).

[0110] After introducing a distance threshold d.sub.cutoff as in the height scaling approach, this gives the inverse pixel transform:

[0111] T.sup.−1(d)=exp.sup.[(d−b)/a] if d>d.sub.cutoff and,

[0112] T.sup.−1(d)=d if d≤d.sub.cutoff.

At the distance threshold d.sub.cutoff the position should be the same. Thus:

[0113] T.sup.−1(d.sub.cutoff)=exp((d.sub.cutoff−b)/a)=d.sub.cutoff giving that b=d.sub.cutoff−a log d.sub.cutoff, and by fixing the height c of the object at the distance threshold d.sub.cutoff where T.sup.−1(d.sub.cutoff)=d.sub.cutoff gives that:

[0114] c=[(d.sub.cutoff)(z.sub.0/z.sub.2)]−[(d.sub.cutoff)(z.sub.0/z.sub.1)].

[0115] Thus, for embodiments relating to the constant size approach, the inverse pixel transform T.sup.−1 as a function of the distance d.sub.t may be given by: [0116] T.sup.−1(d.sub.t)=exp.sup.[(dt−b)/a] if d.sub.t>d.sub.cutoff and,

[0117] T.sup.−1(d.sub.t)=d.sub.t if d.sub.t≤d.sub.cutoff; wherein [0118] a=c/[log(z.sub.1/z.sub.2)]; [0119] b=d.sub.cutoff−a log(d.sub.cutoff); [0120] c is a parameter specifying a size in the target image of an object 104 of the specific object type and is set to c=[(d.sub.cutoff)(z.sub.0/z.sub.2)]−[(d.sub.cutoff)(z.sub.0/z.sub.1)] to make the object at the distance d.sub.cutoff in the source image and in the target image of equal size; [0121] z.sub.0 is an assumed height above ground of a centre of an object 104 of the specific type and is given by (z.sub.1+z.sub.2)/2; [0122] z.sub.1 is the camera mounting height above ground; [0123] z.sub.2 is the camera mounting height above ground minus the assumed object height;

[0124] d.sub.t is the distance in the target image from the principal point P to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied; and

[0125] d.sub.cutoff is a distance threshold value at which distance from the principal point an object of the specific object type in the source image will have the same size as the object of the specific object type in the target image.

[0126] By the inverse pixel transform T.sup.−1 according to the constant size approach, the size of the object is normalized in one size dimension of the depicted one or more objects of the specific object type. In the example when the object 104 is a person and the size is a length of the person, the inverse pixel transform T.sup.−1 according to the constant size approach will normalize the size of the object in the object's length direction.

Sphere Scaling

[0127] In some embodiments, herein described as relating to sphere scaling, at least one part of the object 104 of the specific object type is modelled as a sphere, and a scaling factor is proportional to a projected radius r of the sphere in the source image. The determination of the inverse pixel transform T.sup.−1 for embodiments relating to sphere scaling will now be described. In these embodiments, the object 104 is represented with a sphere and a scale factor is set to be proportional to the radius of the projection of that sphere. FIG. 9 schematically illustrates an exemplifying geometrical relationship between the object 104 and the camera 108 when a part of the object 104 is modelled using a sphere 104.sub.sphere having a radius r. The center of the sphere 104.sub.sphere is projected to the source pixel at the distance d from the principal point P. Denote a vector a from the camera 108 to the sphere center as:

a=(a.sub.x,a.sub.z)=(zd/f,z).

To find the boarders of the sphere 104.sub.sphere, the vector a can be rotated 90 degrees into a vector b=(−a.sub.z, a.sub.x). The borders, p and q, can then be found as:

p=(p.sub.x,p.sub.z)=a+rb/|b|;

q=(q.sub.x,q.sub.z)=a−rb/|b|;

which can be projected back into the source image at pixels d.sub.p=fp.sub.x/p.sub.z and d.sub.q=fq.sub.x/q.sub.z. The diameter of the projection would be:

[00001] $s (d) = d_{p} - d_{q} \frac{2 r \sqrt{z^{2} (1 + {(d)}^{2})} (1 + {(d)}^{2})}{z^{2} - {r^{2} (d)}^{2} + {(d)}^{2} z^{2}};$

and if the scale in the center is to be 1, a pixel transformation with a local scaling of s(d)/s(0) at a distance d is to be applied. Integrating s(d) gives the inverse pixel transform T.sup.−1(d) as:

[00002] $T^{- 1} (d) = \int \frac{s (d)}{s (0)} d (d) .$

[0128] Thus, in such embodiments relating to sphere scaling, the inverse pixel transform T.sup.−1 as a function of the distance d.sub.t in the target image is given by:

[00003] $T^{- 1} (d_{t}) = \int \frac{s (d_{t})}{s (0)} d (d_{t});$

wherein:

[00004] $s (d_{t}) = \frac{2 r \sqrt{z^{2} (1 + {(d_{t})}^{2})} (1 + {(d_{t})}^{2})}{z^{2} - {r^{2} (d_{t})}^{2} + {(d_{t})}^{2} z^{2}};$ $s (0) = \frac{2 r}{z};$

[0129] z is an assumed distance between the camera 108 and the at least one part of the object 104 modelled as the sphere; and

[0130] d.sub.t is the distance in the target image from the principal point P.sub.t to the target pixel to which the inverse pixel transform T.sup.−1 is to be applied.

[0131] If the integration above is performed numerically between 0 and 4 and a 4th degree polynom is fitted to the result, the inverse pixel transform T.sup.−1 is for r=250 and z=270−180+r (in lightning_crowd: lens.py, rectst_sphere.py) given by T.sup.−1(d.sub.t)=−0.020750756.sup.4+0.21880565d.sub.t.sup.3+0.281135226d.sub.t.sup.2+0.83111921d.sub.t+0.01962886.

[0132] The at least one part of the object 104 modelled as a sphere may be a head or a torso when the object 104 is a person.

[0133] By the inverse pixel transform T.sup.−1 according to the sphere scaling approach, the size of the object is normalized in two size dimensions of the depicted one or more objects of the specific object type. In the example when the object 104 is a person and the size is two dimensional and given by a length and a width of the person, the inverse pixel transform T.sup.−1 according to the sphere scaling approach will normalize the size of the object in both the object's length direction and width direction.

[0134] Embodiments also relate to the image processor 208 configured to provide a target image for evaluation with the object detector 209. The image processor 208 is configured to obtain a source image captured by an image capturing module 206 and depicting one or more objects 104 in a scene 102. The image processor 208 is further configured to apply an inverse pixel transform T.sup.−1 to each target pixel of a target image to determine one or more source pixels located at a position in the source image corresponding to a position of each target pixel in the target image. Furthermore, the image processor 208 is configured to assign, to each target pixel, a target pixel value determined based on one or more source pixel values of the determined one or more source pixels located at the corresponding position, thereby is a size c in the target image of depicted one or more objects of the specific object type normalized in at least one size dimension of the depicted one or more objects of the specific object type. Yet further, the image processor 208 is configured to feed the target image to an object detector for evaluation.

[0135] Embodiments also relate to a camera 108 for providing a target image that is to be evaluated by the object detector 209. The camera 108 comprises the image processor 208. The camera 108 is, e.g., by means of the image capturing module 208, configured to capture the source image depicting the one or more objects 104 in the scene 102. Further, the camera 108 is configured to provide the source image to the image processor 208. Furthermore, the camera 108 comprises the object detector 209 configured to receive the target image and to evaluate the target image by performing object detection on the target image.

[0136] Embodiments also relates to a non-transitory computer-readable medium having stored thereon computer code instructions adapted to carry out embodiments of the method described herein when executed by a device having processing capability.

[0137] As described above, the camera 108, e.g., the image processor 208 of the camera 108, may be configured to implement a method for providing the target image. For this purpose, the camera 108, e.g., the image processor 208 of the camera 108, may include circuitry which is configured to implement the various method steps described herein.

[0138] In a hardware implementation, the circuitry may be dedicated and specifically designed to implement one or more of the method steps. The circuitry may be in the form of one or more integrated circuits, such as one or more application specific integrated circuits or one or more field-programmable gate arrays. By way of example, the camera 108 may hence comprise circuitry which, when in use, obtains a source image, and applies an inverse pixel transform T.sup.−1 to each target pixel of the target image to determine one or more source pixels located at a position in the source image corresponding to a position of each target pixel in the target image. The camera 108 may further comprise circuitry which, when in use assigns, to each target pixel, a target pixel value determined based on one or more source pixel values of the determined one or more source pixels located at the corresponding position. Thereby is a size c in the target image of depicted one or more objects of a specific object type normalized in at least one size dimension of the depicted one or more objects of the specific object type. The camera 108 may further comprise circuitry which, when in use feeds the target image to an object detector for evaluation.

[0139] In a software implementation, the circuitry may instead be in the form of a processor, such as a microprocessor, which in association with computer code instructions stored on a (non-transitory) computer-readable medium, such as a non-volatile memory, causes the camera 108, e.g., the image processor 208 of the camera 108, to carry out any method disclosed herein. Examples of non-volatile memory include read-only memory, flash memory, ferroelectric RAM, magnetic computer storage devices, optical discs, and the like. In a software case, each of the method steps described above may thus correspond to a portion of computer code instructions stored on the computer-readable medium, that, when executed by the processor, causes the camera 108 to carry out any method disclosed herein.

[0140] It is to be understood that it is also possible to have a combination of a hardware and a software implementation, meaning that some method steps are implemented in hardware and others in software.

[0141] It will be appreciated that a person skilled in the art can modify the above-described embodiments in many ways and still use the advantages of the disclosure as shown in the embodiments above. For example, the camera 108 does not need to be a single unit comprising the image capturing module 206 and the image processing and encoding module 214 at one location but it could be a virtual unit, wherein image capturing part 206 and the image processing and encoding module 214 operate together but they are provided at different locations. Further, the object detector 209 does not need to be arranged in the image processor 208, but could be arranged as a separate unit of the image processing and encoding module 214 and could be arranged in communication with the image processor 208, the encoder 210, the input and output interface 212, and the data storage 214. Thus, the disclosure should not be limited to the shown embodiments but should only be defined by the appended claims. Additionally, as the skilled person understands, the shown embodiments may be combined.

AN IMAGE PROCESSOR AND A METHOD THEREIN FOR PROVIDING A TARGET IMAGE

Assignee

Inventors

Cpc classification

Classification Explorer

G06V10/255

PHYSICS

Classification Explorer

G06V2201/07

PHYSICS

Classification Explorer

G06V10/20

PHYSICS

Classification Explorer

G06V20/52

PHYSICS

Classification Explorer

G06V10/758

PHYSICS

Classification Explorer

G06T3/0062

PHYSICS

Classification Explorer

G06V10/247

PHYSICS

International classification

Classification Explorer

G06T3/00

PHYSICS

Classification Explorer

G06V10/20

PHYSICS

Classification Explorer

G06V10/75

PHYSICS

Classification Explorer

G06V20/52

PHYSICS

Abstract

Claims

Description