Apparatus and method for wide-range optical tracking during medical imaging

11308645 · 2022-04-19

Assignee

Inventors

Cpc classification

International classification

Abstract

Methods to quantify motion of a human or animal subject during a magnetic resonance imaging (MRI) exam are provided. In particular, these algorithms make it possible to track head motion over an extended range by processing data obtained from multiple cameras. These methods make current motion tracking methods more applicable to a wider patient population.

Claims

1. A method of determining a position and orientation of an object in a medical imaging device, the method comprising: rigidly attaching one one or more markers to the object, wherein each marker of the one or more markers comprises three or more feature points, wherein the three or more feature points of each marker of the one or more markers have known positions in a coordinate system of the corresponding marker; configuring two or more cameras to have partial or full views of at least one of the one or more markers; determining a camera calibration that provides transformation matrices T.sub.ij relating a coordinate system C.sub.i of camera i to a coordinate system C.sub.j of camera j, wherein i and j are index integers for the two or more cameras; forming two or more images of the one or more markers with the two or more cameras, wherein the known positions of the three or more feature points of each marker in the coordinate systems of the corresponding markers lead to image consistency conditions for images of the three or more feature points in the camera coordinate systems; wherein the image consistency conditions are relations that are true in images of the one or more markers because of known relative positions of the three or more feature points on each of the one or more markers; and solving the image consistency conditions to determine rigid-body transformation matrices M.sub.k relating coordinate systems MC.sub.k of each marker k to the coordinate systems of the two or more cameras, wherein k is an index integer for the one or more markers, whereby the position and orientation of the object is provided; wherein the solving the image consistency conditions to determine each rigid-body transformation matrix M.sub.k is performed with a least squares solution to an overdetermined system of linear equations; wherein the overdetermined system of linear equations for rigid-body transformation matrix M.sub.k is a set of two equations for each feature point of marker k that is seen by each of the two or more cameras; and wherein the overdetermined system of linear equations for rigid-body transformation matrix M.sub.k has coefficients of the rigid-body transformation matrix M.sub.k as unknowns to be solved for.

2. The method of claim 1, wherein the two or more cameras are compatible with magnetic fields of a magnetic resonance imaging system.

3. The method of claim 1, wherein the one or more markers include a position self-encoded marker.

4. The method of claim 1, wherein the object is a head of a human subject.

5. The method of claim 1, wherein the camera calibration is determined prior to installing the two or more cameras in the medical imaging device.

6. The method of claim 1, wherein the camera calibration includes referencing each camera to system coordinates of the medical imaging device and enforcing consistency conditions for the camera calibration.

7. The method of claim 1, wherein all visible feature points of the one or more markers in the images are used in the solving of the image consistency conditions.

8. The method of claim 1, wherein fewer than all visible feature points of the one or more markers in the images are used in the solving of the image consistency conditions.

9. The method of claim 1, wherein a frame capture timing of the two or more cameras is offset, whereby an effective rate of tracking can be increased.

10. The method of claim 1, wherein the two or more cameras are arranged to allow a marker tracking range in a head-feet direction of a patient being imaged.

11. The method of claim 1, further comprising applying motion correction to medical imaging data based on the position and orientation of the object.

12. The method of claim 11, wherein the motion correction is applied adaptively.

13. The method of claim 12, wherein two or more of the one or more markers are attached to the object, and further comprising performing analysis of a relative position of the two or more markers as a marker consistency check.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1A shows a human subject with a marker attached rigidly to their head lying inside an MRI scanner, which includes three cameras that independently track the pose of the marker.

(2) FIG. 1B shows an example of an optical tracking marker with a self-encoding pattern.

(3) FIG. 2A shows n cameras positioned to view a marker placed on the forehead of a subject.

(4) FIG. 2B shows how the setup in FIG. 2A is robust to head motion and how the usefulness of particular cameras may increase or decrease as motion occurs.

(5) FIG. 3A shows how n cameras and n markers can be positioned to obtain tracking measurements from different locations on the surface of the head.

(6) FIG. 3B shows how the setup in FIG. 3A can be used to detect inconsistent marker motion.

(7) FIG. 4A shows how multiple cameras can be used to extend the effective field of view in the longitudinal (head-feet) direction.

(8) FIG. 4B shows how the setup in FIG. 4A allows a subject's head to be tracked at different positions in the head coil by using multiple cameras on their own or together.

(9) FIG. 5A shows how measurements obtained from video data from each camera can be combined to form a single estimate using the pose combination algorithm.

(10) FIG. 5B shows how measurements obtained from video data from each camera can be combined to form a single estimate using the augmented DLT algorithm.

(11) FIG. 6A shows how the augmented DLT algorithm finds the pose of the marker based on input from any number of cameras.

(12) FIG. 6B shows exemplary equations for solving for the pose of the marker using the augmented DLT algorithm.

(13) FIG. 7 shows how homogeneous transformations can be used to relate the coordinate frames between the MRI scanner and any number of cameras.

(14) FIG. 8 provides experimental results showing that the mean rotation error can be improved by combining data from two cameras using the pose combination algorithm.

DETAILED DESCRIPTION

A) General Principles

(15) To better appreciate the present invention, it will be helpful to briefly describe some embodiments with reference to the subsequent description. An exemplary embodiment of the invention is a method of determining a position and orientation of an object in a medical imaging device. The method includes five main steps.

(16) 1) Providing one or more markers rigidly attached to the object, where each marker includes three or more feature points, and where the feature points of each marker have known positions in a coordinate system of the corresponding marker. In other words, the feature points are marker features that can be distinguished from each other in images and which have known relative positions with respect to each other, provided they are on the same marker.

(17) 2) Providing two or more cameras configured to have partial or full views of at least one of the markers.

(18) 3) Determining a camera calibration that provides transformation matrices T.sub.ij relating a coordinate system C.sub.i of camera i to a coordinate system C.sub.j of camera j. Here i and j are index integers for the two or more cameras. See Eqs. 1 and 3 below for examples of such transformation matrices.

(19) 4) Forming two or more images of the one or more markers with the two or more cameras. Here the known positions of the feature points of each marker in the coordinate systems of the corresponding markers lead to image consistency conditions for images of the feature points in the camera coordinate systems. See Eqs. 2 and 4 below for examples of such consistency conditions. Here image consistency conditions refer to relations that are true in images of the markers because of the known relative positions of feature points on each marker. As a simple example, suppose three feature points are equally spaced in the x-direction of the marker coordinate system. That equal spacing relation will lead to corresponding relations in images including these three feature points. This kind of consistency condition is a single-image consistency condition, and is different from image to image consistency checks performed to see if a marker has moved, as described below.

(20) 5) Solving the image consistency conditions to determine transformation matrices M.sub.k relating the coordinate systems MC.sub.k of each marker k to the coordinate systems of the cameras, wherein k is an index integer for the one or more markers, whereby position and orientation of the object is provided. See FIG. 6B for an example of a system of image consistency conditions.

(21) The cameras are preferably compatible with magnetic fields of a magnetic resonance imaging system. The one or more markers can include a position self-encoded marker. The object can be a head of a human subject.

(22) The camera calibration can be performed prior to installing the cameras in the medical imaging device. The camera calibration can include referencing each camera to system coordinates of the medical imaging device and enforcing consistency conditions for the camera calibration.

(23) All or fewer than all visible feature points of the markers in the images can be used in the solution of the image consistency conditions. A frame capture timing of the two or more cameras can be offset to increase an effective rate of tracking. The cameras can be arranged to increase a marker tracking range in a head-feet direction of a patient being imaged.

(24) The position and orientation of the object can be used to apply motion correction to medical imaging data. Such motion correction can be applied adaptively. In cases where two or more markers are attached to the object, analysis of the relative position of the two or more markers can be performed as a marker consistency check. If this marker consistency check fails, the motion correction can be disabled.

(25) Solving the image consistency conditions can be performed with a least squares solution to an overdetermined system of linear equations (i.e., more equations than unknowns).

B) Examples

(26) FIG. 1A shows an exemplary MRI system 100. A patient 110 is lying inside an MRI scanner 120. The patient wears an optical tracking marker 111 on their forehead. Several cameras 130 are positioned so as to have a view of the patient's forehead. Data from the cameras are transferred out of the scanner via optical fiber 131. Individual fibers can be combined together into a single fiber bundle 132 for easy handling. Alternatively, the cameras may be wireless, which has the advantage of complete flexibility in terms of the number of cameras used, since the system is highly modular and adding an extra camera does not affect the existing cameras. FIG. 1B shows an example of an optical tracking marker with a self-encoding pattern.

(27) FIG. 2A shows the head of a human subject 200 with an attached marker 201. Two or more cameras 210 are positioned so as to have a view of the marker. Whether or not the field of view from each camera overlaps is of no consequence. In this example, the field of view 211 of Camera 1 does not overlap with that of any other camera. However, the field of view 212 of Camera 2 does overlap with the field of view 213 of Camera n. This flexibility is unlike conventional stereo vision approaches, which require a field of view with as much overlap as possible between the two cameras in order to calculate the object pose.

(28) FIG. 2B shows the setup in FIG. 2A after a head rotation denoted by θ. Following the rotation of θ, Camera 1 no longer has a robust view of the marker. Using the algorithms described here, its contribution to the pose estimation process will be decreased. Conversely, other cameras may now have a better view of the marker: their contributions to pose estimation will be automatically increased.

(29) FIG. 3A shows an alternative implementation to FIGS. 2A-B, where multiple separate markers 301 are attached to the head of a human subject 300, rather than using a single marker, as shown in FIGS. 2A-B. Each marker is viewed by a separate camera 310 with non-overlapping fields of view. This implementation has advantages in the case of skin motion, which is typically a confounding non-rigid effect. Skin motion affects all markers differently, so there is an inherent averaging effect when the data are combined. FIG. 3B shows how the central marker could move differently than the left and right markers. When this happens, it is a strong indication that skin motion has occurred or that a marker has become dislodged and is no longer rigidly attached.

(30) In the implementation shown in FIGS. 3A-B, the markers shown are self-encoding markers. However, any marker can be used that has the property that a full or partial view of it is sufficient to calculate its pose (comprising three translation parameters and three rotation parameters). There are many well-known markers that have this property, including rigid 3D constellations of reflective spheres or two-dimensional markers with integrated moiré patterns.

(31) FIG. 4A shows an arrangement that extends the tracking range of the optical system in the longitudinal (head-feet) direction. The patient table 401 is equipped with a head coil 402, where the subject's head 403 is positioned. A marker 404 is attached to the head of the subject. In practice, there is considerable variation in how far into the head coil the subject's head 403, and therefore the marker 404, lies. Two cameras (405 and 406) are placed on the head coil such that their fields of view (407 and 408) only partially overlap and so that the ‘combined’ field of view from both cameras covers a greater range in the head-feet direction than a single camera alone. In this example, two cameras are used; however, this arrangement is not limited to two cameras, and any number of extra cameras can be added depending on the desired tracking range.

(32) FIG. 4B shows three modes of operation of the apparatus shown in FIG. 4A. The diagram on the left illustrates the situation where the subject's head is fully inserted into the head coil. In this case, the marker lies in the field of view of Camera 1, but not of Camera 2. No data combination is required, since tracking data from Camera 1 alone may be used. The diagram in the middle illustrates the situation when the subject's head is placed in a neutral position in the head coil. In this case, the marker lies in the field of view of both Camera 1 and Camera 2. Although data from either camera could be used alone, discarding the other, this would be sub-optimal, and data fusion as described below should instead be used. The diagram on the right illustrates the situation where the subject's head does not reach far into the head coil, which can occur in subjects with shorter necks. In this case, the marker lies in the field of view of Camera 2, but not in the field of view of Camera 1. Here data fusion is not required, since tracking data from Camera 2 alone may suffice. In our experience, subjects move sufficiently during their MRI examination that the marker can move from the field of one camera to the other. Therefore, data fusion is preferably always used, so that such patient motion is automatically handled.

(33) FIGS. 5A-B show two methods that can be used to combine pose measurements obtained from video data from each camera to form a single estimate. We refer to the two methods as (FIG. 5A) the ‘pose combination algorithm’ and (FIG. 5B) the ‘augmented DLT algorithm’, where DLT is an abbreviation for the well-known discrete linear transform. The augmented DLT algorithm is our preferred method for use with the self-encoded marker design described here. To better appreciate the preferred DLT approach, it is helpful to summarize the pose combination algorithm.

(34) The pose combination algorithm (FIG. 5A) works as follows. At any point in time the latest pose is calculated from the latest frames from all cameras that observed the marker. Given n cameras, n individual pose estimates are computed and then one ‘optimal’ estimate is computed from these. For each individual pose estimate, a scalar weight, w.sub.i, is computed, which represents the reliability of the estimate for camera i and where

(35) .Math. i = 1 n w i = 1.
The estimates are then combined using a weighted sum. For the translation component of pose, the combined estimate is given by
t.sub.c=w.sub.1t.sub.1+w.sub.2t.sub.2+ . . . +w.sub.nt.sub.n
where t.sub.i, is the vector translation component of the pose estimate from camera i.

(36) The combined estimate of the rotation component of each pose is computed using a similar weighting procedure. However, simply averaging rotation matrices or Euler angles is not a mathematically valid approach. Instead, rotations components derived from the individual camera views are first expressed as unit quaternions, q.sub.i. Then the combined estimate is calculated as q.sub.c, using one of several known methods, such as spherical linear interpolation (slerp) or the method of Markley, et al., “Averaging Quaternions”, Journal of Guidance, Control and Dynamics, Vol. 30, No. 4, 2007. In our experience, when the unit quaternions to be averaged all represent a similar rotation, a simple and computationally efficient approximation to these methods can be obtained using the following procedure:

(37) 1) Changing the sign of all unit quaternions with negative real part (q and −q represent the same rotation, but can't be easily averaged).

(38) 2) Taking the mean of all n unit quaternions by adding all components and dividing by n.

(39) 3) Renormalizing by dividing the result from (2) by its norm, so that the combined quaternion, q.sub.c, is a unit quaternion.

(40) If weighted averaging is desired, then weights can be easily included as part of Step (2).

(41) The augmented DLT algorithm (FIG. 5B) differs significantly from the pose combination algorithm and is a novel approach that we have developed to optimally combine camera data from a single self-encoded marker. Rather than computing a pose estimate for each camera and then combining poses, the feature points are combined first and then a single pose is computed. This has a number of advantages relating to data weighting, which is performed automatically by the algorithm, rather than requiring the specific calculation of weights. A common example is the situation with one marker and two cameras, where one of the cameras has a good view of the marker (>40 points), but the other camera has a poor view (<15 points). By combining the points prior to pose calculation, the camera with the best view automatically receives the higher weighting, since a greater number of points from that camera are being used to calculate the marker pose.

(42) FIG. 6A further illustrates how the augmented DLT algorithm functions. In this example, there are two cameras, C1 and C2, but the same principles apply to any number of cameras. It is important to note that this augmented DLT algorithm is completely different than stereovision. In stereovision, a point cloud is extracted from the scene, such that all points in the cloud are visible to all cameras. Additionally, the relative locations of these points in the cloud are unknown. In contrast, in our case, a marker with known geometry is tracked, i.e., the locations of the points with respect to each other are known. Additionally, the tracked points need not be in the field-of-view of all cameras: different cameras can see different parts of the object and can still fuse the data to form a single pose estimate. This scenario is depicted in FIG. 6A where the two points .sup.wX.sub.1 and .sup.wX.sub.2 are visible to Cameras 1 and 2, respectively.

(43) The augmented DLT algorithm determines the pose of the marker coordinate frame (W) with respect to a reference camera frame (arbitrarily chosen to be C.sub.1 in this example). This pose is represented by a 4-by-4 transformation matrix T.sub.WC1. Here, we are assuming that the extrinsic calibration of the camera system is already known, i.e., the transformation matrix T.sub.C1C2 linking the two coordinate frames.

(44) Cameras 1 and 2 track two points, .sup.wX.sub.1 and .sup.wX.sub.2, respectively. The left superscript w indicates that .sup.wX.sub.1 and .sup.wX.sub.2 are defined with respect to the coordinate frame W, i.e.,
.sup.C1X.sub.1=T.sub.WC1.sup.WX.sub.1
.sup.C2X.sub.1=T.sub.C1C2T.sub.WC1.sup.WX.sub.1  (1)
In practice, the coordinate frame W corresponds to the coordinate frame defined by the marker.

(45) Using the pinhole camera model, the projection of .sup.C1X.sub.1=(.sup.C1x.sub.1, .sup.C1y.sub.1, .sup.C1z.sub.1) on the first camera image plane) .sup.C1I.sub.1=(.sup.C1u.sub.1.sup.(1), .sup.C1v.sub.1.sup.(1), −f.sup.(1)) can be determined as:

(46) u 1 ( 1 ) C 1 = f ( 1 ) x 1 C 1 C 1 v 1 ( 1 ) C 1 = f ( 1 ) y 1 C 1 C 1 ( 2 )
where f.sup.(1) is the focal length of camera 1. Note that in Eq. 2, we used the coordinates .sup.C1X.sub.1, but in fact one knows .sup.wX.sub.1. Another important point is that the coordinates u and v in Eq. 2 are still defined with respect to a physical coordinate system C1, and are represented in physical units (e.g., millimeters). However, in reality, the location of a projected point on a camera image is described in pixels. The conversion from detected camera image pixel coordinates to physical coordinates (u, v) involve other steps, such as re-centering depending on the offset between centers of the lens and detectors, and correcting for radial and tangential lens distortions. However, pixel-to-physical conversion rules are constant for a camera and can be determined offline using well-known intrinsic camera calibration methods (e.g., Zhang Z. A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000; 22:1330-1334. doi: 10.1109/34.888718). Thus, without loss of generality, it can be assumed that (u, v) coordinates in Eq. 2 can easily be determined from the pixel coordinates on the image. In fact, we can also drop the focal length f.sup.(1) in Eq. 2 by re-defining u′ and v′ such that u′=u/f and v′=v/f.

(47) The transformation matrix between the marker and Camera 1, and between Camera 1 and Camera γ, can be defined as

(48) T WC 1 = [ R 11 R 12 R 13 t 1 R 21 R 22 R 23 t 2 R 31 R 32 R 33 t 3 0 0 0 1 ] T C 1 C γ = [ R 11 γ R 12 γ R 13 γ t 1 γ R 21 γ R 22 γ R 23 γ t 2 γ R 31 γ R 32 γ R 33 γ t 3 γ 0 0 0 1 ] ( 3 )
where γ is the camera index. In both cases, the 3-by-3 matrix R represents the rotation and the 3-by-1 vector t represents translation. T.sub.C1Cγ is already known through extrinsic camera calibration and T.sub.WC1 is the marker pose that is to be determined using DLT. Assuming arbitrary point κ and camera γ, we can re-arrange Eq. 2 to get (and dropping the focal length):
.sup.Cγu.sub.κ.sup.(γ)Cγcustom character−.sup.Cγx.sub.κ=0
.sup.Cγv.sub.κ.sup.(γ)Cγcustom character−.sup.Cγy.sub.κ=0  (4)

(49) Combining Eqs. 1, 3, 4 and cascading the equations for each detected point for all cameras gives a system of equations as shown on FIG. 6B. On this figure a condensed notation is used where coordinate systems are indicated with a right superscript instead of with a left superscript. Another notation change here is that the explicit notation of right superscript to denote the camera is dropped because the coordinate system being used suffices to identify the corresponding camera. FIG. 6B shows two equations for a single feature point as seen by one camera. The expressions for the matrix elements are given on two lines to make the expression compact enough to fit on the page. Such pairs of equations will exist for each feature point on the marker that is seen by each camera.

(50) More explicitly, the matrix in FIG. 6B is

(51) .Math. γ = 1 n γ n k ( γ )
-by-12, where n.sub.γ is the total number of cameras and n.sub.η.sup.(γ) is the number of points detected by camera γ. In cases where more than one marker is employed, a system of equations as in FIG. 6B can be solved for each marker.

(52) Solution of the system of FIG. 6B and extraction of rotation and translation parameters is straightforward using singular value decomposition or iterative methods (Hartley R, Zisserman A. Multiple View Geometry in Computer Vision. 2003.).

(53) FIG. 7 shows how the coordinate frames between the MRI scanner and two cameras, Camera 1 and Camera 2, are connected using homogeneous transformations. The knowledge of these transformations is needed for the methods described in this work. The means of obtaining the transformation between two cameras is well known to those in the field, as is the means to obtain the calibration between a single camera and the MRI scanner. However, due to the use of multiple cameras, it is possible to optimize these transformations to enforce consistency. Assuming the total number of cameras is two, then there are three relevant transformations, namely T.sub.C1S (linking Camera 1 and the scanner), T.sub.C2C1 (linking Camera 1 and Camera 2) and T.sub.SC2 (linking Camera 2 and the scanner). As seen in FIG. 7, if these transformations are correct and are applied sequentially, then an identity transform results, i.e.,
T.sub.C1ST.sub.C2C1T.sub.SC2=I  (5)

(54) Well-known iterative optimization methods can be used to modify the measured transformations, such that the above equation holds, and while satisfying constraints such as

(55) 1) Even distribution of errors between scanner-camera cross-calibration transformations T.sub.C1S and T.sub.C2C1 and/or

(56) 2) No errors in T.sub.C2C1 because camera-camera calibration can be done to far greater accuracy than scanner-camera calibration.

(57) Given more than two cameras, it is possible to formulate the optimal solution of scanner-camera transformation in a least squared sense as follows. Arbitrarily choosing C1 as the reference frame, one can obtain:

(58) T ~ C 1 S T C 1 S T ~ C 2 S T C 1 S T C 2 C 1 .Math. T ~ C γ S T C 1 S T C γ C 1 ( 6 )

(59) Here, {tilde over (T)}.sub.C1S, {tilde over (T)}.sub.C2S and {tilde over (T)}.sub.CγS, are the measured camera-to-scanner transformations for cameras 1, 2 and γ. As mentioned above, the transformation between camera and MRI scanner can be obtained using methods well known to those in the field. In addition, the camera-to-scanner transformations for all cameras can be obtained within one experiment without additional time overhead. In Eq. 6, T.sub.CγC1 represents the transformations between camera γ and camera 1, and can be obtained outside the MRI scanner with a high degree of accuracy. T.sub.C1S in Eq. 6 is the reference-camera-to-scanner transformation that needs to be determined from the equations. Re-writing Eq. 6 as a least-squares problem:

(60) T C 1 S = arg min T C 1 S { .Math. γ = 1 n γ .Math. T ~ C γ S - T C 1 S T C γ C 1 .Math. 2 } ( 7 )
Eq. 7 represents a linear-least-squares problem with respect to the variables in T.sub.C1S, so it can be solved using any available linear equation solver. It is also possible to solve Eq. 7 using non-linear methods, such as Levenberg-Marquardt or Gauss-Newton. One can also solve Eq. 7 by separating the rotational and translational components and solving for the rotational component of the transformation matrices first.

(61) FIG. 8 shows experimental results obtained from an implementation of the pose combination algorithm shown in FIG. 5A. In this experiment, a rotation stage was used to give ground truth information. A marker was moved while video data were collected using two cameras. The graph shows a comparison of errors in rotation for each camera individually and for the combined estimate (labeled ‘Weighted sum’). Of note is the spike in rotation error for Camera 1 between frames 40 and 50. This was caused by a poor view of the marker, leading to an ill-conditioned problem and noise in the pose estimate. Fortunately, in such event the weighted sum approach substantially reduces the rotation estimate error. Similar automatic and adaptive compensation for poor views from individual cameras can be obtained from the augmented DLT method of FIG. 5B.