GPU-based system for performing 2D-3D deformable registration of a body organ using multiple 2D fluoroscopic views

11164324 · 2021-11-02

Assignee

Inventors

Cpc classification

International classification

Abstract

Systems and methods for assisting a physician in a medical intervention comprises performing a 2D-3D deformable registration, and more particularly, performing a 2D-3D registration based on multiple live 2D fluoroscopic views, and implemented on a multi-core processing framework such as a Graphics Processing Unit.

Claims

1. A system for deformably registering pre-acquired 3D CT image data of a patient to real time 3D data of the patient comprising: a first memory for receiving 3D CT image data of a body organ of the patient; a second memory for receiving real-time image data of the patient from an imaging unit; wherein the real-time image data comprises multiple real-time images obtained at different view angles; a processor framework operable to: (a) generate a candidate 3D deformation field based on one or more deformation parameters and the 3D CT image data; (b) create at least one digitally reconstructed radiograph (DRR) for each real-time image of the multiple real-time images based on the candidate 3D deformation field and the different view angles corresponding to the multiple real-time images; (c) determine an optimal deformation field by evaluating each DRR and multiple real-time images for a match; and if a match is not reached, adjust the one or more deformation parameters and repeat the above processor framework steps; or if a match is reached, output the most-recently generated candidate deformation field as the optimal deformation field.

2. The system of claim 1, further comprising the imaging unit, and wherein the imaging unit is one selected from the group consisting of a C-Arm, Biplanar fluoroscopy, 3D fluoroscopy, and Cone Beam CT.

3. The system of claim 1, wherein the processor framework is operable to compute cost function based on the candidate 3D deformation field and the real time image data.

4. The system of claim 3, wherein the processor framework is further operable to compute gradient vector based on the cost function.

5. The system of claim 4, wherein the processer framework comprises a GPU, and the computation of the cost function and gradient vector are performed on the GPU.

6. The system of claim 5, wherein the processer framework is operable to implement a learning-based or BFGS algorithm to determine the optimal deformation field based on the cost function and the gradient vector.

7. The system of claim 5, wherein the deformation parameters comprise a plurality of control points, and wherein the processor framework is operable to compute, in parallel, affected regions of the 3D CT image data for the plurality of control points.

8. The system of claim 1, wherein the processor framework is operable to compute the deformation field based on images arising from different breathing levels using 4D CT.

9. A computer-implemented method for assisting a physician in a medical intervention on a patient comprising: receiving pre-acquired 3D image data of a body organ of the patient prior to the medical intervention; receiving real-time image data of the body organ, wherein the real-time image data comprises multiple real-time images obtained at different view angles; and performing, on a processing framework, computations to determine a deformable registration on the pre-acquired 3D image data and the real-time data from the receiving step, and wherein the processing framework outputs an optimal 3D deformation field; wherein the performing comprises: generating a candidate 3D deformation field based on one or more deformation parameters and the 3D image data; creating at least one digitally reconstructed radiograph (DRR) for each real-time image of the multiple real-time images based on the candidate deformation field and the different view angles corresponding to the multiple real-time images; determining the optimal deformation field by evaluating each DRR and the multiple real-time images for a match; and if a match is not reached, adjust the one or more deformation parameters and repeat the above performing steps; or if a match is reached, output the optimal deformation field.

10. The method of claim 9 wherein the performing step includes computing a cost function value and a gradient vector to determine the deformable registration on the pre-acquired 3D image data.

11. The method of claim 10 comprising inputting the cost function value and the gradient vector into an optimizing algorithm.

12. The method of claim 11 comprising computing, in parallel, affected regions of the pre-acquired 3D data for a plurality of control points.

13. The method of claim 12 wherein the gradient vector further includes generating a first lookup table for pixel IDs in the affected regions for each control point.

14. The method of claim 13 wherein the multiple parallel computations are carried out on a graphics processing unit (GPU).

15. The method of claim 14 wherein the optimizing algorithm is a learning or limited-memory BFGS based algorithm.

16. The method of claim 9, wherein the receiving real time 3D data of the body organ is carried out with a unit for obtaining images selected from the group consisting of a C-Arm, Biplanar fluoroscopy, 3D fluoroscopy, and Cone Beam CT.

17. The method of claim 9, wherein the performing step uses multiple fluoroscopic views at a similar breathing level.

18. The method of claim 9, wherein the performing step is performed at different breathing levels using 4D CT.

19. The system of claim 1, wherein the evaluating each DRR and multiple real-time images for a match is performed simultaneously.

20. A computer-implemented method for assisting a physician in a medical intervention on a patient comprising: receiving pre-acquired 3D image data of a body organ of the patient prior to the medical intervention; receiving real-time image data of the body organ; and performing, on a processing framework, computations to determine a deformable registration on the pre-acquired 3D image data and the real-time image data from the receiving steps, and wherein the processing framework computes a 3D deformation field based on an optimizing algorithm; wherein the performing step comprises computing a cost function value and a gradient vector to determine the deformable registration on the pre-acquired 3D image data; computing, in parallel, affected regions of the pre-acquired 3D image data for a plurality of control points; wherein the gradient vector includes generating a first lookup table for pixel IDs in the affected regions for each control point; and inputting the cost function value and the gradient vector into the optimizing algorithm.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1A illustrates an overview of a system for performing a 2D to 3D deformable registration on a patient.

(2) FIG. 1B illustrates an overview of a method for performing a 2D to 3D deformable registration. The 3D CT is manipulated by a set of control points or deformation parameters (e.g., a 3D Free-Form control points.) DRRs are generated based on the deformed 3D CT to produce DRRs for multiple views. The multiple view DRRs are compared to the multiple fluoroscopic images. If a match is reached, the deformed 3D CT is output. Otherwise, the control points are adjusted to deform the 3D CT image. And the process is repeated.

(3) FIG. 2 illustrates a framework of GPU implementation of the 2D-3D deformable registration.

(4) FIG. 3 illustrates a Kernel system including two kernel pipelines: a cost pipeline and a gradient pipeline.

(5) FIG. 4a illustrates an affected region in the DRR after the change of one control point.

(6) FIG. 4b illustrates parallel granularity for the kernel LocalDRRPixelIDL TBBuilder and LocalDRRsGenerator.

(7) FIG. 4c illustrates parallel granularity for the kernel Gradient.

DETAILED DESCRIPTION OF THE INVENTION

(8) Before the present invention is described in detail, it is to be understood that this invention is not limited to particular variations set forth herein as various changes or modifications may be made to the invention described and equivalents may be substituted without departing from the spirit and scope of the invention. As will be apparent to those of skill in the art upon reading this disclosure, each of the individual embodiments described and illustrated herein has discrete components and features which may be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the present invention. In addition, many modifications may be made to adapt a particular situation, material, composition of matter, process, process act(s) or step(s) to the objective(s), spirit or scope of the present invention. All such modifications are intended to be within the scope of the claims made herein.

(9) Methods recited herein may be carried out in any order of the recited events which is logically possible, as well as the recited order of events. Furthermore, where a range of values is provided, it is understood that every intervening value, between the upper and lower limit of that range and any other stated or intervening value in that stated range is encompassed within the invention. Also, it is contemplated that any optional feature of the inventive variations described may be set forth and claimed independently, or in combination with any one or more of the features described herein.

(10) All existing subject matter mentioned herein (e.g., publications, patents, patent applications and hardware) is incorporated by reference herein in its entirety except insofar as the subject matter may conflict with that of the present invention (in which case what is present herein shall prevail). The referenced items are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such material by virtue of prior invention.

(11) Reference to a singular item, includes the possibility that there are plural of the same items present. More specifically, as used herein and in the appended claims, the singular forms “a,” “an,” “said” and “the” include plural referents unless the context clearly dictates otherwise. It is further noted that the claims may be drafted to exclude any optional element. As such, this statement is intended to serve as antecedent basis for use of such exclusive terminology as “solely,” “only” and the like in connection with the recitation of claim elements, or use of a “negative” limitation. Last, it is to be appreciated that unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

(12) Described herein are systems and methods for assisting a physician to perform a 2D-3D deformable registration, and more particular, a 2D-3D registration using multiple 2D fluoroscopic views implemented on a Graphics Processing Unit (GPU), or similar programmable logic chip, or processor, or a parallel architecture.

(13) Overview

(14) FIG. 1A illustrates an overview of a system for performing a 2D-3D deformable registration. In particular, FIG. 1A illustrates a patient or subject 10 on an operating table 20. Although the subject shown in FIG. 1A is a human, the invention is applicable to animals other than humans and may be performed on live or dead animals or subjects as the case may be.

(15) With reference to FIG. 1A, a surgical device 30 is shown positioned and extending into a lung of the subject. The surgical device 30 has a distal working end or tip 32 which has been passed through the subject's mouth, the trachea, a bronchi or more remote airways, and into the lung. While surgical device 30 shown in FIG. 1A is intended to represent an endoscope, namely, a bronchoscope, the invention is not so limited. The surgical device may be a wide range of devices, instruments, implants, and markers which are visible under fluoroscopy, have a portion which is visible under fluoroscopy, or be modifiable such that it is visible under fluoroscopy. Examples, without limitation, include catheters, sheaths, needles, ablation devices, stents, valves, fiducial markers, seeds, coils, etc. In embodiments, a surgical device or tool is advanced into the airways, and then through a wall of the airway towards a region of interest.

(16) A fluoroscope 40 takes real time fluoroscopic video of the subject. Video frames of the video or images are collected or received by workstation 50 for processing. Real time images may also be displayed on a video monitor 62. Examples of the fluoroscope unit, without limitation, are C-Arm, Biplanar fluoroscopy, 3D fluoroscopy, Cone Beam CT, etc.

(17) The location and pose of the fluoroscopy camera 42 is tracked with a tracking sensor 44. In the fluoroscopy unit shown in FIG. 1A, an optically visible symbol 46 is observed by the optical tracking sensor 44 to provide to workstation 50 information about the 3D location and orientation (namely, pose) of the fluoroscopy camera 42 in real time. The optical tracking sensor 44 may be positioned on any fixed surface including the floor, wall, ceiling, or a table. Although the tracking sensor shown in this embodiment is optically based, other techniques (for example, electromagnetic tracking system) to track the camera may be employed and are in accordance with the present invention.

(18) Additionally, the system can work without a tracking mechanism. The 3D camera positions of the fluoroscope relative to each other can be given or computed separately. For example, relative camera positions can be given by the fluoroscope unit itself, or can be computed by performing a rigid 2D-3D registration for each fluoroscopic image separately, or can be computed by observing 2D location of some fixed radio-opaque markers in fluoroscopic images.

(19) Workstation 50, as will be described in more detail below, is configured to receive the fluoroscopy images from the fluoroscopy unit 40 in real-time.

(20) FIG. 1A also illustrates a display 60 showing a plurality of images. As will be described in greater detail herein, workstation 50 is configured to send to the display a number of types of images including 3D model views, 2D model fluoroscopy views, real fluoroscopy views, real endoscopic views, model endoscopic views, and a wide range of information superimposed on the views such as without limitation planning information, region of interests, virtual target markers, vessels, virtual obstacles, real devices, virtual devices, routes to a target, notes and indicia provided by the user, etc.

(21) The workstation 50 shown in FIG. 1A registers the poses shown in the fluoroscopy images from the fluoroscope unit 40 with the 3D location in a 3D model of the subject. As described herein, the information and location may be generated and displayed in a number of ways to the physician to assist tracking the surgical device in real time, and in the event planning information has been provided to the workstation, to guide the physician to a target. Various planning information may be provided or determined by the workstation as described in U.S. Patent Application No. 2008/0183073 to Higgins et al., for example.

(22) As will be discussed in more detail herein, the workstation 50 includes at least one processor 70 (e.g., without limitation, a GPU or other multi-core processing framework) operable to perform a deformable registration in real time based on, amongst other things, the real-time fluoroscopy images.

(23) Workstation 50 is also shown having a memory device 80 which holds or stores information including imaging, device, marker, and procedural data. The memory device may be a hard drive, for example.

(24) The workstation 50 shown in FIG. 1A is adapted to receive real-time images (e.g., fluoroscopy and/or endoscopy images) through various input ports or connectors (e.g., USB port, video port, etc.). A frame grabber card 72 captures individual video frames or images for processing. Real time fluoroscopy images may be obtained by the workstation 50, for example, from continuous fluoroscopy video, video clips, and/or still frames. The workstation is further adapted to send image data to a display using a video card 82. An example of a workstation is a Dell Computer Precision T5500 T5400, with Dual-core Intel Xeon 2.67 GHz processor, and a Nvidia GeForce GTX TITAN BLACK.

(25) The system 90 shown in FIG. 1A also includes a display 60 which may present reports, data, images, results and models in various formats including without limitation graphical, tabular, and pictorial form. In one embodiment of the present invention, a surgical device is superimposed on a 3D model of the organ.

(26) It is to be understood, however, that although the system in FIG. 1A is shown with a memory 80 for receiving and storing various information the invention is not so limited. In an alternative embodiment the system may be configured to merely access a memory device such as a USB stick, a CD, or other media storage device.

(27) The system 90 shown in FIG. 1A also includes a user input device 86 such as, for example, a keyboard, joystick, or mouse. The user input device allows a user such as the physician to add or input data and information as well as modify planning information and to make notes in the files and records.

(28) FIG. 1B illustrates an overview of a method 110 for 2D-3D registration of the lung using multiple fluoroscopic views 112, 114. 3D image data 115 of a patient's lung is deformed 117 to produce a candidate deformed 3D image 116 based on a grid of control points 126. The 3D image data 115 may include, for example, preoperative image data arising from CT, MRI, or other means.

(29) Multiple DRRs are created 122, 124 from the deformed 3D image data 116.

(30) Next, and as will be described in more detail herein, the DRRs 122, 124 are compared 130 to the multiple fluoroscopic images 132, 134.

(31) When comparing a virtual fluoroscopic 2D image or the DRR with its candidate live fluoroscopic image, the DRR is produced according to the recorded view angle from the fluoroscope as discussed above in connection with FIG. 1A. The view angle between the successive view and the first view may be obtained by a tracking tool 46 fixed on the C-Arms of the fluoroscope. An example of tracking and calibrating the fluoroscope is described in US Patent Publication Nos. 20120289825 and 20120288173, both to Rai et al.

(32) If the images match, the deformed CT and the DRRs are output 142. Otherwise, (e.g., if the images do not match to a desired level of accuracy), the control points 126 are adjusted 144 to deform the 3D image data.

(33) Registering Multiple Fluoroscopic Images

(34) In embodiments, the step matching or comparing the multiple fluoroscopic images to a 3D image data comprises determining the 3D deformation field U.

(35) In embodiments, the 3D deformation field U is solved as an optimization problem comprising the step of computing a cost function. The following is an example of a cost function:
J(CT,Fluo,U)=Metric(DRR(U(CT)),Fluo)  (1)

(36) in which, the CT represents a 3D CT image, Fluo represents one big fluoroscopic image constituted by putting the fluoroscopic images one after another. The U is the unknown deformation field. U(CT) represents the deformed CT. DRR represents a big DRR constituted by putting the DRR at different views one after another. DRR may be produced by a ray casting method, quite similar with CT imaging. It is also possible to add additional constraints in (1) on the deformation field T which may be based on properties of anatomical tissues in the 3D image. Examples of such anatomical properties, without limitation, include rigidity of spinal column, ribs, and lung tissues.

(37) For the Metric, to match DRR with Fluo, we can use the normalized cross-correlation (NCC) since it works well in the same modality registration,

(38) .Math. i , j ( DRR ( i , j ) - DRR _ ) ( Fluo ( i , j ) - Fluo _ ) .Math. i , j ( DRR ( i , j ) - DRR _ ) 2 .Math. i , j ( Fluo ( i , j ) - Fluo _ ) 2 ( 2 )

(39) in which, DRR means the average intensity of DRRs and Fluo means the average of fluoroscopic images.

(40) In one preferred GPU implementation of the cost and the gradient, the standard NCC equation (2) is adapted accordingly to suit the GPU implementation.

(41) In embodiments, the deformation is modeled as a Free-From Deformation (FFD), which is a piece-wise cubic polynomial. (Lee S, Wolberg G, Shin S Y. Scattered data interpolation with multilevel B-splines. IEEE Transactions on Visualization and Computer Graphics 2002: 3: 228-44.) FFD can be manipulated by a regular control grid with spacing s.sub.x×s.sub.y×s.sub.z for a 3D image. FFD is computationally efficient, because the deformation at any point is only influenced by its surrounding 4×4×4 control points.

(42) For a point p with coordinate (x,y,z), assume its 4×4×4 control points are p.sub.ijk, i, j, k=0 . . . 3. Denote d.sub.ijk as the displacement vector associated with the control point p.sub.ijk, the displacement at point p is defined as,

(43) U ( x , y , z | d ijk ) = .Math. i = 0 3 .Math. j = 0 3 .Math. k = 0 3 B i ( u ) B j ( v ) B k ( w ) ( 3 )

(44) where

(45) u = x s x - .Math. x s x .Math. , v = y s y - .Math. y s y .Math. , w = z s z - .Math. z s z .Math. . B i , i = 1 , 2 , 3
is the i-th basis function of Bsplines. (Lee S, Wolberg G, Shin S Y. Scattered data interpolation with multilevel B-splines. IEEE Transactions on Visualization and Computer Graphics 2002: 3: 228-44.)

(46) We can denote the vector U as the concatenation of d.sub.ijk.

(47) Cost function (1) is optimized by a L-BFGS optimizer, which believed to be more efficient compared to a simple gradient descent optimizer for high dimensional optimization problems. See, for example, Nocedal J and Wright S J. Numerical optimization. New York: Springer Verlag; 1999: 221-26.

(48) L-BFGS is a gradient-based optimizer. Both the gradient and cost are needed in the optimizer. For the cost function (1), the gradient is calculated analytically since the image, the deformation and the DRR projection can be represented analytically. Various methods may be used to calculate the gradient. An exemplary method to calculate the gradient is, for example, center differencing. Ames, W. F., (1977). Numerical Methods for Partial Differential Equations, Section 1.6. Academic Press, New York. ISBN 0-12-056760-1.

(49) GPU Implementation

(50) In embodiments a GPU implementation of 2D-3D deformable registration uses OpenCL1.1 on GeForce GTX TITAN Black GPU (manufactured by Nvidia Corporation, Santa Clara, Calif.). OpenCL1.1 is a computer language for GPU and described in, for example, John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. IEEE Des. Test 12, 3 (May 2010), 66-73.

(51) FIG. 2 shows a framework 150 of the implementation, in which the optimizer 152 (e.g., L-BFGS optimizer) receives a cost value (shown as J(U)) and the gradient vector (shown as J′(U)) to determine a next step. As mentioned herein the calculation of the cost and the gradient is extremely computationally intensive. Use of the GPU serves to speed up the time for computation.

(52) FIG. 3 illustrates a GPU implemented framework 200 to compute the cost 202 and gradient 204. As shown the system includes a CPU 226 and GPU 228.

(53) The cost value 202 may be calculated using a series of steps (namely, the cost pipeline 212). The gradient is calculated using another series of steps (namely, the gradient pipeline 214).

(54) Cost Pipeline

(55) FIG. 3 shows an embodiment to carry out the cost pipeline 202 including the following steps: deforming the CT image (220), generating the DRR (222), calculating the partial sum (224), and calculating the final sum (225). As shown in this embodiment, many of the steps can be performed on a GPU 228, and the step of calculating the final sum can be performed on a CPU (226).

(56) In the deforming step 220, the CT image is deformed 220 according to the current deformation field U. This step 220 may be carried out by a program or kernel as described in, for example, Modat, Marc et al., Fast free-form deformation using graphics processing units, Computer Methods and Programs in Biomedicine, Volume 98, Issue 3, 278-284.

(57) In the generating step 222, the DRR is generated. A program or DRRGenerator kernel may be used to produce the DRR based on the deformed image as described in, for example, Tornai G J, Cserey G, Pappas I, Fast DRR generation for 2D to 3D registration on GPUs, MedPhys. 2012 August; 39(8):4795-9. doi: 10.1118/1.4736827.

(58) Next, the partial sum is computed by a program or kernel 224. The inputs of this program are the DRR and the Fluoroscopy, and the output is the Partial sum vector. In embodiments, the kernel proceeds as follows: load the DRR and Fluoro into global memory; load the DRR and Fluoro into share memory; in a loop, calculate the sum for one pair in a binary fashion; and output the result of each group to the global memory.

(59) More specifically, the partial sum may be computed using an applicable normalized cross correlation. The applicable normalized cross correlation (NCC) is:

(60) .Math. i , j ( DRR ( i , j ) × Fluo ( i , j ) ) - Fluo _ .Math. i , j DRR ( i , j ) - DRR _ .Math. i , j Fluo ( i , j ) + n × DRR _ × Fluo _ ( .Math. i , j DRR ( i , j ) 2 - 2 DRR _ .Math. i , j DRR ( i , j ) + n × DRR _ 2 ) ( .Math. i , j Fluo ( i , j ) 2 - 2 Fluo _ .Math. i , j Fluo ( i , j ) + n × Fluo _ 2 ) ( 4 )

(61) In equation (4), there are five different summation operations, which can be implemented using GPU reduction operations (Mark Harris, Optimizing Parallel Reduction in CUDA, NVIDIA Developer Technology, http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf). In embodiments an 8-component vector is employed. The first 5 components are used to store the summations.

(62) Preferably, the reduction on the 8-component vector is performed only once. In GPU reduction, the vector will be divided into array size/work-group size parts. Each thread group is responsible for the summation of its corresponding part. For instance, if the vector size is 256×256, after reduction, the output is a 256 vector with each component storing a partial summation.

(63) Global synchronization can be performed using the kernel call. For example, after kernel PartialINCC 224, the array size/work-group size partial results will appear in GPU 228 global memory. These partial results may be fed into the CPU 226 to finish the final summation. The cost value 202 is sent to L-BFGS optimizer 210. Some intermediate results of the NCC are sent to the gradient pipeline for incremental calculation of the cost in the gradient calculation, as described herein.

(64) Gradient Pipeline

(65) As mentioned above, one algorithm to calculate the gradient 204 is central differences. For simplicity, in the following, we only consider the derivative along x direction:
J.sub.x=(J(x+Δx)−J(x−Δx))/2Δx

(66) In an embodiment, each GPU thread calculates the derivative for one control point. Each thread can invoke the cost pipeline twice with small disturbance along positive x and negative x directions, respectively.

(67) Each thread can hold a global memory to store the DRR, and consequently a large amount of memory needed. Also, the computation for one thread to calculate the cost is heavy.

(68) In another embodiment, in each thread, not all the DRR is stored nor is the cost calculated from scratch.

(69) A program or kernel 240 may be employed to carry out the above described steps and implemented on GPU 228. For example, a Kernel Gradient 240 may perform the following steps: Loop the local DRR region to incrementally calculate the NCC along positive (+) x direction 241; Loop the local DRR region to incrementally calculate the NCC along negative (−) x direction 241; Use central differences to calculate the derivative; and Output the gradient vector 204 to CPU 226.

(70) FIG. 4a illustrates an affected region 302 of the DRR if one control point is changed. The white point 310 represents a control point. A small disturbance Δx of this control point will affect its surrounding region covered by a 5×5×5 control points. Some pixels in the DRR will be affected if the ray connecting the DRR pixel and the source 314 intersects the cubic. Thus, for one thread, it does not need to store the entire DRR. This thread preferably only stores the affected region indicated by reference numeral 302. Along with an original DRR shared by all threads, this thread can have a complete DRR after the voxels inside the cube are changed. To store the pixels inside the affected region after DRR generation, we first prefer to know which pixels are inside the region.

(71) In embodiments a lookup table for each control point may be built, in which each entry records the pixel Id of the DRR. This table does not change in the whole optimization procedure. It only needs to be calculated once. For example, the kernel LocalDRRPixelIDLTBBuilder can perform this task.

(72) In the Kernel LocalDRRPixelIDLTBBuilder, the input is the FFD control points and the output is a lookup table recording the DRR pixel ID in the affected region. The kernel proceeds as follows: find the 5×5×5 region in 3D CT space; determine if the ray is intersected with the region or not; and use a gather operation to record all DRR pixel IDs for each 5×5×5 region.

(73) FIG. 4b illustrates for this kernel parallel granularity on the ray level. The global thread size is equal to the number of rays or the DRR size. Each thread 316 corresponds with one ray 315. For each thread 316, it loops over all cubes to check which cube it intersects. And, for each cube, all rays intersecting with it will be recorded by a gather operation.

(74) Without being bound to theory, parallelizing on the ray level is desirable because the size of the DRR is much larger than the number of cubes. Parallelizing (namely, performing parallel computations) on the ray level can dramatically reduce the burden of the kernel, which is favored by multi-core processing frameworks such as, for example, a GPU. GPU-based frameworks are well suited for, amongst other things, simultaneously running numerous simple tasks rather than a few complex tasks.

(75) With reference again to FIG. 3, a kernel 232 (e.g., Kernel LocalDRRPixelIDLTBBuilder) records the affected DRR pixel Id for each control point.

(76) A kernel 234 (e.g., Kernel LocalDRRsGenerator) calculates the DRR pixel value for each control point in parallel. The inputs of this kernel are the LocalDRRPixelIdLookupTable, FFDLookupTable 236 and RaySumLookupTable and CT, and the output is the LocalDRRs 241, which record the updated DRR pixels for all 5×5×5 region. Steps for this kernel are as follows: get the pixel Id from LocalDRRPixelIdLookupTable; for all samples on the ray inside the 5×5×5 region, tri-linear interpolation the intensity based on the FFDLookupTable and the CT; update the sum of the samples on the ray inside the 5×5×5 region and then update the RaySumLookupTable; and generate updated DRR (local DRR) for each 5×5×5 region.

(77) The kernel is launched as the LocalDRRPixelIDLTBBuilder with the same parallel granularity. Each thread will redo the ray casting, and all DRR pixels corresponding with the control point will be collected by a gather operation. In this kernel, the tri-linear interpolation is performed, which is facilitated by a pre-computed lookup table: FFDLookupTable, an intermediate result of kernel CTDeformation

(78) Since only part of the DRR is affected after the disturbance of one control point. An incremental method can be used to update the NCC instead calculating it from scratch.

(79) To reach the incremental calculation, the original NCC equation is derived as:

(80) .Math. i , j ( DRR - DRR _ + Δ DRR _ ) ( Fluo - Fluo _ ) .Math. i , j ( DRR - DRR _ + Δ DRR _ ) 2 .Math. i , j ( Fluo - Fluo _ ) 2 -> .Math. i , j ( DRR - DRR _ ) ( Fluo - Fluo _ ) .Math. i , j ( ( DRR - DRR _ ) 2 + Δ DRR _ 2 ) .Math. i , j ( Fluo - Fluo _ ) 2 ( 5 )

(81) in which, DRR is the original mean of DRR before we change the control point, and ΔDRR is the change of the mean after the control point is changed.

(82) From equation (5), except the term ΔDRR, all the other terms are available after the cost pipeline.

(83) So, in the kernel Gradient 240, we can first loop the small affected region, output by kernel DRRGenerationForGradient, to get the sum of the changed DRR pixels and then combine with the original mean and the number of the pixels, we can calculate ΔDRR. After ΔDRR is obtained, along with other terms sent from CPU, we can directly calculate the updated NCC after the change of the control point. Repeating the above procedures along positive x and negative x, respectively, we can get corresponding NCC, and then use central differences to calculate the derivative. With reference to FIGS. 4A-4C, the kernel Gradient is launched at the control point 310 level. Each thread 317 is responsible for the calculation of the derivative of one control point 310. The parallel granularity is illustrated in FIG. 4c.

(84) The gradient difference and pattern intensity metrics provides more robust and accurate in 2D-3D registration than some alternative metrics. This is supported in the literature. See, for example, Penney, G., Weese, J., Little, J. A., Desmedt, P., Hill, D. L., Hawkes, D. J., 1998. A, comparison of similarity measures for use in 2-D-3-D medical image registration. IEEE Transactions on Medical Imaging 17, 586-595. The Gradient kernel 240 is independent of the LocalDRRsGenerator 234 kernel. Therefore, only the Gradient kernel 240 needs to be modified if another metric is introduced.

(85) In another embodiment, a non-transitory computer readable medium includes a set of instructions for use with a computer to carry out any one or combination of the steps, functions, and methods described herein. In another embodiment, a GPU includes a set of instructions to carry out any one or combination of the steps, functions, and methods described herein.

(86) While preferred embodiments of this disclosure have been shown and described, modifications thereof can be made by one skilled in the art without departing from the scope or teaching herein. The embodiments described herein are exemplary only and are not intended to be limiting. Because many varying and different embodiments may be made within the scope of the present inventive concept, including equivalent structures, materials, or methods hereafter thought of, and because many modifications may be made in the embodiments herein detailed in accordance with the descriptive requirements of the law, it is to be understood that the details herein are to be interpreted as illustrative and not in a limiting sense.