Difference between revisions of "ITK Registration Optimization"
From NAMIC Wiki
Line 100: | Line 100: | ||
** Added multi-threading to GetValue function - partitions the samples (thereby partitions the computation of their transforms from fixed to moving image across threads) | ** Added multi-threading to GetValue function - partitions the samples (thereby partitions the computation of their transforms from fixed to moving image across threads) | ||
*** Impact should be distributing computation across threads. Original runtime is O(C1+C2*M) given M samples from the fixed image and new runtime is O(C1+(C2*M)/(C3*N)) given N cores. The ratio of C2/C3 varies by the type of transform used. When a BSplineTransform is used, the ratio C2/C3 approaches 1.0. | *** Impact should be distributing computation across threads. Original runtime is O(C1+C2*M) given M samples from the fixed image and new runtime is O(C1+(C2*M)/(C3*N)) given N cores. The ratio of C2/C3 varies by the type of transform used. When a BSplineTransform is used, the ratio C2/C3 approaches 1.0. | ||
− | |||
− | |||
− | |||
** Added the pre-computation of the FixedImageMarginalPDF for the sample | ** Added the pre-computation of the FixedImageMarginalPDF for the sample | ||
*** Required the concept of an AdjustedFixedImageMarginalPDF that is updated with a fixed image voxel does not map into the moving image and thereby isn't valid for the current computations. By only updating when samples are missed, mutex lock to update a cross-thread data structure is needed less often. | *** Required the concept of an AdjustedFixedImageMarginalPDF that is updated with a fixed image voxel does not map into the moving image and thereby isn't valid for the current computations. By only updating when samples are missed, mutex lock to update a cross-thread data structure is needed less often. | ||
+ | ** Each thread now has its own copy of the joinPDF. After threads complete, jointPDFs from each thread are summed. This eliminates mutex from the main loop over samples. | ||
+ | ** SUMMARY: Speedup on a dual-core system is about 40% when using linear transform and linear interpolation and about 90% when using bspline transform and bspline interpolation. | ||
+ | |||
+ | * GetDerivative | ||
+ | ** Following the same convention as used with GetValue | ||
==== Profile of standard ITK: time in functions ==== | ==== Profile of standard ITK: time in functions ==== |
Revision as of 04:17, 3 April 2007
Home < ITK Registration OptimizationContents
Goals
There are two components to this research
- Identify registration algorithms that are suitable for non-rigid registration problems that are indemic to NA-MIC
- Develop implementations of those algorithms that take advantage of multi-core and multi-processor hardware.
Algorithmic Requirements and Use Cases
- Requirements
- relatively robust, with few parameters to tweak
- runs on grey scale images
- has already been published
- relatively fast (ideally speaking a few minutes for volume to volume).
- not patented
- can be implemented in ITK and parallelized.
- Use-cases
- Intersubject mapping
- Example data set (Kilian)
- fMRI to hi-res brain morphology mapping
- Example data set (Steve Pieper)
- DTI: components of the diffusion tensor
- Example data (Sylvain)
- Intersubject mapping
Hardware Platform Requirements and Use Cases
- Requirements
- Shared memory
- Single and multi-core machines
- Single and multi-processor machines
- AMD and Intel - Windows, Linux, and SunOS
- Use-cases
- Intel Core2Duo
- Intel quad-core Xeon processors (?)
- 6 CPU Sun, Solaris 8 (SPL: vision)
- 12 CPU Sun, Solaris 8 (SPL: forest and ocean)
- 16 core Opteron (SPL: john, ringo, paul, george)
- 16 core, Sun Fire, AMDOpteron (UNC: Styner)
Data
Workplan
Establish testing and reporting infrastructure
- Identify timing tools
- Cross platform and multi-threaded
- Timing and profiling
- Status
- Instrumenting modular tests
- Extending itk's cross-platform high precision timer
- Adding thread affinity to ensure valid timings
- Adding method for increasing process priority
- Profiling complete registration solutions for use cases
- Using CacheGrind on single and multi-core linux systems
- Instrumenting modular tests
- Develop performance dashboard for collecting results
- Each test will report time and accuracy to a central server
- The performance of a test, over time, for a given platform can be viewed on one page
- The performance of a set of tests, at one point in time, for all platforms can be viewed on one page
- Status
- BatchMake database communication code being isolated
- Performance dashboard web pages being designed
Develop tests
- Develop modular tests
- Status
- Developed itkCheckerboardImageSource so no IO required
- Developed all tests listed in the "Modular Tests" section below
- Developing second generation tests that compare optimized performance with standard performance in a single executable
- Status
- Develop C-style tests
- Tests should represent the non-ITK way of doing image analysis
- Use standard C/C++ arrays and pointers to access blocks of memory as images
- Tests should represent the non-ITK way of doing image analysis
- Develop complete registration solutions for use cases
- Status
- Centralized data and provide easy access
- Identified relevant registration algorithms
- rigid, affine, bspline, multi-level bspline, and Demons'
- normalized mutual information, mean squared difference, and cross correlation
- Developing traditional ITK-style implementations
- Status
Compute performance on target platforms
- Completing the development of the reporting dashboard, the standard C tests, and the full registration applications (see above).
- Developing optimized versions of select methods in parallel with deploying the testing suite (see below).
ITK Optimization
- Target bottlenecks
- Multi-thread metric calculation
- Initial target is MattesMutualInformationImageToImageMetric
- Optimize code
- Sacrifice some memory and algorithm initialization speed to gain algorithm operation speed increases
- Call multi-threaded functions when possible
- Multi-thread metric calculation
- Integrate metrics with transforms and interpolators for tailored performance
MattesMutualInformationImageToImageMetric
Optimizations Employed
- GetValue
- Added multi-threading to GetValue function - partitions the samples (thereby partitions the computation of their transforms from fixed to moving image across threads)
- Impact should be distributing computation across threads. Original runtime is O(C1+C2*M) given M samples from the fixed image and new runtime is O(C1+(C2*M)/(C3*N)) given N cores. The ratio of C2/C3 varies by the type of transform used. When a BSplineTransform is used, the ratio C2/C3 approaches 1.0.
- Added the pre-computation of the FixedImageMarginalPDF for the sample
- Required the concept of an AdjustedFixedImageMarginalPDF that is updated with a fixed image voxel does not map into the moving image and thereby isn't valid for the current computations. By only updating when samples are missed, mutex lock to update a cross-thread data structure is needed less often.
- Each thread now has its own copy of the joinPDF. After threads complete, jointPDFs from each thread are summed. This eliminates mutex from the main loop over samples.
- SUMMARY: Speedup on a dual-core system is about 40% when using linear transform and linear interpolation and about 90% when using bspline transform and bspline interpolation.
- Added multi-threading to GetValue function - partitions the samples (thereby partitions the computation of their transforms from fixed to moving image across threads)
- GetDerivative
- Following the same convention as used with GetValue
Profile of standard ITK: time in functions
Time in self | Time in subfuncs | Function |
---|---|---|
0.00 | 86.64 | __tmainCRTStartup" |
0.00 | 48.47 | main" |
0.00 | 37.98 | itk::MattesMutualInformationImageToImageMetric<itk::Image<float,3>,itk::Image<float,3> >::GetDerivative" |
8.49 | 20.99 | itk::MattesMutualInformationImageToImageMetric<itk::Image<float,3>,itk::Image<float,3> >::GetValueAndDerivative" |
0.57 | 19.27 | itk::CentralDifferenceImageFunction<itk::Image<float,3>,double>::Evaluate" |
13.40 | 13.55 | itk::CentralDifferenceImageFunction<itk::Image<float,3>,double>::EvaluateAtIndex" |
11.70 | 11.83 | itk::BSplineKernelFunction<3>::Evaluate [1]" |
9.06 | 9.16 | itk::BSplineKernelFunction<2>::Evaluate [1]" |
3.21 | 8.40 | itk::MattesMutualInformationImageToImageMetric<itk::Image<float,3>,itk::Image<float,3> >::GetValue" |
8.11 | 8.21 | floor ?" |
0.00 | 7.25 | itk::CheckerBoardImageSource<itk::Image<float,3> >::GenerateData" |
0.00 | 4.20 | itk::ImageSource<itk::Image<float,3> >::ThreaderCallback" |
3.77 | 3.82 | itk::NearestNeighborInterpolateImageFunction<itk::Image<float,3>,double>::EvaluateAtContinuousIndex" |
3.21 | 3.24 | itk::StatisticsImageFilter<itk::Image<float,3> >::ThreadedGenerateData" |
3.02 | 3.05 | itk::ImageFunction<itk::Image<float,3>,double,double>::IsInsideBuffer" |
2.83 | 2.86 | itk::BSplineDerivativeKernelFunction<3>::Evaluate" |
2.26 | 2.29 | itk::InterpolateImageFunction<itk::Image<float,3>,double>::Evaluate" |
0.00 | 2.10 | endthreadex ?" |
2.08 | 2.10 | itk::MattesMutualInformationImageToImageMetric<itk::Image<float,3>,itk::Image<float,3> >::TransformPoint" |
2.08 | 2.10 | thunk@403355 ?" |
1.89 | 1.91 | _ftol2_pentium4" |
1.70 | 1.72 | itk::MattesMutualInformationImageToImageMetric<itk::Image<float,3>,itk::Image<float,3> >::ComputePDFDerivatives" |
1.70 | 1.72 | thunk@402f54 ?" |
1.51 | 1.53 | itk::BSplineKernelFunction<2>::Evaluate" |
1.51 | 1.53 | itk::ImageBase<3>::GetSpacing" |
1.13 | 1.53 | itk::ImageFunction<itk::Image<float,3>,double,double>::ConvertContinuousIndexToNearestIndex" |
0.94 | 1.34 | itk::CheckerBoardSpatialFunction<double,3,itk::Point<double,3> >::Evaluate" |
1.13 | 1.15 | itk::ImageFunction<itk::Image<float,3>,double,double>::IsInsideBuffer [1]" |
0.94 | 0.95 | itk::BSplineKernelFunction<3>::Evaluate" |
0.75 | 0.76 | itk::ImageBase<3>::GetBufferedRegion" |
0.75 | 0.76 | itk::Point<double,3>::operator+" |
0.75 | 0.76 | thunk@4036d4 ?" |
0.75 | 0.76 | thunk@403cec ?" |
0.19 | 0.57 | itk::ImageFunction<itk::Image<float,3>,itk::CovariantVector<double,3>,double>::ConvertContinuousIndexToNearestIndex" |
0.57 | 0.57 | itk::MattesMutualInformationImageToImageMetric<itk::Image<float,3>,itk::Image<float,3> >::ComputeImageDerivatives" |
0.57 | 0.57 | itk::Point<double,3>::operator=" |
0.57 | 0.57 | itk::ShiftScaleImageFilter<itk::Image<float,3>,itk::Image<float,3> >::ThreadedGenerateData" |
0.57 | 0.57 | itk::TranslationTransform<double,3>::TransformPoint" |
Profile of standard ITK: time in files
Time in self | Time in subfuncs | Files |
---|---|---|
16.42 | 72.33 | itkmattesmutualinformationimagetoimagemetric.txx" |
0.00 | 48.47 | mattesmutualinformationimagetoimagemetrictest.cxx" |
23.21 | 23.47 | itkbsplinekernelfunction.h" |
0.57 | 19.27 | itkcentraldifferenceimagefunction.h" |
13.40 | 13.55 | itkcentraldifferenceimagefunction.txx" |
0.00 | 7.25 | itkcheckerboardimagesource.txx" |
5.66 | 6.49 | itkimagefunction.h" |
0.00 | 4.20 | itkimagesource.txx" |
3.77 | 3.82 | itknearestneighborinterpolateimagefunction.h" |
3.21 | 3.24 | itkstatisticsimagefilter.txx" |
2.83 | 2.86 | itkbsplinederivativekernelfunction.h" |
2.83 | 2.86 | itkimagebase.h" |
2.26 | 2.29 | itkinterpolateimagefunction.h" |
0.94 | 1.34 | itkcheckerboardspatialfunction.txx" |
1.32 | 1.34 | itkpoint.txx" |
0.75 | 0.76 | itktranslationtransform.txx" |
0.57 | 0.57 | itkshiftscaleimagefilter.txx" |
0.38 | 0.38 | vnl_matrix.txx" |
0.00 | 0.19 | itkbsplinedeformabletransform.txx" |
0.19 | 0.19 | itkfixedarray.txx" |
0.19 | 0.19 | itkimageregionconstiterator.txx" |
0.19 | 0.19 | itkobject.cxx" |
0.19 | 0.19 | vnl_vector.txx" |
0.19 | 0.19 | vector" |
0.19 | 0.19 | secchk.c" |
Modular tests
All tests send two values to performance dashboards
- the time required
- an measure of the error (0 = no error; 1 = 100% error)
Tests being developed and their parameter spaces
- NearestNeighborInterpTest <numThreads> <dimSize> <factor> [<outputImage>]
- LinearInterpTest <numThreads> <dimSize> <factor> [<outputImage>]
- BSplineInterpTest <numThreads> <dimSize> <factor> <bSplineOrder> [<outputImage>]
- SincInterpTest <numThreads> <dimSize> <factor> [<outputImage>]
- Uses the Welch window function
- BSplineTransformLinearInterpTest <numThreads> <dimSize> <numNodesPerDim> <bSplineOrder> [<outputImage>]
- 3 nodes are also added outside of the image for interpolation
- MeanSquaresImageToImageMetricTest <numThreads> <dimSize> <iterations>
- CorrelationCoefficientHistogramMetricTest <numThreads> <dimSize> <iterations>
- NormalizedCorreltationImageToImageMetricTest <numThreads> <dimSize> <iterations>
- MattesMutualInformationImageToImageMetricTest <numThreads> <dimSize> <numSamples> <iterations>
- MattesMutualInformationMetric defaults to BSpline interpolator - this test overrides to use nearest neighbor interpolation
- MutualInformationImageToImageMetricTest <numThreads> <dimSize> <numSamples> <iterations>
- NormalizedMutualInformationHistrogramMetricTest <numThreads> <dimSize> <iterations>
SECOND GENERATION TEST
- Computes runtime for GetValue, GetDerivative, and GetValueAndDerivative for standard ITK implementation and for the optimized version being developed. Also computes difference (if any) between their answers and reports as an error measure
- OptMattesMutualInformationImageToImageMetricTest
Related Pages
Performance Measurement
- LTProf - simple profilter for Windows - Shareware
- Intel's VTune for Linux ($)
- TAU
- Threadmon: Thread usage/blockage
- TotalView ($)
- PerfSuite (POSIX Threads)
- GProf work-around for multi-threaded apps
- References on multi-threaded profiling and code optimization