Difference between revisions of "User:Barre/ITK Registration Optimization"
m (→Source Code) |
|||
Line 34: | Line 34: | ||
to checkout the code. | to checkout the code. | ||
+ | |||
+ | You can browse the repository online using [http://public.kitware.com/cgi-bin/viewcvs.cgi/?root=BWHITKOptimization ViewCVS] as well. | ||
===Testing Data=== | ===Testing Data=== |
Revision as of 15:10, 3 April 2007
Contents
ITK Registration Optimization (BW NAC) Project
My (Sebastien Barre) notes so far. Once the dust settles, the relevant sections will be moved to the project pages listed below.
Project
The ultimate specific goal is B-Spline registration optimization for linux and windows on multi-core and multi-processor, shared memory machines. [...] Also, setup tools and a reporting mechanism for ITK speed to be monitored and reported by us and others. BWH is the driving force behind this work.
- Non Rigid Registration at NA-MIC
- ITK Registration Optimization at NA-MIC
- Slicer3:Performance Analysis at NA-MIC
- User:Barre/ITK Registration Optimization at NA-MIC (this page)
Contacts
Quick Links
- Google Performance Tools
- kcachegrind, The KCachegrind Handbook, Running Kcachegrind on Mac OSX 10.4
- Also check my BW-NAC Google Notebook for fresher links.
Source Code
cvs -d :pserver:<login>@public.kitware.com:/cvsroot/BWHITKOptimization login
Enter your VTK <login>, and password, then:
cvs -d :pserver:<login>@public.kitware.com:/cvsroot/BWHITKOptimization co BWHITKOptimization
to checkout the code.
You can browse the repository online using ViewCVS as well.
Testing Data
- NAMIC: Deformable registration speed optimization (DSpace @ Insight-Journal)
Potential Issues with Timing
- Repository was updated so that it can compile on Unix.
__rtdsc()
CallMonWin includes <intrin.h> to call __rdtsc(), a header that does not exist in Microsoft compilers prior to Visual Studio 8/2005. It seems however that one can call __rdtsc() directly from assembly:
- GameDev.net forum, see third comment by Evil Steve (!)
- CPUID for x64 Platforms and Microsoft Visual Studio* .NET 2005
- High Resolution Elapsed Timer
A few articles advises against the use of __rdtsc(), especially in a multicore/multithread context:
The suggested alternative is to use Performance Counters. Hardware counters are actually not an OS feature per se, but a CPU feature that has been around for some time. They provide high-resolution timers that can be used to monitor a wide range of resources:
The issue remains on how to access those counters in a cross-platform way:
- PAPI: "The Performance API (PAPI) project specifies a standard API for accessing hardware performance counters".
Stephen/Christian reported that Dual Core CPUs were not supported, but it seems from the release notes for PAPI 3.5 (2006-11-09) that both Intel Core2Duo and Pentium D (i.e. dual core) are indeed supported.
Process Priority
Whatever our choices, several articles also suggest to bump the application's priority to real-time before performing testing to make sure the wall-clock() results are as realistic as possible. It is however very important to set it back to normal.
- Windows:
See last paragraph. Use GetPriorityClass, SetPriorityClass, GetThreadPriority, SetThreadPriority:
DWORD dwPriorityClass = GetPriorityClass(GetCurrentProcess()); int nPriority = GetThreadPriority(GetCurrentThread()); SetPriorityClass(GetCurrentProcess(),REALTIME_PRIORITY_CLASS); SetThreadPriority(GetCurrentThread(),THREAD_PRIORITY_TIME_CRITICAL); [...]g SetThreadPriority(GetCurrentThread(),nPriority); SetPriorityClass(GetCurrentProcess(),dwPriorityClass);
- Unix:
The getpriority(), setpriority(), and nice() functions can be used to change the priority of processes. The getpriority() call returns the current nice value for a process, process group, or a user. The returned nice value is in the range of [-NZERO, NZERO-1]. NZERO is defined in /usr/include/limits.h. The default process priority always has the value 0 for UNIX. The setpriority() call sets the current nice value for a process, process group, or a user to the value of value + NZERO.
Thread Affinity
We should consider setting the thread affinity to make sure that the starting time is recorded on the same thread as the ending time. Will that constrain the rest of the program to run on a single thread, very good question.
- Windows:
Using SetThreadAffinityMask. Also check Sleep(0), reported in a few discussions, including this long one.
- Unix:
Using sched_setaffinity, but seems to be Linux-only (not POSIX).
Test Platforms
The primary target platform at the 8, 16, and 32 processor machines at BWH. However, preliminary tests have been performed on KHQ computers.
KHQ
A full software stack was compiled on several machines at Kitware. Each component was build in two flavors, both shared/debug and static/release:
- Tcl/Tk 8.4
- VTK (cvs)
- ITK (cvs)
- ITK Applications (cvs)
- FLTK (1.1 svn)
- BWHItkOptimization (cvs)
All platforms are so far described in the BWHItkOptimization/Results directory:
Host | #CPU | CPU | Freq | RAM | Arch | OS | Login |
---|---|---|---|---|---|---|---|
amber2 | 2 | Pentium Xeon | 2.8 GHz | 4 GB | 64 bits | Linux 2.6 (Red Hat Enterprise 4) | kitware (ssh, vnc; cd ~/barre) |
fury | 1 | Pentium 4 (hyperthread) | 2.8 GHz | 1 GB | 32 bits | Linux 2.6 (Fedora Core 4) | barre, jjomier, aylward |
panzer | 1 | Intel Core Duo (dual core) | 1.66 GHz | 1 GB | 32 bits | Mac OS X 10.4.8 | barre, jjomier, aylward |
sanakhan | 1 | Pentium M | 1.8 GHz | 1 GB | 32 bits | Windows XP SP2 | barre |
tetsuo | 1 | Pentium D (dual core) | 3.2 GHz | 2 GB | 32 bits | Windows XP SP2 | barre |
Tests
- LinearInterp: (to describe)
BWH
Systems to be described. We will set up our performancing framework so that it can be run in an automated fashion, using CTest and a dashboard. Running from the spl machine should at that point only require one of us to checkout the code, build and ctest it every night.
Status
- kcachegrind and timing are being performed on amber2. Stay tuned.
- valgrind is not supported on x86_64 architecture :( Now using fury instead of amber2.
- RegTests/RunLinearInterpTest.sh.in is configured automatically to run and times LinearInterp with various combinations of threads, size and factor parameters.
- It was run on fury (release static): Results/fury.kitware.timings-rel.txt
- It was run on fury (debug): Results/fury.kitware.timings-dbg.txt
- It was run on amber2 (release static): Results/amber2.kitware.timings-rel.txt