Difference between revisions of "Slicer3:Large scale experiment control brainstorming"
Line 39: | Line 39: | ||
== Components == | == Components == | ||
− | ( | + | To address the application scenario, we considered the following: |
+ | |||
+ | === Completion of a Single Execution Step === | ||
+ | |||
+ | Given that any job submission may fail (for example, if the cluster or a node goes down) we need to be able to distinguish the following cases when starting a job: | ||
+ | |||
+ | 1. job has not been started | ||
+ | 2. job has been started (i.e. we are reattaching to a running job) | ||
+ | 2a job still running | ||
+ | 2b job not running | ||
+ | 3 job has completed successfully | ||
+ | |||
+ | We want to provide the following capabilities: | ||
+ | * the job can started (i.e. we have a back-end for different job control systems) | ||
+ | * we can check the status of jobs (we save the job token) | ||
+ | * we can restart a job that has died | ||
+ | * we don't start a job if it is alreay running | ||
+ | * we don't re-run a job that already has completed successfully | ||
+ | * we can clean up the state so the whole job can be re-run | ||
+ | |||
+ | === Overall Experiment Controll === | ||
+ | |||
+ | We would like tools to provide the following controls: | ||
+ | |||
+ | * Be able to start and experiment | ||
+ | * Be able to check the status of an experiment that is running | ||
+ | * Be able to confirm that the experiment has completed | ||
+ | * Be able to restart an experiment in the middle if the cluster crashed or jobs failed. | ||
== Thought experiments == | == Thought experiments == |
Revision as of 17:33, 2 April 2007
Home < Slicer3:Large scale experiment control brainstormingGoal
To provide Slicer3 with a mechanism for submitting, monitoring, and summarizing large scale experiments that utilize Slicer3 modules, particularly the Command Line Modules. This page summarizes our thoughts, requirements, and experiments to date, mostly accomplished during the March 2007 Slicer3 MiniRetreat.
There are two introductory use cases that we wish to support:
- Slicer3 is used interactively to select a set of parameters for an algorithm or workflow on a single dataset. Then, these parameters are applied to N datasets non-interactively.
- Slicer3 is used interactively to select a subset of parameters for an algorithm or workflow on a single dataset. Then, the remaining parameter space is searched non-interactively. Various parameter space evaluation techniques could be employed, the simplest of which is to sample the space of (param1, param2, param3).
Note, that with the above two use cases, we are only trying to address large scale experiment control from the standpoint of what it means to Slicer3. We are not trying to solve the general case of large scale experiment control.
Assumptions and restrictions
- Computing configuration.
- We shall support a variety of computing infrastructures which include
- single computer systems,
- clusters,
- grids (optional)
- We shall support a variety of computing infrastructures which include
- Access to compute nodes.
- We shall have no direct access to the compute nodes. All job submissions shall be to some sort of submit node. Exception may be when operating on a single computer system configuration.
- Staged data
- The compute nodes shall mount a filesystem outside of the node on which data is staged. We are not providing Slicer3 with the mechanisms to stage data. We assume that all data is staged outside of Slicer3.
- Staged programs
- The compute nodes shall have access to the Slicer3 processing modules. Like the case for data, the processing modules are staged outside of the Slicer3 environment.
- Experiment scheduling
- A given experiment shall result in one or more processing jobs being submitted to the computing resources.
- Job submission
- Submitting a job to the computing infrastructure shall result in a job submission token such that that job can be
- monitored for status: scheduled, running, completed
- terminated
- Submitting a job to the computing infrastructure shall result in a job submission token such that that job can be
- Experiment control
- We shall be able to monitor an experiment to see its status.
- We shall be able to interrupt an experiment. This may involve removing jobs from the queue and terminating jobs in process.
- We shall be able to resume an experiment without re-running the entire experiment. Previously terminated jobs will be resubmitted. Previously completed jobs will not be rerun.
- We shall be able to rerun an experiment, overwriting previous results.
- Job execution robustness
- Jobs terminating unsuccessfully shall be automatically resubmitted to the computing environment upon the experiment designers request. Jobs may be resubmitted zero times, K times, or until successful.
Components
To address the application scenario, we considered the following:
Completion of a Single Execution Step
Given that any job submission may fail (for example, if the cluster or a node goes down) we need to be able to distinguish the following cases when starting a job:
1. job has not been started 2. job has been started (i.e. we are reattaching to a running job) 2a job still running 2b job not running 3 job has completed successfully
We want to provide the following capabilities:
- the job can started (i.e. we have a back-end for different job control systems)
- we can check the status of jobs (we save the job token)
- we can restart a job that has died
- we don't start a job if it is alreay running
- we don't re-run a job that already has completed successfully
- we can clean up the state so the whole job can be re-run
Overall Experiment Controll
We would like tools to provide the following controls:
- Be able to start and experiment
- Be able to check the status of an experiment that is running
- Be able to confirm that the experiment has completed
- Be able to restart an experiment in the middle if the cluster crashed or jobs failed.
Thought experiments
Below are a few thought experiments to address the above. These will be used to see how the above needs can be addressed.
Makefiles + the loopy launcher
BatchMake
BatchMake allows for large scale experiments to be designed using a scripting language similar to CMake scripts. BatchMake provides a number of looping constructs which can be used to design experiments and parameter searches
- foreach
- sequence
- randomize
- fornfold
Here is a BatchMake script to search the parameter space of a median filter
SetApp(median @'Median Filter') SetAppOption(median.inputVolume 'c:/projects/I2/Insight/Testing/Data/Input/cthead1.png') Set(kernels '1,1,1' '2,2,1' '3,3,1' '4,4,1' '5,5,1') Set(outVolumePrefix 'c:/projects/Temp/Slicer3/median') foreach(kernel ${kernels}) RegEx(kernelText ${kernel} ',' REPLACE '_') SetAppOption(median.outputVolume ${outVolumePrefix}${kernelText}.png) SetAppOption(median.neighborhood ${kernel}) Run(output ${median}) endforeach(kernel)
We have extended the ModuleDescription library in Slicer3 to generate a BatchMake XML Application Wrapper from a ModuleDescription object. This allows Slicer3 Command Line Modules to be loaded into BatchMake and used as BatchMake application objects in BatchMake scripts. This code has yet to be integrated into Slicer3 permanently because there a number of design decisions outstanding. Here is the ModuleDescription XML file that Slicer uses
<?xml version="1.0" encoding="utf-8"?> <executable> <category> Filtering.Denoising </category> <title> Median Filter </title> <description> The MedianImageFilter is commonly used as a robust approach for noise reduction. This filter is particularly efficient against "salt-and-pepper" noise. In other words, it is robust to the presence of gray-level outliers. MedianImageFilter computes the value of each output pixel as the statistical median of the neighborhood of values around the corresponding input pixel. </description> <version>0.1.0.$Revision: 2085 $(alpha)</version> <documentation-url></documentation-url> <license></license> <contributor>Bill Lorensen</contributor> <acknowledgements>This command module was derived from Insight/Examples/Filtering/MedianImageFilter (copyright) Insight Software Consortium</acknowledgements> <parameters> <label>Median Filter Parameters</label> <description>Parameters for the median filter</description> <integer-vector> <name>neighborhood</name> <longflag>--neighborhood</longflag> <description>The size of the neighborhood in each dimension</description> <label>Neighborhood Size</label> <default>1,1,1</default> </integer-vector> </parameters> <parameters> <label>IO</label> <description>Input/output parameters</description> <image> <name>inputVolume</name> <label>Input Volume</label> <channel>input</channel> <index>0</index> <description>Input volume to be filtered</description> </image> <image> <name>outputVolume</name> <label>Output Volume</label> <channel>output</channel> <index>1</index> <description>Output filtered</description> </image> </parameters> </executable>
and here is the resulting BatchMake XML Application wrapper
<?xml version="1.0" encoding="utf-8"?> <BatchMakeApplicationWrapper> <BatchMakeApplicationWrapperVersion>1.0</BatchMakeApplicationWrapper> <Module> <Name>Median Filter</Name> <Version>0.1.0.$Revision: 2085 $(alpha)</Version> <Path>c:/projects/Slicer3-clean-net2005/bin/RelWithDebInfo/../../lib/Slicer3/Plugins/RelWithDebInfo/MedianImageFilter.exe</Path> <Parameters> <Param> <Type>1</Type> <Name>neighborhood.flag</Name> <Value>--neighborhood</Value> <Parent>0</Parent> <External>0</External> <Optional>1</Optional> </Param> <Param> <Type>4</Type> <Name>neighborhood</Name> <Value>1,1,1</Value> <Parent>1</Parent> <External>0</External> <Optional>0</Optional> </Param> <Param> <Type>0</Type> <Name>inputVolume</Name> <Value></Value> <Parent>0</Parent> <External>1</External> <Optional>0</Optional> </Param> <Param> <Type>0</Type> <Name>outputVolume</Name> <Value></Value> <Parent>0</Parent> <External>2</External> <Optional>0</Optional> </Param> </Parameters> </Module> </BatchMakeApplicationWrapper>
BatchMake and the Computing Infrastructure
- What is needed to make BatchMake submit to a cluster?
- To a grid?
BatchMake and Job Control
- Can BatchMake terminate a job?
- Can BatchMake resubmit a job until it completes successfully?
BatchMake and Experiment Control
Can BatchMake interrupt, continue, and rerun an experiment?