Engineering:Project:Condor Job Submission
This Page Under Construction
Contents
Overview
Programmer's Week Project Material: Project Summary Slide
Compute Resources
A 32 node Compute Cluster has been made available to the NAMIC community during the Programmer's Week. Under Condor this allows for 64 available processors. We have set up a 'submit node' where you can log in and run jobs from. It is also possible to access data from the Birn Data Grid (SRB) for computational use, utilizing the proxy delegation method (i.e. you can delegate a user proxy to condor which will then delegate the proxy to the selected computational resources which will then be able to interact directly with the Data Grid). To obtain an account, please contact either Jason Gerk or Jeff Grethe during the Programmer's Week. Once you have a standard Unix account, you will SSH to the 'submit node' namic-srb.nbirn.net .
Remote Access to Data Grid via GSI Proxies
Once you login to the machine namic-srb.nbirn.net run the command birn-proxy-retrieve.pl .
[jgerk@namic-srb jgerk]$ birn-proxy-retrieve.pl You are about to retrieve your BIRN proxy for authentication. Please enter your portal username. jgerk Please enter your portal passphrase to retrieve your proxy. You've successfully retrieved your BIRN proxy. Displayed below is your proxy information: ----------------------------------------------- subject : /C=US/O=BIRN/OU=UCSD - BIRN Coordinating Center/CN=Jason Gerk/USERID=jgerk/CN=proxy issuer : /C=US/O=BIRN/OU=UCSD - BIRN Coordinating Center/CN=Jason Gerk/USERID=jgerk identity : /C=US/O=BIRN/OU=UCSD - BIRN Coordinating Center/CN=Jason Gerk/USERID=jgerk type : full legacy globus proxy strength : 512 bits path : /tmp/x509up_ujgerk timeleft : 11:59:59 0
This will retrieve your proxy and set up your environment in that shell and allow you to add Scommands into your shell script.
A Condor/SRB Example
Here is a working example. In my home directory I have a file called job_submit_file which contains the executable program (small shell script) the I want to be run when I launch the Condor job. This file is called getInfoSRB, but you can name it anything.
[jgerk@namic-srb birn_condor]$ more job_submit_file Executable = getInfoSRB Universe = vanilla Output = output.stdout Error = output.stderr Log = output.log getenv = TRUE should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = /tmp/x509up_u$ENV(LOGNAME)
Below is an example of what my shell script looks like. This will be run in the Compute Cluster environment, calling some Scommands, then pulling out the hello1 program to a standard Linux environment where the permissions are changed and the file is run.
[jgerk@namic-srb birn_condor]$ more getInfoSRB #!/bin/sh echo " " echo "This is an example of a shell script that uses Scommands." echo " " # Run a Sls echo "Do and 'Sls' and show NAMIC related files:" Sls | grep -i namic #Get hello from SRB and download it locally on condor execute dir echo " " echo "Get a file from SRB: 'Sget hello' " Sget hello hello1 #Run hello on condor execution machine echo " " echo "Set the permissions, and execute the program: chmod +x hello1 then ./hello1 " echo "(output below) " chmod +x hello1 ./hello1
Next, we are ready to launch the job. The job_submit_file calls the shell script which calls other programs within the Compute Cluster environment. Here we go.
[jgerk@namic-srb birn_condor]$ birn_condor_submit job_submit_file Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 68.
Once complete, a sucessful job should write to output.stdout (or whatever you defing in your submit file).
[jgerk@namic-srb birn_condor]$ more output.stdout This is an example of a shell script that uses Scommands. Do and 'Sls' and show NAMIC related files: C-/home/jgerk.ucsd-bcc/namic C-/home/jgerk.ucsd-bcc/namic_backup Get a file from SRB: 'Sget hello' Set the permissions, and execute the program: chmod +x hello1 then ./hello1 (output below) hello, Condor
Submitting A Condor Job
Submit template file focusing Condor with GSI and SRB. Replace YOUR_EXECUTALBE with the name of your executable, which will usually be a shell script with SRB commands and the executables you really want to run. add other input files to the end of transfer_input_files as needed for example /tmp/x509up_u$(LOGNAME), your_input1, your_input2, etc. Note that if you are using a shell script as your condor executable, you will have to either copy over your scientific code as an input file or store it in SRB and Sget it.
Template File
Executable = YOUR_EXECUTABLE Universe = vanilla Output = YOUR_EXECUTABLE.stdout Error = YOUR_EXECUTABLE.stderr Log = YOUR_EXECUTABLE.log getenv = TRUE should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = /tmp/x509up_u$ENV(LOGNAME) Queue
A NAMIC Example
Sample data and an associated executable from the NA-MIC svn Sandbox can be found here
In this example the sample data has been uploaded to a directory: /home/jegrethe.ucsd-bcc/TestData/NAMIC Results are written to: /home/jegrethe.ucsd-bcc/TestData/NAMIC_results
The job submit file is as follows:
######################################################## # Submit template file focusing Condor with GSI and SRB. # Replace YOUR_EXECUTALBE with the name of your executable, # which will usually be a shell script with SRB commands and # the executables you really want to run. add other input files to # the end of transfer_input_files as needed for example /tmp/x509up_u$(USER, # your_input1, your_input2, etc. Note that if you are using # a shell script as your condor executable, you will have to # either copy over your scientific code as an input file or # store it in SRB and Sget it. ####################################################### Executable = condorSH.sh Universe = vanilla Output = condorSH.stdout Error = condorSH.stderr Log = condorSH.log getenv = TRUE should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = /tmp/x509up_ujegrethe
The shell script executed on the condor compute node is as follows:
#!/bin/sh /usr/local/bin/Sget -r /home/jegrethe.ucsd-bcc/TestData/NAMIC cd NAMIC chmod a+x ./MultiModalityRigidRegistration ./MultiModalityRigidRegistration reg.nhdr mrt.nhdr xformIN xformOUT 2 1 > reg.out cd .. mv NAMIC NAMIC_results /usr/local/bin/Sput -r NAMIC_results /home/jegrethe.ucsd-bcc/TestData /usr/local/bin/Sput condorSH.stderr /home/jegrethe.ucsd-bcc/TestData/NAMIC_results /usr/local/bin/Sput condorSH.stdout /home/jegrethe.ucsd-bcc/TestData/NAMIC_results
The condor log file for the job is as follows (Note: the log doesn't have any information on the SRB data movement, Condor only manages the movement of the proxy in this case):
000 (069.000.000) 06/17 01:40:08 Job submitted from host: <198.202.95.80:9646> ... 001 (069.000.000) 06/17 01:40:13 Job executing on host: <198.202.95.110:9607> ... 006 (069.000.000) 06/17 01:40:21 Image size of job updated: 35248 ... 006 (069.000.000) 06/17 02:00:21 Image size of job updated: 93844 ... 005 (069.000.000) 06/17 03:06:38 Job terminated. (1) Normal termination (return value 0) Usr 0 01:26:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 01:26:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 3489 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 3489 - Total Bytes Received By Job ...
Condor Quick Help and Useful Commands
To see what resources are available on the cluster: condor_status
[jgerk@namic-srb jgerk]$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime vm1@birn-clus LINUX INTEL Claimed Idle 0.000 1979 0+00:00:03 vm2@birn-clus LINUX INTEL Claimed Idle 0.000 1979 0+00:00:03 vm1@birn-clus LINUX INTEL Claimed Idle 0.000 1979 0+00:00:03 vm2@birn-clus LINUX INTEL Claimed Idle 0.000 1979 0+00:00:04 vm1@birn-clus LINUX INTEL Claimed Idle 0.000 1979 0+00:00:03 vm2@birn-clus LINUX INTEL Unclaimed Idle 0.000 1979 0+00:00:05 vm1@birn-clus LINUX INTEL Claimed Busy 0.000 1979 0+00:00:04 vm2@birn-clus LINUX INTEL Claimed Idle 0.000 1979 0+00:00:06 vm1@birn-clus LINUX INTEL Claimed Idle 0.000 1979 0+00:00:03 vm2@birn-clus LINUX INTEL Claimed Idle 0.000 1979 0+00:00:03
Machines Owner Claimed Unclaimed Matched Preempting INTEL/LINUX 62 0 61 1 0 0 Total 62 0 61 1 0 0
To launch a job using a submit file: birn_condor_submit <job_submit_file>
[jgerk@namic-srb birn_condor]$ birn_condor_submit job_submit_file Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 68.
To see what is queued to be run: condor_q
[jgerk@namic-srb birn_condor]$ condor_q -- Submitter: namic-srb.public : <132.239.132.232:9663> : namic-srb.public ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 67.0 jgerk 6/27 04:02 0+00:03:53 R 0 0.0 getInfoSRB 1 jobs; 0 idle, 1 running, 0 held
To remove a job from the queue: condor_rm <job_id>
[jgerk@namic-srb birn_condor]$ condor_rm 67.0 Job 67.0 marked for removal
Syntax Strawman
namic-submit [--save-submit-file <submitfile>] [--job-directory <jobdir>] <cmd> [<args>...]
where:
- submitfile is a condor job submit file; if not specified a temp file is generated
- jobdir is a working state directory for a set of submissions that you want to later observe with namic-wait or manually inspect
- cmd is a shell script or executable that can be run on the remote environment
- args are arguments to the cmd
functionality: namic-submit takes the given command and args and runs the job on a condor-determined external compute node. By default stdin, stdout, and stderr of the cmd are mapped to the corresponding files of namic-submit.
If srb proxy is not set up, namic-submit prompts you for srb user id and password and sets up proxy.
namic-wait [--job-directory <jobdir>] [<groupid>]
where:
- jobdir is a directory that corresponds to the state for jobs submitted by namic-submit
functionality: namic-wait blocks until all jobs submitted with the given groupid have completed running