Engineering:Project:Condor Job Submission

From NAMIC Wiki
Jump to: navigation, search
Home < Engineering:Project:Condor Job Submission

This Page Under Construction

Overview

Programmer's Week Project Material: Project Summary Slide


Compute Resources

A 32 node Compute Cluster has been made available to the NAMIC community during the Programmer's Week. Under Condor this allows for 64 available processors. We have set up a 'submit node' where you can log in and run jobs from. It is also possible to access data from the Birn Data Grid (SRB) for computational use, utilizing the proxy delegation method (i.e. you can delegate a user proxy to condor which will then delegate the proxy to the selected computational resources which will then be able to interact directly with the Data Grid). To obtain an account, please contact either Jason Gerk or Jeff Grethe during the Programmer's Week. Once you have a standard Unix account, you will SSH to the 'submit node' namic-srb.nbirn.net .

Remote Access to Data Grid via GSI Proxies

Once you login to the machine namic-srb.nbirn.net run the command birn-proxy-retrieve.pl .

 [jgerk@namic-srb jgerk]$ birn-proxy-retrieve.pl

 You are about to retrieve your BIRN proxy for authentication.

 Please enter your portal username.
 jgerk
 Please enter your portal passphrase to retrieve your proxy.

 You've successfully retrieved your BIRN proxy.

 Displayed below is your proxy information:
 -----------------------------------------------
 subject  : /C=US/O=BIRN/OU=UCSD - BIRN Coordinating Center/CN=Jason Gerk/USERID=jgerk/CN=proxy
 issuer   : /C=US/O=BIRN/OU=UCSD - BIRN Coordinating Center/CN=Jason Gerk/USERID=jgerk
 identity : /C=US/O=BIRN/OU=UCSD - BIRN Coordinating Center/CN=Jason Gerk/USERID=jgerk
 type     : full legacy globus proxy
 strength : 512 bits
 path     : /tmp/x509up_ujgerk
 timeleft : 11:59:59
 0

This will retrieve your proxy and set up your environment in that shell and allow you to add Scommands into your shell script.

A Condor/SRB Example

Here is a working example. In my home directory I have a file called job_submit_file which contains the executable program (small shell script) the I want to be run when I launch the Condor job. This file is called getInfoSRB, but you can name it anything.

 [jgerk@namic-srb birn_condor]$ more job_submit_file

 Executable                                 = getInfoSRB
 Universe                                   = vanilla
 Output                                     = output.stdout
 Error                                      = output.stderr
 Log                                        = output.log
 getenv                                     = TRUE
 should_transfer_files                      = YES
 when_to_transfer_output                    = ON_EXIT
 transfer_input_files                       = /tmp/x509up_u$ENV(LOGNAME)


Below is an example of what my shell script looks like. This will be run in the Compute Cluster environment, calling some Scommands, then pulling out the hello1 program to a standard Linux environment where the permissions are changed and the file is run.

 [jgerk@namic-srb birn_condor]$ more getInfoSRB
 #!/bin/sh

 echo " "
 echo "This is an example of a shell script that uses Scommands."
 echo " "

 #  Run a Sls
 echo "Do and 'Sls' and show NAMIC related files:"
 Sls | grep -i namic

 #Get hello from SRB and download it locally on condor execute dir
 echo " "
 echo "Get a file from SRB: 'Sget hello' "
 Sget hello hello1

 #Run hello on condor execution machine
 echo " "
 echo "Set the permissions, and execute the program: chmod +x hello1 then ./hello1 "
 echo "(output below) "
 chmod +x hello1
 ./hello1

Next, we are ready to launch the job. The job_submit_file calls the shell script which calls other programs within the Compute Cluster environment. Here we go.

 [jgerk@namic-srb birn_condor]$ birn_condor_submit job_submit_file
 Submitting job(s).
 Logging submit event(s).
 1 job(s) submitted to cluster 68.

Once complete, a sucessful job should write to output.stdout (or whatever you defing in your submit file).

 [jgerk@namic-srb birn_condor]$ more output.stdout

 This is an example of a shell script that uses Scommands.

 Do and 'Sls' and show NAMIC related files:
 C-/home/jgerk.ucsd-bcc/namic
 C-/home/jgerk.ucsd-bcc/namic_backup

 Get a file from SRB: 'Sget hello'

 Set the permissions, and execute the program: chmod +x hello1 then ./hello1
 (output below)
 hello, Condor

Submitting A Condor Job


Submit template file focusing Condor with GSI and SRB. Replace YOUR_EXECUTALBE with the name of your executable, which will usually be a shell script with SRB commands and the executables you really want to run. add other input files to the end of transfer_input_files as needed for example /tmp/x509up_u$(LOGNAME), your_input1, your_input2, etc. Note that if you are using a shell script as your condor executable, you will have to either copy over your scientific code as an input file or store it in SRB and Sget it.

Template File

Executable                                 = YOUR_EXECUTABLE
Universe                                   = vanilla
Output                                     = YOUR_EXECUTABLE.stdout
Error                                      = YOUR_EXECUTABLE.stderr
Log                                        = YOUR_EXECUTABLE.log
getenv                                     = TRUE
should_transfer_files                      = YES
when_to_transfer_output                    = ON_EXIT
transfer_input_files                       = /tmp/x509up_u$ENV(LOGNAME)

Queue

A NAMIC Example

Sample data and an associated executable from the NA-MIC svn Sandbox can be found here

In this example the sample data has been uploaded to a directory: /home/jegrethe.ucsd-bcc/TestData/NAMIC Results are written to: /home/jegrethe.ucsd-bcc/TestData/NAMIC_results

The job submit file is as follows:

########################################################
  # Submit template file focusing Condor with GSI and SRB.
  # Replace YOUR_EXECUTALBE with the name of your executable,
  # which will usually be a shell script with SRB commands and
  # the executables you really want to run.  add other input files to
  # the end of transfer_input_files as needed for example /tmp/x509up_u$(USER,
  # your_input1, your_input2, etc.  Note that if you are using
  # a shell script as your condor executable, you will have to
  # either copy over your scientific code as an input file or
  # store it in SRB and Sget it.
 #######################################################
 
 Executable                                 = condorSH.sh
 Universe                                   = vanilla
 Output                                     = condorSH.stdout
 Error                                      = condorSH.stderr
 Log                                        = condorSH.log
 getenv                                     = TRUE
 should_transfer_files                      = YES
 when_to_transfer_output                    = ON_EXIT
 transfer_input_files                       = /tmp/x509up_ujegrethe
 
 

The shell script executed on the condor compute node is as follows:

#!/bin/sh
 
 /usr/local/bin/Sget -r /home/jegrethe.ucsd-bcc/TestData/NAMIC
 
 cd NAMIC
 chmod a+x ./MultiModalityRigidRegistration
 ./MultiModalityRigidRegistration reg.nhdr mrt.nhdr xformIN xformOUT 2 1 >  reg.out
 
 cd ..
 mv NAMIC NAMIC_results
 /usr/local/bin/Sput -r NAMIC_results /home/jegrethe.ucsd-bcc/TestData
 
 /usr/local/bin/Sput condorSH.stderr /home/jegrethe.ucsd-bcc/TestData/NAMIC_results
 /usr/local/bin/Sput condorSH.stdout /home/jegrethe.ucsd-bcc/TestData/NAMIC_results
 

The condor log file for the job is as follows (Note: the log doesn't have any information on the SRB data movement, Condor only manages the movement of the proxy in this case):

000 (069.000.000) 06/17 01:40:08 Job submitted from host: <198.202.95.80:9646>
...
001 (069.000.000) 06/17 01:40:13 Job executing on host: <198.202.95.110:9607>
...
006 (069.000.000) 06/17 01:40:21 Image size of job updated: 35248
...
006 (069.000.000) 06/17 02:00:21 Image size of job updated: 93844
...
005 (069.000.000) 06/17 03:06:38 Job terminated.
       (1) Normal termination (return value 0)
               Usr 0 01:26:00, Sys 0 00:00:00  -  Run Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
               Usr 0 01:26:00, Sys 0 00:00:00  -  Total Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
       0  -  Run Bytes Sent By Job
       3489  -  Run Bytes Received By Job
       0  -  Total Bytes Sent By Job
       3489  -  Total Bytes Received By Job
...

Condor Quick Help and Useful Commands

To see what resources are available on the cluster: condor_status

 [jgerk@namic-srb jgerk]$ condor_status

 Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime

 vm1@birn-clus LINUX       INTEL  Claimed    Idle       0.000  1979  0+00:00:03
 vm2@birn-clus LINUX       INTEL  Claimed    Idle       0.000  1979  0+00:00:03
 vm1@birn-clus LINUX       INTEL  Claimed    Idle       0.000  1979  0+00:00:03
 vm2@birn-clus LINUX       INTEL  Claimed    Idle       0.000  1979  0+00:00:04
 vm1@birn-clus LINUX       INTEL  Claimed    Idle       0.000  1979  0+00:00:03
 vm2@birn-clus LINUX       INTEL  Unclaimed  Idle       0.000  1979  0+00:00:05
 vm1@birn-clus LINUX       INTEL  Claimed    Busy       0.000  1979  0+00:00:04
 vm2@birn-clus LINUX       INTEL  Claimed    Idle       0.000  1979  0+00:00:06
 vm1@birn-clus LINUX       INTEL  Claimed    Idle       0.000  1979  0+00:00:03
 vm2@birn-clus LINUX       INTEL  Claimed    Idle       0.000  1979  0+00:00:03


                    Machines Owner Claimed Unclaimed Matched Preempting

        INTEL/LINUX       62     0      61         1       0          0

              Total       62     0      61         1       0          0



To launch a job using a submit file: birn_condor_submit <job_submit_file>

[jgerk@namic-srb birn_condor]$ birn_condor_submit job_submit_file
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 68.

To see what is queued to be run: condor_q

 [jgerk@namic-srb birn_condor]$ condor_q

 -- Submitter: namic-srb.public : <132.239.132.232:9663> : namic-srb.public
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 67.0   jgerk           6/27 04:02   0+00:03:53 R  0   0.0  getInfoSRB

 1 jobs; 0 idle, 1 running, 0 held

To remove a job from the queue: condor_rm <job_id>

 [jgerk@namic-srb birn_condor]$ condor_rm 67.0
 Job 67.0 marked for removal


Syntax Strawman

namic-submit [--save-submit-file <submitfile>] [--job-directory <jobdir>] <cmd> [<args>...]

where:

  • submitfile is a condor job submit file; if not specified a temp file is generated
  • jobdir is a working state directory for a set of submissions that you want to later observe with namic-wait or manually inspect
  • cmd is a shell script or executable that can be run on the remote environment
  • args are arguments to the cmd

functionality: namic-submit takes the given command and args and runs the job on a condor-determined external compute node. By default stdin, stdout, and stderr of the cmd are mapped to the corresponding files of namic-submit.

If srb proxy is not set up, namic-submit prompts you for srb user id and password and sets up proxy.

namic-wait [--job-directory <jobdir>] [<groupid>]

where:

  • jobdir is a directory that corresponds to the state for jobs submitted by namic-submit

functionality: namic-wait blocks until all jobs submitted with the given groupid have completed running