Condor is installed on the cluster, and is configured in such a way that it can be run from any computer connected on the TERAPIX network, including personal computers. Here is a small cookbook on how to submit jobs to the TERAPIX cluster.
First, make sure that Condor is installed and configured on your machine (otherwise follow this link to install and configure Condor): on a Unix system it will generally be installed in /opt/condor or /usr/local/condor (from now on we will refer to the Condor installation directory as $CONDOR_CONFIG). You can check that it is running with
% ps ax | grep condor
You should see at least a task called condor_master. If this is not the case, you might want to start the Condor system (as root if necessary) using
% $CONDOR_CONFIG/sbin/condor_master
If the path is correctly set, typing condor_status from the shell should return something like:
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
vm1@efigix.ia LINUX X86_64 Unclaimed Idle 0.000 975 0+00:20:29
vm2@efigix.ia LINUX X86_64 Unclaimed Idle 0.410 975 0+03:25:08
vm3@efigix.ia LINUX X86_64 Unclaimed Idle 0.000 975 0+15:29:11
vm4@efigix.ia LINUX X86_64 Unclaimed Idle 0.000 975 0+15:28:31
vm1@mix10.iap LINUX X86_64 Unclaimed Idle 0.040 4026 0+03:25:07
vm2@mix10.iap LINUX X86_64 Unclaimed Idle 0.000 4026 0+15:28:29
...
the command condor_submit is used to send jobs to the cluster. You may either provide the name of a "submission file" as an argument, or pipe it to condor_submit. The following commented submission file shows how to send the system command ls and return the result in ls.out:
executable = /bin/ls # what to run
universe = vanilla # standard job (not MPI, etc.)
arguments = / # command line arguments (separated with spaces)
output = ls.out # where to write the result (from stdout)
error = ls.error # where to write the errors (from stderr)
log = ls.log # where to write the Condor log
should_transfer_files = YES # don't use NFS or any other shared filesystem
when_to_transfer_output = ON_EXIT_OR_EVICT # mandatory for transfering files
queue # go!
typing quickly condor_q right after submitting the job should show something like:
-- Submitter: kiravix.iap.fr : <194.57.221.16:41872> : kiravix.iap.fr
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
39.0 bertin 7/4 16:12 0+00:00:00 I 0 9.8 ls /
1 jobs; 1 idle, 0 running, 0 held
You may also use condor_q -global to list all jobs currently queued for execution (not only those sent from your machine), condor_q -long to list more details, or condor_q -better-analyze to get some hints if the job is not executed as planned. If everything works well, the job should vanish from the condor_q list after a few seconds, and files called ls.out,ls.error and ls.log should appear in the current directory.
Here is a more complex "real-life" example which send SExtractor as a "cluster of jobs". It queues 3 jobs, and sends the executable as well as configuration files and data:
#
# Condor submission file for PGC morphology
#
executable = /usr/local/bin/sex
universe = vanilla
transfer_executable = True
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
log = sex.log
arguments = image1.fits -XML_NAME image1.xml -CATALOG_NAME image1.cat
transfer_input_files = image1.fits,default.sex,default.param,default.conv
queue
arguments = image2.fits -XML_NAME image2.xml -CATALOG_NAME image2.cat
transfer_input_files = image2.fits,default.sex,default.param,default.conv
queue
arguments = image3.fits -XML_NAME image3.xml -CATALOG_NAME image3.cat
transfer_input_files = image3.fits,default.sex,default.param,default.conv
queue
Note the coma between the filenames of images to be transferred, and the queue command separating each job submission. Files created by the executable are automatically transferred back. Actually the submission file above was written with a shell script called sex.x:
#! /bin/tcsh
echo "#"
echo "# Condor submission file for PGC morphology"
echo "#"
echo "executable = /usr/local/bin/sex"
echo "universe = vanilla"
echo "transfer_executable = True"
echo "should_transfer_files = YES"
echo "when_to_transfer_output = ON_EXIT_OR_EVICT"
echo "log = pgc.log"
foreach file ( $* )
set rfile = $file:r:t
echo "arguments = "$file" -XML_NAME "$rfile".xml -CATALOG_NAME "$rfile".cat"
echo "transfer_input_files = "$file",default.sex,default.param,default.conv"
#echo "transfer_output_files =
#echo "output = "$rfile".out"
#echo "error = "$rfile".error"
echo "queue"
end
Using pipes we can now send SExtractor jobs for all images in the current directory using
% ./sex.x *.fits | condor_submit
If a job is stuck or takes too much time to complete, you might want to remove it from the queue with condor_rm. For instance
% condor_rm bertin
removes all jobs owned by user bertin. Hopefully, condor_rm can be more selective:
% condor_rm 56.3
will remove only job #56.3.