How to run jobs with PBS/Pro

From
Revision as of 19:00, 9 July 2014 by Admin (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

QUICK LINKS

  • How to run jobs with PBS/Pro -- How to run batch jobs on eureka and yucca
  • Building PBS Job Command Files -- How to build a PBS command file on eureka and yucca.

How to run jobs with PBS/Pro

Here is a quick step-by-step guide to getting started running MPI jobs on Eureka and Yucca.

We want to run an MPI job that uses a total of 64 processes (cores). We also want to limit to 8 the number of processes running on each node (this allows us the flexibility of controlling how the system allocates the compute cores so we can have OPENMP threads or other special needs taken into account.

To compile a simple "hello world" mpi program (after logging into Eureka):

  module add intel/intel-12-impi                               # activate the Intel compiler suite
                                                               # default uses the gcc/rocks openmpi toolchain

  cp /share/apps/intel/impi/4.0.3.008/test/test.c test.c       # make a copy of the sample hello world program
  mpicc test.c -o testc                                        # compile the sample program

Create a file called testc.pbs with the following (starting in column 1):

  #!/bin/bash
  #PBS -l select=8:ncpus=8:mpiprocs=8
  
  cd $PBS_O_WORKDIR
  module add intel/intel-12-impi
  echo
  echo The following nodes will be used to run this program:
  echo
  cat $PBS_NODEFILE
  echo
  mpirun -n 64 --hostfile $PBS_NODEFILE ./testc
  exit 0

The line #PBS -l select=8:ncpus=8:mpiprocs=8, controls how the system allocates processor cores for your MPI jobs.

  • select=# -- allocate # separate nodes
  • ncpus=# -- on each node allocate # cpus (cores)
  • mpiprocs=# -- on each node allocate # cpus (of the ncpus allocated) to MPI

By varying the above, you can control how cpu resources are allocated, The above example allocates 64 cores all of which are for use by MPI (8 nodes with 8 cpus on each node).

If, for example, your program is hybrid MPI/OPENMP program that runs 8 MP threads on 4 mpi control processes, you would use something like: #PBS -l select=4:ncpus=12:mpiprocs=4.

To submit the test job:

  qsub -q compute testc.pbs

Commonly used PBS commands

All jobs must be submitted to the batch scheduling system. There are numerous commands that you use to communicate with the scheduler. All of the batch scheduler commands have manual pages available on the system (i.e. man qsub). See batch queues for a description of the queues available on the system. The most often used commands are:

  • qsub -- submit PBS job.
           qsub [-a date_time] [-A account_string] [-c interval]
                [-C directive_prefix] [-e path] [-h] [-I] [-j join] [-J range] [-k
                keep] [-l resource_list] [-m mail_events] [-M user_list] [-N name]
                [-o  path]  [-p priority] [-P project] [-q destination] [-r c] [-S
                path_list]  [-u  user_list]  [-v  variable_list]  [-V]  [-W  addi-
                tional_attributes]  [-X] [-z] [script | -- executable [arglist for
                executable]]
    
           qsub --version
    

    The most commonly used options for the qsub command are:

       -q <queue_name>
       -l select=1:ncpus=<num_cores>:mem=#gb
    

    A simple qsub command is:

      qsub -q compute -N test-job -- /bin/hostname
    

    This command with schedule a job on the compute queue that uses 1 core and whose standard output will be written to a file called "test-job.o####" (where #### is the job number). Similarly, the standard error output will be written to a file call "test-job.e####".

  • qstat -- display status of PBS batch jobs, queues, or servers
           Displaying Job Status
           Default format:
           qstat [-p] [-J] [-t] [-x]
                 [ [job_identifier | destination] ...]
    
           Long format:
           qstat -f [-p] [-J] [-t] [-x]
                 [ [job_identifier | destination] ...]
    
           Alternate format:
           qstat [-a [-w]| -H | -i | -r ] [-G | -M] [-J] [-n [-1][-w]]
                 [-s  [-1][-w]]  [-t] [-T [-w]] [-u user_list] [ [job_identifier |
                 destination] ...]
    
           Displaying Queue Status
           Default format:
           qstat -Q [destination ...]
    
           Long format:
           qstat -Q -f [destination ...]
    
           Alternate format:
           qstat -q [-G | -M] [destination ...]
    
           Displaying Server Status
           Default format:
           qstat -B [server_name ...]
    
           Long format:
           qstat -B -f [server_name ...]
    
           Version Information
           qstat --version
    

    If you issue the qstat command without any options, it will display a single line of information about for each of the pending, running and suspended jobs active on the system.

    [ron@yucca ~]$ qstat
    Job id            Name             User              Time Use S Queue
    ----------------  ---------------- ----------------  -------- - -----
    1030.yucca        ADNIscpt.sh      alborno2          112:31:3 R smp             
    766.yucca         reconSHR.sh      saba              44:48:08 R smp             
    1135.yucca        chr4dataScript   flink             44:34:01 R smp             
    1196.yucca        chr10data_Filte  flink             18:20:29 R smp             
    1197.yucca        chr11data_Filte  flink             18:20:32 R smp             
    1198.yucca        chr12data_Filte  flink             18:20:15 R smp             
    1199.yucca        chr13data_Filte  flink             18:20:15 R smp             
    1200.yucca        chr14data_Filte  flink             18:20:16 R smp             
    1201.yucca        chr15data_Filte  flink             18:20:20 R smp             
    1202.yucca        chr16data_Filte  flink             18:20:15 R smp             
    1203.yucca        chr17data_Filte  flink             18:20:21 R smp             
    1204.yucca        chr18data_Filte  flink             18:20:26 R smp             
    1205.yucca        chr19data_Filte  flink             18:20:12 R smp             
    1206.yucca        chr1data_Filter  flink             18:20:28 R smp             
    1207.yucca        chr20data_Filte  flink             18:20:09 R smp             
    1208.yucca        chr2data_Filter  flink             18:20:16 R smp             
    1209.yucca        chr3data_Filter  flink             18:20:17 R smp             
    1210.yucca        chr4data_Filter  flink             18:20:09 R smp             
    1211.yucca        chr5data_Filter  flink             18:20:12 R smp             
    1212.yucca        chr6data_Filter  flink             18:20:20 R smp             
    1213.yucca        chr7data_Filter  flink             18:20:14 R smp             
    1214.yucca        chr8data_Filter  flink             18:20:21 R smp             
    1215.yucca        chr9data_Filter  flink             18:20:19 R smp             
    1216.yucca        chrXdata_Filter  flink             18:20:18 R smp             
    1231.yucca        schiller.sh      alborno2          07:48:48 R smp             
    1234.yucca        test-job         ron                      0 Q compute         
    1235.yucca        test-job         ron                      0 Q compute         
    [ron@yucca ~]$
    

    To see detailed information about a particular job, use the qsub -f #### command, where #### is the job number assigned by PBS. If your jobs is not running, look at the comment line for a hint to the possible cause of the problem.

     [ron@yucca ~]$ qstat -f 1234
    Job Id: 1234.yucca.nscee.edu
        Job_Name = test-job
        Job_Owner = ron@yucca.nscee.edu
        job_state = Q
        queue = compute
        server = yucca.nscee.edu
        Checkpoint = u
        ctime = Fri May  9 12:05:50 2014
        Error_Path = yucca.nscee.edu:/home/ron/test-job.e1234
        Hold_Types = n
        Join_Path = n
        Keep_Files = n
        Mail_Points = a
        mtime = Fri May  9 12:05:51 2014
        Output_Path = yucca.nscee.edu:/home/ron/test-job.o1234
        Priority = 0
        qtime = Fri May  9 12:05:51 2014
        Rerunable = True
        Resource_List.ncpus = 1
        Resource_List.nodect = 1
        Resource_List.place = pack
        Resource_List.select = 1:ncpus=1
        substate = 10
        Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash,
    	PBS_O_HOME=/home/ron,PBS_O_LOGNAME=ron,PBS_O_WORKDIR=/home/ron,
    	PBS_O_LANG=en_US.UTF-8,
    	PBS_O_PATH=/opt/openmpi/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:
    	/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/pbs/default/bin:/opt/pbs/
    	default/sbin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/
    	latest/bin:/opt/maven/bin:/opt/pdsh/bin:/opt/rocks/bin:/opt/rocks/sbin:
    	/home/ron/bin,PBS_O_MAIL=/var/spool/mail/ron,PBS_O_QUEUE=compute,
    	PBS_O_HOST=yucca.nscee.edu
        comment = Not Running: Not enough free nodes available
        etime = Fri May  9 12:05:51 2014
        Submit_arguments = -q compute -N test-job -- /bin/hostname
        executable = <jsdl-hpcpa:Executable>/bin/hostname</jsdl-hpcpa:Executable>
        project = _pbs_project_default
    
    [ron@yucca ~]$ 
    
  • qdel -- Deletes PBS jobs
           qdel [-x] [-Wforce | -Wsuppress_email=<N>]
                 job_identifier [job_identifier ...]
           qdel --version
    
Personal tools
MediaWiki Appliance - Powered by TurnKey Linux