How to run jobs with PBS/Pro

From NSIwiki


How to run jobs with PBS/Pro

Here is a quick step-by-step guide to getting started running MPI jobs on Eureka and Yucca.

We want to run an MPI job that uses a total of 64 processes (cores). We also want to limit to 8 the number of processes running on each node (this allows us the flexibility of controlling how the system allocates the compute cores so we can have OPENMP threads or other special needs taken into account.

To compile a simple "hello world" mpi program (after logging into Eureka):

  module add intel/intel-12-impi                               # activate the Intel compiler suite
                                                               # default uses the gcc/rocks openmpi toolchain

  cp /share/apps/intel/impi/ test.c       # make a copy of the sample hello world program
  mpicc test.c -o testc                                        # compile the sample program

Create a file called testc.pbs with the following (starting in column 1):

  #PBS -l select=8:ncpus=8:mpiprocs=8
  module add intel/intel-12-impi
  echo The following nodes will be used to run this program:
  mpirun -n 64 --hostfile $PBS_NODEFILE ./testc
  exit 0

The line #PBS -l select=8:ncpus=8:mpiprocs=8, controls how the system allocates processor cores for your MPI jobs.

  • select=# -- allocate # separate nodes
  • ncpus=# -- on each node allocate # cpus (cores)
  • mpiprocs=# -- on each node allocate # cpus (of the ncpus allocated) to MPI

By varying the above, you can control how cpu resources are allocated, The above example allocates 64 cores all of which are for use by MPI (8 nodes with 8 cpus on each node).

If, for example, your program is hybrid MPI/OPENMP program that runs 8 MP threads on 4 mpi control processes, you would use something like: #PBS -l select=4:ncpus=12:mpiprocs=4.

To submit the test job:

  qsub -q compute testc.pbs

Commonly used PBS commands

All jobs must be submitted to the batch scheduling system. There are numerous commands that you use to communicate with the scheduler. All of the batch scheduler commands have manual pages available on the system (i.e. man qsub). See batch queues for a description of the queues available on the system. The most often used commands are:

  • qsub -- submit PBS job.
           qsub [-a date_time] [-A account_string] [-c interval]
                [-C directive_prefix] [-e path] [-h] [-I] [-j join] [-J range] [-k
                keep] [-l resource_list] [-m mail_events] [-M user_list] [-N name]
                [-o  path]  [-p priority] [-P project] [-q destination] [-r c] [-S
                path_list]  [-u  user_list]  [-v  variable_list]  [-V]  [-W  addi-
                tional_attributes]  [-X] [-z] [script | -- executable [arglist for
           qsub --version

    The most commonly used options for the qsub command are:

       -q <queue_name>
       -l select=1:ncpus=<num_cores>:mem=#gb

    A simple qsub command is:

      qsub -q compute -N test-job -- /bin/hostname

    This command with schedule a job on the compute queue that uses 1 core and whose standard output will be written to a file called "test-job.o####" (where #### is the job number). Similarly, the standard error output will be written to a file call "test-job.e####".

  • qstat -- display status of PBS batch jobs, queues, or servers
           Displaying Job Status
           Default format:
           qstat [-p] [-J] [-t] [-x]
                 [ [job_identifier | destination] ...]
           Long format:
           qstat -f [-p] [-J] [-t] [-x]
                 [ [job_identifier | destination] ...]
           Alternate format:
           qstat [-a [-w]| -H | -i | -r ] [-G | -M] [-J] [-n [-1][-w]]
                 [-s  [-1][-w]]  [-t] [-T [-w]] [-u user_list] [ [job_identifier |
                 destination] ...]
           Displaying Queue Status
           Default format:
           qstat -Q [destination ...]
           Long format:
           qstat -Q -f [destination ...]
           Alternate format:
           qstat -q [-G | -M] [destination ...]
           Displaying Server Status
           Default format:
           qstat -B [server_name ...]
           Long format:
           qstat -B -f [server_name ...]
           Version Information
           qstat --version

    If you issue the qstat command without any options, it will display a single line of information about for each of the pending, running and suspended jobs active on the system.

    [ron@yucca ~]$ qstat
    Job id            Name             User              Time Use S Queue
    ----------------  ---------------- ----------------  -------- - -----
    1030.yucca      alborno2          112:31:3 R smp             
    766.yucca      saba              44:48:08 R smp             
    1135.yucca        chr4dataScript   flink             44:34:01 R smp             
    1196.yucca        chr10data_Filte  flink             18:20:29 R smp             
    1197.yucca        chr11data_Filte  flink             18:20:32 R smp             
    1198.yucca        chr12data_Filte  flink             18:20:15 R smp             
    1199.yucca        chr13data_Filte  flink             18:20:15 R smp             
    1200.yucca        chr14data_Filte  flink             18:20:16 R smp             
    1201.yucca        chr15data_Filte  flink             18:20:20 R smp             
    1202.yucca        chr16data_Filte  flink             18:20:15 R smp             
    1203.yucca        chr17data_Filte  flink             18:20:21 R smp             
    1204.yucca        chr18data_Filte  flink             18:20:26 R smp             
    1205.yucca        chr19data_Filte  flink             18:20:12 R smp             
    1206.yucca        chr1data_Filter  flink             18:20:28 R smp             
    1207.yucca        chr20data_Filte  flink             18:20:09 R smp             
    1208.yucca        chr2data_Filter  flink             18:20:16 R smp             
    1209.yucca        chr3data_Filter  flink             18:20:17 R smp             
    1210.yucca        chr4data_Filter  flink             18:20:09 R smp             
    1211.yucca        chr5data_Filter  flink             18:20:12 R smp             
    1212.yucca        chr6data_Filter  flink             18:20:20 R smp             
    1213.yucca        chr7data_Filter  flink             18:20:14 R smp             
    1214.yucca        chr8data_Filter  flink             18:20:21 R smp             
    1215.yucca        chr9data_Filter  flink             18:20:19 R smp             
    1216.yucca        chrXdata_Filter  flink             18:20:18 R smp             
    1231.yucca      alborno2          07:48:48 R smp             
    1234.yucca        test-job         ron                      0 Q compute         
    1235.yucca        test-job         ron                      0 Q compute         
    [ron@yucca ~]$

    To see detailed information about a particular job, use the qsub -f #### command, where #### is the job number assigned by PBS. If your jobs is not running, look at the comment line for a hint to the possible cause of the problem.

     [ron@yucca ~]$ qstat -f 1234
    Job Id:
        Job_Name = test-job
        Job_Owner =
        job_state = Q
        queue = compute
        server =
        Checkpoint = u
        ctime = Fri May  9 12:05:50 2014
        Error_Path =
        Hold_Types = n
        Join_Path = n
        Keep_Files = n
        Mail_Points = a
        mtime = Fri May  9 12:05:51 2014
        Output_Path =
        Priority = 0
        qtime = Fri May  9 12:05:51 2014
        Rerunable = True
        Resource_List.ncpus = 1
        Resource_List.nodect = 1 = pack = 1:ncpus=1
        substate = 10
        Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash,
        comment = Not Running: Not enough free nodes available
        etime = Fri May  9 12:05:51 2014
        Submit_arguments = -q compute -N test-job -- /bin/hostname
        executable = <jsdl-hpcpa:Executable>/bin/hostname</jsdl-hpcpa:Executable>
        project = _pbs_project_default
    [ron@yucca ~]$ 
  • qdel -- Deletes PBS jobs
           qdel [-x] [-Wforce | -Wsuppress_email=<N>]
                 job_identifier [job_identifier ...]
           qdel --version