Wayne State University

AIM HIGHER

Wayne State University

High Performance Computing Services

PBS

HOW TO SCHEDULE JOBS ON THE WSU GRID WITH PBS PRO
THIS PAGE IS NOT A REPLACEMENT FOR THE MANUAL! THIS PAGE IS SUPPLEMENTARY MATERIAL AND SHOULD BE USED ONLY AFTER READING THE MANUAL!
PBS Pro Users Guide - Version 13.0
PBS Pro Reference Guide - Version 13.0
Location of Files
For users that maintain their own environments and shells, the PBS Pro binaries and man pages are installed in /usr/pbs/bin/ and /usr/pbs/man respectively.
Information about WSU’s Environment
There are several different queues in the WSU Grid. Some research groups have purchased their own machines, and as a result, have their own queues. The default queue for general grid use is named ‘wsuq’ and is accessible by all users. This queue routes all Grid jobs to other queues on the system. You will only have access to specific queues based on which research group you belong to. Mass Queues are used to submit large numbers of jobs to a queue. ASXQ has a job submission limit of 32 and MTXQ has a job submission limit of 12, but MASXQ and MMTXQ are unlimited. Below is a diagram of the WSU environment and its queues.

Queues in the Grid: WSUQ, MTXQ, ASXQ, PCCQ, VIIQ, ZFHQ, DADQ, HBXQ, GEDQ
Queue Information
QUEUE WALL TIME MAX_QUEUEABLE MAX_USER_RUN MAX_USER_RES.NCPUS PRIORITY QUEUE OVER OPEN TO NODES
AMXQ NO NO NO NO No All Users AMX1-AMX4
EAMXQ NO NO NO NO Over AMXQ Majumder Group AMX1-AMX4
ASXQ NO NO 32 32 Over MASXQ All Users ASX1-ASX40
MASXQ NO NO NO NO NO All Users ASX7-ASX40
DADQ NO NO NO NO NO All Users DAD1-DAD8
GEDQ NO NO 24 NO NO All Users GED1-GED2
EGEDQ NO NO 48 NO Over GEDQ Dyson Group GED1-GED2
HBXQ NO 384 192 NO NO Schlegel Group HBX1-HBX6
EHBXQ NO 384 192 NO Over HBXQ Schlegel Group HBX1-HBX6
MTXQ NO NO 32 32 Over MMTXQ All Users MTX1-MTX56
MMTXQ NO NO NO NO NO All Users MTX13-MTX56
PCCQ NO NO 128 NO NO All Users PCC1-PCC35
EGACQ NO NO 128 NO Over PCCQ Cisneros Group PCC1-PCC28, PCC35
EVYCQ NO NO 128 NO Over PCCQ Chernyak Group PCC29-PCC34
EVIIQ NO NO 128 NO Over VII nodes Dong Group VII1-VII32
WSUQ NO NO NO NO NO All Users RND12-15, VII1-32, MTX1-56, ASX1-40
ZFHQ NO 13 13 8 NO All Users ZFH1-ZFH17
ZFLQ NO 104 104 104 Over ZFHQ Huang Group ZFH1-ZFH17
EZFHQ NO NO 80 NO Over ZFHQ, ZFLQ Huang Group ZFH1-ZFH17
Connecting to the Grid
To connect to the WSU Grid users must login to the master node (grid.wayne.edu) with their access ID and password.

Individual compute nodes can not be accessed directly, unless a user already has a job running on said node. Compute nodes can only be accessed through the PBS scheduler with either a batch job or an interactive (shell) job.
Basic Commands (full list - beginning on page 21 of the user guide)
qmgrExamine the queues
qselect -u Access IDLists a user's jobs
qsubSubmit a job
qstatLists jobs in queue
qmeList your jobs in queue
freenodesLists the free compute nodes
qdel Job IDDelete a specific job
qdel $(qselect -u Access ID)Delete all of a user's jobs
qstat -n1Lists running jobs and the nodes that they are on
Resources and Options (full lists - beginning on page 35 and beginning on page 62 of the user guide)
-adate_timeDefers execution
-epathRedirecting error files
-IDeclares that the job is an interactive-batch job
-lresources_listncpus - Number of CPUs (processors) required by job
nodes - Number and/or type of nodes needed by job
walltime - Maximum amount of real time (wall-clock elapsed time) which the job needs to execute (run)
-l mem=Specifies total job memory required
-mMailOptionsSpecifying e-mail notification
-Muser_listSetting e-mail recipient list
-opathRedirecting output files
-qdestinationSpecifying queue
Environment Variables
$TMPDIRUnique to every job, is created on all nodes in $PBS_NODEFILE at the start of the job, and is erased at the end.pbs.XXXXX.pbs (with XXXXX = the job number)
$PBS_JOBID Unique to every jobXXXXX.pbs (with XXXXX = the job number)
$PBS_JOBNAMEThe name of the job.JOBNAME (with JOBNAME = to the name of the job)
$PBS_NODEFILEA list of the nodes allocated to the job. NODENAME1
NODENAME2
NODENAME (with NODENAMEX = to the name of a node)
Basic Command Examples
/export/usr/scripts/pbsqueueaccess.plLists queue availability
qstat -qLists the status of all queues
qstat –nLists jobs and nodes they are assigned to
qdel 991.pbsKill job number 991
Example:
In this example a user examines the jobs in the queue, and then deletes a running job.

-wsush-2.05b$ qstat
WSU Grid Master Node
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
55.grid StdIN ac6531 00:00:00 R workq
-wsush-2.05b$ qdel 55
-wsush-2.05b$ qstat
-wsush-2.05b$

Notice that when no jobs are running, qstat returns nothing.
Submitting Specific Types of Jobs
The examples below are samples of possible job files, which contain script used for running a job. Job files can be used to define a shell, options, resources, and the job to be run. These job files are executed with the qsub command. Keep in mind that these are examples and that they can be modified to better suit your needs. When creating a job file be sure that there are no unwanted special characters or line breaks.

Single Threaded Applications

#!/bin/bash

#PBS -l ncpus=1
#PBS -m ea
#PBS -M at8036@wayne.edu
#PBS -o /wsu/home/at/at80/at8036/pbs/job_name/completed/output_file
#PBS -e /wsu/home/at/at80/at8036/pbs/job_name/completed/error_file

cd $TMPDIR
cp /wsu/home/at/at80/at8036/pbs/job_name/script_file $TMPDIR
cp /wsu/home/at/at80/at8036/pbs/job_name/data_file $TMPDIR

$TMPDIR/script_file

mv $TMPDIR/* /wsu/home/at/at80/at8036/pbs/job_name/completed/.
In this example: The user requests one processor, that Email be sent to at8036@wayne.edu when the job completes, and that any output or errors will be written to the files 'output_file' and 'error_file'. The user moves to the directory in /tmp called $TMPDIR created by PBS. The user copies script files and data files to the $TMPDIR directory. The user executes 'script_file'. The user moves all files in $TMPDIR to their home directory and the job ends.
Shared Memory Applications

#!/bin/bash

#PBS -l ncpus=2
#PBS -m ea
#PBS -M at8036@wayne.edu
#PBS -o /wsu/home/at/at80/at8036/pbs/job_name/completed/output_file
#PBS -e /wsu/home/at/at80/at8036/pbs/job_name/completed/error_file

cd $TMPDIR
cp /wsu/home/at/at80/at8036/pbs/job_name/script_file $TMPDIR
cp /wsu/home/at/at80/at8036/pbs/job_name/data_file $TMPDIR

$TMPDIR/script_file

mv $TMPDIR/* /wsu/home/at/at80/at8036/pbs/job_name/completed/.
In this example: The user requests two processors, (Since the user is not using the 'nodes' keyword, both processors will be on the same computer and PBS assumes the job will use shared memory.) that Email be sent to at8036@wayne.edu when the job completes, and that any output or errors will be written to the files 'output_file' and 'error_file'. The user moves to the directory in /tmp called $TMPDIR created by PBS. The user copies script files and data files to the $TMPDIR directory. The user executes 'script_file'. The user moves all files in $TMPDIR to their home directory and the job ends.
MPICH Applications

#!/bin/bash

#PBS -l ncpus=8
#PBS -l nodes=2:ppn=4
#PBS -m ea
#PBS -M at8036@wayne.edu
#PBS -o /wsu/home/at/at80/at8036/pbs/job_name/completed/output_file
#PBS -e /wsu/home/at/at80/at8036/pbs/job_name/completed/error_file

/export/arch/i386/mpich-1.2.7/gcc/ch_p4/bin/mpirun -machinefile $PBS_NODEFILE -np 8 /wsu/home/at/at80/at8036/pbs/job_name/script_file

In this example: The user requests eight processors (Since the user is using the 'nodes' keyword, the eight processors will be on different computers and PBS assumes the job is a 'message passing' job.) with four processors on each node of the two nodes, that Email be sent to at8036@wayne.edu when the job completes, and that any output or errors will be written to the files 'output_file' and 'error_file'. The user moves to the directory in /tmp called $TMPDIR created by PBS. The user copies script files and data files to the $TMPDIR directory. The user executes 'script_file' using the MPICH mpirun wrapper. (Notice that the machinefile a user would typically define for an MPICH job is replaced with $PBS_NODEFILE, the dynamic machinefile created by PBS.) The user moves all files in $TMPDIR to their home directory and the job ends.
MPICH2 Applications

#!/bin/bash

#PBS -l ncpus=8
#PBS -l nodes=2:ppn=4
#PBS -M at8036@wayne.edu
#PBS -o /wsu/home/at/at80/at8036/pbs/job_name/completed/output_file
#PBS -e /wsu/home/at/at80/at8036/pbs/job_name/completed/error_file

cd $TMPDIR
cp /wsu/home/at/at80/at8036/pbs/job_name/script_file $TMPDIR
cp /wsu/home/at/at80/at8036/pbs/job_name/data_file $TMPDIR

cat $PBS_NODEFILE | uniq | sed -e 's/$/.grid.wayne.edu/g' > /wsu/home/at/at80/at8036/mpd.hosts

/wsu/arch/x86_64/mpich2-1.0.5p4/bin/mpdboot -n 2
/wsu/arch/x86_64/mpich2-1.0.5p4/bin/mpirun -np 8 $TMPDIR/script_file
In this example: The user requests eight processors (Since the user is using the 'nodes' keyword, the eight processors will be on different computers and PBS assumes the job is a 'message passing' job.) with four processors on each node of the two nodes, that Email be sent to at8036@wayne.edu when the job completes, and that any output or errors will be written to the files 'output_file' and 'error_file'. The user moves to the directory in /tmp called $TMPDIR created by PBS. The user copies script files and data files to the $TMPDIR directory. A script modifies the mpd.hosts file in the home directory. The mpdboot command starts boot daemons on the nodes assigned by PBS. The user executes 'script_file' using the MPICH mpirun wrapper. The user moves all files in $TMPDIR to their home directory and the job ends.
Interactive Applications

qsub -I -q wsuq -l ncpus=1

cd $TMPDIR
cp /wsu/home/at/at80/at8036/pbs/job_name/script_file /.
cp /wsu/home/at/at80/at8036/pbs/job_name/data_file /.

./script_file

mv $TMPDIR/* /wsu/home/at/at80/at8036/pbs/job_name/completed/.
In this example: The user requests one processor, and that the job be run on the wsuq. (This command opens a shell on a node with an available processor. This is useful for jobs that have graphical interfaces or jobs that require user interaction.) The user moves script files and data files to the $TMPDIR directory. The user executes 'script_file'. The user moves all files in $TMPDIR to their home directory and the job ends.
Troubleshooting

To troubleshoot a job there are a few very useful commands.

qstat -f Prints all info about all jobs
qstat -f XXXX Prints all info about job XXXX
qstat -n Displays nodes allocated to any running jobs
qstat -Q Displays status of queues

pbsnodes -l Lists nodes that are down
Computer Center Services * 5925 Woodward Avenue Detroit, Michigan 48202