SUN Grid Engine

Introduction

The SUN Grid Engine is a batch system. Users submit jobs which are placed in queues and the jobs are then executed, depending on the system load, the opening hours of the queues and the job priority.

Manuals

Usage policy

It is recommended that all jobs longer than a couple of minutes are to be enqueued. This allows a system administrator to suspend jobs if for some reason another user wants to do timing experiments. In the present configuration, users can still log on to Albireo and execute code at any time. We however strongly recommend users to leave Albireo idle on nights and weekends to allow enqueued jobs to take full advantage of the machine. If, for some reasons, this policy does not work other measures will be taken to give enqueued jobs exclusive usage.

It is also suggested that long jobs (in terms of execution time) are given a low priority to allow short jobs to pass.

Queues

The system has several queues defined but for normal usage only two are open, one for MPI-jobs and one for multithreaded/serial jobs.

Job suspension

If a job is still running and the queue closes, the system will suspend the job until the queue opens again.

Tutorial

For normal usage the following commands should be sufficient. There is a very nice graphical X based tool for submission and monitoring called qmon. See the manual for further details.

qsub

Submits a job to the system, with the following options (edited):
Albireo$ qsub -help
CODINE 5.2
usage: qsub [options]
   [-a date_time]                           request a job start time
   [-clear]                                 skip previous definitions for job
   [-cwd]                                   use current working directory
   [-C directive_prefix]                    define command prefix for job script
   [-e path_list]                           specify standard error stream path(s)
   [-h]                                     place user hold on job
   [-help]                                  print this help
   [-j y|n]                                 merge stdout and stderr stream of job
   [-m mail_options]                        define mail notification events
   [-now y[es]|n[o]]                        start job immediately or not at all
   [-M mail_list]                           notify these e-mail addresses
   [-N name]                                specify job name
   [-o path_list]                           specify standard output stream path(s)
   [-p priority]                            define job's relative priority
   [-pe pe-name slot_range]                 request slot range for parallel jobs
   [-q destin_id_list]                      bind job to queue(s)
   [-v variable_list]                       export these environment variables
   [-V]                                     export all environment variables
   [-@ file]                                read commandline input from file
   [{script|-} [script_args]]

date_time               [[CC]YY]MMDDhhmm[.SS]
destin_id_list          queue[ queue ...]
job_id_list             job_id[,job_id,...]
mail_address            username[@host]
mail_list               mail_address[,mail_address,...]
mail_options            `e' `b' `a' `n' `s'
path_list               [host:]path[,[host:]path,...]
priority                -1023 - 1024
slot_range              [n[-m]|[-]m] - n,m > 0
variable_list           variable[=value][,variable[=value],...]
Examples:
albireo$ qsub -q weekend -V -p -23 super_code.sh

Submits the script super_code.sh to the weekend queue. Sets the job priority to -23 ( the valid range is [-1023,1024] )

albireo$ qsub -a 103004.45 -cwd -q night -pe mpi 2-12 -m e -M henrikl@tdb.uu.se -V super_mpi_code.sh

Submits the script super_mpi_code.sh to the night queue (-q night) using the parallel environment mpi, 2-12 processors with the following extras. Start the job at 04:45, 30:th of October (-a 103004.45), use the current working directory (-cwd), mail me (-M henrikl@tdb.uu.se) when the job ends (-m e) and use all login environment variables (-V).

These flags can also be stated in the job script file, where the above flags are passed using the sentinel #$. See the example scripts below.

albireo$ qsub script.sh

To avoid mishaps always make shure that the full path to your executable is supplied. You can also use the flag -V.

Mail options

The mail options can be clustered (-m bes) and mean

'e'
Mail at the end of a job
'b'
Mail at the begining of a job
'a'
Mail at the abortion of a job
'n'
never mail
's'
Mail at the suspension of a job

Sample scripts

These sample script should cover the basic needs. Just edit the template files below.

Example:

#!/bin/sh 
#
# (c) 2000 Sun Microsystems, Inc.
#
# All commands use the sentinel #$
#
# ---------------------------
# User needs to customize the following items 
# enclosed by <>
#
#$ -N SuperSimulation
#$ -S /bin/sh
#$ -o super.output
#$ -e super.error
#$ -M samuel@tdb.uu.se
#$ -m es
# ---------------------------
#
# Execute the job from  the  current  working  directory
#$ -cwd
#
# Parallel environment request
# ---------------------------
# User needs to customize the following items 
# enclosed by <>
#
#$ -l cre
#
# CPU_Numbers_requested, use  or -
#
# Example: 2 or 4-22 where the latter gives 4 to 22 CPU:s
#          depending on the amount of idle CPU:s at
#          execution time
#
#$ -pe mpi 8 
# ---------------------------
#
# All resources are defined here
#
# Choose your queue
#
#$ -q albireo.cre
#
# Job priority
#
#$ -p 0
#
# ---------------------------
#
# Put compilations here
#
# ---------------------------
#
# Execution
#
# ---------------------------
#
# User needs to customize the following items 
# enclosed by <>
#
#

/opt/codine/mpi/MPRUN -np $NSLOTS -Mf $TMPDIR/machines ./supersimulation

# ---------------------------

Note:

Make sure that the scripts are executable by giving the command:
albireo$ chmod u+x script_name

qstat

Monitors the queues.

The default behavior is to list all jobs with no queue status information. If you supply the flag -f you will also see queue status and pending jobs.

qdel

Deletes jobs from queues

User must supply a job_id given at submission or by qstat

qhost

Monitors the system, example

albireo$ qhost
HOSTNAME             ARCH       NPROC LOAD  MEMTOT   MEMUSE   SWAPTO   SWAPUS  
--------------------------------------------------------------------------------
global               -          -     -     -        -        -        -       
albireo              solaris    30    17.04 7.5G     2.8G     4.0G     32.0M  

Basic information

SUN Grid Engine software is a batch system where jobs (formulated as shell scripts) are put into queues and executed when the resource requirements of the job are fulfilled. Jobs are sorted in FIFO (first-in-first-out) fashion according to their priority. The job priority can only be lowered by an ordinary user. Jobs not eligible for execution will be placed in the pending job pool. The jobs are also sorted by Equal-share-scheduling which means that within each priority level jobs are sorted among different users. This prevents a user from "pushing" other users downwards by submitting a series of jobs (from a shell-script).

When a job has ended, the console output of the script will be put into files in the users home directory. The names of the files are composed of the job script file name, an appended dot sign followed by an "o" for stout file and an "e" for the stderr file and finally the unique job ID. These files can be merged and placed in other locations by suppling the right flags, described below. So if a user submits the job "simple.sh" the system will answer: your job 231 ("simple.sh") has been submitted . When the job has been executed the output will be called, "simple.sh.o231" and "simple.sh.e231".

Parallel jobs

The system has some, still limited, support for parallel programs called parallel environments. A queue can be defined as a parallel queue containing a number of slots. So, users can allocate slots until the slots are all consumed. The current configuration allows four simultaneous users with no restrictions upon the allocation of parallel slots. The amount of slots (processors) allocated is passed to the job script through the NSLOTS environment variable.

You can also require a range of slots, (3-13) where you will be given at least 3 and at most 13 slots depending on the amount of slots left when there are at least 3 slots free in the queue.

Manuals

General information on SUN Grid Engine (used to be CODINE 5.1 by Gridware )is available at http://www.sun.com/gridware. The Grid Engine software is aimed at controlling the computational resources in a heterogeneous environment. For our purposes, we use the software as a batch-system incorporating only one host (Albireo).

The user manual is also available

Questions

Direct specific questions to me, henrikl@tdb.uu.se. Use the same adress if you would like to discuss the policys and extensions, changes to the system.

Back to Albireo home page


HPC utvecklingskonto
Last modified: Wed Aug 23 14:27:23 MET DST 2000