TDB HPC User Guide
This is the official users guide to the new high performance computer systems at the department of information technology. Guides for the other systems at the department can be found here.
This user guide is formatted as a single HTML-file which can be printed or viewed in any HTML4 compliant browser. Use the quick links in the left margin to jump into the guide.
Usage tip
You can access this guide online on tee using
lynx
tee$ lynx http://www.it.uu.se/datordrift/maskinpark/teedeebee
Use CTRL-A in lynx to jump to the quick
links (top of page)
Introduction
TeeDeeBee is a parallel computer system at the Department of Information Technology, Uppsala university.
This system, delivered year 2001, is part of an ongoing collaborative research program with Sun Microsystems® established in 1999. The aquisistion of the computer system was possible through a grant from the Knut and Alice Wallenberg Foundation.
System configuration
Consisting of three Sun Fire 6800 servers, the theoretical top performance of the system is 72 Gflops/s.
Each server is configured as follows:
- 16 UltraSPARC III CPUs at 750 MHz
- 8 MB of L2 Cache memory
- 16 GB of primary RAM
- Two 18GB drives
Giving a total of 48 CPUs and 48 GB of RAM. This system will be replaced by a more powerful Sun Fire 15k system in the beginning of year 2002.
The UltraSPARC-III microprocessor
The UltraSPARC-III is a 64-bit 4-way issue superscalar pre-fetching microprocessor, featuring:
- Six execution pipelines (2 integer, 2 FP/VIS, 1 load/store, 1 addressing)
- 16 integer registers, 32 FP/VIS registers
- Latency: 3- to 4-cycle integer/FP/VIS add, subtract, logical and multiply
- Latency: 17- to 29-cycle FP divide, square root
- 14-stage, non-stalling pipeline
- Well managed 16k-entry branch prediction table
- L1-cache: 64 kB 4-way Data, 32 kB 4-way instruction, 2 kB pre-fetch, 2 kB Write
- 150 MHz bus clock frequency
- Bus Bandwidths: 2.4 GB/sec processor-to-memory, 4.8 GB/s Bus-to-memory
- 512 entry TLB
Sun Fire system architecture
The core of the Sun Fire architecture is the Sun Fireplane system interconnect. The Sun Fireplane is a packet switched broadcast medium capable of 9.6 GB/s.
Four US-III CPUs are packaged on a CPU/memory board together with the L2 cache and primary RAM. One Sun Fire 6800 machine can have up to 6 boards whereas the Sun Fire 15k can have up to 18 boards.
Six boards form a uniform access snooping (broadcast) coherenent domain, the Sun Fire 6800. Three such domains can be linked through a scalable shared memory (SSM) device and a crossbar using point-to-point directory based coherence to form a coherent non-uniform access domain, the Sun Fire 15k.
Each snooping domain (SMP) has a peak data bandwidth of 9.6 GB/s whereas the domain to domain bandwidth (non-local accesses through a crossbar) is limited to 2.4 GB/s.
Programming environment
Apart from serial code, the system supports message passing (MPI) using the Sun HPC ClusterTools 4 software as well as explicit shared memory programming using Solaris threads, POSIX threads, OpenMP, Sun or Cray style directives.
The system run the latest version (8) of the Solaris operative system. We use the Sun Grid Engine batch system software for resource mangement.
There are three nodes: Tee, Dee and Bee. Dee and Bee are accessed only by the batch system, do not log on to them. Log on to Tee using a secure shell (SSH):
bash$ ssh tee.it.uu.se
Accounts are managed by Henrik Löf, henlof@tdb.uu.se
Compilers
Currently several compiler suites are installed:
- Sun Forte 6.1,
/opt/SUNWspro/bin - Sun Forte 6.2,
/scr1/compilers/WS6U2/SUNWspro/bin - GNU GCC v2.95.2,
/it/sw/gnu/bin
Sun Forte 6.2 is the default compiler and supports the following languages:
- ANSI C (
bash$ cc) - C++ (
bash$ CC) - Fortan 77 (
bash$ f77) - Fortran 90/95 (
bash$ f90)
To use the Sun Forte 6.1 compiler put
/opt/SUNWspro/bin first in your
$PATH.
Minimal effort compiling (Sun Forte)
- 32-bit code:
-fast -xtarget=ultra3 -xarch=v8plusb- 64-bit code:
-fast -xtarget=ultra3 -xarch=v9b
The -fast flag is a macro and it expands to
(version 6.2):
- Fortran:
-xO5 -xpad=local -xvector=yes -xprefetch=auto,explicit -f -fsimple=2 -fns=yes -ftrap=common -xlibmil -xlibmopt -dalign -xdepend- C:
-fns -fsimple=2 -fsingle -ftrap=%none -xalias_level=basic -xbuiltin=%all -xlibmil -xmemalign=8s -xO5- C++:
-xO5 -fsimple=2 -fns=yes -ftrap=%none -xlibmil -xlibmopt -xbuiltin=%all -dalign
For maximum effect you should also link with
-fast. The Sun compilers all follow the
"rightmost-flag-win" rule, which means that if you want to
compile with all the options in the fast macro and lower the
optimization level you should compile with -fast
-xO4.
A quick reference is obtained using the flag
-flags. For a more detailed description of the
different flags and their effect click here
Useful options
- Allow loop interchange and loop optimizations
-xdepend- True 64-bit load/store and alignment
-dalign- Explicit in-lining
-xinline=my_func- Interprocedual optimizations
-xipo=1
Math libraries
The Sun Forte compilers supply optimized versions and in-lined
versions of the libm library:
- In-lined
libm - -xlibmil
- Optimized
libm - -xlibmopt
Sun also supplies:
- BLAS1, BLAS2, BLAS3
- LAPACK v3.0
- LINPACK v3.0
- FFTPACK
- VFFTPACK
using the Sun Performance library. To use the library compile with:
-dalign -xlic_lib=sunperf
Fortran 90 user should also include the module
sunperf, USE SUNPERF. The library
automatically switches to a parallel version if the compiling
program is shared memory parallelized.
A users guide for the Sun Performance library can be found here (docs.sun.com)
Message passing using MPI
This tutorial only show how to use
Compilation using Sun MPI and Sun Forte compilers
To simplify MPI compilation, Sun has included compiler front-ends in the ClusterTools package to set the correct paths etc. The procedure is relatively simple:
Include the MPI headers:
- Fortran
INCLUDE 'mpif.h'- C/C++
#include <mpi.h>
Compile using the front-ends:
- Fortran 77:
mpf77 <flags> -dalign -lmpi- Fortran 90:
mpf90 <flags> -dalign -lmpi- C:
mpcc <flags> -lmpi- C++:
mpCC <flags> -mt -lmpi
If your MPI code is multi-threaded you should replace the
-lmpi with -lmpi_mt.
Online MPI documentation
All the MPI routines are accessible through
man-pages, see tee$ man mpi.
External documentation
Shared memory programming
There are essentially two different ways to parallelize code using threads.
- Explicit multi-threading by calling OS primitives
- Parallelization using compiler directives or pragmas
Using Solaris threads and POSIX threads
There are to packages available for explicit multi-threading in Solaris:
Native Solaris threads
- Symbols:
- Define
_REENTRANT - Linking:
-lthread- Example:
bash$ cc [flags] file... -D_REENTRANT -lthread
Portable POSIX 1003.1c threads (
pthreads)- Symbols:
- Define
_POSIX_C_SOURCE=199506L - Linking:
-lpthread- Example:
bash$ cc [flags] file... -D_POSIX_C_SOURCE=199506L [-lposix4] -lpthread
The [-lposix4] flag is for the
POSIX.1b-1993 real-time extensions such as
semaphores.
You can also use the macro -mt which expands to
-D_REENTRANT -lthread when compiling native
threads code. This flag is required when compiling C++ code to
get the correct linking.
Using OpenMP
The Sun Forte C compiler currently supports the OpenMP
1.0 standard and the Sun Forte Fortran compiler
supports the OpenMP 2.0 standard.
-
Include files and modules (only present in the 2.0 standard)
- Include file for runtime functions
INCLUDE 'omp_lib.h'- alt. Fortran90 module
USE omp_lib
-
OpenMP sentinels
- C
#pragma omp- Fortran 77
C$OMP- Fortran 90
!$OMP
-
Compiling
- C
-xopenmp=parallel- Fortran
-Xlist -openmp
Set the variable OMP_NUM_THREADS or use runtime
library functions to set the number of threads
External links
- Solaris 8 Multi-threaded Programming Guide
- @docs.sun.com
- Fortran User's Guide
- @docs.sun.com
- Forte C 6 update 2 / Sun WorkShop 6 update 2 C Compiler User's Guide
- @docs.sun.com
- OpenMP standard
- www.openmp.org
Running code interactively
The node tee is available for development and interactive use without limitations. You can also submit batch jobs to tee, exclusive usage can in that case not be guaranteed. For exclusive usage, timings and such you shall use the batch system.
For MPI code, you must specify to run it on tee:.
tee$ mprun -np <num> -R "name=tee" <program>
Once the program is run you can monitor the MPI job using several tools
mpps- Shows you your current running MPI jobs, use flag
-eto see all running MPI jobs. mpkill <job_id>- Kills a running MPI job using the ID given to you at run
time or from
mpps.
Use the above tools for MPI jobs started interactively on tee only
Tools for monitoring system activity
java Jmpstat -u <sec>- Gives a graphical view of the load on each individual CPU
mpstat <interval> <count>- Gives detailed per processor information
top- Shows the top most CPU consuming processes on the system
Development tools
There are several tools included in the different Sun software packages.
- Sun Workshop,
tee$ workshop - Is a complete environment for software development including, build tools, source browsing, debugging, visualisation and performance profiling.
- Prism,
Prism -np <num_cpus> <program> - Environment for debugging, visualisation and profiling for Sun MPI programs.
- Sun S3L, parallel math libraries
- Large range of MPI parallelized solvers (ScaLAPACK..)
A more detailed description as well as tutorials will be posted here later.
Using the batch system
To allow a fair and better usage of the system we use a resource manager to coordinate user demands. We use the Sun GridEngine software.
GridEngine basics
- Jobs are submitted as scripts to central pool of jobs on the master host (tee).
- The master senses its execution hosts (bee, dee) and schedules jobs from the pool for execution.
- Jobs are put into different queue according to their resource requirements.
- Jobs are executed from the queue in the order established by their priority.
Each host has one queue for serial/multi-threaded jobs and one for MPI jobs.
If the resource requirements cannot be fulfilled, the job will be pending (waiting) for its resources. This can also happen if you specify resources that never can be fulfilled. The job will then be in the pending state forever until it is removed or changed.
Submitting jobs
- Jobs (scripts) are submitted using the command
qsub. - Job options can be passed from the command line or from inside the job script or both.
- Each job has an unique id and a user definable name
Qsub options
- Output file,
stdout(default: [job_name].o[job_id]) -o filename- Output file,
stderr(default: [job_name].e[job_id] -e filename- Join output to
stdoutfile -j y- Start job at a specific time or date
-a MMDDhhmm.ss- Start job from current working directory (default: $HOME)
-cwd- Set job name
-N name- Set job priority (valid numbers are -1023 to 1024)
-p priority- Export current environment variables (default: no variables)
-V- Mail user(s)
-M user[@host],...- Mail options
-m b|e|a|s|n,...- b - Mail at the beginning of job
- e - Mail at the end of job
- a - Mail at the abortion of job
- s - Mail at the suspension of job
- n - Never mail (default)
- Choose queue
-q queue_name, queue_name,...
Example:
tee$ qsub -N myjob -cwd -j y -o myjob.out -M
henlof@tdb.uu.se -m b, e, s -q bee.q, dee.q my_job_script.sh
Submitting parallel jobs
GridEngine uses the concept of a parallel environment which
defines how a parallel job should be ran. Currently the system
supports two parallel environments:
mpi and openmp.
- MPI qsub option
-pe mpi [num_cpus]- OpenMP qsub option
-pe openmp [num_cpus]
The [num_cpus] parameter can be an integer or an
interval. If the parameter is an interval, -pe mpi
4-12, the scheduler tries to allocate at least 4 CPUs
and at most 12.
GridEngine uses the terminology slots for
CPUs. Hence, each queue has as many slots available as there
are CPUs in the execution host. If there are several queues on
the host, the slots are shared. The number of slots allocated
by the system is passed through the environment variable
NSLOTS.
Important:
To allow correct execution of MPI jobs you must use the flag,
-l cre to place MPI jobs in the corresponding
*.cre queue.
Writing batch scripts
A batch script can be seen as a text file beginning with
#!/bin/sh containg commands. One command per row
in the file.
Example:
#!/bin/sh
#
# This is a comment
#
cd $HOME/forsking/helmholtz
./helm
The most convenient way to use the batch system is to embed
the qsub options in the batch script. Qsub recognices all rows
starting with #$ as options.
Example:
#!/bin/sh
#
# All commands use the sentinel #$
#
#$ -N CRE_test
#$ -j y
#$ -o CRE_test.output
#$ -cwd
#$ -M user@tdb.uu.se
#$ -m b,e,s
#
#$ -l cre
#$ -pe mpi 6-12
#
#$ -q tee.cre,dee.cre,bee.cre
#
#$ -p 0
#
/opt/codine/mpi/MPRUN -np $NSLOTS -Mf $TMPDIR/machines <put your app here>
Setting the correct job priority
Job priorities shall be set according to the following table:
Approx. exec. time | Priority |
| 0 - 1 h | 0 |
| 1 - 10 h | 1 |
| 10 - 48 h | 2 |
| 48 h+ | 3 |
If you have special needs, contact Henrik Löf.
Sample scripts
Monitoring the batch system
There are several tools available to monitor Sun GridEngine.
qhost- Gives you a quick status of the whole system
qstat- Shows the status of your enqueued and running jobs, use
the flag
-fto see the jobs of all users.
Removing and changing jobs
There are several tools to manage already submitted jobs
qdel <job_id>,...- Removes pending jobs from the pool
qalter- Modifies a pending job
qresub- Resubmits an existing job
Graphical user interface to GridEngine
There is a graphical user interface to GridEngine which is
straight forward to use. Start the gui by typing
qmon at the prompt. See the GridEngine manuals for
more information about qmon.
More documentation
Documentation for all commands are available through
man-pages. There are also other pages available
online, tee$ man codine_intro.
A complete manual as well as forums, FAQs an such are available here: http://supportforum.sun.com/gridengine