Uppsala universitet

Quick links

Introduction

System configuration

The UltraSPARC-III microprocessor

SunFire system architecture

Programming environment

Compilers

Minimal effort compiling
Math libraries

Message Passing using MPI

Shared memory programming

Using OpenMP

Running code interactively

Monitoring system activity

Development tools

Using the batch system

Sun Grid Engine basics
Submitting jobs
Writing batch scripts
Monitoring Sun Grid Engine

TDB HPC User Guide

This is the official users guide to the new high performance computer systems at the department of information technology. Guides for the other systems at the department can be found here.

This user guide is formatted as a single HTML-file which can be printed or viewed in any HTML4 compliant browser. Use the quick links in the left margin to jump into the guide.

Usage tip

You can access this guide online on tee using lynx

tee$ lynx http://www.it.uu.se/datordrift/maskinpark/teedeebee

Use CTRL-A in lynx to jump to the quick links (top of page)

Introduction

TeeDeeBee is a parallel computer system at the Department of Information Technology, Uppsala university.

This system, delivered year 2001, is part of an ongoing collaborative research program with Sun Microsystems® established in 1999. The aquisistion of the computer system was possible through a grant from the Knut and Alice Wallenberg Foundation.

System configuration

Consisting of three Sun Fire 6800 servers, the theoretical top performance of the system is 72 Gflops/s.

Each server is configured as follows:

Giving a total of 48 CPUs and 48 GB of RAM. This system will be replaced by a more powerful Sun Fire 15k system in the beginning of year 2002.

The UltraSPARC-III microprocessor

The UltraSPARC-III is a 64-bit 4-way issue superscalar pre-fetching microprocessor, featuring:

Sun Fire system architecture

The core of the Sun Fire architecture is the Sun Fireplane system interconnect. The Sun Fireplane is a packet switched broadcast medium capable of 9.6 GB/s.

Four US-III CPUs are packaged on a CPU/memory board together with the L2 cache and primary RAM. One Sun Fire 6800 machine can have up to 6 boards whereas the Sun Fire 15k can have up to 18 boards.

Six boards form a uniform access snooping (broadcast) coherenent domain, the Sun Fire 6800. Three such domains can be linked through a scalable shared memory (SSM) device and a crossbar using point-to-point directory based coherence to form a coherent non-uniform access domain, the Sun Fire 15k.

Each snooping domain (SMP) has a peak data bandwidth of 9.6 GB/s whereas the domain to domain bandwidth (non-local accesses through a crossbar) is limited to 2.4 GB/s.

Programming environment

Apart from serial code, the system supports message passing (MPI) using the Sun HPC ClusterTools 4 software as well as explicit shared memory programming using Solaris threads, POSIX threads, OpenMP, Sun or Cray style directives.

The system run the latest version (8) of the Solaris operative system. We use the Sun Grid Engine batch system software for resource mangement.

There are three nodes: Tee, Dee and Bee. Dee and Bee are accessed only by the batch system, do not log on to them. Log on to Tee using a secure shell (SSH):

bash$ ssh tee.it.uu.se

Accounts are managed by Henrik Löf, henlof@tdb.uu.se

Compilers

Currently several compiler suites are installed:

Sun Forte 6.2 is the default compiler and supports the following languages:

To use the Sun Forte 6.1 compiler put /opt/SUNWspro/bin first in your $PATH.

Minimal effort compiling (Sun Forte)

32-bit code:
-fast -xtarget=ultra3 -xarch=v8plusb
64-bit code:
-fast -xtarget=ultra3 -xarch=v9b

The -fast flag is a macro and it expands to (version 6.2):

Fortran:
-xO5 -xpad=local -xvector=yes -xprefetch=auto,explicit -f -fsimple=2 -fns=yes -ftrap=common -xlibmil -xlibmopt -dalign -xdepend
C:
-fns -fsimple=2 -fsingle -ftrap=%none -xalias_level=basic -xbuiltin=%all -xlibmil -xmemalign=8s -xO5
C++:
-xO5 -fsimple=2 -fns=yes -ftrap=%none -xlibmil -xlibmopt -xbuiltin=%all -dalign

For maximum effect you should also link with -fast. The Sun compilers all follow the "rightmost-flag-win" rule, which means that if you want to compile with all the options in the fast macro and lower the optimization level you should compile with -fast -xO4.

A quick reference is obtained using the flag -flags. For a more detailed description of the different flags and their effect click here

Useful options

Allow loop interchange and loop optimizations
-xdepend
True 64-bit load/store and alignment
-dalign
Explicit in-lining
-xinline=my_func
Interprocedual optimizations
-xipo=1

Math libraries

The Sun Forte compilers supply optimized versions and in-lined versions of the libm library:

In-lined libm
-xlibmil
Optimized libm
-xlibmopt

Sun also supplies:

using the Sun Performance library. To use the library compile with:

-dalign -xlic_lib=sunperf

Fortran 90 user should also include the module sunperf, USE SUNPERF. The library automatically switches to a parallel version if the compiling program is shared memory parallelized.

A users guide for the Sun Performance library can be found here (docs.sun.com)

Message passing using MPI

This tutorial only show how to use

Compilation using Sun MPI and Sun Forte compilers

To simplify MPI compilation, Sun has included compiler front-ends in the ClusterTools package to set the correct paths etc. The procedure is relatively simple:

  1. Include the MPI headers:
    Fortran
    INCLUDE 'mpif.h'
    C/C++
    #include <mpi.h>
  2. Compile using the front-ends:
    Fortran 77:
    mpf77 <flags> -dalign -lmpi
    Fortran 90:
    mpf90 <flags> -dalign -lmpi
    C:
    mpcc <flags> -lmpi
    C++:
    mpCC <flags> -mt -lmpi

If your MPI code is multi-threaded you should replace the -lmpi with -lmpi_mt.

Online MPI documentation

All the MPI routines are accessible through man-pages, see tee$ man mpi.

External documentation

ClusterTools documentation

Shared memory programming

There are essentially two different ways to parallelize code using threads.

  1. Explicit multi-threading by calling OS primitives
  2. Parallelization using compiler directives or pragmas

Using Solaris threads and POSIX threads

There are to packages available for explicit multi-threading in Solaris:

  1. Native Solaris threads
    Symbols:
    Define _REENTRANT
    Linking:
    -lthread
    Example:
    bash$ cc [flags] file... -D_REENTRANT -lthread
  2. Portable POSIX 1003.1c threads (pthreads)
    Symbols:
    Define _POSIX_C_SOURCE=199506L
    Linking:
    -lpthread
    Example:
    bash$ cc [flags] file... -D_POSIX_C_SOURCE=199506L [-lposix4] -lpthread

The [-lposix4] flag is for the POSIX.1b-1993 real-time extensions such as semaphores.

You can also use the macro -mt which expands to -D_REENTRANT -lthread when compiling native threads code. This flag is required when compiling C++ code to get the correct linking.

Using OpenMP

The Sun Forte C compiler currently supports the OpenMP 1.0 standard and the Sun Forte Fortran compiler supports the OpenMP 2.0 standard.

  1. Include files and modules (only present in the 2.0 standard)
    Include file for runtime functions
    INCLUDE 'omp_lib.h'
    alt. Fortran90 module
    USE omp_lib
  2. OpenMP sentinels
    C
    #pragma omp
    Fortran 77
    C$OMP
    Fortran 90
    !$OMP
  3. Compiling
    C
    -xopenmp=parallel
    Fortran
    -Xlist -openmp

Set the variable OMP_NUM_THREADS or use runtime library functions to set the number of threads

External links

Solaris 8 Multi-threaded Programming Guide
@docs.sun.com
Fortran User's Guide
@docs.sun.com
Forte C 6 update 2 / Sun WorkShop 6 update 2 C Compiler User's Guide
@docs.sun.com
OpenMP standard
www.openmp.org

Running code interactively

The node tee is available for development and interactive use without limitations. You can also submit batch jobs to tee, exclusive usage can in that case not be guaranteed. For exclusive usage, timings and such you shall use the batch system.

For MPI code, you must specify to run it on tee:.

tee$ mprun -np <num> -R "name=tee" <program>

Once the program is run you can monitor the MPI job using several tools

mpps
Shows you your current running MPI jobs, use flag -e to see all running MPI jobs.
mpkill <job_id>
Kills a running MPI job using the ID given to you at run time or from mpps.

Use the above tools for MPI jobs started interactively on tee only

Tools for monitoring system activity

java Jmpstat -u <sec>
Gives a graphical view of the load on each individual CPU
mpstat <interval> <count>
Gives detailed per processor information
top
Shows the top most CPU consuming processes on the system

Development tools

There are several tools included in the different Sun software packages.

Sun Workshop, tee$ workshop
Is a complete environment for software development including, build tools, source browsing, debugging, visualisation and performance profiling.
Prism, Prism -np <num_cpus> <program>
Environment for debugging, visualisation and profiling for Sun MPI programs.
Sun S3L, parallel math libraries
Large range of MPI parallelized solvers (ScaLAPACK..)

A more detailed description as well as tutorials will be posted here later.

Using the batch system

To allow a fair and better usage of the system we use a resource manager to coordinate user demands. We use the Sun GridEngine software.

GridEngine basics

Each host has one queue for serial/multi-threaded jobs and one for MPI jobs.

If the resource requirements cannot be fulfilled, the job will be pending (waiting) for its resources. This can also happen if you specify resources that never can be fulfilled. The job will then be in the pending state forever until it is removed or changed.

Submitting jobs

Qsub options

Output file, stdout (default: [job_name].o[job_id])
-o filename
Output file, stderr (default: [job_name].e[job_id]
-e filename
Join output to stdout file
-j y
Start job at a specific time or date
-a MMDDhhmm.ss
Start job from current working directory (default: $HOME)
-cwd
Set job name
-N name
Set job priority (valid numbers are -1023 to 1024)
-p priority
Export current environment variables (default: no variables)
-V
Mail user(s)
-M user[@host],...
Mail options
-m b|e|a|s|n,...
  • b - Mail at the beginning of job
  • e - Mail at the end of job
  • a - Mail at the abortion of job
  • s - Mail at the suspension of job
  • n - Never mail (default)
Choose queue
-q queue_name, queue_name,...
Example:

tee$ qsub -N myjob -cwd -j y -o myjob.out -M henlof@tdb.uu.se -m b, e, s -q bee.q, dee.q my_job_script.sh

Submitting parallel jobs

GridEngine uses the concept of a parallel environment which defines how a parallel job should be ran. Currently the system supports two parallel environments: mpi and openmp.

MPI qsub option
-pe mpi [num_cpus]
OpenMP qsub option
-pe openmp [num_cpus]

The [num_cpus] parameter can be an integer or an interval. If the parameter is an interval, -pe mpi 4-12, the scheduler tries to allocate at least 4 CPUs and at most 12.

GridEngine uses the terminology slots for CPUs. Hence, each queue has as many slots available as there are CPUs in the execution host. If there are several queues on the host, the slots are shared. The number of slots allocated by the system is passed through the environment variable NSLOTS.

Important:

To allow correct execution of MPI jobs you must use the flag, -l cre to place MPI jobs in the corresponding *.cre queue.

Writing batch scripts

A batch script can be seen as a text file beginning with #!/bin/sh containg commands. One command per row in the file.

Example:
#!/bin/sh
#
# This is a comment
#

cd $HOME/forsking/helmholtz

./helm
      

The most convenient way to use the batch system is to embed the qsub options in the batch script. Qsub recognices all rows starting with #$ as options.

Example:
#!/bin/sh 
#
# All commands use the sentinel #$
#
#$ -N CRE_test
#$ -j y
#$ -o CRE_test.output
#$ -cwd
#$ -M user@tdb.uu.se
#$ -m b,e,s
#
#$ -l cre
#$ -pe mpi 6-12
#
#$ -q tee.cre,dee.cre,bee.cre
#
#$ -p 0
#

/opt/codine/mpi/MPRUN -np $NSLOTS -Mf $TMPDIR/machines <put your app here>

      

Setting the correct job priority

Job priorities shall be set according to the following table:

Approx. exec. time
Priority
0 - 1 h0
1 - 10 h1
10 - 48 h2
48 h+3

If you have special needs, contact Henrik Löf.

Sample scripts

Serial jobs

MPI jobs

OpenMP jobs

Monitoring the batch system

There are several tools available to monitor Sun GridEngine.

qhost
Gives you a quick status of the whole system
qstat
Shows the status of your enqueued and running jobs, use the flag -f to see the jobs of all users.

Removing and changing jobs

There are several tools to manage already submitted jobs

qdel <job_id>,...
Removes pending jobs from the pool
qalter
Modifies a pending job
qresub
Resubmits an existing job

Graphical user interface to GridEngine

There is a graphical user interface to GridEngine which is straight forward to use. Start the gui by typing qmon at the prompt. See the GridEngine manuals for more information about qmon.

More documentation

Documentation for all commands are available through man-pages. There are also other pages available online, tee$ man codine_intro.

A complete manual as well as forums, FAQs an such are available here: http://supportforum.sun.com/gridengine