Ngorongoro User Guide
The Ngorongoro system was delivered to the Department of Information Technology, Uppsala University, in December 2001. It was aquired as part of an ongoing collaborative research program with Sun Microsystems® established in 1999. The aquisistion of the computer system was possible through a grant from the Knut and Alice Wallenberg Foundation.
This is the official users guide to the new high performance computer systems at the department of information technology. Guides for the other systems at the department can be found here.
This user guide is formatted as a single HTML-file which can be printed or viewed in any HTML4 compliant browser. Use the quick links in the left margin to jump into the guide.
Usage tip
You can access this guide online on simba using
lynx
simba$ lynx http://www.it.uu.se/datordrift/maskinpark/ngorongoro
Use CTRL-A in lynx to jump to the quick
links (top of page)
Latest news
Compilation 64-bit applications using Sun Forte 6.2 and MPI does not
work. Use -xarch=v8plusb to compile 32-bit instead. We
are looking in to these problems.
System configuration
Consisting of one Sun Fire 15k server, the theoretical top performance of the system is 86 Gflops/s.
The server is configured as follows:
- 48 UltraSPARC III+ CPUs at 900 MHz
- 8 MB of L2 Cache memory
- 48 GB of primary RAM
- 12 18GB drives
Giving a total of 48 CPUs and 48 GB of RAM.
The system is divided into two separate "virtual servers" (domains), simba and duma.
- Simba
- Front end, unlimited interactive usage, 16 CPUs and 16 GB of RAM.
- Duma
- Batch machine, used only through the batch system. 32 CPUs and 32 GB of RAM.
The UltraSPARC-III microprocessor
The UltraSPARC-III is a 64-bit 4-way issue superscalar pre-fetching microprocessor, featuring:
- Six execution pipelines (2 integer, 2 FP/VIS, 1 load/store, 1 addressing)
- 16 integer registers, 32 FP/VIS registers
- Latency: 3- to 4-cycle integer/FP/VIS add, subtract, logical and multiply
- Latency: 17- to 29-cycle FP divide, square root
- 14-stage, non-stalling pipeline
- Well managed 16k-entry branch prediction table
- L1-cache: 64 kB 4-way Data, 32 kB 4-way instruction, 2 kB pre-fetch, 2 kB Write
- 150 MHz bus clock frequency
- Bus Bandwidths: 2.4 GB/sec processor-to-memory, 4.8 GB/s Bus-to-memory
- 512 entry TLB
Sun Fire system architecture
The core of the Sun Fire architecture is the Sun Fireplane system interconnect. The Sun Fireplane is a packet switched broadcast medium capable of 9.6 GB/s.
Four US-III CPUs are packaged on a CPU/memory board together with the L2 cache and primary RAM. One Sun Fire 6800 machine can have up to 6 boards whereas the Sun Fire 15k can have up to 18 boards.
Six boards form a uniform access snooping (broadcast) coherenent domain, the Sun Fire 6800. Three such domains can be linked through a scalable shared memory (SSM) device and a crossbar using point-to-point directory based coherence to form a coherent non-uniform access domain, the Sun Fire 15k.
Each snooping domain (SMP) has a peak data bandwidth of 9.6 GB/s whereas the domain to domain bandwidth (non-local accesses through a crossbar) is limited to 2.4 GB/s.
Programming environment
Apart from serial code, the system supports message passing (MPI) using the Sun HPC ClusterTools 4 software as well as explicit shared memory programming using Solaris threads, POSIX threads, OpenMP, Sun or Cray style directives.
The system run the latest version (8) of the Solaris operative system. We use the Sun Grid Engine batch system software for resource mangement.
There are two nodes: Simba and Duma. Duma is accessed only by the batch system, do not log on to this machine. Log on to Simba using a secure shell (SSH):
bash$ ssh simba.it.uu.se
18 GB of temporary local storage is available by creating a
directory in /scr0/home. Global temporary storage is
available by creating a directory in /scr1/home.
Accounts are managed by Henrik Löf, henlof@tdb.uu.se
Compilers
Currently several compiler suites are installed:
- SunOne Studio 7.0,
/opt/SUNWspro/bin - Sun Forte 6.2,
/scr1/compilers/WS6U2/SUNWspro/bin - GNU GCC v3.2.2,
/it/sw/gnu/bin
Sun Forte 6.2 is the default compiler and supports the following languages:
- ANSI C (
bash$ cc) - C++ (
bash$ CC) - Fortan 77 (
bash$ f77) - Fortran 90/95 (
bash$ f90)
To use the SunOne Studio 7.0 compiler put
/opt/SUNWspro/bin first in your
$PATH.
Minimal effort compiling (Sun Forte)
- 32-bit code:
-fast -xtarget=ultra3plus -xarch=v8plusb- 64-bit code:
-fast -xtarget=ultra3plus -xarch=v9b
The -fast flag is a macro and it expands to
(version 6.2):
- Fortran:
-xO5 -xpad=local -xvector=yes -xprefetch=auto,explicit -f -fsimple=2 -fns=yes -ftrap=common -xlibmil -xlibmopt -dalign -xdepend- C:
-fns -fsimple=2 -fsingle -ftrap=%none -xalias_level=basic -xbuiltin=%all -xlibmil -xmemalign=8s -xO5- C++:
-xO5 -fsimple=2 -fns=yes -ftrap=%none -xlibmil -xlibmopt -xbuiltin=%all -dalign
For maximum effect you should also link with
-fast. The Sun compilers all follow the
"rightmost-flag-win" rule, which means that if you want to
compile with all the options in the fast macro and lower the
optimization level you should compile with -fast
-xO4.
A quick reference is obtained using the flag
-flags. For a more detailed description of the
different flags and their effect click here
Useful options
- Allow loop interchange and loop optimizations
-xdepend- True 64-bit load/store and alignment
-dalign- Explicit in-lining
-xinline=my_func- Interprocedual optimizations
-xipo=1
Math libraries
The Sun Forte compilers supply optimized versions and in-lined
versions of the libm library:
- In-lined
libm - -xlibmil
- Optimized
libm - -xlibmopt
Sun also supplies:
- BLAS1, BLAS2, BLAS3
- LAPACK v3.0
- LINPACK v3.0
- FFTPACK
- VFFTPACK
using the Sun Performance library. To use the library compile with:
-dalign -xlic_lib=sunperf
Fortran 90 user should also include the module
sunperf, USE SUNPERF. The library
automatically switches to a parallel version if the compiling
program is shared memory parallelized.
A users guide for the Sun Performance library can be found here (docs.sun.com)
Message passing using MPI
This tutorial only show how to use
Compilation using Sun MPI and Sun Forte compilers
To simplify MPI compilation, Sun has included compiler front-ends in the ClusterTools package to set the correct paths etc. The procedure is relatively simple:
Include the MPI headers:
- Fortran
INCLUDE 'mpif.h'- C/C++
#include <mpi.h>
Compile using the front-ends:
- Fortran 77:
mpf77 <flags> -dalign -lmpi- Fortran 90:
mpf90 <flags> -dalign -lmpi- C:
mpcc <flags> -lmpi- C++:
mpCC <flags> -mt -lmpi
If your MPI code is multi-threaded you should replace the
-lmpi with -lmpi_mt.
Online MPI documentation
All the MPI routines are accessible through
man-pages, see simba$ man mpi.
External documentation
Shared memory programming
There are essentially two different ways to parallelize code using threads.
- Explicit multi-threading by calling OS primitives
- Parallelization using compiler directives or pragmas
Using Solaris threads and POSIX threads
There are to packages available for explicit multi-threading in Solaris:
Native Solaris threads
- Symbols:
- Define
_REENTRANT - Linking:
-lthread- Example:
bash$ cc [flags] file... -D_REENTRANT -lthread
Portable POSIX 1003.1c threads (
pthreads)- Symbols:
- Define
_POSIX_C_SOURCE=199506L - Linking:
-lpthread- Example:
bash$ cc [flags] file... -D_POSIX_C_SOURCE=199506L [-lposix4] -lpthread
The [-lposix4] flag is for the
POSIX.1b-1993 real-time extensions such as
semaphores.
You can also use the macro -mt which expands to
-D_REENTRANT -lthread when compiling native
threads code. This flag is required when compiling C++ code to
get the correct linking.
Using OpenMP
The Sun Forte C compiler currently supports the OpenMP
1.0 standard and the Sun Forte Fortran compiler
supports the OpenMP 2.0 standard.
-
Include files and modules (only present in the 2.0 standard)
- Include file for runtime functions
INCLUDE 'omp_lib.h'- alt. Fortran90 module
USE omp_lib
-
OpenMP sentinels
- C
#pragma omp- Fortran 77
C$OMP- Fortran 90
!$OMP
-
Compiling
- C
-xopenmp=parallel- Fortran
-Xlist -openmp
Set the variable OMP_NUM_THREADS or use runtime
library functions to set the number of threads
External links
- Solaris 8 Multi-threaded Programming Guide
- @docs.sun.com
- Fortran User's Guide
- @docs.sun.com
- Forte C 6 update 2 / Sun WorkShop 6 update 2 C Compiler User's Guide
- @docs.sun.com
- OpenMP standard
- www.openmp.org
Running code interactively
The node simba is available for development and interactive use without limitations. You can also submit batch jobs to simba, exclusive usage can in that case not be guaranteed. For exclusive usage, timings and such you shall use the batch system.
For MPI code, you must specify to run it on simba:.
simba$ mprun -np <num> -R "name=simba" <program>
Once the program is run you can monitor the MPI job using several tools
mpps- Shows you your current running MPI jobs, use flag
-eto see all running MPI jobs. mpkill <job_id>- Kills a running MPI job using the ID given to you at run
time or from
mpps.
Use the above tools for MPI jobs started interactively on simba only
Tools for monitoring system activity
java Jmpstat -u <sec>- Gives a graphical view of the load on each individual CPU
mpstat <interval> <count>- Gives detailed per processor information
top- Shows the top most CPU consuming processes on the system
prstat- Similar to
top
Development tools
There are several tools included in the different Sun software packages.
- Sun Workshop,
simba$ workshop - Is a complete environment for software development including, build tools, source browsing, debugging, visualisation and performance profiling.
- Prism,
Prism -np <num_cpus> <program> - Environment for debugging, visualisation and profiling for Sun MPI programs.
- Sun S3L, parallel math libraries
- Large range of MPI parallelized solvers (ScaLAPACK..)
A more detailed description as well as tutorials will be posted here later.
Using the batch system
To allow a fair and better usage of the system we use a resource manager to coordinate user demands. We use the Sun GridEngine software.
GridEngine basics
- Jobs are submitted as scripts to central pool of jobs on the master host (simba).
- The master senses its execution hosts (bee, dee) and schedules jobs from the pool for execution.
- Jobs are put into different queue according to their resource requirements.
- Jobs are executed from the queue in the order established by their priority.
Each host has one queue for serial/multi-threaded jobs and one for MPI jobs.
If the resource requirements cannot be fulfilled, the job will be pending (waiting) for its resources. This can also happen if you specify resources that never can be fulfilled. The job will then be in the pending state forever until it is removed or changed.
Submitting jobs
- Jobs (scripts) are submitted using the command
qsub. - Job options can be passed from the command line or from inside the job script or both.
- Each job has an unique id and a user definable name
Qsub options
- Output file,
stdout(default: [job_name].o[job_id]) -o filename- Output file,
stderr(default: [job_name].e[job_id] -e filename- Join output to
stdoutfile -j y- Start job at a specific time or date
-a MMDDhhmm.ss- Start job from current working directory (default: $HOME)
-cwd- Set job name
-N name- Set job priority (valid numbers are -1023 to 1024)
-p priority- Export current environment variables (default: no variables)
-V- Mail user(s)
-M user[@host],...- Mail options
-m b|e|a|s|n,...- b - Mail at the beginning of job
- e - Mail at the end of job
- a - Mail at the abortion of job
- s - Mail at the suspension of job
- n - Never mail (default)
- Choose queue
-q queue_name, queue_name,...
Example:
simba$ qsub -N myjob -cwd -j y -o myjob.out -M
henlof@tdb.uu.se -m b, e, s -q simba.q, duma.q my_job_script.sh
Submitting parallel jobs
GridEngine uses the concept of a parallel environment which
defines how a parallel job should be ran. Currently the system
supports two parallel environments:
cre and openmp.
- MPI qsub option
-pe cre [num_cpus]- OpenMP qsub option
-pe openmp [num_cpus]
The [num_cpus] parameter can be an integer or an
interval. If the parameter is an interval, -pe cre
4-12, the scheduler tries to allocate at least 4 CPUs
and at most 12.
GridEngine uses the terminology slots for
CPUs. Hence, each queue has as many slots available as there
are CPUs in the execution host. If there are several queues on
the host, the slots are shared. The number of slots allocated
by the system is passed through the environment variable
NSLOTS.
Important:
To allow correct execution of MPI jobs you must use the flag,
-l cre to place MPI jobs in the corresponding
*.cre queue.
Current GridEngine setup
There are currently four queues setup. The *.cre
queues are for MPI jobs and the *.q queues are
for serial and multithreaded jobs.
Use the duma queues for verified "production" runs and the simba queues for testing purposes.
It is possible to run MPI jobs across both nodes by simply
adding both queues to the qsub script. The
interconnect is however not very powerful (100Mbit/s Ethernet).
Writing batch scripts
A batch script can be seen as a text file beginning with
#!/bin/sh containg commands. One command per row
in the file.
Example:
#!/bin/sh
#
# This is a comment
#
cd $HOME/forsking/helmholtz
./helm
The most convenient way to use the batch system is to embed
the qsub options in the batch script. Qsub recognices all rows
starting with #$ as options.
Example:
#!/bin/sh
#
# All commands use the sentinel #$
#
#$ -N CRE_test
#$ -j y
#$ -o CRE_test.output
#$ -cwd
#$ -M my_user_name@tdb.uu.se
#$ -m b,e,s
#
#$ -l cre
#$ -pe cre 6-12
#
#$ -q simba.cre,duma.cre
#
#$ -p 0
#
/export/hpc/codine/mpi/MPRUN -np $NSLOTS -Mf $TMPDIR/machines <put your app here>
Setting the correct job priority
Job priorities shall be set according to the following table:
Approx. exec. time | Priority |
| 0 - 1 h | 0 |
| 1 - 10 h | 1 |
| 10 - 48 h | 2 |
| 48 h+ | 3 |
If you have special needs, contact Henrik Löf.
Sample scripts
Monitoring the batch system
There are several tools available to monitor Sun GridEngine.
qhost- Gives you a quick status of the whole system
qstat- Shows the status of your enqueued and running jobs, use
the flag
-fto see the jobs of all users.
Removing and changing jobs
There are several tools to manage already submitted jobs
qdel <job_id>,...- Removes pending jobs from the pool
qalter- Modifies a pending job
qresub- Resubmits an existing job
Graphical user interface to GridEngine
There is a graphical user interface to GridEngine which is
straight forward to use. Start the gui by typing
qmon at the prompt. See the GridEngine manuals for
more information about qmon.
More documentation
Documentation for all commands are available through
man-pages. There are also other pages available
online, simba$ man codine_intro.
A complete manual as well as forums, FAQs an such are available here: http://supportforum.sun.com/gridengine