1. (1)Formal Verification for High Performance Computing

  2. (2)XUM: An Experimental Multicore Supporting MCAPI

Ganesh Gopalakrishnan

Formal Verification of unmodified concurrent programs used in high performance message passing (MPI) requires a formal model of MPI's "happens-before (HB)" and a dynamic scheduler that runs relevant schedules. Formal verification of highly optimized GPU computing code using a combination of symbolic reasoning and testing requires efficient handling of concurrency modeling, and coverage of highly likely GPU code bugs (races, deadlocks, performance bugs such as memory bank conflicts). In the first half of my talk, I will teach you our MPI happens-before model, and show how this HB is parametric over send buffer allocation, thus helping catch resource deadlocks also. I will detail our ISP tool's dynamic scheduler. Then I will detail our Distributed Analyzer for MPI programs (DAMPI), showing how it efficiently realizes causality tracking for deterministic replay using sparse Lamport clocks over 1000 CPUs.  In the second half of my talk, I will teach you how our Prover of User GPU programs (PUG) tool achieves formal verification of CUDA kernels. I will detail its efficient encoding of concurrency for inter-thread race checking. I will then present symbolic detection of deadlocks and bank conflicts. I will conclude with new work in the space of execution plus symbolic GPU kernel verification. The talk will conclude with where the proverbial "Gretsky's puck" will likely be in the combined message passing / GPU space. In the third half of my talk (!), I will describe the eXtensible Utah Multicore project (XUM) in which we are building an FPGA-based message passing multicore processor supporting the MCAPI standard by the Multicore Association. XUM's novelty is in incorporating an ISA extension and an NOC design that allows fast MCAPI messaging and support for MCAPI semantics. Initial performance results from XUM will be presented.

The Design and Engineering of Concurrency Libraries

Doug Lea

The majority of commonly used concurrency support components are increasingly found in standardized libraries.  These include lightweight Execution frameworks, concurrent data structures, futures, locks, queues, barriers, and other synchronizers. This presentation will introduce selected algorithms, design decisions, engineering processes, and standardization issues stemming from experience with the java.util.concurrent library. It will also include preliminary presentations of planned and upcoming components, and some of the new research and development challenges that they present.

Shared Memory: An Elusive Abstraction

Francesco Zappa Nardelli

Optimisations performed by the hardware or by high-level language compilers can reorder memory accesses in various subtle ways.  These reorderings can sometimes be observed by concurrent programs, exacerbating the usual problems of concurrent programming.  Progress has been made in understanding and specifying the behaviour of some major processor architectures and the impact of optimisations performed by high-level programming languages.  In these lectures I will give a overview of relaxed memory models, both at the hardware level (covering the x86 and Power architectures) and at the high-level language level (covering the forthcoming C/C++ standard).  In between I will highlight the key role played by the compiler.

Scala for Multicore Programming

Philip Haller

In this lecture we will introduce several approaches to parallel and concurrent programming in Scala. We'll begin with an introduction to Scala's actors library which combines event-based and thread-based models of concurrency. We'll also discuss extensions for remote communication and advanced synchronization patterns using joins. Using a novel type system extension based on static capabilities, we'll show how actor isolation and race freedom is enforced. We'll also show how to provide custom type system extensions as plugins for the Scala compiler. In addition to our detailed treatment of Scala actors, we'll also cover other approaches to parallel programming in Scala such as parallel collections and parallel embedded domain-specific languages (DSLs). The talk concludes with an outlook of possible future directions of Scala for multi-core programming.

Concurrent and Parallel Programming — What's the difference?

Joe Armstrong

Q: Is there a difference between Concurrent and Parallel programming?

A: Oh boy — a massive difference.

Concurrent programming is really easy — but parallel programming is a mess. All sorts of real-world nastyness happens when you write parallel programs. You have to deal with failure, resource allocation, locality, caching, consistency, version control, and host of other nastynesses.

Joe Armstrong has been writing concurrent programs for 25 years, and since the advent of the multicore has even been writing parallel programs. He'll tell you why why writing parallel programs made his hair turn gray.

Multicore: a view from Sweden and a case for task-centric programming models

Mats Brorson

Multicore processors are rapidly evolving to manycore. One definition of a manycore processor is when the number of cores have increased to the point where we no longer have effective programming models to fill the cores with useful work (source Wikipedia). At the Multicore Center in Kista, we investigate manycore programming models, their implementations, architectures and operating systems for industrial applications. We advocate to leave thread-centric programming models as the main parallel programming model in favour of task-centric models which may be combined with other models as needed.

In a thread-centric model, programmers explicitly create compute threads (e.g. pthreads) and manually assign work to these threads and manage their execution on parallel cores. This process is error-prone, difficult to analyse from a performance perspective, and not future-proof. With the rapid increase of the number of cores, a program written with threads easily becomes outdated as it is often tied to a specific number of cores. In contrast, in a task-centric programming model, programmers concentrate on exposing the available parallelism, expressing it in concurrent tasks. A run-time system maintains a pool of worker threads that helps out in executing the tasks in parallel. The worker threads can be managed manually in bare-metal platforms or by the operating system if such exists. Good implementations use private task-queues per worker thread and work-stealing is often employed to provide for good load balance.

Task-centric models are rapidly gaining popularity since the incorporation of tasks in OpenMP 3.0. Other examples of task-centric models are Cilk++, Microsoft Task Parallel Library, Intel Threading Building Blocks and Apple Grand Central Dispatch. Current development consider expressing inter-task data dependencies for more efficient task and data scheduling and exposure of real-time and/or resource requirements.

In this presentation I will provide an overview of the trend of manycore architectures and programming models for embedded systems, and argue for how task-centric models will help developers get more out of these architectures. The presentation builds on experiences from industrial manycore projects and several EU research projects. I will also present the Swedish Multicore Initiative which is an informal network of excellence to promote research and development in multicore technology.

Analysing DMA Races in Multicore Software

Philipp Rümmer

Modern multicore processors, such as the Cell Broadband Engine, achieve high performance by equipping accelerator cores with small ”scratch-pad” memories. In contrast to the caches common in most architectures, scratch-pad memories are not coherent with each other, or with main memory, which means that the programmer must manually orchestrate data movement using direct memory access (DMA) operations.  The price for increased performance is programming complexity—programming using asynchronous DMAs is error-prone, and DMA races can lead to nondeterministic bugs which are hard to reproduce and fix. I will give an introduction to the use of DMA on Cell processors, and then present methods to automatic DMA race analysis that we recently developed, tailored to automatically detecting DMA races or proving their absence.

The school starts Monday at 10.30 and ends Thursday at 13.30.

Natually, a big part of any summer school is the discussions and activities after the lectures are over, with chances to talk to the speakers and others about your work, ask detailed questions, ask for feedback, etc.


10:30–10:45 Welcome

10:45–12:00 Concurrent and Parallel Programming — What's the difference? (Joe Armstrong)

12:00–13:15 Lunch

13:15–15:00 Scala for Multicore Programming (Philipp Haller)

15:00–15:30 Coffee

15:30–17:15 Formal Verification Methods for Message Passing (Ganesh Gopalakrishnan)


08:45–09:45 Analysing DMA Races in Multicore Software (Philipp Rümmer)

09:45–10:15 Coffee

10:15–12:00 Scala for Multicore Programming (Philipp Haller)

12:00–13:15 Lunch

13:15–15:00 XUM: An Experimental Multicore Supporting MCAPI (Ganesh Gopalakrishnan)

15:00–15:30 Coffee

15:30–17:15 The Design and Engineering of Concurrency Libraries (Doug Lea)


08:45–09:45 Multicore: a view from Sweden and a case for task-centric programming models (Mats Brorson)

09:45–10:15 Coffee

10:15–12:00 The Design and Engineering of Concurrency Libraries (Doug Lea)

12:00–13:15 Lunch

13:15–15:00 Shared Memory: An Elusive Abstraction (Francesco Zappa Nardelli)

15:00–15:30 Coffee

15:30–17:15 Erlang-style concurrency: pros, cons and implementation issues (Kostis Sagonas)

20:00–22:00 Open discussion. NB: Mandatory for students that want certificates/grading.


08:45–09:45 Erlang-style concurrency: pros, cons and implementation issues (Kostis Sagonas)

09:45–10:15 Coffee

10:15–12:00 Shared Memory: An Elusive Abstraction (Francesco Zappa Nardelli)

12:00–13:30 Lunch