Parallel Algorithms for Multicore Processors
Motivation and Background
We are looking at various examples of applications of Scientific Computing and try to understand how today's multicore computer may offer improved performance. This currently involves differential equations, N-body problems, event-based models, inverse formulations and preconditioned iterative methods:
- Solving differential equations in general and partial differential equations in particular is a classical problem in computational science. Event-based models, which for example include Markov chains, is also an important computational model in many branches of physics. With a wide range of applications in industry, research, and development, these topics remain central subjects when designing, improving, understanding, or estimating physical systems.
- Inverse formulations (find the coefficients/boundary conditions/... given measured data) typically involve multiple 'forward' evaluations and are thus often computationally very intensive. We currently look on improving algorithms that make many passes over large amounts of data for which data-parallel algorithms perform well.
- We are interested in the challenges of solving and/or evolving these models efficiently in the multicore area. Users of these results are often other researchers trying to design or comprehend complex physical systems such as air-flow around wind power plants, phase-formation in micro-complex fluidic systems, chemical reactions, or combustion in engines.
- Porting iterative methods on multi- and many-core devices is a challenging task due to the limited degree of parallelism in most of the preconditioning techniques. By providing new algorithms suitable for these platforms, we can significantly accelerate many scientific applications based on finite-difference/elements/volumes methods.
In today's multicore era such computational problems can be solved on desktop computers if implemented and understood in the right way. We intend to do so!
Long Term Goal
This project currently encompasses four separate sub-projects.
- (AFMM) Study the impact on efficiency for adaptive fast multipole methods (FMM) on modern heterogeneous multicore computers. FMMs are efficient methods for evaluating N-body interactions, a critical operation in many different fields of computational physics including molecular dynamics and all the way up to astrophysics. Adaptivity is needed in most applications as point distributions tend to be non-uniform. Parts of the FMM algorithm are highly suitable to e.g. GPU computing and we aim at understanding design principles for a balanced version which exploits the heterogeneity of the algorithm itself.
- (ParPDE) Develop an adaptive framework for solution of high-dimensional partial differential equations optimized for clusters of multicore nodes. The main focus of this project is to investigate how operations on high-dimensional grids are most efficiently implemented with respect to performance and programmer effort. Communication-avoiding Krylov subspace methods that minimize data traffic between nodes and cores are implemented and evaluated.
- (URDME) Markov Chain Monte Carlo methods are commonly used to solve stochastic formulations in physics and biology and are traditionally categorized as discrete event simulation techniques. Due to their sequential way of processing chronological events, parallelization of discrete event simulations is known to be a non-trivial task, especially when massively parallel target architectures as Graphical Processing Units (GPUs) are taken into account. We focus on fully parallel or hybrid (CPU/GPU) approaches, relying on software efficiency, high quality random numbers as well as non-heuristic error estimates. Our goal is to develop a reliable, fast and adaptive parallel solver for MCMC methods capable to run on heterogeneous platforms or completely parallel computing architectures.
- (X-ray) X-ray diffraction analysis is a rapidly evolving topic in Structural Biology. This project is joint with The Laboratory of Molecular Biophysics in BMC at Uppsala University. The project aims at improving computational methods from imaging of biomolecules via X-ray lasers. Part of the research will be involved with parallelization and improving the underlying algorithms.
- (PARALUTION) Develop fine-grained parallel preconditioners which can be used on multi-core CPUs as well as on GPUs and other accelerators (e.g. Intel MIC). In this direction we are focusing on LU-based and approximate inverse techniques but also on multi-level and multi-grid methods.
New algorithms and new implementations of old algorithms!
- FMM2D: a two-dimensional adaptive fast multipole method with a Matlab mex-interface. Currently available in a serial C99-version and a fully ported CUDA/C++-version. Download here.
- PARALUTION - a library for iterative methods on CPU and GPU is available here
- (AFMM) FMMs are known to be notoriously difficult to implement and adaptive versions are even trickier. Yet such algorithms offer a very favorable asymptotic running-time. A new type of adaptive FMM has been derived and analyzed. The adaptivity suggested simplifies implementation on multi-parallel computers such as the GPU. A hybrid CPU/GPU version has been implemented that leverages the strengths of each hardware type and allows for automatic tuning for maximum performance. Several tuning algorithms have been implemented and are being evaluated in a range of application scenarios.
- (ParPDE) When solving large scale PDEs on massively parallel computer clusters, the parallelism will be hierarchical; between nodes on the interconnect, between sockets on a motherboard, between cores on a socket, etc. We attempt to write software that addresses the trade-offs at every level in the hierarchy by using a multi-layer parallel framework (currently MPI+OpenMP).
- (URDME) We are currently developing a parameter sweep solver for URDME capable of handling each single reaction and diffusion path in a single trajectory and enabling advanced parameter sensitivity analysis. As this approach demands the computation of a much higher amount of stochastic events, parallelization is absolutely needed when considering models with complex geometries and huge dependent reaction networks. Thus our approach is to develop a hybrid solver based on small-state, long period random number generators using single- and dual-GPU setups to supply event trajectories with high quality random numbers on the fly.
- (X-ray) When dealing with a large amount of reconstructions of diffraction imaging from X-ray lasers, the algorithm should be parallelized and distributed over many cores. We are attempting to develop a software that speed up the reconstruction processing by using a multi-layer parallel framework (CUDA + MPI).
- (PARALUTION) The parallelism in the LU-type of preconditioners/smoothers is based on graph analyzing phase such as multi-coloring (MC) and maximal independent set (MIS) decomposition. In addition, we study various approximate inverse techniques which can be not only applied but also constructed in parallel. Users can try our methods with our library - PARALUTION! This is a library which enables you to utilize various sparse iterative solvers and preconditioners on multi/many-core CPU and GPU devices. Based on C++, it provides generic and flexible design which allows seamless integration with other scientific software packages.
- We are also actively looking for new challenging and interesting applications.
- Parallelization and performance in simulation of disease spread by animal transfer. BSc-thesis by Fredrik Pasanen and Magnus Söderling. This work was done joint with Stefan Widgren at the Swedish National Veterinary Institute.
internal project page
Senior: Stefan Engblom (Contact), Sverker Holmgren, Dimitar Lukarski.
Ph.D. students: Magnus Gustafsson, Marcus Holm, Pavol Bauer, Jing Liu.