Parallel Algorithms for Multicore Processors
Staff
Senior: Stefan Engblom (Contact), Sverker Holmgren.
Postdoctoral: Jonathan Bull.
Ph.D. students: Pavol Bauer, Jing Liu, Karl Ljungkvist.
Past members: Magnus Grandin, Dimitar Lukarski, Marcus Holm.
Motivation and Background
We are looking at various examples of applications of Scientific Computing where we are taking advantage of today's multicore and manycore computers. This currently involves differential equations, Nbody problems, eventbased models, and inverse formulations. Challenges in this area include balancing memory accesses and computations, understanding synchronization, finding optimal granularity, and balancing cost and accuracy. Users of these results are often other researchers trying to design or comprehend complex physical systems such as airflow around wind power plants, flow in microfluidic systems, chemical reactions, or combustion in engines.
Subprojects
 Nbody problems involve the dynamics of a population of N particles in a potential field. Every particle interacts with every other particle via the field. Notable examples are charged particles in an electrical field and massive objects in a gravitational field. The challenge lies in reducing the cost of computing the forces between N particles, which scales with N^2. In many cases the force between particles decays with the distance between them, allowing distant interactions to be neglected. The Fast Multipole Method (FMM) operates on this principle and is capable of reducing the effort to O(N log N) or O(N). We are working on a parallel adaptive implementation in three dimensions with the goal of achieving high efficiency at very large N.
 The Finite Element Method (FEM) is an important numerical method used for simulation in application fields such as structural mechanics and fluid dynamics. However, classical matrixbased FEM algorithms access too much data per arithmetic operation to fully utilize modern many and multicore processors, especially when solving problems in 3D with elements of high order. We instead use a matrixfree approach, which has a significantly smaller memoryaccess footprint. Additional advantages are that no matrix assembly is needed, and that larger problems can be solved since no matrix needs to be stored. In our research, we develop Deal.II, an opensource C++ library for finiteelement computations which is widely used by researchers in application fields such as geoscience.
 Eventbased models are used to model discrete models, that are altered by instantaneous changes in time, socalled events. Due to the possibility to incorporate stochastic dynamics as for example Markov Chains, such models have gained popularity in fields as Computational Systems Biology or Computational Epidemiology. We are studying the parallelizaiton of eventbased models for multisocket processors, that are typically equipped with 32 or more cores. Due to the mathematical properties of the underlying models, the parallelization of discrete event simulations is not trivial and requires a deep understanding of the model sychronization behaviour.
 The flash Xray single particle diffraction Imaging (FXI) is a cuttingedge technology that solves structures of small biological objects without using crystals. Other than current Xray beam limitation, recontructing a 3D structure is challenging task due to the large mount of diffraction patterns gernerated from FXI experiments and the computeintensive reconstruction algorithm. Hence, selecting proper diffraction patterns and reconstructe models efficiently and automatically are important, and our current aims.
Publications
Refereed
 Efficient interprocess synchronization for parallel discrete event simulation on multicores. In Proc. 3rd ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, pp 183194, ACM Press, New York, 2015. (DOI).Machine learning for ultrafast Xray diffraction patterns on largescale GPU clusters. In The international journal of high performance computing applications, volume 29, pp 233243, 2015. (DOI, fulltext:postprint).Stable difference methods for blockoriented adaptive grids. In Journal of Scientific Computing, volume 65, pp 486511, 2015. (DOI).Sensitivity estimation and inverse problems in spatial stochastic models of chemical kinetics. In Numerical Mathematics and Advanced Applications: ENUMATH 2013, volume 103 of Lecture Notes in Computational Science and Engineering, pp 519527, Springer, 2015. (DOI).Matrixfree finiteelement operator application on graphics processing units. In EuroPar 2014: Parallel Processing Workshops, Part II, volume 8806 of Lecture Notes in Computer Science, pp 450461, Springer, 2014. (DOI).On the stability of stochastic jump kinetics. In Applied Mathematics, volume 5, pp 32173239, 2014. (DOI).Dynamic autotuning of adaptive fast multipole methods on hybrid multicore CPU and GPU systems. In SIAM Journal on Scientific Computing, volume 36, pp C376C399, 2014. (DOI).Xray laser imaging of biomolecules using multiple GPUs. In Parallel Processing and Applied Mathematics: Part I, volume 8384 of Lecture Notes in Computer Science, pp 480489, SpringerVerlag, Berlin, 2014. (DOI).On the impact of the heterogeneous multicore and manycore platforms on iterative solution methods and preconditioning techniques. In HighPerformance Computing on Complex Environments, pp 1332, WileyBlackwell, Hoboken, NJ, 2014. (DOI).Accelerating COBAYA3 on multicore CPU and GPU systems using PARALUTION. In Proc. 2nd Joint International Conference on Supercomputing in Nuclear Applications and Monte Carlo, La Société Française d'Energie Nucléaire, Paris, France, 2013.Accurate surface embedding for higher order finite elements. In Proc. 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp 187192, ACM Press, New York, 2013. (DOI).Adaptive fast multipole methods on the GPU. In Journal of Supercomputing, volume 63, pp 897918, 2013. (DOI).URDME: a modular framework for stochastic simulation of reactiontransport processes in complex geometries. In BMC Systems Biology, volume 6, pp 76:117, 2012. (DOI).Communicationefficient algorithms for numerical quantum dynamics. In Applied Parallel and Scientific Computing: Part II, volume 7134 of Lecture Notes in Computer Science, pp 368378, SpringerVerlag, Berlin, 2012. (DOI).Using hardware transactional memory for highperformance computing. In Proc. 25th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, pp 16601667, IEEE, Piscataway, NJ, 2011. (DOI).On wellseparated sets and fast multipole methods. In Applied Numerical Mathematics, volume 61, pp 10961102, 2011. (DOI).An implementation framework for solving highdimensional PDEs on massively parallel computers. In Numerical Mathematics and Advanced Applications: 2009, pp 417424, SpringerVerlag, Berlin, 2010. (DOI).
Theses
 Parallelism and efficiency in discreteevent simulation. Licentiate thesis, IT licentiate theses / Uppsala University, Department of Information Technology nr 2015004, Uppsala University, 2015. (fulltext).Techniques for finite element methods on modern processors. Licentiate thesis, IT licentiate theses / Uppsala University, Department of Information Technology nr 2015001, Uppsala University, 2015. (fulltext).Adaptive Solvers for HighDimensional PDE Problems on Clusters of Multicore Processors. Ph.D. thesis, Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology nr 1199, Acta Universitatis Upsaliensis, Uppsala, 2014. (fulltext, preview image).Scientific computing on hybrid architectures. Licentiate thesis, IT licentiate theses / Uppsala University, Department of Information Technology nr 2013002, Uppsala University, 2013. (fulltext).Towards an adaptive solver for highdimensional PDE problems on clusters of multicore processors. Licentiate thesis, IT licentiate theses / Uppsala University, Department of Information Technology nr 2012003, Uppsala University, 2012. (fulltext).
Other
 Fast Matlab compatible sparse assembly on multicore computers. In Parallel Computing, volume 56, pp 117, 2016. (DOI).Parallel data structures and algorithms for highdimensional structured adaptive mesh refinement. Technical report / Department of Information Technology, Uppsala University nr 2014020, 2014. (External link).Data structures and algorithms for highdimensional structured adaptive mesh refinement. Technical report / Department of Information Technology, Uppsala University nr 2014019, 2014. (External link).GPUParallel simulation of rigid fibers in Stokes flow. Student thesis, supervisor: Stefan Engblom, examiner: Jarmo Rantakokko, Olle Gällmo, IT nr 14 029, 2014. (fulltext).Numerical evaluation of the CommunicationAvoiding Lanczos algorithm. Technical report / Department of Information Technology, Uppsala University nr 2012001, 2012. (External link).Parallelization and performance in simulation of disease spread by animal transfer. Student thesis, supervisor: Stefan Engblom, examiner: Jarmo Rantakokko, Olle Gällmo, IT nr 12 002, 2012. (fulltext).Efficient implementation of a highdimensional PDEsolver on multicore processors. In Proc. 2nd Swedish Workshop on MultiCore Computing, pp 6466, Department of Information Technology, Uppsala University, Uppsala, Sweden, 2009.
Software

 URDME: a general software framework for modeling and simulation of stochastic reactiondiffusion processes on unstructured, tetrahedral and triangular meshes.
 SimInf: a flexible and efficient framework for stochastic disease spread simulations in animal populations.
 FMM2D: a twodimensional adaptive fast multipole method with a Matlab mexinterface. Currently available in a serial C99version and a fully ported CUDA/C++version. Download here.
 FMM3D: a threedimensional adaptive fast multipole method with a Matlab mexinterface. Currently under development.
 deal.II: An opensource C++ library for finiteelement computations.