In computational science (including fields such as astrophysics, quantum chemistry, materials science, genetics) the application problems that are of scientific interest require parallel execution due to extreme volumes of data and computations. Scientific progress in these fields also requires a combination of increasingly powerful computer systems, more advanced models, and more sophisticated parallel algorithms. The universal shift to multicore based system design in theory offers increased performance, but parallel legacy codes typically execute at a fraction of peak performance and scale poorly, and it is becoming prohibitively costly in terms of man-hours to develop parallel software for real application problems that meet reasonable performance requirements.
The picture to the right shows the cluster Kalkyl at UPPMAX with 348 computing nodes, equipped with 8 cores each, as an example of a multicore based high performance computer system.
With the multicore revolution, also other types of performance critical applications where parallelization becomes important emerge. An example is computationally heavy algorithms such as encryption/decryption that are executed on battery-operated hand-held mobile devices. In this case, parallelization over the cores is needed to lower the energy consumption.
To tackle the parallel performance issue for multicores the basic philosophy of parallel programming needs to be adapted. Before, load balancing and minimizing communication were important. Now, the flow of data between the processor chip and the main memory is a major bottleneck and hence an area of opportunity for performance speedups, whereas cache-to-cache communication within a multicore is basically free. Furthermore, a core is inexpensive compared to a computer in a multicomputer system which make speculative computation more affordable. We will
- map and classify algorithms and their performance properties with respect to different hardware
- develop principles for constructing high performing algorithms
- explore alternative algorithms and models that may be sub-optimal in serial or for pre-multicore architectures, but yield high multicore performance
- consider new types of applications.
An important complement to the algorithmic development is the efficient system level support provided by the UPMARC research direction Efficiency and Predictability.
To increase the programmer efficiency, we will develop models or tools to handle (complex) data dependencies, to maximize data locality, and to schedule tasks onto cores and accelerators. This is closely connected with the direction Ease of Programming in terms of providing better ways to express parallelism. We will
- shift the focus from parallelization of software to parallel software development
- develop library software to support fast development of parallel application software
- develop frameworks to allow domain experts to express intent and leave low level optimization to hardware experts
- demonstrate our approaches on real application problems
The figure shows the dependency graph for a simple 4x4 block LU-factorization.