Task-Parallel Implementation of the QZ Algorithm for Computing Eigenvalues

Background

LAPACK is one of the most widely used numerical libraries in the scientific computing/computational science communities. It implements important linear algebra subroutines such as LU, Cholesky, and QR factorizations, and eigenvalue and singular value decompositions. Currently new parallel versions of the library are developed; PLASMA for multicore shared memory systems, and MAGMA for heterogeneous systems combing multicore nodes and GPU accelerators.

The state-of-the-art algorithms for the eigenvalue computations in LAPACK and its descendants are developed at Umeå University by Lars Karlsson and Bo Kågström among others. These algorithms are not yet implemented in PLASMA and MAGMA.

In the Application Performance group of UPMARC we have developed a library, D-TeiP (Dependent Tasks Executed in Parallel), for task based parallelisation of general algorithms. The most important features of the library are that it is flexible, uses only local scheduling decisions, and relieves the user of thinking about dependencies.

Ideas and Goals

It is already clear that task based programming models for parallelization is competitive for straightforward algorithms such as the LU factorization. However, in this project we want to test if it can also be successful for expressing much more complex (two-sided) algorithms such as the QZ algorithm for eigenvalue decompositions.

By using our library to parallelize the latest QZ algorithm(s) we hope to

show that our library can handle complex and dynamical algorithms,
explore how to best parallelize the new algorithms,
influence how the next generation of LAPACK solves this problem.

Project

The first part of the project consists of parallelizing a part of the QZ-algorithm: the reduction to upper Hessenberg form. The PLASMA-team have developed a parallel two-stage algorithm [1] for this, where the matrix is first reduced to a block Hessenberg form, and then reduced to a Hessenberg form in a second stage. Lars Karlsson and Bo Kågström have developed a more efficient second stage [2], which we are interested in implementing.

The second part would be to vary the implementation and explore successful strategies. We would start with a pure shared memory parallelization, but if this is successful, we can try to extend it to distributed memory systems. Then one key issue will be scheduling of the tasks over the nodes to ensure memory locality.

Advisors

Required Skills

Great interest in linear algebra
Solid programming experience

References

[1] Hatem Ltaief, Jakub Kurzak, and Jack Dongarra. Parallel Block Hessenberg Reduction using Algorithms by Tiles for Multicore Architectures Revisited. Tech. Report, LAPACK Working Note 208, August 2009.
[2] Lars Karlsson, and Bo Kågström. Efficient Reduction from Block Hessenberg Form to Hessenberg Form using Shared Memory. Report UMINF 10.12, 2010.