The Chunks & Tasks framework provides a way to simplify parallelization of algorithms by using separate modules for data access and task execution. The key idea is that the programmer defines his algorithm in terms of well-defined tasks that access data via a chunk management system. This approach is particularly appealing when the problem can be solved using a dynamic hierarchic algorithm. Examples of applications where dynamic hierarchic algorithms are used include blocked sparse matrix algebra and multipole methods, both of which are important in large scale electronic structure calculations. Another example is finite difference methods using adaptive grids.
When writing a program using the Chunks & Tasks framework, all the programmer needs to know is the Chunks & Tasks interface. This interface consists of a set of functions available for the programmer that allows for creation of new tasks and access to data through chunk objects. This is a great advantage since once the algorithm has been implemented using the Chunks & Tasks interface, it can be executed using a multitude of different platforms without changing the source code. Different platforms utilizing different types of parallelism and storage devices can be used as long as a Chunks & Tasks implementation exists for the platform in question. Optimization related to specific technical issues such as for example the best way to store a chunk can be left to the implementation of the Chunks & Tasks framework. Thus, once your algorithm has been formulated using a Chunks & Tasks interface, you can benefit from future optimizations and hardware developments without rewriting your program.
Our current implementation of the Chunks & Tasks framework is written in C++ and uses the work stealing parallel programming paradigm to achieve good scalability with respect to the number of compute nodes. We have implemented a task scheduler and a chunk manager as two separate modules. Both modules use the Message Passing Interface (MPI) and communication conflicts are avoided using separate MPI communicators. Chunks are stored distributed on the compute nodes' memory and data locality is achieved by making sure that chunks created by a task running on a particular node is stored on the same node. Thus, communication is only needed when the chunk is referenced by a task executed by another compute node. To further reduce the need for communication, each compute node uses a chunk cache that keeps copies of the most recently used chunks.