Licentiate thesis 2003-008

Efficient Synchronization and Coherence for Nonuniform Communication Architectures

Zoran Radovic

September 2003

Abstract:

Nonuniformity is a common characteristic of contemporary computer systems, mainly because of physical distances in computer designs. In large multiprocessors, the access to shared memory is often nonuniform, and may vary as much as ten times for some nonuniform memory access (NUMA) architectures, depending on if the memory is close to the requesting processor or not. Much research has been devoted to optimizing such systems.

This thesis identifies another important property of computer designs, nonuniform communication architecture (NUCA). High-end hardware-coherent machines built from a few large nodes or from chip multiprocessors, are typical NUCA systems that have a lower penalty for reading recently written data from a neighbor's cache than from a remote cache. The first part of the thesis identifies node affinity as an important property for scalable general-purpose locks. Several software-based hierarchical lock implementations that exploit NUCAs are presented and investigated. This type of lock is shown to be almost twice as fast for contended locks compared with other software-based lock implementations, without introducing significant overhead for uncontested locks.

Physical distance in very large systems also limits hardware coherence to a subsection of the system. Software implementations of distributed shared memory (DSM) are cost-effective solutions that extend the upper scalability limit of such machines by providing the "illusion" of shared memory across the entire system. This also creates NUCAs with even larger local-remote penalties, since the coherence is maintained entirely in software.

The major source of inefficiency for traditional software DSM implementations comes from the cost of interrupt-based asynchronous protocol processing, not from the actual network latency. As the raw hardware latency of internode communication decreases, the asynchronous overhead in the communication becomes more dominant. This thesis introduces the DSZOOM system that removes this type of overhead by running the entire coherence protocol in the requesting processor.

Available as PDF (2.04 MB)

Download BibTeX entry.