Implementing Low Latency Distributed Software-Based Shared Memory
Zoran Radovic and Erik Hagersten
In Proceedings of the Workshop on Memory Performance Issues (WMPI 2001), held in conjunction with the 28th International Symposium on Computer Architecture (ISCA28), Göteborg, Sweden, June 2001.
Software-implementations of shared memory are still far behind the performance of hardware-based shared memory implementations (HW-DSM) and are not viable options for most fine-grain shared memory applications. The major source for their inefficiency comes from the cost of interrupt-based asynchronous protocol processing, not from the actual network latency. As the raw hardware latency of inter-node communication decreases, the asynchronous overhead in the communication becomes more dominant. We describe how all the interrupt- and/or poll-based asynchronous protocol processing can be completely removed by running the entire coherence protocol in the requesting processor. This not only removes the asynchronous overhead, but also makes use of a processor that otherwise would stall. The technique is applicable to both page-based and fine-grain software-based shared memory. DSZOOM-WF---the implementation presented in this paper---is a sequentially consistent, fine-grain distributed software-based shared memory. It demonstrates a protocol-handling overhead below a microsecond for all the actions involved in a remote load operation, to be compared to the fastest implementation to date of around ten microseconds. The all-software protocol is implemented assuming some basic low-level primitives in the cluster interconnect and an operating system bypass functionality, similar to the emerging InfiniBand standard. DSZOOM-WF demonstrates consistently comparable performance to HW-DSM implementations.
Available as PDF (151 kB)
BibTeX file entry: Radovic:2001:jun