In Proceedings of the Workshop on Memory Performance Issues (WMPI 2001), held in conjunction with the 28th International Symposium on Computer Architecture (ISCA28), Göteborg, Sweden, June 2001.
Software-implementations of shared memory are still far behind the performance of hardware-based shared memory implementations (HW-DSM) and are not viable options for most fine-grain shared memory applications. The major source for their inefficiency comes from the cost of interrupt-based asynchronous protocol processing, not from the actual network latency. As the raw hardware latency of inter-node communication decreases, the asynchronous overhead in the communication becomes more dominant. We describe how all the interrupt- and/or poll-based asynchronous protocol processing can be completely removed by running the entire coherence protocol in the requesting processor. This not only removes the asynchronous overhead, but also makes use of a processor that otherwise would stall. The technique is applicable to both page-based and fine-grain software-based shared memory. DSZOOM-WF---the implementation presented in this paper---is a sequentially consistent, fine-grain distributed software-based shared memory. It demonstrates a protocol-handling overhead below a microsecond for all the actions involved in a remote load operation, to be compared to the fastest implementation to date of around ten microseconds. The all-software protocol is implemented assuming some basic low-level primitives in the cluster interconnect and an operating system bypass functionality, similar to the emerging InfiniBand standard. DSZOOM-WF demonstrates consistently comparable performance to HW-DSM implementations.
Available as PDF
(151 kB)
BibTeX file entry: Radovic:2001:jun