The advances in semiconductor technology have set the shared memory server trend towards processors with multiple cores per die and multiple threads per core. This paper presents simple hardware primitives enabling flexible and low complexity multi-chip designs supporting an efficient inter-node coherence protocol run in software. The design is based on two node permission bits per cache line and a new way to decouple the intra-chip coherence protocol from the inter-node coherence protocol. The protocol implementation enables the system to cache remote data in the local memory system with no additional hardware support.
Our evaluation is based on detailed full system simulation of both commercial and HPC workloads. We compare a low-complexity system based on the proposed primitives with aggressive hardware multi-chip shared-memory systems and show that the performance is competitive, and often better, across a large design space.
Download BibTeX entry.