Coherence in multicores introduces complexity and overhead (directory, state bits) in exchange for local caching, while being "invisible" to the memory consistency model. In this paper we show that a much simpler (directory-less/broadcast-less) multicore coherence provides almost the same performance without the complexity and overhead of a directory protocol. Motivated by recent efforts to simplify coherence for disciplined parallelism, we propose a hardware approach that does not require any application guidance. The cornerstone of our approach is a run-time, application-transparent, division of data into private and shared at a page-level granularity. This allows us to implement a dynamic write-policy (write-back for private, write-through for shared), simplifying the protocol to just two stable states. Self-invalidation of the shared data at synchronization points allows us to remove the directory (and invalidations) completely, with just a data-race-free guarantee (at the write-through granularity) from software. Allowing multiple simultaneous writers and merging their writes, relaxes the DRF guarantee to a word granularity and optimizes traffic. This leads to our main result: a virtually costless coherence that uses the same simple protocol for both shared, DRF data and private data (differentiating only in the timing of when to put data back in the last-level cache) while at the same time approaching the performance (within 3%) of a complex directory protocol.
Available as PDF (294 kB, no cover)
Download BibTeX entry.