Both software-initiated and hardware-initiated prefetching have been used to accelerate shared-memory server performance. While software-initiated prefetching require instruction set and compiler support, hardware prefetching often require additional hardware structures or extra memory state.
The coherence batching scheme proposed in this paper keeps the system completely binary transparent and does not rely on any additional hardware. Hence, it can be implemented without additional hardware in software coherent systems and improve performance for already optimized and compiled binaries.
We have evaluated our proposals on a trap-based memory architecture where fine-grained coherence permission checks are done in hardware but the coherence protocol is run in software on the requesting processor. Functional full-system simulation shows that our software-only coherence-batch scheme is able to reduce the number of coherence misses with up to 60 percent compared to a system without coherence batching. The average miss reduction is 37 percent while the average bandwidth usage is reduced.
Available as PDF (261 kB, no cover)
Download BibTeX entry.