Prefetching has proven useful for reducing cache misses in multiprocessors at the cost of increased coherence traffic. This is especially troublesome for snooping-based systems, where the available coherence bandwidth often is the scalability bottleneck.
The new bundling technique, introduced in this paper, reduces the overhead caused by prefetching by two means: piggybacking prefetches with normal requests; and, requiring only one device on the "bus" to perform a snoop lookup for each prefetch transaction. This paper describes bundling implementations for three important transaction types: reads, upgrades and downgrades.
While bundling could reduce the overhead of most existing prefetch schemes, the evaluation of bundling performed in this paper has been limited to two of them: sequential prefetching and Dahlgren's adaptive sequential prefetching.
Both schemes have their snoop bandwidth cut in about half for all the commercial and scientific benchmarks studied. The combined effect of bundling applied to these fairly naive prefetch schemes lowers the cache miss rate, the address bandwidth, as well as the snoop bandwidth compared with no prefetching for all applications - a result never demonstrated before.
Bundling, however, will not reduce the data bandwidth introduced by a prefetch scheme. We argue, however, that the data bandwidth is more easily scaled than the snoop bandwidth for snoop-based coherence systems.
Available as PDF (896 kB, no cover)
Download BibTeX entry.