Aggressive hardware prefetching is extremely beneficial for single-threaded performance but can lead to significant slowdowns in multicore processors due to oversubscription of off-chip bandwidth and shared cache capacity. This work addresses this problem by adjusting prefetching on a per-application basis to improve overall system performance. Unfortunately, an exhaustive search of all possible per-application prefetching combinations for a multicore workload is prohibitively expensive, even for small processors with only four cores.
In this work we develop Perf-Insight, a simple, scalable mechanism for understanding and predicting the impact of any available hardware/software prefetching choices on applications' bandwidth consumption and performance. Our model considers the overall system bandwidth, the bandwidth sensitivity of each co-running application, and how applications' bandwidth usage and performance vary with prefetching choices. This allows us to profile individual applications and efficiently predict total system bandwidth and throughput. To make this practical, we develop a low-overhead profiling approach that scales linearly, rather than exponentially, with the number of cores, and allows us to profile applications while running in the mix.
With Perf-Insight we are able to achieve an average weighted speedup of 21% for 14 mixes of 4 applications on commodity hardware, with no mix experiencing a slowdown. This is significantly better than hardware prefetching, which only achieves an average speedup of 9%, with three mixes experiencing slowdowns. Perf-Insight delivers performance very close to the best possible prefetch settings (22%). Our approach is simple, low-overhead, applicable to any collection of prefetching options and performance metric, and suitable for dynamic runtime use on commodity multicore systems.
Available as PDF (370 kB, no cover)
Download BibTeX entry.