Skip to main content
Department of Information Technology
Uppsala Architecture Research Team

Resource Efficient Prefetching for Multicores

Modern processors typically employ hardware prefetching to help hide memory latency. Hardware prefetchers are usually very effective and can speed up some applications by more than 40% when running in isolation. However, this speedup often comes at the cost of prefetching a significant volume of useless data which wastes shared last level cache space and off-chip bandwidth, which can hurt multi-program and parallel performance.

We demonstrate an accurate resource-efficient prefetching scheme that can improve performance when executing multiple applications by conserving shared resources. We use fast cache modeling to accurately identify memory instructions that frequently miss in the cache, and then use this information to automatically insert software prefetches in the application at compile-time or runtime. Our prefetching scheme has good accuracy and employs intelligent cache bypassing whenever possible. These properties help reduce off-chip bandwidth consumption and last-level cache pollution. While single-thread performance remains comparable to hardware prefetching, when several cores are used and demand for shared resources grows we see multi-application improvements of 10% on average, and up to 25% depending on the application mix.

amd_ph2_softpref.png

Throughput performance distribution for our resource-efficient prefetching mechanism (Software Prefetching+Non-temporal cache bypassing) and the Hardware Prefetching (Hardware Pref.) on an AMD Phenom 2 processor, for 180 mixes. Each mix contains 4 applications running in parallel on 4 cores. Poster

Updated  2015-01-22 15:00:16 by David Black-Schaffer.