Optimizing applications for off-chip bandwidth usage has become increasingly critical as computing resources in multicore processors have increased much faster than shared resources, off-chip bandwidth and shared cache capacity. While improved use of shared resources can benefit single core performance, it is crucial for system with several active cores, where the way each core uses shared resources can directly impact the performance of sibling cores.
Although optimizing for memory bandwidth has been a priority for decades, the tools to effectively profile application memory accesses are relatively new. With such tools we can uncover memory accesses that use shared cache capacity and memory bandwidth inefficiently, and trace them back to the original source code. This paper presents case studies of using memory access profiles to uncover and explain critical memory access issues for three selected workloads. These memory bottlenecks are resolved using commonly applicable software optimization techniques. We then investigate the throughput wall - the relationship between the drop in off-chip traffic, post-optimization, and the resulting throughput gain achieved. Our experiments for multi-execution show that, after optimization, the drop in off-chip traffic is reflected in the maximum throughput that can be achieved by the optimized workloads relative to the original.
Available as PDF (346 kB, no cover)
Download BibTeX entry.