The introduction of chip multiprocessors (CMPs) presents new challenges and trade-offs to computer architects. Architects must now strike a balance between the number of cores per chip versus the amount of on-chip cache and available pin bandwidth. Technology projections predict that the cost of pin bandwidth will increase significantly and may therefore limit the number of processor cores per CMP.
We observe a trend in many processor designs towards larger cache blocks for the highest level on-chip cache. A large cache block size is beneficial for workloads with a high amount of spatial locality. Our study confirms previous observations finding that significant parts of medium-sized cache blocks that are brought on-chip often remain unused and therefore wastefully consume pin bandwidth, especially for the commercial workloads studied. In this paper we target this waste by proposing a method of fine-grained fetches.
In this paper we show that due to characteristics of runahead execution it is possible to remove the implicit assumption that programs exhibit abundant spatial locality, with a limited performance impact. We demonstrate, using execution-driven full system simulation, that our method of fine-grained fetching can obtain significant performance speedups in bandwidth constrained systems but also yield stable performance systems that are not bandwidth limited.
Available as PDF (177 kB, no cover)
Download BibTeX entry.