Technical Report 2016-004

Implementing Fast, Virtualized Profiling to Eliminate Cache Warming

Nikos Nikoleris, Andreas Sandberg, Erik Hagersten, and Trevor E. Carlson

March 2016


Simulation is an important part of the evaluation of next-generation computing systems. Detailed, cycle-level simulation, however, can be very slow when evaluating realistic workloads on modern microarchitectures. Sampled simulation (e.g., SMARTS and SimPoint) improves simulation performance by an order of magnitude or more through the reduction of large workloads into a small but representative sample. Additionally, the execution state just prior to a simulation sample can be stored into checkpoints, allowing for fast restoration and evaluation. Unfortunately, changes in software, architecture or fundamental pieces of the microarchitecture (e.g., hardware-software co-design) require checkpoint regeneration. The end result for co-design degenerates to creating checkpoints for each modification, a task checkpointing was designed to eliminate. Therefore, a solution is needed that allows for fast and accurate simulation, without the need for checkpoints.

Virtualized fast-forwarding proposals, like FSA, are an alternative to checkpoints that speed up sampled simulation by advancing the execution at near-native speed between simulation points. They rely, however, on functional simulation to warm the architectural state prior to each simulation point, a costly operation for moderately-sized last-level caches (e.g., above 8MB). Simulating future systems with DRAM caches of many GBs can require warming of billions of instructions, dominating the time for simulation and negating the benefit of virtualized fast-forwarding.

This paper proposes CoolSim, an efficient simulation framework that eliminates cache warming. CoolSim advances between simulation points using virtualized fast-forwarding, while collecting sparse memory reuse information (MRI). The MRI is collected more than an order of magnitude faster than functional warming. At the simulation point, detailed simulation is used to evaluate the design while a statistical cache model uses the previously acquired MRI to estimate whether each memory request hits in the cache. The MRI is an architecturally independent metric and therefore a single profile can be used in simulations of any size cache. We describe a prototype implementation of CoolSim based on KVM and gem5 running 19x faster than the state-of-the-art sampled simulation, while it estimates the CPI of the SPEC CPU2006 benchmarks with 3.62% error on average, across a wide range of cache sizes.

Available as PDF (171 kB, no cover)

Download BibTeX entry.