Department of Information Technology

Efficient Modeling Technology

More information: Uppsala Architecture Research Team | Statistical Cache Modeling and Phase Detection.

Motivation

This work focuses on the development of efficient techniques for capturing relevant performance information at runtime as well as the development of statistical modeling methods to predict and analyze application performance based on the information captured at runtime. Understanding an application behavior is necessary to fully exploit the available hardware. The memory hierarchy, latency hiding techniques, and many other advanced mechanisms in the CPU architecture dictate how fast applications execute. Simulation has been used to study the performance impact of these features, but is typically too slow to analyze large applications or datasets. Furthermore, fast modeling techniques are needed for providing guidance for schedulers, compilers and programmers.
This work seeks to develop new statistical and analytical models to predict and capture significant aspects of application performance and power consumption based on runtime information that can be cheaply captured. It also includes methods to cheaply collect such information.
A second approach taken is to use various forms of sampling to speed up traditional cycle-accurate simulators.

Long Term Goal

Develop efficient statistical models of critical performance factors based on easily captured data. We will also develop the infrastructure to capture a wide variety of performance and execution information for real applications processing real data sets on real hardware. While the current modeling the memory hierarchy, future work will be expended to also cover other parts of computer systems.

Our Approach

Our work is based on three different core technologies, originally developed in this research group: Statistical reuse distance sampling, statistical modeling and on-line phase detection. In addition, virtualization techniques are used to fast-forward simulation between sampling points.

Reuse Distance Sampling

An application´s locality properties (e.g., its temporal and spatial locality) dictates how well it can utilize the cache hierarchy. Measuring an application´s stack-distance counts the number of UNOIQUE data pieces touch between two accesses to the same piece of data. This effectively captures the essential parts of an application´s locality properties and is used by many researchers to feed their analysis models. However, no proposal exists for how to efficiently capture stack distance information. We instead have proposed several new methods for capturing reuse distance information for an application, i.e., to count the number of other memory accesses between two accesses to the same piece of data. We have also implemented several efficient tools for capturing reuse distance information.

Statistical Modeling

Ruse distance information alone cannot tell much about the properties of a system and will have to be refined to model system properties. We have developed several mathematical models for modeling caches with LRU replacement and random replacement based on reuse distance information. We have also shown that such caches can be effectively modeled even with very sparse reuse distance data.
stat-sampling-model.gif

Phase Detection

The third core technology for efficient modeling is on-line phase detection. The behavior of an application tends to change over time and is often assembled from a handful of different behavior types, known as phases. Knowing the behavior for each such phase, as well as its fraction of the execution time, is sufficient for knowing the application´s overall behavior. Our approach involved techniques for detecting phase changes as well as detecting if the new phase has occurred before.

Simulation Fast Forwarding with Statistical Cache Warming

The forth project will partly leverage the techniques presented above to speed up detailed cycle-accurate pipeline simulation. Instead of simulating the entire execution with high accuracy (and slow simulation speed), a number of shorter representative simulation point are statistically selected. We use virtualization technology to fast-forward the simulation to the next simulation point and thus speeding up the overall simulation by 2000x. Before the detailed simulation of the simulation point can start, caches needs to be warmed by simulating many millions of cycles. To avoid also this overhead, statistical simulation techniques will be used. Finally, phase-detection can be used to select the simulation points in a more intelligent way.

Achievements

  • Achieved 2000x speedups for sampled simulation using the industry-standard gem5 full-system simulator.
  • Integrated statistical cache models for very fast cache warming in simulation.
  • Integrated statistical cache modeling with the interval performance model from Gent University. (part of an EU project)
  • Implemented a low-overhead tool for reuse distance sampling on x86 running Linux. Our previous tool only supported UltraSPARC III running Solaris.
  • StatCache modeling techniques for efficiently analyzing and predicting application cache usage for random replacement caches. The StatCache approach uses a simple statistical model to very rapidly (< 1s) produce accurate estimates of an application's cache miss ratio for arbitrary-sized caches. This approach was commercialized in the Acumem tool suite, which was acquired by Rogue Wave Software 2010.
  • Designed a new statistical cache model to model the behavior of caches with LRU replacement.
  • Implemented the first general low-overhead runtime phase detection at tool. It´s average overhead is below 2 percent.
  • Designed a phase-guided reuse distance sampler by combining the x86 sampler and the phase detection tool. It results in on average 6 times lower overhead and 50 percent better accuracy than the basic x86 sampler.
  • Implemented a sampler and model for modeling instruction caches behavior.

Expected Future Results

  • Modeling of memory level parallelism and its effect on the memory system and performance based on statistically captured runtime data (ongoing)
  • Efficient simulation/modeling of out-of-order pipelines (ongoing)
  • Parallel virtualized simulation support
  • Use interval modeling to model detailed effects in out-of-order processors based on efficiently captured architecturally independent runtime information (such as reuse distance). (part of an approved EU project)

Publications

  • Peter Vestberg. Low-Overhead Memory Access Sampler An efficient method for data-locality profiling. M.Sc. Dissertation Uppsala University January 2011, ISSN: 1401-5749, UPTEC IT 11 003.
  • Andreas Sembrant. Low Overhead Online Phase Predictor and Classifier. M.Sc. Dissertation Uppsala University January 2011, ISSN: 1401-5749, UPTEC IT 11 002.
  • Sascha Bischoff, Andreas Sandberg, Andreas Hansson, Sunwoo Dam, Ali Saidi, Matthew Horsnell, and Bashir Al-Hashimi: Flexible and High-Speed System-Level Performance Analysis using Hardware-Accelerated Simulation. In , Design, Automation & Test in Europe (DATE), Grenoble, France, 2013.
  • David Black-Schaffer: Full System Simulation at Near Native Speed: Parallelising and Accelerating Gem5 through Hardware Virtualisation. Talk at the Efficient Modelling of Parallel Computer Systems Workshop in conjunction with HiPEAC Conference, Vienna January 2014.
  • Nikos Nikoleris, David Eklöv, Erik Hagersten: Extending Statistical Cache Models to Support Detailed Pipeline Simulators. To appear in Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Monterey, CA, USA, March 2014.

Internal project page

Updated  2015-03-03 08:29:38 by David Black-Schaffer.