Resource Sharing Modeling
The relatively long latency and limited bandwidth to off-chip memory makes applications' performance highly dependent on how well they utilize the resources in the memory hierarchy. In modern chip multiprocessors (CMPs) cores share key memory hierarchy resources, such as on-chip caches and off-chip memory bandwidth. On the hardware level, these resource are typically shared in a free-for-all manner, and how much resources an application receive can therefore vary greatly depending on what other applications it happen to be co-running with. On the software level, process schedulers and thread placement algorithms therefore have the potential to greatly improve application performance and scalability by intelligently schedule and place threads/applications in a resource aware manner.
This work focuses on techniques to collect information about resource usage and models that capture how resource sharing affects performance. With this information we can understand how to optimize applications, operating system schedulers, and hardware to maximize the performance of applications sharing resources.
Long Term Goal
Our long term goal is to develop methods to predict and enhance application performance and scalability. These approaches can be used as a basis for the development of practical resource-sharing aware process scheduling algorithms, thread mappings, and application optimizations.
- Tools to efficiently measure, model and predict application performance and scalability in the presence of resource sharing. In particular, we are examining the impacts of sharing in the memory system through caches, prefetchers, and off-chip bandwidth.
- Practical resource-aware process scheduling and thread placement algorithms to leverage the performance models for better scalability.
- Application optimizations based on the analysis of shared resources usage to improve application performance in when sharing resources and to minimize its impact on other applications.
- Initial work on extending statistical cache models to task-based frameworks (StatTask), including profiling of the OpenMP BOTS benchmark suite to identify data reuse properties.
- StatCC presented an efficient method to model the performance impacts of cache sharing for multi-program workloads. StatCC uses static program information (instruction mix) and a performance/cache-miss model to predict how applications will affect each other's performance when sharing a cache.
- We have leveraged the StatStack cache modeling framework to automatically identify memory access instructions that pollute shared caches by installing data that is never reused. Once found, these instructions can then be automatically transformed to non-temporal accesses to keep them from being installed in the cache. This automatic analysis and transformation increases the amount of cache available to other applications, thereby improving multi-application throughput.
- We have developed a new methods that measure how properties of commodity hardware change as the available cache capacity and bandwidth for an application changes. Through Cache Pirating and the Bandwidth Bandit we can steal cache capacity and bandwidth to evaluate the sensitivity of an application's performance, bandwidth, and cache behavior as a function of cache size and bandwidth allocation.
- Using the Cache Pirate and Bandwidth Bandit data we have developed techniques to understand scaling bottlenecks.
- We can accurately model cache allocation on commodity hardware using Cache Pirate profiles. This allows us to predict performance (CPI) and bandwidth consumption of a set of co-executed applications from data captured for each application individually.
- We have used our cache allocation model to rapidly (600x faster than native execution) explore the variability in application slowdown due to varying offsets in application executions. This work has shown that two co-executing applications can have very different slowdowns depending on how they overlap in time.
- German Ceballos, Erik Hagersten, and David Black-Schaffer. "StatTask: Reuse Distance Analysis for Task-Based Applications." RAPDIO 2015 (Best Paper)
- German Ceballos and David Black-Schaffer. "Shared Resource Sensitivity in Task-Based Runtime Systems." In Proceedings of the 6th Swedish Workshop on Multi-Core Computing, 2013.
- Andreas Sandberg, Andreas Sembrant, Erik Hagersten, and David Black-Schaffer. "Modeling Performance Variation Due to Cache Sharing." 19th International Symposium on High Performance Computer Architecture (HPCA), 2013.
- David Eklöv, Nikos Nikoleris, David Black-Schaffer and Erik Hagersten. "Bandwidth Bandit: Quantitative Characterization of Memory Contention." International Symposium on Code Generation and Optimization (CGO), 2013, Shenzhen, China.
- Andreas Sandberg, David Black-Schaffer and Erik Hagersten. "Efficient Techniques for Predicting Cache Sharing and Throughput." In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), 2012, Minneapolis, Minnesota.
- David Eklov, David Black-Schaffer and Erik Hagersten. "Fast Modeling of Cache Contention in Multicore Systems", In Proceedings of the the 6th International Conference on High Performance and Embedded Architecture and Compilation (HiPEAC), Heraklion, Crete, Greece, January 2011.(Best paper award.)
- Andreas Sandberg, David Eklov and Erik Hagersten. Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses, In Proceedings of Supercomputing (SC), New Orleans, LA, USA, November 2010.
- David Eklov, David Black-Schaffer and Erik Hagersten. "StatCC: A Statistical Cache Contention Model", In the Proceedings of 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), Vienna, Austria, September 2010.
- David Eklov, Nikos Nikoleris, David Black-Schaffer and Erik Hagersten. Cache Pirating: The Curse of the Shared Cache. Technical Report 2011-001
- David Eklov, Nikos Nikoleris, David Black-Schaffer and Erik Hagersten. Cache Pirating: Measuring the Curse of the Shared Cache International Conference on Parallel Processing 2011. (Best paper award )
This work leverages low-overhead hardware performance counters, runtime resource stealing, and fast statistical hardware models to rapidly predict the effects of resource sharing.