Shared-memory architectures represent a class of parallel computer systems commonly used in the commercial and technical market. While shared-memory servers typically come in a large variety of configurations and sizes, the advance in semiconductor technology have set the trend towards multiple cores per die and multiple threads per core.
Software-based distributed shared-memory proposals were given much attention in the 90s. But their promise of short time to market and low cost could not make up for their unstable performance. Hence, these systems seldom made it to the market. However, with the trend towards chip multiprocessors, multiple hardware threads per core and increased cost of connecting multiple chips together to form large-scale machines, software coherence in one form or another might be a good intra-chip coherence solution.
This thesis shows that data locality, software flexibility and minimal processor support for read and write coherence traps can offer good performance, while removing the hard limit of scalability. Our aggressive fine-grained software-only distributed shared-memory system exploits key application properties, such as locality and sharing patterns, to outperform a hardware-only machine on some benchmarks. On average, the software system is 11 percent slower than the hardware system when run on identical node and interconnect hardware. A detailed full-system simulation study of dual core CMPs, with multiple hardware threads per core and minimal processor support for coherence traps is on average one percent slower than its hardware-only counter part when some flexibility is taken into account. Finally, a functional full-system simulation study of an adaptive coherence-batching scheme shows that the number of coherence misses can be reduced with up to 60 percent and bandwidth consumption reduced with up to 22 percent for both commercial and scientific applications.
Available as PDF (1.65 MB)
Download BibTeX entry.