Performance Limits of Trace Caches
1999; Volume: 1; Linguagem: Inglês
ISSN
1942-9525
AutoresMatt Postiff, Gary Tyson, Trevor Mudge,
Tópico(s)Interconnection Networks and Systems
ResumoA growing number of studies have explored the use o f trace caches as a mechanism to increase instruction fetch bandwidth. The trace cac he is a memory structure that stores statically non-contiguous but dynamically adjacent inst ructions in contiguous memory locations. When coupled with an aggressive trace or multiple branch predictor, it can fetch multiple basic blocks per cycle using a single-port ed cache structure. This paper compares trace cache performance to the theoretical limit of a three-block fetch mechanism. The three-block fetch mechanism is modeled by an ideali zed 3-ported instruction cache with a zero-latency alignment network. Several new metrics are defined to formalize analysis of the trace cache. These include fragmentation, dupli cation, indexability, and efficiency metrics. We show that performance is more limited by b ranch mispredictions than ability to fetch multiple blocks per cycle. As branch predicti on improves, high duplication and the resulting low efficiency are shown to be among the reasons that the trace cache does not reach its upper bound. Based on the shortcomings of the trace cache shown in this paper, we identify some potential future research areas. Instruction supply is a key element in the performa nce of current superscalar processors. Because of the large number of branch instructions in the typical instruction stream and the small size of basic blocks, fetching through multip le branches per cycle is critical to high performance processors. Traditional instruction cac he designs cannot fetch past multiple branches per cycle, and in particular through multi ple taken branches per cycle. The trace cache fetch mechanism is a solution to th e problem of fetching past multiple branches in a single cycle. It stores dynamically a djacent instructions in a contiguous memory block and can do so with intervening branch ins tructions. When it is coupled with a multiple-branch predictor, it can provide a high-bandw idth mechanism to fetch multiple basic blocks per cycle. This paper presents a study of the limits of trace cache performance and their causes. The goal is not to compare the trace cache against othe r competing mechanisms or to introduce any new features, but to study where current trace cache configurations can improve. The contributions of this study are: • an examination of the limit of trace cache perform ance based on an idealized 3-block fetch mechanism that is modeled by a 3-port ed instruction cache with a perfect instruction alignment network; • a definition of several metrics to aid in analysis of trace cache performance;
Referência(s)