Overview
When it comes to innovations in x86 designs, AMD has been a leader for many years, especially in features that provide topflight performance: on-chip memory controllers, HyperTransport™ technology links, 64-bit extensions, and multiple cores.The Barcelona family of x86 processors to be released in August by AMD provides another host of innovations. And none will be more visible than AMD’s radical redesign of the cache hierarchy.
This article discusses the three levels of cache in Barcelona and the unique ways in which they interact. Only a basic understanding of processor caches is needed to follow along.
L1: The New Home, Sweet Home
The L1 (that is, level 1) cache is the one closest to the instruction-execution units in the processor core. It is segregated into two 64KB caches, one each for instructions and data. I will focus on the L1 data cache.In traditional x86 cache designs, this data cache is fed data items from the L2 cache, which is a larger, slower cache that is fed items from memory. Under this scheme, when the processor needs a data item that is not in L1 cache, it looks in L2 cache, and, failing that, it stalls while the item is fetched from memory. Depending on which vendor’s cache system is being used, the memory item is either loaded directly to L1 or first to L2 and from there into L1.
Barcelona reverses all this. The L1 cache is the target of all loads of data, regardless of origin. All fetches of data items from memory are placed into L1 and nowhere else. To reduce the possibility of a stall, Barcelona has a built-in hardware prefetch mechanism that detects repeating patterns of data access and loads upcoming data items into, you guessed it, L1 cache.
This design provides the maximum performance in loading items, but it does have one constraint: At 64KB, the L1 cache’s capacity is limited, so good management of its data is crucial. To this end, Barcelona provides a two-tier spill-over cache architecture so that items can be evicted quickly from L1 cache and brought back rapidly again if needed again. These two tiers consist of the L2 and L3 caches.
L2 Cache: Per-Core Workhorse
The L2 cache is purely a spill-over cache for items evicted from the L1 cache. (In technical parlance, this is referred to as a “victim cache.”) Each core has its own private L2 cache, which on the first generation of Barcelona chips measures 512KB. Loads of data items from the L2 cache back to the L1 cache are fast, so if an evicted item is needed in the L1 cache again, the latency of getting it is low. As in previous generations of AMD processors when an item is moved from L2 cache to L1, it is deleted from the L2 cache. This policy increases the effective size of L2 by eliminating redundancy.Having the L1 cache feed data into the L2 is unusual. The more common pattern is that the L2 cache is fed directly from memory or from an L3 cache. And then, the L2 cache feeds the core’s L1 cache. In this traditional arrangement, each lower level of cache is smaller and faster. In reversing the data flow, so that evicted data goes from the smaller faster cache to the larger outer caches, AMD is seeking to accelerate loads and stores and to use the L2 cache as an extension of the L1 cache.
Just as they are on current AMD processors, the L2 caches on Barcelona are duplicated, not shared. That is, each core has its own L1 and L2 caches, which are not shared with other cores. (The L2 caches are coherent, however. This means that if two L2 caches contain copies of the same data item, any changes made to one copy are immediately mirrored in the other cache’s copy.) AMD argues that, by giving each core its own dedicated L2 cache, cores have a more predictable execution profile because the L2 cache cannot be swamped by data items from another core. This is correct and it makes it possible for developers who want to tune code to know that they can rely on a fairly consistent performance profile of the cache. However, the design has one limitation: If a core needs more cache than the L2 offers currently, it has no easy way of obtaining it, other than by eviction of current items. This aspect inspired AMD to add a third layer of cache to the processor, which is shared, so that cache availability can be expanded as needed.
This design breaks new ground for x86 processors: The Barcelona chips will offer the advantages of a directly accessed L1 cache, and both shared and duplicated higher-level caches. The L3 cache has some additional unique features.
L3 Cache: A Shared Container
In the initial Barcelona models, the new L3 cache is set at 2MB. AMD has announced that future models will quickly scale up to larger sizes. For example, the upcoming Shanghai generation of processors will sport a 6MB L3 cache.Sharing the L3 cache between cores provides several important benefits. The first is that any core that needs more cache can generally get it. The cache management logic expels the least recently used (LRU) cache lines (that is, blocks of data in cache) using the algorithm I describe shortly, and thereby creates room for the new data. The other major advantage, and probably the more important one, is that two cores that are working on the same data can have a single copy in the L3 cache. This arrangement works out particularly well for multimedia applications. For example, multiple cores might each be working on decoding and rendering a single video cell. If the entire cell is loaded into L3 cache, the cores can all access the same data item without reading it multiple times from RAM.
The cache-management logic for the L3 cache is unique. When an item is loaded from L3 cache into a core’s L1 cache (the L2 cache is always by-passed), the item is sometimes removed from the L3 cache and sometimes not. The determining factor is whether other cores are still accessing the item. If so, it’s not removed from L3 and a copy of the data is loaded into L1. If no other cores are accessing the data item, then it is removed from the L3 cache. Developers will recognize this decision-making process, as it is very much akin to the design used by garbage collectors in Java and other mechanisms: Leave the data alone if it’s still being accessed, otherwise get rid of it.
Another aspect makes the L3 cache unusual is that it is not fed from memory. Rather, it serves as a spill-over cache for items evicted from the L2 cache. So when the L3 cache loads an item into L1 cache and the cache manager can delete it from the L3 cache, room is created for more spill-over from the L2 cache. Eventually, the data item in the L1 cache will make it back to the L2 cache (when the processor is done with it) and the cycle could repeat itself. However, if the processor should need the item right away, it can find it in the L2 cache, if it has not been pushed back out to the L3 cache-and it would thereby obtain faster turnaround. AMD has not published figures on the latency of data access in the L3 cache, so it’s not possible currently to know how much faster.
The Whole Cache Hierarchy
Putting together these three levels of cache, results in a picture like Figure 1.
Figure 1. The three-tier data cache hierarchy in the Barcelona processor. (Courtesy: AMD)
The good news is that developers need do nothing special to derive the benefits of this new architecture. They can, however, help the architecture function optimally by keeping a few points in mind:
- If multiple cores are going to work on the same data structure, do so in parallel-that is, at the same time. This enables the common data item to reside in the shared cache and so it lowers overall cache loading.
- The prefetch hardware in the L1 cache works best when it can accurately predict the next data blocks that will be needed. Make data accesses as predictable as possible.
- Work on systems with the maximum number of cores. It is easy to optimize for quad-core and test for dual-core, for example, when you have a quad-core system. However, there is no real way to tell how well your code scales on quad-core systems, if you’re developing and testing on a dual-core platform. Likewise in the future, when additional cores are added.
As stated previously, cache design and performance is one of the key factors affecting the overall processor performance. This was true on the single-core processors of the previous generation, and it’s even more true in multicore chips, due to the possibilities of interactions between cores (at L2 on chips from other vendors) and in L3 caches on the Barcelona architecture. The new cache architecture in Barcelona was purpose-designed with these needs in mind.
Resources
AMD has not released much data yet on the L3 cache and the new cache structure. The most information is found in the new Software Optimization Guide. See Chapter 5 and Appendix A. Note: This document refers to Barcelona chips as the AMD 10h Family of processors.A thorough, technical, and very detailed description of the Barcelona chip, by an independent authority AnandTech, contains good coverage of the cache architecture:
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2939&p=1
Anderson Bailey is a developer with a longstanding interest in the techniques for using code to exploit processor features. He can be reached at chip.coder@gmail.com.