AMD EPYC 7002 Series Zen 2 Architecture Doubles Data Center Performance And Density


AMD EPYC 7002 Processors: Architecture And Platform Details

radeon mi epyc server 1

AMD has officially taken the wraps off its second generation EPYC 7002 series processors. The EPYC 7002 series builds upon the success of the original EPYC processors by leveraging the Zen 2 microarchitecture, which is also at the heart of the recently released Ryzen 3000 series desktop parts. EPYC 7002 series processors maintain platform compatibility with existing first-gen EPYC parts, but significantly increase the platform’s capabilities in aggregate and per socket, and introduce a number of new features as well.

epyc 7002 summary

In terms of the microarchitecture, all of the benefits Zen 2 afforded the desktop have an impact on EPYC 7002 series processors, codenamed Rome. AMD has made a number of enhancements with Zen 2 in an effort to improve everything from IPC (instructions per clock) and single-thread performance, to multi-thread scaling, latency, and efficiency/power. Zen 2 IPC has been improved by approximately 15% generation-over-generation (Zen vs. Zen 2) depending on the workload, thanks to improved branch prediction, higher integer throughput, doubling the floating-point performance, and reducing the effective latency to main memory. These gains are over and above the frequency and power benefits inherent to the processor’s more advanced 7nm manufacturing process.

zen2 updates epyc 7002

The move to 7nm also helps AMD to increase density. By using a multi-die approach and leveraging multiple 8-core, 7nm CPU dies alongside a 14nm IO die in EPYC 7002 series processors, AMD is able to squeeze nearly 1000mm2 of silicon and nearly 32B transistors into a single socket; EPYC 7002 series processors consist of 9 dies altogether (8 x CPU dies + 1 IO die) and max out with 64 physical cores per socket (128 threads). Using this multi-die approach has an inherent yield advantage over a large, monolithic single die approach and it also allows AMD to produce a wide array of products at various prices and performance levels by disabling cores / dies / CCxes, etc., without limiting the processors’ feature set.

load store epyc 7002

Although Zen 2 builds upon the first-generation Zen (and Zen+), many changes have been made to improve efficiency and performance. The updated cores have a new TAGE (Tagged Geometry) Branch Predictor, in addition to improved instruction pre-fetching, and a re-optimized L1 cache structure with double the micro-Op cache.

cache epyc 7002

The new TAGE branch predictor is able to make selections with increased accuracy and granularity, and is able to manage longer histories for workloads where that is important. The L1 instruction cache has actually been halved down to 32K versus the original Zen’s 64, but it is now 8-way associative. The L2 cache remains 512K per core, and is 8-way associative like the L1. The Zen 2 architecture features more L1 and L2 BTB (Branch Target Buffer) entries and a larger 1K indirect target array as well.

load store epyc 7002

Load / Store bandwidth has been significantly increased in this generation (2 Loads and 1 Store per cycle, with a 48 entry queue versus 44 in the originals), Zen 2 has a larger rename space with 180 registers (up from 168), and another Address Generation Unit (AGU) has been added too, which brings the total number up to 3 AGUs. Zen 2 offers a wider, 6 mico-op dispatch and it can better utilize available CPU resources for enhanced SMT (Symmetrical Multi-Threading) scaling and performance. Better SMT performance is critically important for the kinds of heavily-threaded, diverse workloads and applications EPYC 7002 series processors are designed for.

floating point epyc 7002

Zen 2’s floating point capabilities have been significantly enhanced too. Zen 2 essentially doubles FP performance and Load / Store bandwidth from 128-bit to 256-bit, it features 2 x 256-bit Fmacs (consisting of 4 pipes, 2 Fadd, and 2 Fmul), and offers single-op support for AVX-256 instructions. The architecture has been optimized to reduce contention in Integer execution as well, for more consistent performance with a wider variety of workloads.

Related content