AMD Radeon RDNA 3 Architecture Overview: Efficiency Is King

Name: Radeon RX 7900 XTX
Brand: Radeon

by Zak Killian — Monday, November 14, 2022, 09:00 AM EDT

A Close Look At AMD's New And Improved RDNA 3 Architecture

Throughout the period encompassing the Radeon HD 7000 series, the Radeon Rx 200 series, Rx 300 series, Polaris, Fury, and Vega, AMD struggled with efficiency. Its GPUs were powerful, and had brute-force might to spare, but that didn't always translate into the top-shelf gaming performance the company would have liked to have offered. Ever since the release of the original RDNA architecture with the Radeon RX 5700 XT, the red team has been working to reverse those fortunes with an emphasis on efficiency.

Has AMD succeeded on that front? Well, the RX 5700 XT delivered Radeon RX Vega 64-like performance with little more than half the power budget, and the Radeon RX 6000 series and its RDNA 2 architecture delivered a 50% improvement on that in terms of performance per watt. AMD has promised another leap of that same magnitude—54%, specifically—with its upcoming RDNA 3 architecture. The first chips based on that design will be the Radeon RX 7900 XT and peculiarly-named Radeon RX 7900 XTX, arriving December 13th.

Historically, improvements in microprocessor efficiency have largely come from die shrinks, but those benefits have started to slow down as we reach smaller and ever smaller lithographies. Companies like AMD have had to take a long, hard look toward processor design as a way to improve efficiency, widening bottlenecks and relieving trouble spots for their specific architectures. That's exactly what AMD has done with RDNA 3: rearranged RDNA to eliminate legacy cruft and improve silicon utilization, which the company claims is up 20% over last generation.

Fundamentally, the changes to RDNA 3 from the previous generation come down to two main categories: refinements to the existing compute unit, and then arguably the bigger change, advanced physical design including a move to chiplets. We'll go over both of these topics in order, and talk about the impetus and the possible impact of AMD's decisions.

RDNA 3 Architecture: No Transistor Left Behind

With those first GCN GPUs in the Radeon HD 7000 series back in 2011, AMD introduced the concept of the "compute unit." That was the smallest building block of a generalized-compute-focused Radeon GPU, although that hasn't actually been the case since the advent of RDNA. The newer parts bundle two compute units into a single "Workgroup Processor" (WGP), and even the smallest Radeon GPUs—being those built into the company's Ryzen 7000 processors—don't break the architecture down further than that.

In RDNA 3, one WGP still comprises two compute units, but one of the major changes is that each compute unit now holds a pair each of ALUs and vector units. Still one WGP, still two CUs, but each CU now has twice the resources. That's how we see a move from 40 WGPs on Navi 21 to 48 WGPs in Navi 31, yet AMD claims that FP32 compute performance has increased by 2.7x. The actual shader count hasn't increased (beyond the 20% bump in WGPs), but each shader has twice as many functional units. There is twice as much L0 cache, too, now up to 32KB.

Each of these vector units now has additional parts as well. For starters, the doubled SIMD32s can handle 64-bit multi-precision operations, including single-clock Wave64 FMAs. They've also been upgraded with support for the popular bfloat16 data type, which improves compute performance in AI workloads. Speaking of AI workloads, there's a brand-new AI matrix accelerator that AMD says offers a 2.7x boost in matrix math operations. Curiously, AMD says that this matrix accelerator isn't intended for use in general compute workloads, but strictly in games.

That likely refers to AMD's upcoming FSR 3 technology. Revealed at the same time as the Radeon RX 7900 series cards, the third whole-number iteration of FidelityFX Super Resolution is a frame-generation technology similar (at least in concept) to NVIDIA's DLSS 3. AMD has shared scant few details about FSR 3, but we can reasonably assume that it will take advantage of RDNA 3's new matrix math unit in some way.

So what didn't get duplicated in RDNA 3? Well, there's still only one ray accelerator per compute unit. Don't fret, though; ray-tracing performance has been drastically improved, in part by the aforementioned upgrades to the compute units, particularly including the lower-level cache bumps. AMD emphasized something that NVIDIA has also said, and that's that ray-tracing is fundamentally a compute workload like any other. You can increase performance there by piling on compute, but the much faster way to improve RT performance is through efficiency optimizations.

Like with the rest of the design, AMD's second-generation ray-tracing builds on the lessons taught by RDNA 2 to drastically improve efficiency in "heavy RT workloads." AMD describes it as "getting the most out of each ray." This primarily comes in the form of reducing unnecessary work through early subtree culling, as well as hardware support for DXR Ray Flags, reducing the number of traversal iterations required. Along with a new two-stage scheduling algorithm to discard empty ray quads, AMD says RT performance is up by 80% from RDNA 2.

The World's First Chiplet GPUs

If the architectural changes from RDNA 2 to RDNA 3 seem relatively light—although they aren't, really—that could be because the company was preoccupied with getting the world's first-ever chiplet GPUs working. GPUs present a unique challenge when trying to move the design over to disaggregated construction because of their need for ultra-high-bandwidth interconnects and sensitivity to latency.

But how do you do a chiplet GPU? If you split the processor core, you end up with the same sort of situation we saw with multi-GPU technologies like Crossfire and competitor NVIDIA's SLI—technologies that are all-but abandoned these days due to their difficulties. AMD took a smarter tack: just break off the components that don't benefit from scaling down to denser process nodes.

The Navi 31 GPU comprises a single large processor, known as the GCD, and then six smaller chips known as MCDs. The GCD is the GPU as we conventionally think of it, simply without memory interfaces or last-level cache. Those parts reside on the MCDs, and the GCD is connected to the MCDs using ultra-high-bandwidth links known as Infinity Fanouts.

AMD talked at length about the difficulties it faced when creating the first chiplet GPUs. Doing chiplets on a CPU was relatively easy; even many-core CPUs are highly serial in comparison to GPUs, so you don't need to manage dozens of links running at extremely high bandwidth. GPU shader engines require absolutely massive amounts of connectivity, so the company had to create a custom high-throughput link enabling 5.3 TB/sec communication between the GCD and its MCDs. It accomplishes this while drawing less than 5% of the total board power, too.

If it's so difficult, why bother? The company points out that etching silicon on newer nodes is rapidly getting more expensive, while some parts of chips (like memory interfaces and caches) don't really gain anything from being on the latest process. By separating out the parts of the GPU that don't need to be on the latest process, AMD can optimize its cost-vs-benefit analysis.

AMD acknowledges that breaking those parts off of the GPU does add a little bit of extra latency, but in classical fashion, the company overcomes that concern with increased clock rates. Specifically, the 18% increase in GPU clock and a 43% increase in Infinity Fabric clock means that, apparently, the typical infinity cache hit has 10% lower latency on Navi 31 compared to Navi 21. Good stuff.

It does seem that GPU clock rates have some room to grow as well. The Radeon RX 7900 XTX and RX 7900 XT sport game clocks of 2.3 GHz and 2.0 GHz, respectively. AMD has tipped that RDNA 3 is architected to reach 3 GHz clocks which has us wondering about the future. Whether there will be a higher-clocked RX 7950 XT(X) released remains to be seen, but AMD has teased that the announced cards "absolutely" have overclocking headroom and that AIB's are "going to do something special with it." We will have to wait and see exactly how this pans out.

AMD's RDNA 3: The First Step For The Future Of Radeon

AMD says that chiplets are the future of GPU fabrication, and it's easy to see why. Intel's doing chiplets too with its Data Center Max GPUs, formerly known as Ponte Vecchio. NVIDIA has made no mention of chiplet designs, but it's difficult to imagine the House of GeForce passing on the benefits for too much longer.

This article has focused on Navi 31, the GPU behind the Radeon RX 7900 series, but there's probably a whole family of Radeon GPUs on the way based on this architecture. AMD hasn't commented on them yet, though, so we'll just talk about what the company has announced: the Radeon RX 7900 XT and Radeon RX 7900 XTX.

Based purely on their specifications, the two cards are further apart than we'd expect given their similar nomenclature. The Radeon RX 7900 XT drops 1/6th of the memory bandwidth and capacity, 1/6th of its Infinity Cache, and 14% of its compute units; we expect it to be around 15% slower than the top-end model. That's a lot of performance to drop for $100 and just losing an "X" in the name, but maybe we're off the mark on performance.

They also look really similar at a glance, but they're not identical. AMD tells us the vapor chamber on the Radeon RX 7900 XTX is 10% larger than that on its little brother, and that's clear to see as the reference cooler extends a bit above the ATX back panel plate. We wouldn't fault anyone for mistaking one for the other, though; AMD's previous-generation reference designs were also extremely similar.

Both of the new Radeon cards are destined to appear on December 13th. Keep an eye out for our review if you want to know how they stack up to the competition.