NVIDIA Pascal GPU Architecture Preview: Inside The GP100

Name: NVIDIA Tesla P100
Brand: NVIDIA

by Marco Chiappetta — Monday, April 11, 2016, 09:24 PM EDT

NVIDIA Pascal GPU Architecture

At last week’s GPU Technology Conference, NVIDIA’s CEO Jen-Hsun Huang unveiled a couple of key products and technologies that target the High-Performance Computing, or HPC, space, including the Tesla P100 data center accelerator and its companion DGX-1 deep-learning system, which is a powerful server that has up to eight Tesla P100 cards at its core. Today though, we have more information about the underlying architecture employed in the P100, otherwise known NVIDIA’s Pascal GPU architecture.

Pascal is the follow-up to the Maxwell architecture, which is leveraged in NVIDIA’s current-generation of graphics cards and mobile GPUs. And the Pascal-based GPU at the heart of the Tesla P100 is codenamed the GP100 and it promises to be a very different animal.

NVIDIA Tesla P100, Featuring The GP100 GPU With HBM2

If NVIDIA’s past GPU naming convention rings-true throughout the entire next-generation, the GP100 will be the “big” version of Pascal, and presumably scaled down iterations of the chip will power more mainstream consumer-class GPUs, at least initially. With Maxwell, the “big” GM200 didn’t appear in a consumer-targeted GPU until well after cards based on the GM204, and smaller, Maxwell-based GPUs had already been on the market for quite some time. Let's look at previous generation Tesla implementations for some perspective...

	Tesla K40	Tesla M40	Tesla P100
GPU	GK110 (Kepler)	GM200 (Maxwell)	GP100 (Pascal)
SMs	15	24	56
TPCs	15	24	28
FP32 CUDA Cores / SM	192	128	64
FP32 CUDA Cores / GPU	2880	3072	3584
FP64 CUDA Cores / SM	64	4	32
FP64 CUDA Cores / GPU	960	96	1792
Base Clock	745 MHz	948 MHz	1328 MHz
GPU Boost Clock	810/875 MHz	1114 MHz	1480 MHz
FP64 GFLOPs	1680	213	5304
Texture Units	240	192	224
Memory Interface	384-bit GDDR5	384-bit GDDR5	4096-bit HBM2
Memory Size	Up to 12 GB	Up to 24 GB	16 GB
L2 Cache Size	1536 KB	3072 KB	4096 KB
Register File Size / SM	256 KB	256 KB	256 KB
Register File Size / GPU	3840 KB	6144 KB	14336 KB
TDP	235 Watts	250 Watts	300 Watts
Transistors	7.1 billion	8 billion	15.3 billion
GPU Die Size	551 mm²	601 mm²	610 mm²
Manufacturing Process	28-nm	28-nm	16-nm FinFET

Based on what we know so far about the GP100, it is an absolute beast of a GPU. It’s got roughly 3x the compute performance, 5x the GPU-to-GPU bandwidth, and 3x the memory bandwidth of NVIDIA’s previous generation high-end products. The full complement of features and specifications that have been revealed to date are represented in the table above.

The GP100 will be manufactured using TSMC’s 16nm FinFET process. The GPU is comprised of roughly 15.3 billion transistors and has a die size measuring 610mm². That’s about the same size as the Maxwell-based GM200, which comes in at about 601mm², but with nearly double the number of transistors – 15.3 billion vs. 8 billion. In addition to the advanced manufacturing process, NVIDIA's GP100 will also make use of HBM2 (second-generation of High Bandwidth Memory), and leverage new technologies like NVLink, Unified Memory, and a new board / connector design.

NVIDIA GP100 GPU Block Diagram

In its full implementation, the GP100 features 60 streaming multiprocessors (SM). As configured in the Tesla P100, however, only 56 of those SMs are enabled. The base clock of the GPU is an impressive 1348MHz, with a boost clock of 1480MHz, and a 300 watt TDP. Considering how young TSMC’s 16nm FinFET process is, seeing clocks this high on such a big chip bodes well for NVIDIA. As configured, and with those clocks, the GP100-powered Tesla P100 offers 5.3 teraflops (TFLOPs) of double-precision compute performance, 10.6 TFLOPs of full-precision compute, and 21.2 TFLOPs at half precision. We should also mention that atomic addition is available at double-precision with Pascal, while with Maxwell it is not.

NVIDIA Pascal SM Configuration In The GP100

Inside the GP100, those 56 active SMs house a total of 3584 FP32 cores, or 1792 FP 64 cores. There are 64 FP32 / 32 FP64 cores per SM, and 224 total texture units. The GPU links to its 16GB of HMB2 memory via 4096-bit interface, which offers up 720GB/s of peak bandwidth. There is 4MB of L2 cache on the chip, and a 256K register file per SM, for a total of 14,336KB. That's double the registers of the previous generation, with 1.33x the shared memory capacity, and double the shared memory bandwidth. In other words, this thing is massive but let's dive in deeper...