CPU Startup Combines CPU+DRAM

The CPU design firm Venray Technology announced a new product design this week that it claims can deliver enormous performance benefits by combining CPU and DRAM on to a single piece of silicon. We spent some time earlier this fall discussing the new TOMI (Thread Optimized Multiprocessor) with company CTO Russell Fish, but while the idea is interesting; its presentation is marred by questionable conceptualizing and suspect analytics.

The Multicore Problem:

There are three limiting factors, or walls, that limit the scaling of modern microprocessors. First, there's the memory wall, defined as the gap between the CPU and DRAM clock speed. Second, there's the ILP (Instruction Level Parallelism) wall, which refers to the difficulty of decoding enough instructions per clock cycle to keep a core completely busy. Finally, there's the power wall--the faster a CPU is and the more cores it has, the more power it consumes.

Attempting to compensate for one wall often risks running afoul of the other two. Adding more cache to decrease the impact of the CPU/DRAM speed discrepancy adds die complexity and draws more power, as does raising CPU clock speed. Combined, the three walls are a set of fundamental constraints--improving architectural efficiency and moving to a smaller process technology may make the room a bit bigger, but they don't remove the walls themselves.

TOMI attempts to redefine the problem by building a very different type of microprocessor. The TOMI Borealis is built using the same transistor structures as conventional DRAM; the chip trades clock speed and performance for ultra-low low leakage. Its design is, by necessity, extremely simple. Not counting the cache, TOMI is a 22,000 transistor design, as compared to 30,000 transistors for the original ARM2. The company's early prototypes, built on legacy DRAM technology, ran at 500MHz on a 110nm process.



Instead of surrounding a CPU core with a substantial amount of L2 and L3 cache, Venray inserted a CPU core directly into a DRAM design. A TOMI Borealis core connects eight TOMI cores to a 1Gbit DRAM with a total of 16 ICs per 2GB DIMM. This works out to a total of 128 processor cores per DIMM. Because they're built using ultra-low-leakage processes and are so small, such cores cost very little to build and consume small amounts of power (Venray claims power consumption is as low as 23mW per core at 500MHz).

It's an interesting idea.

The Bad:

When your CPU has fewer transistors than an architecture that debuted in 1986, it's a good chance that you left a few things out--like an FPU, branch prediction, pipelining, or any form of speculative execution. Venray may have created a chip with power consumption an order of magnitude lower than anything ARM builds and more memory bandwidth than Intel's highest-end Xeons, but it's an ultra-specialized, ultra-lightweight core that trades 25 years of flexibility and performance for scads of memory bandwidth.



The last few years have seen a dramatic surge in the number of low-power, many-core architectures being floated as the potential future of computing, but Venray's approach relies on the manufacturing expertise of companies who have no experience in building microprocessors and don't normally serve as foundries. This imposes fundamental restrictions on the CPU's ability to scale; DRAM is manufactured using a three layer mask rather than the 10-12 layers Intel and AMD use for their CPUs. Venray already acknowledges that these conditions imposed substantial limitations on the original TOMI design.

Of course, there's still a chance that the TOMI uarch could be effective in certain bandwidth-hungry scenarios--but that's where the Venray Questionable Train goes flying off the track.

The Disingenuous and Questionable



Let's start here. In a graph like this, you expect the two bars to represent the same systems being compared across three different characteristics. That's not the case. When we spoke to Russell Fish in late November, he pointed us to this publicly available document and claimed that the results came from a customer with 384 2.1GHz Xeons. There's no such thing as an S5620 Xeon and even if we grant that he meant the E5620 CPU, that's a 2.4GHz chip.

The "Power consumption" graphs show Oracle's maximum power consumption for a system with 10x Xeon E7-8870s, 168 dedicated SQL processors, 5.3TB (yes, TB) of Flash and 15x 10,000 RPM hard drives. It's not only a worst-case figure, it's a figure utterly unrelated to the workload shown in the Performance comparison. Furthermore, given that each Xeon E7-8870 has a 130W TDP, ten of them only come out to 1.3kW--Oracle's 17.7kW figure means that the overwhelming majority of the cabinet's power consumption is driven by components other than its CPUs.

The only existing TOMI chips are prototypes built on a 110nm process. Venray's power figures are for a 42nm part -- which means that neither side of the comparison is anything more than a made-up number.

In his literature, Fish makes his points about power walls by referring to unverified claims that prototype 90nm Tejas chips drew 150W at 2.8GHz back in 2004. That's like arguing that Ford can't build a decent car because the Edsel stunk.

After reading about the technology, you might think Venray was planning to market a small chip to high-end HPC niche markets... and you'd be wrong. The company expects the following to occur as a result of this revolutionary architecture (organized by least-to-most creepy):

  • Computer speech will be so common that devices will talk to other devices in the presence of their users.
  • Your cell phone camera will recognize the face of anyone it sees and scan the computer cloud for backround red flags as well as six degrees of separation
  • Common commands will be reduced to short verbal cues like clicking your tongue or sucking your lips
  • Your personal history will be displayed for one and all to see...women will create search engines to find eligible, prosperous men. Men will create search engines to qualify women. Criminals will find their jobs much more difficult because their history will be immediately known to anyone who encounters them.
  • TOMI Technology will be built on flash memories creating the elemental unit of a learning machine... the machines will be able to self organize, build robust communicating structures, and collaborate to perform tasks.
  • A disposable diaper company will give away TOMI enabled teddy bears that teach reading and arithmetic. It will be able to identify specific children... and from time to time remind Mom to buy a product. The bear will also diagnose a raspy throat, a cough, or runny nose.
Conclusion:

Fish has spent decades in the microprocessor industry--he invented the first CPU to use a clock multiplier in conjunction with Chuck H. Moore--but his vision of the future is, in our opinion, distorted enough to scare mad dogs and Englishmen.

His idea for a CPU architecture is interesting, even underneath the obfuscation and questionable representation, but too practically limited to ever take off. Google, an enthusiastic and dedicated proponent of energy efficient, multi-core research said it best in a paper titled "Brawny cores still beat wimpy cores, most of the time."

"Once a chip’s single-core performance lags by more than a factor of two or so behind the higher end of current-generation commodity processors, making a business case for switching to the wimpy system becomes increasingly difficult... So go forth and multiply your cores, but do it in moderation, or the sea of wimpy cores will stick to your programmers’ boots like clay."