There's a general view that everything gets faster and better as technology advances, but when it comes to external memory latency, that's not the case. In a recent
ARM TechCon paper Marc Greenberg, director of product marketing at Cadence, showed why DRAM latency is increasing and discussed ways of improving the situation.
The paper was titled "DDR4, Higher Speeds and Larger SoCs: Why External Memory Latency is Getting Worse, and What to do About it." It was presented before a standing-room-only audience Oct. 25. You can read an
article by Marc Greenberg on the same topic in the Nov. 22 ChipEstimate.com newsletter. A video of the presentation is embedded below and you can also click here to view it.
Greenberg started the ARM TechCon presentation by showing a chart, based on publicly available data, that predicts a DDR4 read latency of 22 clock cycles for the highest DDR4 data rate. The chart assumes an average latency of around 13.5 ns and is basically a plot of 13.5 ns against the clock periods of the various DRAM types. "Basically the DRAM cell array hasn't changed in the past 10 years," he explained. "At its core is a 100 MHz to 200 MHz array that has an access time of about 10 to 15 ns."
RL-tRCD (RAS to CAS delay)-tRP (read-to-precharge) of DDR3 DRAM by speed grade, with curve-fit prediction for DDR4.
DRAM is getting faster, Greenberg noted, because successive DRAM technology generations are increasingly parallelizing the array. With DDR3, for example, you can send transactions to 8 arrays in parallel. So even though the DRAM data rate has increased by over 10X in the past ten years, and CPU clock frequency has increased by over 10X, "the latency really hasn't changed," Greenberg said. "In fact, if you measure it in clock cycles it's been getting worse."
How Can We Improve?
In discussing ways to improve the situation, Greenberg pointed to some options that are in many cases impractical. He first warned that while reducing minimum CPU-to-DRAM latency is important, it should not be done at the expense of average latency, or at the expense of DRAM bandwidth. It is possible to make a very low latency DRAM controller that doesn't do any reordering of transactions, but that will come at the expense of DRAM bandwidth.
Other potential solutions include:
Adding more on-chip memory will reduce latency, but it's expensive. Specialty DRAM with lower latency is available, but it comes at a high cost. Off chip SRAM is fast but very expensive. Out-of-order CPU execution lets the CPU work on other instructions while waiting for data from the DRAM, but there's a practical limit to the number of outstanding transactions, and a cost in area and power.
What if we just build a simple DRAM controller with the goal of reducing latency? This won't work, Greenberg said, because "a DRAM controller requires a queue of upcoming commands to optimize the performance of the DRAM. Almost every memory controller has the ability to look ahead. Without doing look-ahead optimization, you'll waste a bunch of clock cycles."
For the most common system configurations, Greenberg noted, DDR4-3200 speeds will require 5 to 6 cache line fills in the DRAM controller at any given time to have enough look-ahead to keep the data pipe full. Okay, you might conclude, we'll just have a simple controller that can look ahead but still executes in-order. That works until you issue two transactions to different rows in the same bank. Now the tRC (activate-to-activate) delay of each bank in DRAM becomes a problem. tRC is another timing parameter that is not decreasing over time; at DDR4-3200 with a tRC of 45ns, tRC delay will be 72 clock cycles.
Things get even more complex. For most system configurations, DDR4 speeds will require 14-18 cache line fills in the DRAM controller to cover the tRC time of the DRAM. But if all those transactions are done in order, latency will suffer. Further, you don't always need to hold exactly 6 cache line transactions in queue for effective look-ahead. What if a more optimal command comes along? Some degree of flexibility is needed.
Another complication is that modern systems have three types of masters -- latency-sensitive masters that need low latency, bandwidth-sensitive masters that need a lot of data, and maximum-latency masters that care only about a latency limit. Greenberg reviewed the requirements for each. He noted that memory controllers should re-order transactions for priority, making it possible to differentiate transactions based on their latency requirements.
Greenberg concluded that a static allocation of a fixed number of commands to the DRAM controller cannot reliably meet latency and bandwidth demands. The best approach is to allow as much flexibility as possible in command ordering, and to make decisions on command ordering as close as possible to the memory.
Note: In April 2011 Cadence
announced the industry's first DDR4 IP solution. The solution includes hard and soft PHY IP, controller IP, memory models, verification IP, tools and methodologies, and signal integrity reference designs for the package and board. For more information on Cadence DDR memory controller IP and the optimizations it offers, click here. To view the video of the presentation, open the video image below or click here. VIDEO
Other blog posts about ARM TechCon papers:
ARM TechCon Paper: New Methodology Eases Challenges of 32/28nm Designs ARM TechCon Paper: "Tips and Tricks" for ARM Cortex-A15 Designs