Dynamically Assembling DRAM Bursts over a Multitude of Random Accesses


  Ienne Paolo


FPGAs implement massively parallel, application-specific compute engines. However, that approach fails when the application is memory bandwidth-bound. This is especially true for applications that perform irregular and narrow memory accesses directly on DRAM. Options for optimization are expensive in design time and hard to integrate with accelerators generated from high-level synthesis. Nonblocking caches are widely used on CPUs to reduce the negative impact of misses and thus increase performance of applications with low cache hit rate; however, they rely on associative lookup for handling multiple outstanding misses, which limits their scalability, especially on FPGAs. This results in frequent stalls whenever the application has a very low hit rate.

In this project, we show that by handling thousands of outstanding misses without stalling we can achieve a massive increase of memory-level parallelism, which can significantly speed up irregular memory-bound latency-insensitive applications. By storing miss information in cuckoo hash tables in block RAM instead of associative memory, we show how a nonblocking cache can be modified to support up to three orders of magnitude more misses. In addition, to further increase the available bandwidth on the memory side and unlike a traditional nonblocking cache, DynaBurst handles bursts of variable length on the memory side. When possible, we make bursts longer and exploit more of a DRAM row without being limited to the controller width. In other cases, when spatial locality is insufficient, we keep burst short and minimize contention in the controller.

Our research shows that DynaBurst provides new Pareto-optimal and Pareto-dominant design points in the area-delay space of throughput-oriented memory systems. Furthermore, supporting bursts is required for miss-optimized memory systems to be beneficial behind external memory interfaces with multiple narrow ports and can further boost read throughput when behind a single wide memory port.

Our memory system can be downloaded as an open-source project.

Suggested Readings:

Paper at 29th International Conference on Field Programmable Logic and Applications (FPL)