Home > Community > Blogs > System Design and Verification > high level synthesis what expertise is needed for micro architecture tradeoffs
 
Login with a Cadence account.
Not a member yet?
Create a permanent login account to make interactions with Cadence more convenient.

Register | Membership benefits
Get email delivery of the System Design and Verification blog (individual posts).
 

Email

* Required Fields

Recipients email * (separate multiple addresses with commas)

Your name *

Your email *

Message *

Contact Us

* Required Fields
First Name *

Last Name *

Email *

Company / Institution *

Comments: *

High-Level Synthesis—What Expertise Is Needed for Micro-Architecture Tradeoffs?

Comments(0)Filed under: SystemC, hls, C-to-Silcon, C++, C, hardware, micro-architecture, RAM

My most recent blog post mentioned how utilizing new algorithms together with high-level synthesis can continue to drive innovation in hardware design by balancing power consumption with performance improvements.

A great example of this is what Fujitsu Semiconductor was able to accomplish—they took advantage of the high abstraction level of SystemC to explore different micro-architecture tradeoffs. They found a micro-architecture that delivered 35% faster throughput along with a 51% power reduction and 35% area reduction, as compared to their original handwritten RTL. This almost sounds unbelievable. But when you have to manually describe a micro-architecture by hand in RTL, it is painful to try more than a couple alternatives because usually you have a schedule to meet.

In contrast, the high-level approach allows you to describe only the core functionality along with the communication protocol in SystemC. The core functionality is comprised of typical C/C++ constructs, like functions, loops, and arrays. As a hardware designer, you compile it into an HLS tool like C-to-Silicon Compiler and begin exploring the solution space. In a typical flow, you might have the tool schedule the design as-is so you can explore how resources were mapped, where your critical timing/area/power issues are, and make adjustments. What are some of the adjustments you could make that affect micro-architecture?

Functions—sharing vs. inlining

Sharing versus inlining is analogous to flattening versus preserving modules in RTL synthesis. Let's say you have a simple function whose functionality is:

int math_func ( int a, int b, int c) {

        int r = a * b / c ;

        return r;

     }

If C-to-Silicon keeps the function intact when it is called, it becomes a "resource" similar to a single adder or multiplier resource—so it can be shared as a resource by the various processes that call it (those  processes call it in different clock cycles, of course), in order to save area. If you specify to C-to-Silicon to inline this function, then the individual adder and divider are "flattened" within the calling function, so each can be optimized in the context in which it is called. This allows the smaller operations to possibly be shared locally, but could also have the effect of reducing overall sharing because of how each is optimized in context. This is where your knowledge of the design comes into play!

In order to take advantage of this capability to reduce power, one approach is to share as aggressively as possible in order to reduce area, which is a close analogy to leakage power and a rough analogy to switching power. It's also important to keep an eye on whether the number of states must increase in order to meet timing when sharing, because that can adversely affect both performance and power. This is why it's important to have accurate timing characterization in your HLS tool!

Arrays—flatten, register file, or memory?

Arrays are very commonly used by software developers in C/C++. As a hardware designer, when you receive code that has arrays, you will want to specify how they should be implemented as hardware.

"Flattening"—or mapping to a set of registers that can be written to and read from, controlled through muxes for each register—is fine for a small array because access to these registers will be quick and there's more flexibility to optimize the logic inline. However for larger arrays, flattening may dominate the resulting area, as ITRI experienced when they first synthesized software C++ code.

C-to-Silicon also offers the ability to use a "built-in RAM," which essentially implements a register file with the muxing structures inside the module that contains the memory logic, shared across the registers. This is useful for creating a register file or for mapping to a RAM in an FPGA.

The third option is to specify that a large array will be a memory in the final hardware. This will be the most compact implementation from an area perspective. You would typically specify the read and write ports associated with the target vendor RAM.

Flattening an array delivers the best performance for small arrays, while mapping to a RAM delivers a more compact and power-efficient implementation for large arrays. Back to the tradeoffs—finding the right tradeoff point depends on your knowledge of the design and the goals of the specific implementation goals of this project.

Loops—break or unroll, fully or partially?

As in Verilog, C is full of for() loops. Typically when you receive C/C++ from the architecture team, the loops will be purely combinational, meaning there are no wait() statements in them. Take the following loop:

for (int i=0; i<3; i++) {

            sc_uint<2> cc_v = 0;

            cc_v[0] = pix[i] < ref_min[i];

            cc_v[1] = pix[i] > ref_max[i];

            cc[i] = cc_v;

        }

Here we have data being fed into the datapath operations multiple times with no registers. We could break the loop by inserting a register. This means that the operations could then be shared with different data being loaded each state. Adding in the muxing structures, this would still be a smaller area yet would require more cycles, possibly increasing switching power as well. The other choice is to unroll the loop, duplicating the operations so they would run in parallel in one cycle for better latency, larger area, and more leakage power.

Again, it's all about what your project's goals are and finding the right tradeoff point. Then what if we have a combinational loop where one of the operation's inputs is its current output?

int r = 1;

for (int i=0; i<4; i++) {

            r = r * din[i];

        }

This has a dependency on the results of the previous operation, so unrolling the loop would create a serial sequence of four multipliers in a single clock cycle. That would not be good for timing!

Instead what we could do is a partial unroll, combined with breaking the loop. In this case we would get two multipliers shared across two clock cycles. It would increase our latency by an extra clock cycle, but it would likely enable us to meet timing along with reducing the area by sharing the multipliers. It's difficult to judge from a power perspective since we're adding a register and muxing logic while saving two multipliers, but in this case it's likely something we have to do in order to meet timing anyway.

Conclusion—hardware designers needed!

These were just some small examples to illustrate the types of micro-architecture decisions that a hardware designer would likely need to make in a SystemC-HLS flow. As you can see, each decision depends heavily on the structure and implementation goals of the design. And it also helps greatly to have knowledge of the general effects of each decision on performance, power, and area. Only hardware designers possess this type of knowledge! Going forward, this is where the advancements in hardware will come from, as process scaling becomes more challenging as proven by Fujitsu Semiconductor. There is indeed a bright future for hardware designers, as long as they adapt their skill set to take advantage of these more powerful tools.

Jack Erickson

Comments(0)

Leave a Comment


Name
E-mail (will not be published)
Comment
 I have read and agree to the Terms of use and Community Guidelines.
Community Guidelines
The Cadence Design Communities support Cadence users and technologists interacting to exchange ideas, news, technical information, and best practices to solve problems and get the most from Cadence technology. The community is open to everyone, and to provide the most value, we require participants to follow our Community Guidelines that facilitate a quality exchange of ideas and information. By accessing, contributing, using or downloading any materials from the site, you agree to be bound by the full Community Guidelines.