Beyond RTL part 1: High-Level Synthesis

Let’s say for a minute that you believe that it is finally time to drop RTL (maybe it was my previous post that convinced you). What can I say? I’m glad! You now have to pick among several competing technologies, each with its pros and cons, each of course claiming to be the best, and not one compatible with the others.

So let’s begin with what is probably the most well-known alternative to RTL: High-Level Synthesis (HLS). As you can see, Wikipedia lists many tools that implement one form of HLS or another. So what is the promise of HLS? To take a software description (code) and turn it into optimized hardware.This paper discusses the history of HLS, tracing back its origin to the 1960’s; commercial products were already available in the 90’s (the oldest that I know of is NEC’s CyberWorkbench, which was already around in 1988). Did HLS succeed? That is, if you want to create a H.264 decoding chip, can you take a piece of software (for instance x264, an optimized software implementation of H.264) and turn it into a chip? Well, no. Come on, did you really expect another answer? :-)

Think about it, a digital circuit is composed of a huge number of components, of which a lot are active simultaneously, whereas software is (still) mostly a huge sequence of instructions issued sequentially, that keep reading/writing memory (be it with pointers or references) to do their computation (remember the von Neumann architecture, I talked about it before. Keep in mind that automatic parallelization won’t give any satisfactory results in the presence of pointers, because Precise Flow-Insensitive May-Alias Analysis is NP-Hard. This means it is very difficult to know in advance if two pointers may point to the same memory address (known as aliasing) even if they are in the same procedure. Oh, and if you have dynamic memory allocation (which is present in virtually every piece of software), then you are confronted with The Undecidability of Aliasing… The conclusion is that it is not possible to extract parallelism from arbitrary sequential software. Not now, not ever.

You may wonder what can HLS do, then? Well, the thing is that it is possible to extract some parallelism from a limited amount of sequential software written in a certain way (i.e. with some restrictions, and sometimes with help from the user who wrote it). With that in mind, let’s examine what some of the tools available out there can do (I detail in a future post other alternatives to RTL):

some free HLS tools: C to Verilog, GAUT, LegUp, Shang. GAUT and C to Verilog only seem to work at the function level. LegUp can translate a fair amount of C, as long as it respects the CHStone coding style, which includes inter-procedural calls and pointers (although I suspect that if you want efficient hardware, pointers must alias to a well-known memory location). However, the objective of LegUp is to make FPGAs easier to use by software engineers, and to be able to execute a particular C program faster in hardware than in software (which they do very well), rather than create really efficient hardware (for instance, this paper shows that their implementation of AES requires 14,000 clock cycles to encode+decode a block, whereas an efficient pipelined hardware implementation can encode/decode one entire block at each clock cycle). Shang claims to outperform LegUp by “30%” (in frequency? clock cycles? throughput?).
most commercial HLS tools (Catapult, C-to-silicon, CyberWorkBench, Cynthesizer, ImpulseC, Synphony, Vivado), on the other hand, focus on creating efficient hardware (their objective is to generate hardware whose performance rivals with hand-written designs) generally from a subset of C/C++ with (proprietary) extensions for hardware and for task-level parallelism, or from a SystemC design. Almost all of the tools support, and some favor, SystemC, as it is a standard, although it is not really software anymore (even if SystemC is a C++ framework, low-level SystemC code looks more like Verilog or VHDL than C++ in my opinion). As for the subset of C/C++ that these tools accept, it varies, although it generally forbids arbitrary pointers, recursive functions, etc. What about the performance? Generally, expect great performance for filters (their favorite example is still the FIR filter) and similar regular, easy-to-pipeline algorithms. For whole designs, you will have to try these tools by yourself :-) If you did I would be glad to know the results you obtained – provided the tool vendor allows you to disclose them (and often they do not).

This is a guest post by Synflow, an innovative EDA company based in Europe.