New tool enables mapping complex applications onto FPGAs
Today's powerful FPGAs are suitable for demanding applications. They can handle the complexity and performance needed for applications that include image and video processing. But there is a growing gap between what the FPGAs can deliver, and what the application designers can program to run on them. Complex applications contain up to tens of thousands of lines of code, a number that is still likely to increase. They are coded in high-level procedural languages such as C, C++, JAVA and C#. There is, however, no straightforward way of parallelizing and optimizing this code to run on FPGAs.
To solve this constraint, we have created a design flow and a supporting tool codename SPRINT. This flow allows the designer to explore different partitioning alternatives for parallelization in a matter of minutes. And, once the code has been partitioned, it enables the designer to tune the target FPGA for the application. He can do so by selecting, for each partition, the most suitable FPGA implementation.
Why automated mapping?
Current applications are complex, and they need an ever increasing computational power. For image and video processing, this is best witnessed by the ever increasing demand for better resolution, both in image dimensions and in the number of frames per second. In most cases the required computational power can only be achieved by the parallelism offered by a multiple processor implementation. But using multiple processors is one thing. The application designer will also have to rework his sequential description, parallelizing it so that it can maximally profit from running on multiple processors.
Having a tool to help with the transformation from a sequential description into a parallel implementation will bring many advantages. It will reduce the time-to-market of FPGA implementations. Second, it will improve the design efficiency and cost, involving less designers and testers. And because it helps managing the complexity of parallel computation, it will also avoid the errors typical for this transformation process. Last, it enables the reuse of existing code and existing implementation tools (such as Synfora, C2H, Impulse C and others).
Design flow from C to a parallel FPGA implementation
As input for a parallel FPGA implementation, we start with a single-thread application description (for example in C). The task now is to gradually refine and parallelize this description in a controlled design flow. Every step in this flow will have to be verified through carefully selected application test cases, so that the application remains functionally consistent.
The main phases of this flow are:
- A partitioning exploration phase. The original sequential description in C is translated into a concurrent description in parallel C. With the tool, alternatives can be explored easily.
- An FPGA implementation and optimization phase. Here the parallel processes are implemented and optimized for a selected processing style (e.g. general purpose processor, hardware implementation).
- Last, our tool offers support for HDL development and verification.
Partitioning, and exploring alternatives
The first step in the flow involves partitioning the single-threaded application. The result will be a concurrent application that is functionally equivalent with the original application.
The programming model for the concurrent implementation consists of a collection of processing nodes, connected with a message passing system. That system is made from a limited, but complete set of communication primitives. Thanks to the message passing communication approach, the concurrent implementation will be self-timed.
The SPRINT tool supports the designer during this partitioning phase, automating the bulk of the conversion work. De designer first indicates a set of process boundaries in the sequential C code. SPRINT then automatically transforms the sequential model into a concurrent implementation model. It detects the necessary communication channels, and maps them on the communication primitives. SPRINT also inserts the appropriate APIs into the code specification of the processing nodes. That way, the designer can generate an executable parallel program without having to rewrite the application code.
It is important that the concurrent model is executable, as this will allow:
1. to verify that the concurrent application is functionally correct
2. to assess the concurrency of the different processing nodes and overall system performance by adding (estimated or profiled) performance measures for the processing nodes
Because it largely automates the conversion from a sequential C program to a parallel C program, the SPRINT tool makes it possible to explore many partitioning alternatives in a short time, something which would not be possible if the code had to be converted by hand.
Implementing and optimizing the platform
Next to an executable simulation model of the concurrent implementation, SPRINT also generates the project files necessary for an FPGA implementation. Such an FPGA implementation can be seen as one specific implementation model of the concurrent application. It consists of a network of soft-cores with dedicated interconnects. Figure 2 and Figure 3 give a screenshot of these architectures in the design environment of Altera (SOPC builder) and Xilinx (EDK) respectively. From these design environments, the configuration of the FPGA can be generated.
This prototyping environment avoids the need to translate every processing node to a hardware accelerator before a first test release can be made. This enables a gradual approach where critical tasks are systematically replaced by hardware accelerators, while the functionality of the complete system is preserved. Moreover, as SPRINT is implementation agnostic, the general purpose soft cores can be replaced with more advanced processors (e.g. DSPs), or with dedicated hardware to improve the performance. This dedicated hardware is either generated using C-to-implementation tools (such as Synfora, C2H, Impulse C, and many others), or it is manually designed. Figure 4 shows the decision tree for the processing nodes.
Developing and verifying the hardware acceleration
The SPRINT tool also offers supports for custom hardware design and verification. It identifies the input and output communication channels for each of the processing nodes, and generates the top-level netlist of the hardware implementation and the entity description for each building block (see Figure 5). In addition, SPRINT can generate all necessary hardware testbenches.
Next to the generation of the testbenches, it is possible to generate testbench stimuli and expected outputs. This is done by executing the concurrent implementation model, generated in the partitioning phase, and registering the input and output values. The registered inputs are used to drive the hardware simulation of the block under development; the simulated results are compared to the registered outputs.
For a complete functional verification of a hardware block, a significant amount of test stimuli are needed. Collecting these would require prohibitively long simulation times and extensive testing. Therefore, SPRINT provides a path to hardware emulation or FPGA prototyping. This includes the software running on the PC providing the test stimuli to the emulation board and all hardware interfacing on the emulation platform.
Hardware emulation reduces the required simulation time considerably, as the verification process runs nearly real-time. The drawback of the emulation approach is a reduced observability of internal signals. To overcome the issue of observability during emulation, it is crucial to isolate the errors observed on the emulation platform in the complete set of test stimuli. This restricted set of stimuli is then used to rerun the simulation, resulting in a perfect observability of all internal signals. This 2-step approach, shown in Figure 6, combines the good observability of simulation with the execution speed of emulation.
With this method, replacing the original software implementation with a custom hardware block is straightforward. This is because all of the processing nodes interface descriptions are automatically generated. This guarantees the correspondence between the interfaces for the software implementation and for the hardware interface. Additionally, carefully selecting the application testbenches and adhering to the verification strategy guarantees a functionally equivalent hardware and software implementation. As a consequence, the software and hardware implementation can easily be interchanged.
In this article we have proposed a design flow supported by the SPRINT tool that enables a concurrent implementation of an application describes in a high-level language. First of all, the proposed approach evaluates different partition alternatives in a matter of minutes, enabling the designer to explore the design phase. Second, we have described a route towards an automated FPGA implementation. This automated implementation path enables the designer to further tune the performance of the application by selecting the most appropriate implementation style for the processing nodes. These styles include both software running on a general purpose processor (type NIOS, microBlaze or PowerPC), soft-core DSP processor (when available), as well as dedicated hardware, either developed automatically using existing C-to-implementation tools or a custom (handcrafted) design. Finally, we proposed an integrated verification strategy for the development of these custom processor nodes.
For more details of our methodology, please refer to the journal papers describing this work in the related links section below.
About the Authors:
Bart Vanhoof is linked to IMEC as a senior R&D engineer. His main focus is on validating innovative design tools (developed at IMEC and elsewhere) by applying them on relevant design cases. In this context he is currently applying the SPRINT methodology and tool to map image and video processing applications on a dedicated network of softcores on an FPGA. This includes also the implementation of a verification strategy for the development of HW accelerators.
Kristof Denolf works, since 1998, at IMEC in the multimedia team. He is currently a senior R&D engineer and focuses on the cost efficient design of advanced image/video processing systems and the end-to-end quality of experience.