FPGA Central - World's 1st FPGA / CPLD Portal

FPGA Central

World's 1st FPGA Portal

 

Go Back   FPGA Groups > NewsGroup > FPGA

FPGA comp.arch.fpga newsgroup (usenet)

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 09-05-2009, 09:42 AM
Thomas Womack
Guest
 
Posts: n/a
Default Interfacing variable-speed functional units

What should I look at for information about designing pipelined
interfaces between multiple producers and multiple consumers in an
environment where the amount of time the consumer takes to perform an
operation is not constant?

Say I'm designing HDL to draw Mandelbrot sets; I have a small block
which takes a pixel position as input and, between 8 and 260 cycles
later, gives an output as to what colour the pixel should be. 260
cycles * VGA resolution * 200MHz = 2.5 frames per second; not fast
enough.

Obviously I could instantiate sixteen copies of the block and have
sixteen pixel-position-generator units talking to it. But then I need
a rather awkward unit between the parallel part and the frame buffer,
which can take up to sixteen [location,pixel] inputs each cycle and
queue them up to write to the single frame buffer; I don't know what
this is called or what its internal design would look like. I assume
that it's not possible to design one that can keep up under all
circumstances, so I'd need to propagate a
please-wait-the-queue-is-full signal back down the system, and this is
starting to sound nightmarishly close-coupled enough that I'm sure
clever electronics designers have a much better way to do it.

If my screen is small enough I could have one sub-frame buffer (as a
dual-port block RAM) per Mandelbrot-set unit, and have the
frame-buffer-managing unit read out from those in rotation; but if the
sub-frames don't all fit on the chip this doesn't work.


I'm also very unsure what the right thing to do is if the producer is
a CPU; for operations that take a long time, I'd raise an interrupt
when the job is done and let the CPU pick up the answer, and the
software interface is to hand over a callback function which gets
called by the interrupt handler, but I can't see how to stop the
software from getting terribly convoluted in that case (since instead
of a loop to issue the requests, I have to arrange for the callback
function itself to figure out what the next request to issue should
be, and the simple 'do these N things' concept gets divided among
three functions scattered across the code), and I don't know if
there's something better to do in the case where the operation
generally takes a length of time comparable to the interrupt latency.

Tom
Reply With Quote
  #2 (permalink)  
Old 09-05-2009, 10:22 AM
Frank Buss
Guest
 
Posts: n/a
Default Re: Interfacing variable-speed functional units

Thomas Womack wrote:

> Obviously I could instantiate sixteen copies of the block and have
> sixteen pixel-position-generator units talking to it. But then I need
> a rather awkward unit between the parallel part and the frame buffer,
> which can take up to sixteen [location,pixel] inputs each cycle and
> queue them up to write to the single frame buffer; I don't know what
> this is called or what its internal design would look like. I assume
> that it's not possible to design one that can keep up under all
> circumstances,


A simple architecture would be to start all 16 calculations in parallel.
The parallel units signals ready. If AND all ready signals = 1, then loop
over the 16 outputs and write it to the framebuffer. No problem to keep up
under all circumstances. Backdraw of this simple design: If one unit needs
the maximum time, meanwhile all other units are twiddling their thumbs when
they are finished.

> so I'd need to propagate a
> please-wait-the-queue-is-full signal back down the system, and this is
> starting to sound nightmarishly close-coupled enough that I'm sure
> clever electronics designers have a much better way to do it.


I think what you are searching for is "bus arbitration".

There are other interesting ideas for distribution many tasks to simple CPU
elements. Take a look at the CUDA architecture of NVIDIA, maybe this can
give you some ideas you can use (but I think you have to beware of patents,
if it is a commercial project), or the Cell CPU architecture.

--
Frank Buss, [email protected]
http://www.frank-buss.de, http://www.it4-systems.de
Reply With Quote
  #3 (permalink)  
Old 09-05-2009, 03:01 PM
Jonathan Bromley
Guest
 
Posts: n/a
Default Re: Interfacing variable-speed functional units

On 05 Sep 2009 09:42:16 +0100 (BST), Thomas Womack wrote:

>Say I'm designing HDL to draw Mandelbrot sets; I have a small block
>which takes a pixel position as input and, between 8 and 260 cycles
>later, gives an output as to what colour the pixel should be. 260
>cycles * VGA resolution * 200MHz = 2.5 frames per second; not fast
>enough.
>Obviously I could instantiate sixteen copies of the block and have
>sixteen pixel-position-generator units talking to it. But then I need
>a rather awkward unit between the parallel part and the frame buffer,


It may help to think of the compute units as a farm. If you can
split up the work into pieces small enough that they can live
entirely on the FPGA (which, for Mandelbrot, is easy: a single
image line can fit comfortably in on-chip RAM) then you could
have a single job dispatcher that works its way along the line,
issuing each pixel in sequence to the first free functional
unit that it finds. Then you have sixteen (or whatever)
"done" flags, and a simple round-robin polling gadget that
checks each flag in turn and, as soon as it finds a "done"
compute unit, pulls the data from that and writes it to the
line buffer. The corresponding compute unit is then free
and will soon be given work by the dispatcher. When the
whole line is done, you can write it to external memory
and switch over the compute farm to work on a second
line buffer while the write-to-memory is in progress.
This approach scales reasonably well to increasing
numbers of compute units.

Obviously it's considerably harder when the work spans
more than one pixel, but it's still worth the effort of
breaking up the work so that you are operating on one or
more complete image lines stored on-chip. Each line can
then be written to the external frame store efficiently
as one or more bursts.
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
[email protected]
http://www.MYCOMPANY.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
Reply With Quote
  #4 (permalink)  
Old 09-05-2009, 03:37 PM
Kolja
Guest
 
Posts: n/a
Default Re: Interfacing variable-speed functional units

On 5 Sep., 10:42, Thomas Womack <[email protected]>
wrote:
> Say I'm designing HDL to draw Mandelbrot sets; I have a small block
> which takes a pixel position as input and, between 8 and 260 cycles
> later, gives an output as to what colour the pixel should be. *260
> cycles * VGA resolution * 200MHz = 2.5 frames per second; not fast
> enough.
>
> Obviously I could instantiate sixteen copies of the block and have
> sixteen pixel-position-generator units talking to it. *But then I need
> a rather awkward unit between the parallel part and the frame buffer,
> which can take up to sixteen [location,pixel] inputs each cycle and
> queue them up to write to the single frame buffer;


No, you don't. If you have N processing units (PU) that perform
computations
that require more than N clock cycles it is obviously sufficient to
issue
one task per clock cycle and retire one task per clock cycle.

To avoid idle cycles at the units each unit should be able to buffer
one result
that waits to be retired and it also should be able to store the
instructions
for the next task while still busy with the previous task.

There is one scheduler that selects a PU with empty instruction
register and
issues an instruction to it and an arbiter that selects a PU with
valid result in its
output buffer and writes it wherever it should be written to.

If you have very high clock rates or a very large number of PUs
arbitration and scheduling
might not be possible in a single cycle. You can than use a linear
array of PUs.
Instructions are fed to the left end of the array. Each unit either
consumes the instruction
of forwards it to it neighbor to the right.
On the outputs each PU forwards the result from the left neighbor, or
its own result
if none is presented by the neighbor.
This approach can achieve very high clock rates because only local
routing is required and
the decision making is extremely simple.

Kolja Sulimma
www.cronologic.de


Reply With Quote
Reply

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Factory-Reconditioned Ryobi ZRRS290 5-Inch Variable Speed Random Orbit Sander [email protected] Verilog 0 05-22-2009 01:14 PM
DDR2 MEMORY INTERFACING INTERFACING WITH HARWARE CORE AND MICROBLAZE SUMAN FPGA 1 03-12-2009 01:45 PM
how to implement variable ports with variable width? weijun VHDL 4 12-19-2005 05:16 PM
Convert Character Variable to Integer Variable Brad Smallridge VHDL 2 11-18-2004 01:56 AM
Map n algorithms to m functional units Andreas VHDL 0 12-02-2003 02:34 PM


All times are GMT +1. The time now is 09:53 AM.


Powered by vBulletin® Version 3.8.0
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0
Copyright 2008 @ FPGA Central. All rights reserved