FPGA Central - World's 1st FPGA / CPLD Portal

FPGA Central

World's 1st FPGA Portal

 

Go Back   FPGA Groups > NewsGroup > FPGA

FPGA comp.arch.fpga newsgroup (usenet)

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 02-19-2008, 11:15 AM
Nial Stewart
Guest
 
Posts: n/a
Default Efficient division algorithm?

I'm going to talk to a potential new client about using FPGAs to
accelerate part of their system.

As part of what needs done there could be a significant amount of
division(s) done.

Previously I've been able to multiply by a reciprocal then scale to
make division a double clock operation so this can be easily pipelined.
This is only achieveable if the divisor is pre-known and the
reciprocal can be pre-calculated.

With what's coming up I'm not sure that I can do this, I know that

Are there any clever techniques for streamlining divisions that
make them deterministic and don't use a big wodge of logic?


Thanks for any pointers,


Nial


Reply With Quote
  #2 (permalink)  
Old 02-19-2008, 11:27 AM
Uwe Bonnes
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

Nial Stewart <nial*REMOVE_THIS*@nialstewartdevelopments.co.uk > wrote:
> I'm going to talk to a potential new client about using FPGAs to
> accelerate part of their system.


> As part of what needs done there could be a significant amount of
> division(s) done.


> Previously I've been able to multiply by a reciprocal then scale to
> make division a double clock operation so this can be easily pipelined.
> This is only achieveable if the divisor is pre-known and the
> reciprocal can be pre-calculated.


> With what's coming up I'm not sure that I can do this, I know that


> Are there any clever techniques for streamlining divisions that
> make them deterministic and don't use a big wodge of logic?


Do you need high throughput and short latency? If latency isn't that big
issue, I have successfully used the serial devider from opencores.Otherwise
there are more divider projects at opencores.

--
Uwe Bonnes [email protected]

Institut fuer Kernphysik Schlossgartenstrasse 9 64289 Darmstadt
--------- Tel. 06151 162516 -------- Fax. 06151 164321 ----------
Reply With Quote
  #3 (permalink)  
Old 02-19-2008, 01:41 PM
KJ
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

On Feb 19, 5:15*am, "Nial Stewart"
<nial*REMOVE_TH...@nialstewartdevelopments.co.uk > wrote:
> I'm going to talk to a potential new client about using FPGAs to
> accelerate part of their system.
>
> As part of what needs done there could be a significant amount of
> division(s) done.
>
> Previously I've been able to multiply by a reciprocal then scale to
> make division a double clock operation so this can be easily pipelined.
> This is only achieveable if the divisor is pre-known and the
> reciprocal can be pre-calculated.
>
> With what's coming up I'm not sure that I can do this, I know that
>
> Are there any clever techniques for streamlining divisions that
> make them deterministic and don't use a big wodge of logic?
>
> Thanks for any pointers,
>


Check into the lpm_divide function that is part of the lpm package.
It's synchronous and accepts new inputs on every clock. It is a
standard, although people seem to rail against lpm as being mainly
'Altera'.

Kevin Jennings
Reply With Quote
  #4 (permalink)  
Old 02-19-2008, 02:02 PM
Mike Treseler
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

Nial Stewart wrote:

> Are there any clever techniques for streamlining divisions that
> make them deterministic and don't use a big wodge of logic?


That's a trade-off with speed.
A subtract loop is simple and deterministic but slow.
A binary search multiply-subtract is possible with dsp blocks.

-- Mike Treseler
Reply With Quote
  #5 (permalink)  
Old 02-19-2008, 02:04 PM
Nial Stewart
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

> Do you need high throughput and short latency?

I don't know yet :-)

> If latency isn't that big
> issue, I have successfully used the serial devider from opencores.Otherwise
> there are more divider projects at opencores.


I'll have a look at Opencores, I hadn't thought of trying there.

Thanks,

Nial.



Reply With Quote
  #6 (permalink)  
Old 02-19-2008, 02:05 PM
Nial Stewart
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

> Check into the lpm_divide function that is part of the lpm package.
> It's synchronous and accepts new inputs on every clock. It is a
> standard, although people seem to rail against lpm as being mainly
> 'Altera'.


I'll have a look at this but my impression with this is that it
uses a _lot_ of logic?



Nial


Reply With Quote
  #7 (permalink)  
Old 02-19-2008, 03:43 PM
Mike Treseler
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

Nial Stewart wrote:

> I'll have a look at this but my impression with this is that it
> uses a _lot_ of logic?


It makes a_whole_lotta_luts in one lump.
Requires a slow system clock or a multi-cycle constraint.
Quartus will infer lpm_divide from from int or signed /.

-- Mike Treseler

Reply With Quote
  #8 (permalink)  
Old 02-19-2008, 04:50 PM
KJ
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

On Feb 19, 8:05*am, "Nial Stewart"
<nial*REMOVE_TH...@nialstewartdevelopments.co.uk > wrote:
> > Check into the lpm_divide function that is part of the lpm package.
> > It's synchronous and accepts new inputs on every clock. *It is a
> > standard, although people seem to rail against lpm as being mainly
> > 'Altera'.

>
> I'll have a look at this but my impression with this is that it
> uses a _lot_ of logic?
>
> Nial


That of course depends on how much you think is *a lot*. It does let
you tradeoff latency for performance though and depending on how you
set that parameter it uses different amounts of logic...if I recall
correctly, 0 latency (i.e. a combinatorial implementation of division)
results in the largest logic usage, a latency equal to the width of
the numerator (or denominator, maybe? fuzzy brain neuron currently
being accessed) resulted in a good tradeoff in my application where I
had three dividers but needed somewhat high performance (75 MHz in a
Stratix I).

If you need to be able to process full divisions on a system clock
cycle, it will do the trick. If your system clock can run much faster
so you can take several clocks to do the divide a divider that works
on single bits at a time might be more appropriate.

Kevin Jennings
Reply With Quote
  #9 (permalink)  
Old 02-19-2008, 04:57 PM
KJ
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

On Feb 19, 9:43*am, Mike Treseler <mike_trese...@comcast.net> wrote:
> Nial Stewart wrote:
> > I'll have a look at this but my impression with this is that it
> > uses a _lot_ of logic?

>
> It makes a_whole_lotta_luts in one lump.
> Requires a slow system clock or a multi-cycle constraint.


I ran it at 75 MHz on a Stratix I without any multi-cycle
constraints. You want to use the pipelining parameter to lpm_divide
to tell it how many pipeline stages it can use. That affects how it
implements the logic and allows for tradeoffs between the size of the
lump of logic, the latency and the clock cycle but does not require
specifying of any multi-cycle timing constraints. In any case, it
shouldn't require any dramatic system clock sloooowing.

> Quartus will infer lpm_divide from from int or signed /.
>

Doing so though will result in lpm_divide being parameterized with 0
latency which will result in the largest logic lump and the slowest
clock cycle performance. Depending on the application it might be
better to instantiate the lpm_divide and not have it inferred from
"/".

Kevin Jennings
Reply With Quote
  #10 (permalink)  
Old 02-19-2008, 05:01 PM
KJ
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

On Feb 19, 8:02*am, Mike Treseler <mike_trese...@comcast.net> wrote:
> Nial Stewart wrote:
> A binary search multiply-subtract is possible with dsp blocks.
>

That's one that I've always wanted to try just to see how well it
does, but haven't had the need to do so of late. One would expect it
to get high throughput, low logic usage and probably be the best in
today's world of nearly free DSP block multipliers in FPGAs.

Any benchmark data?

Kevin Jennings
Reply With Quote
  #11 (permalink)  
Old 02-19-2008, 06:46 PM
Mike Treseler
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

KJ wrote:

>> A binary search multiply-subtract is possible with dsp blocks.
>>

> That's one that I've always wanted to try just to see how well it
> does, but haven't had the need to do so of late.


Yes, that one's on my white board too.

> One would expect it
> to get high throughput, low logic usage and probably be the best in
> today's world of nearly free DSP block multipliers in FPGAs.


It would be a good use for hardware multipliers
that often left unwired.

> Any benchmark data?


Well it couldn't do any better than log2(quotient) ticks
with one multiplier. Hmmm.. could divide that time 2**m for
m multipliers and parallel searches...

-- Mike Treseler
Reply With Quote
  #12 (permalink)  
Old 02-19-2008, 06:54 PM
Mike Treseler
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

KJ wrote:

>> Quartus will infer lpm_divide from from int or signed /.
>>

> Doing so though will result in lpm_divide being parameterized with 0
> latency which will result in the largest logic lump and the slowest
> clock cycle performance. Depending on the application it might be
> better to instantiate the lpm_divide and not have it inferred from
> "/".


Thanks for the clarifications.
I might also try using "/"
with pipeline registers
and turn on synthesis register retiming.

-- Mike Treseler
Reply With Quote
  #13 (permalink)  
Old 02-19-2008, 09:19 PM
Alain
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

Hi all,

Take a look at http://www.xilinx.com/support/docume...ides/ug073.pdf
On page 61, it describes two different algorithms. I've implement the
first one without DSP block. It can perform a division of N bits in N
cycles. It runs quite fast; on Spartan-3A-5, 100MHz for 32 bits
(200MHz for 8 bits).

Best regards.
Reply With Quote
  #14 (permalink)  
Old 02-19-2008, 11:13 PM
Tommy Thorn
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

On Feb 19, 7:57 am, KJ <kkjenni...@sbcglobal.net> wrote:
> On Feb 19, 9:43 am, Mike Treseler <mike_trese...@comcast.net> wrote:
> > Quartus will infer lpm_divide from from int or signed /.

>
> Doing so though will result in lpm_divide being parameterized with 0
> latency which will result in the largest logic lump and the slowest
> clock cycle performance. Depending on the application it might be
> better to instantiate the lpm_divide and not have it inferred from
> "/".


It doesn't have to be. If you explicitly delay the result, it will
infer pipeline stages (I haven't tried it for divide, but other lpm
works this way) . Eg.

res3 <= a / b;
res2 <= res3;
res1 <= res2;
final_result <= res1;

You get the idea.

Tommy
Reply With Quote
  #15 (permalink)  
Old 02-20-2008, 02:58 AM
Ray Andraka
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

Nial Stewart wrote:
> I'm going to talk to a potential new client about using FPGAs to
> accelerate part of their system.
>
> As part of what needs done there could be a significant amount of
> division(s) done.
>
> Previously I've been able to multiply by a reciprocal then scale to
> make division a double clock operation so this can be easily pipelined.
> This is only achieveable if the divisor is pre-known and the
> reciprocal can be pre-calculated.
>
> With what's coming up I'm not sure that I can do this, I know that
>
> Are there any clever techniques for streamlining divisions that
> make them deterministic and don't use a big wodge of logic?
>
>
> Thanks for any pointers,
>
>
> Nial
>
>


Nial,

Your reciprocal division is reasonably close to a low latency divider
I've used many times in the past. Instead of just one reciprocal, use
the upper bits of the denominator to address a BRAM containing
reciprocals, then use the lower bits of the denominator to interpolate
between the stored reciprocals to obtain a corrected reciprocal for more
precision. Use that to multiply the numerator to obtain the quotient.
It can be pipelined to run at the max clock for the multipliers, and has
a latency of a BRAM plus two multipliers plus an adder.

For example a divider with 17 bit denominator and numerator and 16 bit
quotient can use the upper 10 bits of the denominator to address a 1Kx18
reciprocal ROM to get an approximate reciprocal. Interpolate using a
second 1Kx18 ROM (holds the slope at each stored reciprocal) and a
multiplier to get a correction equal to the product of the 6 LSBs of the
denominator and the entry from the interpolation rom. That gets added to
the reciprocal ROM output to get a corrected reciprocal. Multiply that
corrected reciprocal by the numerator and round to get the quotient.
The entire 17 bit divide uses two multipliers, two BRAMs and two adders
plus registers to match delays to give a low latency divider. You can
improve the accuracy further if you can normalize the denominator and
numerator before the divide and then denormalize by the difference of
the normalizing shifts afterwards (basically floating point).
Reply With Quote
  #16 (permalink)  
Old 02-20-2008, 07:50 PM
Kevin Neilson
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

Nial Stewart wrote:
> I'm going to talk to a potential new client about using FPGAs to
> accelerate part of their system.
>
> As part of what needs done there could be a significant amount of
> division(s) done.
>
> Previously I've been able to multiply by a reciprocal then scale to
> make division a double clock operation so this can be easily pipelined.
> This is only achieveable if the divisor is pre-known and the
> reciprocal can be pre-calculated.
>
> With what's coming up I'm not sure that I can do this, I know that
>
> Are there any clever techniques for streamlining divisions that
> make them deterministic and don't use a big wodge of logic?
>
>
> Thanks for any pointers,
>
>
> Nial
>
>

I've heard that the CORDIC can be used for division. I'm not sure how
favorably this compares with other methods.
-Kevin
Reply With Quote
  #17 (permalink)  
Old 02-20-2008, 08:28 PM
Ray Andraka
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

Kevin Neilson wrote:

>>

> I've heard that the CORDIC can be used for division. I'm not sure how
> favorably this compares with other methods.
> -Kevin


Cordic division is similar to non-restoring sequential division. It
uses a lot of adder-subtractors, which means it is either quite slow or
it has a large clock latency.
Reply With Quote
  #18 (permalink)  
Old 02-21-2008, 11:42 AM
Nial Stewart
Guest
 
Posts: n/a
Default Re: Efficient division algorithm?

> Nial,
> Your reciprocal division is reasonably close to a low latency divider I've used many times in the
> past. Instead of just one reciprocal, use the upper bits of the denominator to address a BRAM
> containing reciprocals, then use the lower bits of the denominator to interpolate between the
> stored reciprocals to obtain a corrected reciprocal for more precision. Use that to multiply the
> numerator to obtain the quotient. It can be pipelined to run at the max clock for the multipliers,
> and has a latency of a BRAM plus two multipliers plus an adder.
>
> For example a divider with 17 bit denominator and numerator and 16 bit quotient can use the upper
> 10 bits of the denominator to address a 1Kx18 reciprocal ROM to get an approximate reciprocal.
> Interpolate using a second 1Kx18 ROM (holds the slope at each stored reciprocal) and a multiplier
> to get a correction equal to the product of the 6 LSBs of the denominator and the entry from the
> interpolation rom. That gets added to the reciprocal ROM output to get a corrected reciprocal.
> Multiply that corrected reciprocal by the numerator and round to get the quotient. The entire 17
> bit divide uses two multipliers, two BRAMs and two adders plus registers to match delays to give a
> low latency divider. You can improve the accuracy further if you can normalize the denominator
> and numerator before the divide and then denormalize by the difference of the normalizing shifts
> afterwards (basically floating point).



Thanks Ray,

This is exactly the sort of technique I was talking about.

This will defenitely be worth a look in I end up having to do multiple
divisions (as I think I might have to).


Nial.


Reply With Quote
  #19 (permalink)  
Old 02-21-2008, 03:10 PM
Nial Stewart
Guest
 
Posts: n/a
Default Further Thoughts...

The NiosII blurb says that it can implement a single instruction
32 bit multiply or divide.

I wonder what sort of architecture they're using?


Nial.


Reply With Quote
  #20 (permalink)  
Old 02-21-2008, 04:05 PM
[email protected]
Guest
 
Posts: n/a
Default Re: Further Thoughts...

>The NiosII blurb says that it can implement a single instruction
>32 bit multiply or divide.
>
>I wonder what sort of architecture they're using?


What's unusual about a single instruction multiply or divide? Surely
it's just a question of writing the microcode to do it.

Mike
Reply With Quote
  #21 (permalink)  
Old 02-25-2008, 11:26 AM
Nial Stewart
Guest
 
Posts: n/a
Default Re: Further Thoughts...

<[email protected]> wrote in message news:[email protected]..
> >The NiosII blurb says that it can implement a single instruction
>>32 bit multiply or divide.
>>I wonder what sort of architecture they're using?

> What's unusual about a single instruction multiply or divide? Surely
> it's just a question of writing the microcode to do it.
> Mike



Ah ha, that was me with my hardware head on (I don't have another one)
reading single instruction as single clock!

:-(


Nial


Reply With Quote
Reply

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Filtered Back Projection Algorithm (FBP Algorithm) Bapaiah Katepalli Verilog 1 06-23-2006 06:58 PM
Efficient implementation of Address Decoding logic Srikanth BJ FPGA 10 06-09-2006 02:37 PM
Hints for efficient 32 bit multiplier Giox FPGA 4 09-23-2005 01:41 PM
How to create area-efficient comparator rootz Verilog 5 09-15-2005 05:31 PM
Efficient Voltage Regulators Spartan 3 Current Requirements Yaju N FPGA 8 02-25-2005 04:39 AM


All times are GMT +1. The time now is 02:23 AM.


Powered by vBulletin® Version 3.8.0
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0
Copyright 2008 @ FPGA Central. All rights reserved