I'm going to talk to a potential new client about using FPGAs to
accelerate part of their system.
As part of what needs done there could be a significant amount of
division(s) done.
Previously I've been able to multiply by a reciprocal then scale to
make division a double clock operation so this can be easily pipelined.
This is only achieveable if the divisor is pre-known and the
reciprocal can be pre-calculated.
With what's coming up I'm not sure that I can do this, I know that
Are there any clever techniques for streamlining divisions that
make them deterministic and don't use a big wodge of logic?
Nial Stewart <nial*REMOVE_THIS*@nialstewartdevelopments.co.uk > wrote:
> I'm going to talk to a potential new client about using FPGAs to
> accelerate part of their system.
> As part of what needs done there could be a significant amount of
> division(s) done.
> Previously I've been able to multiply by a reciprocal then scale to
> make division a double clock operation so this can be easily pipelined.
> This is only achieveable if the divisor is pre-known and the
> reciprocal can be pre-calculated.
> With what's coming up I'm not sure that I can do this, I know that
> Are there any clever techniques for streamlining divisions that
> make them deterministic and don't use a big wodge of logic?
Do you need high throughput and short latency? If latency isn't that big
issue, I have successfully used the serial devider from opencores.Otherwise
there are more divider projects at opencores.
On Feb 19, 5:15*am, "Nial Stewart"
<nial*REMOVE_TH...@nialstewartdevelopments.co.uk > wrote:
> I'm going to talk to a potential new client about using FPGAs to
> accelerate part of their system.
>
> As part of what needs done there could be a significant amount of
> division(s) done.
>
> Previously I've been able to multiply by a reciprocal then scale to
> make division a double clock operation so this can be easily pipelined.
> This is only achieveable if the divisor is pre-known and the
> reciprocal can be pre-calculated.
>
> With what's coming up I'm not sure that I can do this, I know that
>
> Are there any clever techniques for streamlining divisions that
> make them deterministic and don't use a big wodge of logic?
>
> Thanks for any pointers,
>
Check into the lpm_divide function that is part of the lpm package.
It's synchronous and accepts new inputs on every clock. It is a
standard, although people seem to rail against lpm as being mainly
'Altera'.
> If latency isn't that big
> issue, I have successfully used the serial devider from opencores.Otherwise
> there are more divider projects at opencores.
I'll have a look at Opencores, I hadn't thought of trying there.
> Check into the lpm_divide function that is part of the lpm package.
> It's synchronous and accepts new inputs on every clock. It is a
> standard, although people seem to rail against lpm as being mainly
> 'Altera'.
I'll have a look at this but my impression with this is that it
uses a _lot_ of logic?
> I'll have a look at this but my impression with this is that it
> uses a _lot_ of logic?
It makes a_whole_lotta_luts in one lump.
Requires a slow system clock or a multi-cycle constraint.
Quartus will infer lpm_divide from from int or signed /.
On Feb 19, 8:05*am, "Nial Stewart"
<nial*REMOVE_TH...@nialstewartdevelopments.co.uk > wrote:
> > Check into the lpm_divide function that is part of the lpm package.
> > It's synchronous and accepts new inputs on every clock. *It is a
> > standard, although people seem to rail against lpm as being mainly
> > 'Altera'.
>
> I'll have a look at this but my impression with this is that it
> uses a _lot_ of logic?
>
> Nial
That of course depends on how much you think is *a lot*. It does let
you tradeoff latency for performance though and depending on how you
set that parameter it uses different amounts of logic...if I recall
correctly, 0 latency (i.e. a combinatorial implementation of division)
results in the largest logic usage, a latency equal to the width of
the numerator (or denominator, maybe? fuzzy brain neuron currently
being accessed) resulted in a good tradeoff in my application where I
had three dividers but needed somewhat high performance (75 MHz in a
Stratix I).
If you need to be able to process full divisions on a system clock
cycle, it will do the trick. If your system clock can run much faster
so you can take several clocks to do the divide a divider that works
on single bits at a time might be more appropriate.
On Feb 19, 9:43*am, Mike Treseler <mike_trese...@comcast.net> wrote:
> Nial Stewart wrote:
> > I'll have a look at this but my impression with this is that it
> > uses a _lot_ of logic?
>
> It makes a_whole_lotta_luts in one lump.
> Requires a slow system clock or a multi-cycle constraint.
I ran it at 75 MHz on a Stratix I without any multi-cycle
constraints. You want to use the pipelining parameter to lpm_divide
to tell it how many pipeline stages it can use. That affects how it
implements the logic and allows for tradeoffs between the size of the
lump of logic, the latency and the clock cycle but does not require
specifying of any multi-cycle timing constraints. In any case, it
shouldn't require any dramatic system clock sloooowing.
> Quartus will infer lpm_divide from from int or signed /.
>
Doing so though will result in lpm_divide being parameterized with 0
latency which will result in the largest logic lump and the slowest
clock cycle performance. Depending on the application it might be
better to instantiate the lpm_divide and not have it inferred from
"/".
On Feb 19, 8:02*am, Mike Treseler <mike_trese...@comcast.net> wrote:
> Nial Stewart wrote:
> A binary search multiply-subtract is possible with dsp blocks.
>
That's one that I've always wanted to try just to see how well it
does, but haven't had the need to do so of late. One would expect it
to get high throughput, low logic usage and probably be the best in
today's world of nearly free DSP block multipliers in FPGAs.
>> A binary search multiply-subtract is possible with dsp blocks.
>>
> That's one that I've always wanted to try just to see how well it
> does, but haven't had the need to do so of late.
Yes, that one's on my white board too.
> One would expect it
> to get high throughput, low logic usage and probably be the best in
> today's world of nearly free DSP block multipliers in FPGAs.
It would be a good use for hardware multipliers
that often left unwired.
> Any benchmark data?
Well it couldn't do any better than log2(quotient) ticks
with one multiplier. Hmmm.. could divide that time 2**m for
m multipliers and parallel searches...
>> Quartus will infer lpm_divide from from int or signed /.
>>
> Doing so though will result in lpm_divide being parameterized with 0
> latency which will result in the largest logic lump and the slowest
> clock cycle performance. Depending on the application it might be
> better to instantiate the lpm_divide and not have it inferred from
> "/".
Thanks for the clarifications.
I might also try using "/"
with pipeline registers
and turn on synthesis register retiming.
Take a look at http://www.xilinx.com/support/docume...ides/ug073.pdf
On page 61, it describes two different algorithms. I've implement the
first one without DSP block. It can perform a division of N bits in N
cycles. It runs quite fast; on Spartan-3A-5, 100MHz for 32 bits
(200MHz for 8 bits).
On Feb 19, 7:57 am, KJ <kkjenni...@sbcglobal.net> wrote:
> On Feb 19, 9:43 am, Mike Treseler <mike_trese...@comcast.net> wrote:
> > Quartus will infer lpm_divide from from int or signed /.
>
> Doing so though will result in lpm_divide being parameterized with 0
> latency which will result in the largest logic lump and the slowest
> clock cycle performance. Depending on the application it might be
> better to instantiate the lpm_divide and not have it inferred from
> "/".
It doesn't have to be. If you explicitly delay the result, it will
infer pipeline stages (I haven't tried it for divide, but other lpm
works this way) . Eg.
Nial Stewart wrote:
> I'm going to talk to a potential new client about using FPGAs to
> accelerate part of their system.
>
> As part of what needs done there could be a significant amount of
> division(s) done.
>
> Previously I've been able to multiply by a reciprocal then scale to
> make division a double clock operation so this can be easily pipelined.
> This is only achieveable if the divisor is pre-known and the
> reciprocal can be pre-calculated.
>
> With what's coming up I'm not sure that I can do this, I know that
>
> Are there any clever techniques for streamlining divisions that
> make them deterministic and don't use a big wodge of logic?
>
>
> Thanks for any pointers,
>
>
> Nial
>
>
Nial,
Your reciprocal division is reasonably close to a low latency divider
I've used many times in the past. Instead of just one reciprocal, use
the upper bits of the denominator to address a BRAM containing
reciprocals, then use the lower bits of the denominator to interpolate
between the stored reciprocals to obtain a corrected reciprocal for more
precision. Use that to multiply the numerator to obtain the quotient.
It can be pipelined to run at the max clock for the multipliers, and has
a latency of a BRAM plus two multipliers plus an adder.
For example a divider with 17 bit denominator and numerator and 16 bit
quotient can use the upper 10 bits of the denominator to address a 1Kx18
reciprocal ROM to get an approximate reciprocal. Interpolate using a
second 1Kx18 ROM (holds the slope at each stored reciprocal) and a
multiplier to get a correction equal to the product of the 6 LSBs of the
denominator and the entry from the interpolation rom. That gets added to
the reciprocal ROM output to get a corrected reciprocal. Multiply that
corrected reciprocal by the numerator and round to get the quotient.
The entire 17 bit divide uses two multipliers, two BRAMs and two adders
plus registers to match delays to give a low latency divider. You can
improve the accuracy further if you can normalize the denominator and
numerator before the divide and then denormalize by the difference of
the normalizing shifts afterwards (basically floating point).
Nial Stewart wrote:
> I'm going to talk to a potential new client about using FPGAs to
> accelerate part of their system.
>
> As part of what needs done there could be a significant amount of
> division(s) done.
>
> Previously I've been able to multiply by a reciprocal then scale to
> make division a double clock operation so this can be easily pipelined.
> This is only achieveable if the divisor is pre-known and the
> reciprocal can be pre-calculated.
>
> With what's coming up I'm not sure that I can do this, I know that
>
> Are there any clever techniques for streamlining divisions that
> make them deterministic and don't use a big wodge of logic?
>
>
> Thanks for any pointers,
>
>
> Nial
>
>
I've heard that the CORDIC can be used for division. I'm not sure how
favorably this compares with other methods.
-Kevin
>>
> I've heard that the CORDIC can be used for division. I'm not sure how
> favorably this compares with other methods.
> -Kevin
Cordic division is similar to non-restoring sequential division. It
uses a lot of adder-subtractors, which means it is either quite slow or
it has a large clock latency.
> Nial,
> Your reciprocal division is reasonably close to a low latency divider I've used many times in the
> past. Instead of just one reciprocal, use the upper bits of the denominator to address a BRAM
> containing reciprocals, then use the lower bits of the denominator to interpolate between the
> stored reciprocals to obtain a corrected reciprocal for more precision. Use that to multiply the
> numerator to obtain the quotient. It can be pipelined to run at the max clock for the multipliers,
> and has a latency of a BRAM plus two multipliers plus an adder.
>
> For example a divider with 17 bit denominator and numerator and 16 bit quotient can use the upper
> 10 bits of the denominator to address a 1Kx18 reciprocal ROM to get an approximate reciprocal.
> Interpolate using a second 1Kx18 ROM (holds the slope at each stored reciprocal) and a multiplier
> to get a correction equal to the product of the 6 LSBs of the denominator and the entry from the
> interpolation rom. That gets added to the reciprocal ROM output to get a corrected reciprocal.
> Multiply that corrected reciprocal by the numerator and round to get the quotient. The entire 17
> bit divide uses two multipliers, two BRAMs and two adders plus registers to match delays to give a
> low latency divider. You can improve the accuracy further if you can normalize the denominator
> and numerator before the divide and then denormalize by the difference of the normalizing shifts
> afterwards (basically floating point).
Thanks Ray,
This is exactly the sort of technique I was talking about.
This will defenitely be worth a look in I end up having to do multiple
divisions (as I think I might have to).
<[email protected]> wrote in message news:[email protected]..
> >The NiosII blurb says that it can implement a single instruction
>>32 bit multiply or divide.
>>I wonder what sort of architecture they're using?
> What's unusual about a single instruction multiply or divide? Surely
> it's just a question of writing the microcode to do it.
> Mike
Ah ha, that was me with my hardware head on (I don't have another one)
reading single instruction as single clock!