I am planning to make a design in FPGA that has 4 2nd-order cascaded
IIR filters.
Now the question/feedback/advice which I am seeking is the following:

To what resolution can I have the input and output databuses of the
IIRs ?
Assume there is nothing else but the IIRs in the FPGA

P.S the FPGA is spartan 3 (400k gates)

I made a rough estimate :
I would be needing ~800-1000FFs (there is atotal of 8k)
~14 16-bit adders (do not know the total)
~8 18x18 dedicated multipliers (there is a total of 16)
and a whole bunch of muxes. I estimate about ~2000 4:1 muxes/demuxes

The above bunch of logic is for
4 2nd order IIRs
16 bit input databus for each IIR
16 bit output databus for each IIR
64 bit feedfwd & feedbck coeeficients for each IIR
An input DC gain of 2^12 for each IIR
One, and only one, 96 bit adder responsible for all the sums
One, and only one, 27x64 bit multiplier responsible for all the
multiplication
The adder and the multipler will function at a much higher frequency
than the sample rate, hence permitting them to do all the operations
for all the IIRs,
Sample rate is 1MHz. I am assuming that the sample rate can be
multiplied up by a factor of at least of 50. 50 would give at LEAST
1cycles/operation. There are 20 sums and 20 multiplication to be done
per sample period.

Hence, I arrived to the conclusion that such a digital filter design
will take me ~25% of the space of the FPGA. Does this sound accurate ?
However I do not know how to account for routing overhead.

I would appreciate previous projects citiings and how much % of the
FPGA they occupied.

"Roger Bourne" <[email protected]> wrote in message
news:[email protected] oups.com...
> Hello all,
>
> I would like some feeback :
>
> I am planning to make a design in FPGA that has 4 2nd-order cascaded
> IIR filters.
> Now the question/feedback/advice which I am seeking is the following:
>
> To what resolution can I have the input and output databuses of the
> IIRs ?
> Assume there is nothing else but the IIRs in the FPGA
>
> P.S the FPGA is spartan 3 (400k gates)
>
> I made a rough estimate :
> I would be needing ~800-1000FFs (there is atotal of 8k)
> ~14 16-bit adders (do not know the total)
> ~8 18x18 dedicated multipliers (there is a total of 16)
> and a whole bunch of muxes. I estimate about ~2000 4:1 muxes/demuxes
>
> The above bunch of logic is for
> 4 2nd order IIRs
> 16 bit input databus for each IIR
> 16 bit output databus for each IIR
> 64 bit feedfwd & feedbck coeeficients for each IIR
> An input DC gain of 2^12 for each IIR
> One, and only one, 96 bit adder responsible for all the sums
> One, and only one, 27x64 bit multiplier responsible for all the
> multiplication
> The adder and the multipler will function at a much higher frequency
> than the sample rate, hence permitting them to do all the operations
> for all the IIRs,
> Sample rate is 1MHz. I am assuming that the sample rate can be
> multiplied up by a factor of at least of 50. 50 would give at LEAST
> 1cycles/operation. There are 20 sums and 20 multiplication to be done
> per sample period.
>
> Hence, I arrived to the conclusion that such a digital filter design
> will take me ~25% of the space of the FPGA. Does this sound accurate ?
> However I do not know how to account for routing overhead.
>
> I would appreciate previous projects citiings and how much % of the
> FPGA they occupied.
>
> Thx in advance
> -Roger

You could reduce your resource requirements significantly by implementing a
multi-channel, multi-stage mechanism that manipulates your data and
coefficients through one BlockRAM - eliminating most of the multiplexers -
and pipelines some of the operations such as the multiply to use fewer
resources overall.

For these kinds of things, a little pseudocode and a spreadsheet can help to
visualize how to break up the problem and verify the soultion.

Thats seems reasonable in terms of HW resources but I would throw in a
guard of atleast another 50% till you have done an actual synthesis
with P/R. For most data paths even hand placed, I usually see 1/3 of
the resources can't be used, conflicts of placement etc. . So fo N
known flops used, add atleast another 20% which can't be used. For your
really wide 96bit adders and 64bit mult you want to pipeline those and
that adds many flops. YMMV

John_H wrote:
> "Roger Bourne" <[email protected]> wrote in message
> news:[email protected] oups.com...
> > Hello all,
> >
> > I would like some feeback :
> >
> > I am planning to make a design in FPGA that has 4 2nd-order cascaded
> > IIR filters.
> > Now the question/feedback/advice which I am seeking is the following:
> >
> > To what resolution can I have the input and output databuses of the
> > IIRs ?
> > Assume there is nothing else but the IIRs in the FPGA
> >
> > P.S the FPGA is spartan 3 (400k gates)
> >
> > I made a rough estimate :
> > I would be needing ~800-1000FFs (there is atotal of 8k)
> > ~14 16-bit adders (do not know the total)
> > ~8 18x18 dedicated multipliers (there is a total of 16)
> > and a whole bunch of muxes. I estimate about ~2000 4:1 muxes/demuxes
> >
> > The above bunch of logic is for
> > 4 2nd order IIRs
> > 16 bit input databus for each IIR
> > 16 bit output databus for each IIR
> > 64 bit feedfwd & feedbck coeeficients for each IIR
> > An input DC gain of 2^12 for each IIR
> > One, and only one, 96 bit adder responsible for all the sums
> > One, and only one, 27x64 bit multiplier responsible for all the
> > multiplication
> > The adder and the multipler will function at a much higher frequency
> > than the sample rate, hence permitting them to do all the operations
> > for all the IIRs,
> > Sample rate is 1MHz. I am assuming that the sample rate can be
> > multiplied up by a factor of at least of 50. 50 would give at LEAST
> > 1cycles/operation. There are 20 sums and 20 multiplication to be done
> > per sample period.
> >
> > Hence, I arrived to the conclusion that such a digital filter design
> > will take me ~25% of the space of the FPGA. Does this sound accurate ?
> > However I do not know how to account for routing overhead.
> >
> > I would appreciate previous projects citiings and how much % of the
> > FPGA they occupied.
> >
> > Thx in advance
> > -Roger
>
> You could reduce your resource requirements significantly by implementing a
> multi-channel, multi-stage mechanism that manipulates your data and
> coefficients through one BlockRAM - eliminating most of the multiplexers -
> and pipelines some of the operations such as the multiply to use fewer
> resources overall.
>
> For these kinds of things, a little pseudocode and a spreadsheet can help to
> visualize how to break up the problem and verify the soultion.
>
> Are you looking specifically for a tiny solution?

> Are you looking specifically for a tiny solution?
I am looking for a solution that fits in the FPGA. Tiny?, not really,
as long as every thing fits.

>...ism that manipulates your data and coefficients through one BlockRAM

BlockRAM. Great idea! I checked the timing specs of the blockram
module, and it seems pretty fast.1clock cycle to write and 1 clock
cycle for read. max freq of ~160MHz. No need for a complex multiplexing
network. In fact, there is no need for delay elements
(FFs)alltogether!.

However, I never used RAM on an FPGA (that is the reason I did not
initially lean towards that solution). Is there some obvious, flagrant
, blatant drawback when using RAM , instead of FFs ? Especially since
there is 36 times more RAM bits than available FFs (288K vs 8K). And in
RAM, ALL the bits can be used!

According to the timing waveform in the specs, it only requires 1 cycle
for read and 1 cycle for write --so I do not think loss of cycles
between data transters will be an issue, especially if the data rate is
~150 times slower than the fastest clock available. The module that
performs the multiplication can thus be time-multiplexed.

It is sounds like it is working on a DSP, rather than a FPGA, if one
foregoes the use of FFs...:-)

Depending on speed, using an FPGA can be exactly like using a DSP.

The use of the BRAM basically means you are building a custom DSP
machine, which will 'execute' a fixed program (based on a FSM),
manipulating the BRAM contents much like a DSP would bring operands in
and out of the ALU, to and from memory.

Personally, if something this slow would work, to make your life easier,
you might consider Microblaze as the processor, and execute both program
and data from BRAMs. That way the program (which may already be in c
code) could remain in c code.

Or, alternatively, use a "real" DSP processor, as (let's be honest) the
FPGA may be extreme overkill for what you may be doing.

If the speed there is just not fast enough, there may be hardened FFT
filter structures that are serial, rather than parallel, which still may
be fast enough (faster than a DSP), and yet use fewer resources (than a
full parallel one). The SRL16s are particularly good at this, as you
have up to 16 FFs for the SLICEs with SRLs/LUTRAM.

Remember that a parallel multiply may not be needed, and a serial
multiplier may be a lot less hungry (for resources, overall).

Many extreme audio applications (see NAB conference) use serial
processing of many audio streams at once on a signle FPGA for a superb
cost/performance point.

Finally, if the problem can be partitioned in time into more than one
piece, I have seen people calculate part 1, store results in an external
SRAM, reconfigure, and then read in last part and calculate part 2,
store results in external SRAM, etc...

"Roger Bourne" <[email protected]> wrote in message
news:[email protected] oups.com...
<snip>
> BlockRAM. Great idea! I checked the timing specs of the blockram
> module, and it seems pretty fast.1clock cycle to write and 1 clock
> cycle for read. max freq of ~160MHz. No need for a complex multiplexing
> network. In fact, there is no need for delay elements
> (FFs)alltogether!.
>
> However, I never used RAM on an FPGA (that is the reason I did not
> initially lean towards that solution). Is there some obvious, flagrant
> , blatant drawback when using RAM , instead of FFs ? Especially since
> there is 36 times more RAM bits than available FFs (288K vs 8K). And in
> RAM, ALL the bits can be used!
>
> According to the timing waveform in the specs, it only requires 1 cycle
> for read and 1 cycle for write --so I do not think loss of cycles
> between data transters will be an issue, especially if the data rate is
> ~150 times slower than the fastest clock available. The module that
> performs the multiplication can thus be time-multiplexed.
>
> It is sounds like it is working on a DSP, rather than a FPGA, if one
> foregoes the use of FFs...:-)
>
> -Roger

If you want a dedicated port to a controller to allow on-the-fly update of
coefficient values, a dual-port RAM would implement the controller on one
port and the data I/O on the other. If you have a fixed configuration, you
can dedicate one port for read, one for write, and your data can flow at the
full 320 MHz BlockRAM rate. Dual-ports are great.

Initializing BlockRAM contents always seems a little tough with the
synthesis and simulation tools never quite making it practical to get
everything flowing just right. If you look into the help or app notes from
the various tools, you could have pre-initialized BlockRAMs for fixed
coefficients to make life simpler.

For your application, this really *is* best implemented in a DSP mindset;
you can keep your resources low (1 MAC) and maintain the values in a
register file with limited I/O in your algorithm. Since you have 100x+ the
sample rate to do your processing, the system flows beautifully. The only
question for me would be how complex the state machine or microcode would
need to be to have the system work beautifully without adding a generic
processor like the MicroBlaze or similar. This is where prototyping with
pseudo-code and an Excel spreadsheet get me to my results with a simple
implementation.

John_H wrote:
> "Roger Bourne" <[email protected]> wrote in message
> news:[email protected] oups.com...
> <snip>
> > BlockRAM. Great idea! I checked the timing specs of the blockram
> > module, and it seems pretty fast.1clock cycle to write and 1 clock
> > cycle for read. max freq of ~160MHz. No need for a complex multiplexing
> > network. In fact, there is no need for delay elements
> > (FFs)alltogether!.
> >
> > However, I never used RAM on an FPGA (that is the reason I did not
> > initially lean towards that solution). Is there some obvious, flagrant
> > , blatant drawback when using RAM , instead of FFs ? Especially since
> > there is 36 times more RAM bits than available FFs (288K vs 8K). And in
> > RAM, ALL the bits can be used!
> >
> > According to the timing waveform in the specs, it only requires 1 cycle
> > for read and 1 cycle for write --so I do not think loss of cycles
> > between data transters will be an issue, especially if the data rate is
> > ~150 times slower than the fastest clock available. The module that
> > performs the multiplication can thus be time-multiplexed.
> >
> > It is sounds like it is working on a DSP, rather than a FPGA, if one
> > foregoes the use of FFs...:-)
> >
> > -Roger
>
> If you want a dedicated port to a controller to allow on-the-fly update of
> coefficient values, a dual-port RAM would implement the controller on one
> port and the data I/O on the other. If you have a fixed configuration, you
> can dedicate one port for read, one for write, and your data can flow at the
> full 320 MHz BlockRAM rate. Dual-ports are great.
>
> Initializing BlockRAM contents always seems a little tough with the
> synthesis and simulation tools never quite making it practical to get
> everything flowing just right. If you look into the help or app notes from
> the various tools, you could have pre-initialized BlockRAMs for fixed
> coefficients to make life simpler.
>
> For your application, this really *is* best implemented in a DSP mindset;
> you can keep your resources low (1 MAC) and maintain the values in a
> register file with limited I/O in your algorithm. Since you have 100x+ the
> sample rate to do your processing, the system flows beautifully. The only
> question for me would be how complex the state machine or microcode would
> need to be to have the system work beautifully without adding a generic
> processor like the MicroBlaze or similar. This is where prototyping with
> pseudo-code and an Excel spreadsheet get me to my results with a simple
> implementation.
>
> For me, these kinds of tasks are great fun.

> This is where prototyping with
> pseudo-code and an Excel spreadsheet get me to my results with a simple
> implementation.

pseudo-code ???
What exactly do you mean by peudo-code?

"Roger Bourne" <[email protected]> wrote in message
news:[email protected] oups.com...
>
> John_H wrote:
>> This is where prototyping with
>> pseudo-code and an Excel spreadsheet get me to my results with a simple
>> implementation.
>
> pseudo-code ???
> What exactly do you mean by peudo-code?
>
> -Roger

Just writing down what ssteps you'd take to implement the code in your data
path. It's helpful to "see" the data pipeline by looking at the steps and
the loops to manipulate the data.