Hi,
We are using a virtex4 FPGA to prototype a DSP processor to be
implemented in an ASIC. We are using the ISE flow and everything works
fine except that we can't prototype at full speed. We are only able to
run at about 65MHz, which is far from the 150MHz target. The longest
combinationnal path is in the MAC, which contains a 28x28 multiplier
followed by a 56x56 adder. I created the multiplier and the adder
using Core Generator.

Is there a way to speed this up? The virtex4 have those Xtreame DSP
slices, but I can't find a way to to make good use of them, since our
datapath is so large.

> We are using a virtex4 FPGA to prototype a DSP processor to be
> implemented in an ASIC. We are using the ISE flow and everything works
> fine except that we can't prototype at full speed. We are only able to
> run at about 65MHz, which is far from the 150MHz target. The longest
> combinationnal path is in the MAC, which contains a 28x28 multiplier
> followed by a 56x56 adder. I created the multiplier and the adder
> using Core Generator.

> Is there a way to speed this up? The virtex4 have those Xtreame DSP
> slices, but I can't find a way to to make good use of them, since our
> datapath is so large.

Virtex4 has 18x18 multiplier hardware. Your 28x28 may be made
from them, but you need to pipeline it, and also a pipeline
stage before the adder. I will guess that gets to 150MHz, but
you will have to try it to find out.

gretzteam wrote:
> Hi,
> We are using a virtex4 FPGA to prototype a DSP processor to be
> implemented in an ASIC. We are using the ISE flow and everything works
> fine except that we can't prototype at full speed. We are only able to
> run at about 65MHz, which is far from the 150MHz target. The longest
> combinationnal path is in the MAC, which contains a 28x28 multiplier
> followed by a 56x56 adder. I created the multiplier and the adder
> using Core Generator.
>
> Is there a way to speed this up? The virtex4 have those Xtreame DSP
> slices, but I can't find a way to to make good use of them, since our
> datapath is so large.
>
> Thank you,
> David
>

If you use the Xtreme DSP slices properly, with all of their dedicated
interconnects, you should be able to do a 34x34 multiply using 4
pipelined slices at full rate (450-500MHz, depending upon part speed).
You might need an extra two slices to do the 56-bit accumulate. Look
for the "XtremeDSP Design Consdierations" guide on the Xilinx site and
it describes how to do this. I'm not sure exactly what CoreGen is
producing but it might not be completely optimized. It might be using
CLB fabric for some of the operations.
-Kevin

Right now I'm not using anything fancy. I created a 28x28 multiplier
and a 56x56 adder with coregen and wired them together. I used the
multiplier component and it is supposed to use the XtremeDSP slices.
Maybe it is not wise enough to make use of other dedicated
interconnects. I will look at this "XtremeDSP Design Consdierations".
Thank you,
David

I can't really use pipelining here. The MAC is all combinationnal; i
receive inputs at time 0, and I need an answer by time x. I don't see
how pipelining would help.
Thanks,
Dave

gretzteam wrote:
> I can't really use pipelining here. The MAC is all combinationnal; i
> receive inputs at time 0, and I need an answer by time x. I don't see
> how pipelining would help.

What is x?

If x is one clock cycle then you need either faster logic or
a lot more of it. I believe this can be done easily with a
three cycle pipeline, so that you get an answer out every cycle,
which each one taking three cycles.

Hi,
I guess I don't understand something about pipeling. In my case, the
whole system runs at master clock, which I would like to be 100MHz or
more. Right now, the whole MAC unit is combinational logic and needs
to produce an answer for each clock cycle (time x=1/100MHz). Are you
guys saying that if I would run the mac at 3 times the master clock
(300MHz) with a three stage pipeline, I could compute the answer fast
enough?

Thanks,
David

glen herrmannsfeldt <[email protected]> wrote in message news:<[email protected]>...
> gretzteam wrote:
> > I can't really use pipelining here. The MAC is all combinationnal; i
> > receive inputs at time 0, and I need an answer by time x. I don't see
> > how pipelining would help.
>
> What is x?
>
> If x is one clock cycle then you need either faster logic or
> a lot more of it. I believe this can be done easily with a
> three cycle pipeline, so that you get an answer out every cycle,
> which each one taking three cycles.
>
> -- glen

David wrote:
> Hi,
> I guess I don't understand something about pipeling. In my case, the
> whole system runs at master clock, which I would like to be 100MHz or
> more. Right now, the whole MAC unit is combinational logic and needs
> to produce an answer for each clock cycle (time x=1/100MHz). Are you
> guys saying that if I would run the mac at 3 times the master clock
> (300MHz) with a three stage pipeline, I could compute the answer fast
> enough?

Howdy David,

Using different terms, let's try another analogy on this Saturday:
imagine an automobile assembly line. It puts out a certain number of
cars per hour. If you add another step in the assembly process, you
can still get the same number of cars per hour out - it just takes a
little longer for it to roll off the assembly line. Circuits work the
same way.

If your main requirement is to be able to handle a certain number of
calculations per second, you can possibly break the calculations up
into smaller parts which are easier to do in series: rather than doing
a multiply and an accumulate in the same cycle, do the multiply in one
cycle, and the addition in the next cycle. While the accumulation is
occuring during this 2nd cycle, the 2nd piece of data is being
multiplied. On the 3rd cycle, the 2nd piece of data is now in the
accumulator and a 3rd piece of data enters the multiplier. You get the
same number of calculations per second out of the circuit (or perhaps
even more, since you can meet timing now!), but it takes 20 ns rather
than 10 ns. If you can't stand the extra delay, then you may need to
up the clock rate (and then you will sure enough have to pipeline!).

Hi,
I understand what you mean. However, I don't think it works in my case
because I have a loop (it is a MAC). In order to start the next
calculation, I need an answer to the previous one. I guess the only
solution is faster logic. I thought that a virtex4 would be able to
give us those kind of calculation speed...

> I understand what you mean. However, I don't think it works in my case
> because I have a loop (it is a MAC). In order to start the next
> calculation, I need an answer to the previous one. I guess the only
> solution is faster logic. I thought that a virtex4 would be able to
> give us those kind of calculation speed...

Unless the result from the accumulator goes as an input to the
multiplier, it should pipeline just fine. Using the built in
multipliers, it should be two or three stages. The answers
will come out, one per clock cycle, two or three clocks later.