Jeorg's question on sci.electronics.design for an under $2 DSP chip got
me to thinking:
How are 1-cycle multipliers implemented in silicon? My understanding is
that when you go buy a DSP chip a good part of the real estate is taken
up by the multiplier, and this is a good part of the reason that DSPs
cost so much. I can't see it being a big gawdaful batch of
combinatorial logic that has the multiply rippling through 16 32-bit
adders, so I assume there's a big table look up involved, but that's as
far as my knowledge extends.
Yet the reason that you go shell out all the $$ for a DSP chip is to get
a 1-cycle MAC that you have to bury in a few (or several) tens of cycles
worth of housekeeping code to set up the pointers, counters, modes &c --
so you never get to multiply numbers in one cycle, really.
How much less silicon would you use if an n-bit multiplier were
implemented as an n-stage pipelined device? If I wanted to implement a
128-tap FIR filter and could live with 160 ticks instead of 140 would
the chip be much smaller?
Or is the space consumed by the separate data spaces and buses needed to
move all the data to and from the MAC? If you pipelined the multiplier
_and_ made it a two- or three- cycle MAC (to allow time to shove data
around) could you reduce the chip cost much? Would the amount of area
savings you get allow you to push the clock up enough to still do audio
applications for less money?
Obviously any answers will be useless unless somebody wants to run out
and start a chip company, but I'm still curious about it.
Tim Wescott wrote:
> Jeorg's question on sci.electronics.design for an under $2 DSP chip got
> me to thinking:
>
> How are 1-cycle multipliers implemented in silicon? My understanding is
> that when you go buy a DSP chip a good part of the real estate is taken
> up by the multiplier, and this is a good part of the reason that DSPs
> cost so much. I can't see it being a big gawdaful batch of
> combinatorial logic that has the multiply rippling through 16 32-bit
> adders, so I assume there's a big table look up involved, but that's as
> far as my knowledge extends.
>
There's no lookup table. Its just a BIG cascade of and's. This might
help:
> Yet the reason that you go shell out all the $$ for a DSP chip is to get
> a 1-cycle MAC that you have to bury in a few (or several) tens of cycles
> worth of housekeeping code to set up the pointers, counters, modes &c --
> so you never get to multiply numbers in one cycle, really.
>
> How much less silicon would you use if an n-bit multiplier were
> implemented as an n-stage pipelined device? If I wanted to implement a
> 128-tap FIR filter and could live with 160 ticks instead of 140 would
> the chip be much smaller?
>
I think this would lead to lousy performance on small loops - such as
those found in JPEG encoding.
> Or is the space consumed by the separate data spaces and buses needed to
> move all the data to and from the MAC? If you pipelined the multiplier
> _and_ made it a two- or three- cycle MAC (to allow time to shove data
> around) could you reduce the chip cost much? Would the amount of area
> savings you get allow you to push the clock up enough to still do audio
> applications for less money?
Quite a lot of the chip cost depends on the design complexity and the
amount of time and money spent in R&D, not to mention the quantity of
chips the company hopes to sell, so its not a direct proportional
relation between cost and size of chip. If you're trying to save money,
you could try using a fast general purpose microcontroller instead of a
DSP.
>
> Obviously any answers will be useless unless somebody wants to run out
> and start a chip company, but I'm still curious about it.
>
> --
>
> Tim Wescott
> Wescott Design Services
> http://www.wescottdesign.com
Your question got me thinking, trying to recall the discussions I had
in the microprocessor architecture classes. So here is some food for
thought:
I seem to recall that (back then? - 99 -> 01) that multipliers were
assumed to take multiple cycles, I think for the class purposes we
usually assumed three or four cycles. Sometimes the premise was that
there were multiple multipliers and other ALU units that could be used
simulataneously. If an instruction was set to execute and there
weren't resources available, this resulted in a pipeline stall, but
otherwise the apparent output was single cycle. I even believe we had
test problems dealing with determining how many multipliers a processor
required versus other resource items (each with a $ value attatched),
given a certain mix of instructions and having to determine the optimal
resource mix.
In the latter portions of the class, we got away from the CPU
architecture and spent a lot of time dealing with the concept of
maintaining single cycle execution through the use of compiler
scheduling. A lot of emphasis was placed on scheduling algorthims that
scanned for data and resource dependancies and how code will get
executed out of sequence to maximize resource utilization.
Another concept that was raised is the idea of sub cycle (clocking) or
micro-operations where in a single "instruction cylce" multiple
processor cycles would occur while still maintaing the apparent single
cycle execution.
I would imagine that modern DSPs rely on techniques like these, or some
totally new ones, to maximize the throughput.
> Tim Wescott wrote:
>
>>Jeorg's question on sci.electronics.design for an under $2 DSP chip got
>>me to thinking:
>>
>>How are 1-cycle multipliers implemented in silicon? My understanding is
>>that when you go buy a DSP chip a good part of the real estate is taken
>>up by the multiplier, and this is a good part of the reason that DSPs
>>cost so much. I can't see it being a big gawdaful batch of
>>combinatorial logic that has the multiply rippling through 16 32-bit
>>adders, so I assume there's a big table look up involved, but that's as
>>far as my knowledge extends.
>>
>
>
> There's no lookup table. Its just a BIG cascade of and's. This might
> help:
>
> http://www2.ele.ufes.br/~ailson/digi...er05.doc5.html
>
Interesting. So that's what they actually do in practice, just copy a
page out of a textbook? Wouldn't the stages of adders really cause a
speed hit? To have your signal ripple through so many stages would
require you to slow your clock way down from what it could be otherwise
-- it seems an odd way to build a chip who's purpose in life is to be
really fast while doing a MAC.
>
>>Yet the reason that you go shell out all the $$ for a DSP chip is to get
>>a 1-cycle MAC that you have to bury in a few (or several) tens of cycles
>>worth of housekeeping code to set up the pointers, counters, modes &c --
>>so you never get to multiply numbers in one cycle, really.
>>
>>How much less silicon would you use if an n-bit multiplier were
>>implemented as an n-stage pipelined device? If I wanted to implement a
>>128-tap FIR filter and could live with 160 ticks instead of 140 would
>>the chip be much smaller?
>>
>
> I think this would lead to lousy performance on small loops - such as
> those found in JPEG encoding.
>
Good point. Yes it would, unless you used some fancy pipelining to keep
the throughput up (which would probably require a fancy optimizer to let
humans write fast code).
>
>>Or is the space consumed by the separate data spaces and buses needed to
>>move all the data to and from the MAC? If you pipelined the multiplier
>>_and_ made it a two- or three- cycle MAC (to allow time to shove data
>>around) could you reduce the chip cost much? Would the amount of area
>>savings you get allow you to push the clock up enough to still do audio
>>applications for less money?
>
>
> Quite a lot of the chip cost depends on the design complexity and the
> amount of time and money spent in R&D, not to mention the quantity of
> chips the company hopes to sell, so its not a direct proportional
> relation between cost and size of chip. If you're trying to save money,
> you could try using a fast general purpose microcontroller instead of a
> DSP.
>
Yet DSP chips cost tons of money, which disappoints Jeorg who designs
for high-volume customers who are _very_ price sensitive. The question
was more a hypothetical "what would Atmel do if Atmel wanted to compete
with the dsPIC" than "should I have a custom chip designed for my
10-a-year production cycle".
>
>>Obviously any answers will be useless unless somebody wants to run out
>>and start a chip company, but I'm still curious about it.
>>
>
On Wed, 19 Oct 2005 10:42:41 -0700, Tim Wescott wrote:
> Yet DSP chips cost tons of money, which disappoints Jeorg who designs
> for high-volume customers who are _very_ price sensitive.
Actually, I believe that prices have little to do with cost, particularly
in high volume, low material cost items like ICs. This is true until the
item has become a commodity, where anybody can make it. At that point,
market factors start to bring the prices down. Until that point, pricing
is more closely related to the cost of what the item replaces.
In the case of DSP chips, the replacement is a traditional microprocessor,
with its fast external memory, PC design and debug time, etc.
So, cutting down on the silicon area won't help prices; it'll just
increase the profits of the chipmakers. What helps prices is stiff, fair
competition, and lots of it. So, chip makers try to differentiate their
designs, making it hard to 'jump ship' and head off in new directions,
thus keeping a a particular group of users a 'captive audience'. Once
standardization sets in, they are doomed to compete.
---
Regards,
Bob Monsen
Let us grant that the pursuit of mathematics is a divine madness of the
human spirit.
- Alfred North Whitehead
Tim Wescott wrote:
> Jeorg's question on sci.electronics.design for an under $2 DSP chip got
> me to thinking:
>
> How are 1-cycle multipliers implemented in silicon? My understanding is
> that when you go buy a DSP chip a good part of the real estate is taken
> up by the multiplier, and this is a good part of the reason that DSPs
> cost so much. I can't see it being a big gawdaful batch of
> combinatorial logic that has the multiply rippling through 16 32-bit
> adders, so I assume there's a big table look up involved, but that's as
> far as my knowledge extends.
>
Single-cycle multipliers in small microcontrollers are frequently 8x8,
which is obviously much easier. The chip mentioned, the msp430, does
16x16, but it is not actually single-cycle (as far as I remember). The
other big difference compared to expensive DSPs is the speed - it is a
lot easier to do 16x16 in a single cycle at 8 MHz (the top speed of the
current msp430's) than at a few hundred MHz (for expensive DSPs).
Tim Wescott wrote:
> Pramod Subramanyan wrote:
>
>> Tim Wescott wrote:
>>
>>> Jeorg's question on sci.electronics.design for an under $2 DSP chip got
>>> me to thinking:
>>>
>>> How are 1-cycle multipliers implemented in silicon? My understanding is
>>> that when you go buy a DSP chip a good part of the real estate is taken
>>> up by the multiplier, and this is a good part of the reason that DSPs
>>> cost so much. I can't see it being a big gawdaful batch of
>>> combinatorial logic that has the multiply rippling through 16 32-bit
>>> adders, so I assume there's a big table look up involved, but that's as
>>> far as my knowledge extends.
>>>
>>
>>
>> There's no lookup table. Its just a BIG cascade of and's. This might
>> help:
>>
>> http://www2.ele.ufes.br/~ailson/digi...er05.doc5.html
>>
> Interesting. So that's what they actually do in practice, just copy a
> page out of a textbook? Wouldn't the stages of adders really cause a
> speed hit? To have your signal ripple through so many stages would
> require you to slow your clock way down from what it could be otherwise
> -- it seems an odd way to build a chip who's purpose in life is to be
> really fast while doing a MAC.
It's much much harder than just copying a page out of a textbook.
There's small optimizations that depend strongly on data distributions,
etc etc. Even before the designer can begin laying out the multiplier,
which is pretty much the hardest part, they have to work out whether it
has the characteristics required.
As an example I recently designed a 4bit*4bit multiplier as a class
project. It's much harder than many people realise to do, and it's
complexity grows exponentially (in most cases) to the input bit width.
Sometimes it may be as simple as laying down a standard multiplier block
(from one of many IP libraries around) however in most DSPs this will be
the critical timing path for single cycle operation and so must be hand
modified to produce acceptable path delays, then assessed under all
conditions.
Certainly not a lookup table, that would indeed be simply copying from a
book, and would also require (2^(2*N))*N/4 bytes of storage. For
anything but small N this would be enormous, and not very efficient in
terms of chip real estate.
As an aside, the other members of my class implemented their multipliers
in a pipeline configuration, whilst I did mine in a completely
parallel configuration (with ripple adder as high speed wasn't a design
consideration). This means that others had 2/3/4 cycle latencies whilst
mine was a single cycle. The trade-off is that the upper frequency of
mine was more limited than was their's due to the increased path delays.
Getting single cycle high speed multipliers is a very challenging
prospect, and one which much research is still ongoing.
You should have a go at making up a simple 3bit*3bit multiplier using
single transistors on a PCB sometime.. it's quite similar to the layout
flow used in IC design.
Newer FPGAs have lots of fast 18 x 18 multipliers.
The humble XC4VSX25 has, among other goodies, 128 such multipliers
running at max 500MHz single-cycle rate.
The mid-range SX35 has 192, and the top SX55 has 512 such fast 18 x 18
multipliers each with its associated 48-bit accumulator structure. We
invite you to keep that kind of arithmetic performance busy... No
wonder these FPGAs can outperform sophisticated and expensive DSP
chips.
> Pramod Subramanyan wrote:
>
snip
> >
> > http://www2.ele.ufes.br/~ailson/digi...er05.doc5.html
> >
> Interesting. So that's what they actually do in practice, just copy a
> page out of a textbook? Wouldn't the stages of adders really cause a
> speed hit? To have your signal ripple through so many stages would
> require you to slow your clock way down from what it could be otherwise
afair the delay for the straight forward N*N bit parallel multiplier is
only around double the delay of a N bit adder, i.e. the longest path in
the multiplier is lsb to msb plus top to bottom
> -- it seems an odd way to build a chip who's purpose in life is to be
> really fast while doing a MAC.
I think its more likely that they look at different options and find
the
smallest that is fast enough
> Yet DSP chips cost tons of money, which disappoints Jeorg who designs
> for high-volume customers who are _very_ price sensitive. The question
> was more a hypothetical "what would Atmel do if Atmel wanted to compete
> with the dsPIC" than "should I have a custom chip designed for my
> 10-a-year production cycle".
I'm not sure the size of the multiplier makes a big difference, my
guess
is that if you look at the die you would see that most of it is memory
what price are you looking for?, how much memory?, how fast?
On Wed, 19 Oct 2005 09:25:12 -0700, Tim Wescott <[email protected]>
wrote:
>Jeorg's question on sci.electronics.design for an under $2 DSP chip got
>me to thinking:
>
>How are 1-cycle multipliers implemented in silicon? My understanding is
>that when you go buy a DSP chip a good part of the real estate is taken
>up by the multiplier, and this is a good part of the reason that DSPs
>cost so much. I can't see it being a big gawdaful batch of
>combinatorial logic that has the multiply rippling through 16 32-bit
>adders, so I assume there's a big table look up involved, but that's as
>far as my knowledge extends.
>
>Yet the reason that you go shell out all the $$ for a DSP chip is to get
>a 1-cycle MAC that you have to bury in a few (or several) tens of cycles
>worth of housekeeping code to set up the pointers, counters, modes &c --
>so you never get to multiply numbers in one cycle, really.
>
>How much less silicon would you use if an n-bit multiplier were
>implemented as an n-stage pipelined device? If I wanted to implement a
>128-tap FIR filter and could live with 160 ticks instead of 140 would
>the chip be much smaller?
>
>Or is the space consumed by the separate data spaces and buses needed to
>move all the data to and from the MAC? If you pipelined the multiplier
>_and_ made it a two- or three- cycle MAC (to allow time to shove data
>around) could you reduce the chip cost much? Would the amount of area
>savings you get allow you to push the clock up enough to still do audio
>applications for less money?
>
>Obviously any answers will be useless unless somebody wants to run out
>and start a chip company, but I'm still curious about it.
A while back when I was doing such things Wallace Trees and Booth
Multipliers were all the rage. Doing a search on those turned up Ray
Andraka's page (no big surprise ) which has a really good discussion
on alternatives.
Since then things have gotten even smaller and faster and, as someone
else pointed out, the FPGA companies now find it prudent to splatter
large numbers of very fast single-cycle multipliers around their parts
just because they can (and becuase they know people will use them).
I've no clue what they're doing there, but efficient single-cycle
multipliers have been around for a long time in various flavors. I'm
sure they're not all the same.
Eric Jacobsen
Minister of Algorithms, Intel Corp.
My opinions may not be Intel's opinions. http://www.ericjacobsen.org
"Tim Wescott" <[email protected]> wrote in message
news:[email protected]..
> Jeorg's question on sci.electronics.design for an under $2 DSP chip got me to
> thinking:
>
> How are 1-cycle multipliers implemented in silicon? My understanding is that
> when you go buy a DSP chip a good part of the real estate is taken up by the
> multiplier, and this is a good part of the reason that DSPs cost so much. I
> can't see it being a big gawdaful batch of combinatorial logic that has the
> multiply rippling through 16 32-bit adders, so I assume there's a big table
> look up involved, but that's as far as my knowledge extends.
In addition to the single-cycle MAC, there is also all the structure needed to
keep that MAC busy, i.e. dual busses with single-cycle access to memory.
> Yet the reason that you go shell out all the $$ for a DSP chip is to get a
> 1-cycle MAC that you have to bury in a few (or several) tens of cycles worth
> of housekeeping code to set up the pointers, counters, modes &c --
> so you never get to multiply numbers in one cycle, really.
True, a fast MAC is most useful when you are doing a bunch of them in a row,
like for example in a FIR filter. But a fast multiply is pretty darn useful
too, and often can be taken advantage of 1 or 2 at a time.
BTW, on a SHARC, the set-up is basically 3 cycles: set-up 2 pointers (typically
one for data, one for coefs) and initialize your loop counter. You may need to
pre-fetch the data too, so that could add another cycle, or could be built into
the main loop. Not too bad, though. It doesn't take too long of a filter,
matrix, vector op, etc. to start paying dividends.
A idiosyncratic feature of the SHARC is that for fixed point, there is a true
single-cycle MAC, whereas for floating point, you have a parallel multiply/add
instruction, and you can't use the result of the multiply in the add. At first
glance it seems quite restritive, but in practice it just means one more stage
of pipelining before you can rip off those single-cycle FIRs. I'm guessing a
single-cycle 40-bit floating-point MAC would have been the slowest instruction
on the chip by a mile, and would have forced a much slower max clock rate.
> How much less silicon would you use if an n-bit multiplier were implemented as
> an n-stage pipelined device? If I wanted to implement a 128-tap FIR filter
> and could live with 160 ticks instead of 140 would the chip be much smaller?
I don't think it would save that much total real estate. I'm basing this on the
fact that Analog Devices started including two complete multipliers and ALUs in
all their new SHARCs. Or maybe with die shrinks, silicon area is not such a big
deal? When you can take advantage of the second unit, you can get FIRs in 1/2
cycle per tap, or parallel operation on a second data stream "for free"--nice!
> Or is the space consumed by the separate data spaces and buses needed to move
> all the data to and from the MAC? If you pipelined the multiplier _and_ made
> it a two- or three- cycle MAC (to allow time to shove data around) could you
> reduce the chip cost much? Would the amount of area savings you get allow you
> to push the clock up enough to still do audio applications for less money?
My guess is that going to a 2-stage multiplier would not allow you to get you
anywhere near twice the clock frequency. On most DPSs, it's not just the MAC
but every other instruction is also single cycle, so unless you can get them all
to run 2x, you can't double the clock rate. The MAC is probably the slowest
path, but I would guess there are a lot of other "close seconds". I've often
wanted to see data from a DSP manufacturer on the speed of different
instructions, just out of curiosity. It would also allow you to determine if
you could reliably overclock a chip based on what instructions your application
used.
> Tim Wescott skrev:
>
>
>>Pramod Subramanyan wrote:
>>
>
> snip
>
>>>http://www2.ele.ufes.br/~ailson/digi...er05.doc5.html
>>>
>>
>>Interesting. So that's what they actually do in practice, just copy a
>>page out of a textbook? Wouldn't the stages of adders really cause a
>>speed hit? To have your signal ripple through so many stages would
>>require you to slow your clock way down from what it could be otherwise
>
>
> afair the delay for the straight forward N*N bit parallel multiplier is
>
> only around double the delay of a N bit adder, i.e. the longest path in
> the multiplier is lsb to msb plus top to bottom
>
>
>>-- it seems an odd way to build a chip who's purpose in life is to be
>>really fast while doing a MAC.
>
>
> I think its more likely that they look at different options and find
> the
> smallest that is fast enough
>
> have a look at http://www.andraka.com/multipli.htm
>
>
> snip
>
>
>>Yet DSP chips cost tons of money, which disappoints Jeorg who designs
>>for high-volume customers who are _very_ price sensitive. The question
>>was more a hypothetical "what would Atmel do if Atmel wanted to compete
>>with the dsPIC" than "should I have a custom chip designed for my
>>10-a-year production cycle".
>
>
> I'm not sure the size of the multiplier makes a big difference, my
> guess
> is that if you look at the die you would see that most of it is memory
>
>
> what price are you looking for?, how much memory?, how fast?
>
> Not that I will build you one, but I'm curious
>
> -Lasse
>
The original question was for an under-$2 DSP chip capable of doing
audio frequency stuff, including FFTs. I'm not the fellow who asked; it
just sparked a tangential thought in my head about why there isn't some
intermediate step on the way to a full-speed DSP.
So I couldn't tell you exactly how much speed and memory, given I don't
know how often he was considering doing the FFTs, or how many points, etc.
Hey Jeorg! You listening? What did you need to do, exactly?
Peter Alfke wrote:
> Newer FPGAs have lots of fast 18 x 18 multipliers.
> The humble XC4VSX25 has, among other goodies, 128 such multipliers
> running at max 500MHz single-cycle rate.
> The mid-range SX35 has 192, and the top SX55 has 512 such fast 18 x 18
> multipliers each with its associated 48-bit accumulator structure. We
> invite you to keep that kind of arithmetic performance busy... No
> wonder these FPGAs can outperform sophisticated and expensive DSP
> chips.
Well, the original question was about $2 DSP's, presumably for
low cost products. So are there any FPGA's currently available
for $2 or less which include any built-in single-cycle 18 x 18
multipliers, plus enough logic cells to replace the rest of the
functionality of a single low cost DSP?
snip
> The original question was for an under-$2 DSP chip capable of doing
> audio frequency stuff, including FFTs. I'm not the fellow who asked; it
> just sparked a tangential thought in my head about why there isn't some
> intermediate step on the way to a full-speed DSP.
>
I never have to buy stuff so I don't know anything about prices, but
philips recently announced a couple of 70MHz ARM7TDMIs in the 2$
range, it's not DSPs but at 70MHz and one cycle per 8bits of
32*32->64bit multiply it'll do some dsp
[email protected] wrote:
> Tim Wescott skrev:
>
> snip
>
>>The original question was for an under-$2 DSP chip capable of doing
>>audio frequency stuff, including FFTs. I'm not the fellow who asked; it
>>just sparked a tangential thought in my head about why there isn't some
>>intermediate step on the way to a full-speed DSP.
>>
>
>
> I never have to buy stuff so I don't know anything about prices, but
> philips recently announced a couple of 70MHz ARM7TDMIs in the 2$
> range, it's not DSPs but at 70MHz and one cycle per 8bits of
> 32*32->64bit multiply it'll do some dsp
TI have just volume-released their 100MHz FLASH controllers, start at
sub $5, so not quite a $2 target, but these have FLASH(not ROM) and
include 12 bit 6Msps ADCs, a 150ps resolution PWM, and CAN bus
>> Jeorg's question on sci.electronics.design for an under $2 DSP chip got
>> me to thinking:
>>
>> How are 1-cycle multipliers implemented in silicon? My understanding is
>> that when you go buy a DSP chip a good part of the real estate is taken
>> up by the multiplier, and this is a good part of the reason that DSPs
>> cost so much. I can't see it being a big gawdaful batch of
>> combinatorial logic that has the multiply rippling through 16 32-bit
>> adders, so I assume there's a big table look up involved, but that's as
>> far as my knowledge extends.
>>
>> Yet the reason that you go shell out all the $$ for a DSP chip is to get
>> a 1-cycle MAC that you have to bury in a few (or several) tens of cycles
>> worth of housekeeping code to set up the pointers, counters, modes &c --
>> so you never get to multiply numbers in one cycle, really.
>>
>> How much less silicon would you use if an n-bit multiplier were
>> implemented as an n-stage pipelined device? If I wanted to implement a
>> 128-tap FIR filter and could live with 160 ticks instead of 140 would
>> the chip be much smaller?
If you leave out FFT-based multipliers which are only benefitial for
really large mutliplicands (a few hundred bits) you need to perform
N^2 1-bit additions for a multiplication.
These N^2 1-bit adders can be arranged in different ways called for
example array multipliers or wallace tree multipliers to change the
critical path and the layout. But we proved that they all can be
transformed into each other just by swapping wires around. http://eis.eit.uni-kl.de/eis/researc...ers/iccd04.pdf
The only thing you can do about area is trading off area for delay. You
can forexample reuse the same N adders over N clock cycles.
Pipelining does not save area. But it can increase performance a little.
You can cut the critical path in half by adding N pipeline stages.
(Beware that without pipelining while you perform N additions of N bits
the length of the critical path is only O(N) for all multiplier
architectures achieved by reordering)
The multiplier is a large part of the CPU core, probably the largest in
a DSP, but the area of most processor chips including DSPs is dominated
by caches and other memory like reordering buffers, shadow registers,
TLA buffers, etc.
Bevan Weiss wrote:
> Getting single cycle high speed multipliers is a very challenging
> prospect, and one which much research is still ongoing.
Actually, if you cannot do full custom circuit optimizations
(e.g. because you do standard cell design or because you are using
LUTs in an FPGA) swapping wires is the only possible structural
optimization. All other multiplier transformations can be reduced to swaps.
An extremely nice property of swapping wires is, that it can be done
after placement. This is such a huge advantage that we were able to beat
sophisticated multiplier generators with a simple greedy algorithm when
applying it after placement: http://eis.eit.uni-kl.de/eis/researc...ers/iccd04.pdf
On Wed, 19 Oct 2005 09:25:12 -0700, Tim Wescott wrote:
> Jeorg's question on sci.electronics.design for an under $2 DSP chip got
> me to thinking:
>
> How are 1-cycle multipliers implemented in silicon?
I don't know how they do it this days, but I do know that with a
whole shitpot load of adders, you could do it in n propagation delays,
where n is the width of whichever operand you arrange to come in
sideways. I almost drew a schematic. You have a set of adders as
wide as operand "A", and its inputs are operand "A" and the "latest"
partial product - and its outputs go to another bank of adders whose
other inputs are either "A" again or 0, and so on - the other operand,
"B", would be presented down the side of the array, deciding which
partial products get added to and which don't. The LSB, of course,
gets sent out as "product", and the carry is the MSB of the next
partial product. They form a parallelogram.
I just fired up that Xilinx S/W to see what it's got in the way of
symbols, and it already has a 16-bit adder. With 16 of them, and 256
AND gates, I could build a 16 x 16 multiplier that would have an
answer in about 16 or 17 propagation delays. :-)
<Rich fires up Xilinx ISE...>
OK, it's gonna be a day or so. Please be gentle, it's my first time. :-)
Kolja Sulimma wrote:
> Bevan Weiss wrote:
>> Getting single cycle high speed multipliers is a very challenging
>> prospect, and one which much research is still ongoing.
> Actually, if you cannot do full custom circuit optimizations
> (e.g. because you do standard cell design or because you are using
> LUTs in an FPGA) swapping wires is the only possible structural
> optimization. All other multiplier transformations can be reduced to swaps.
>
> An extremely nice property of swapping wires is, that it can be done
> after placement. This is such a huge advantage that we were able to beat
> sophisticated multiplier generators with a simple greedy algorithm when
> applying it after placement:
> http://eis.eit.uni-kl.de/eis/researc...ers/iccd04.pdf
>
I was referring to custom design, not the use of standard cells or
FPGAs. It is certainly obvious that if you can't design your cells from
scratch then you're just arranging the cells that you have available.
I'm not sure if it can be reduced to swapping wires however, though
certainly in FPGAs where the entire logic design is already laid out and
the only configuration possible is via routing changes then this is the
case.
>I don't know how they do it this days, but I do know that with a
>whole shitpot load of adders, you could do it in n propagation delays,
>where n is the width of whichever operand you arrange to come in
>sideways. I almost drew a schematic. You have a set of adders as
>wide as operand "A", and its inputs are operand "A" and the "latest"
>partial product - and its outputs go to another bank of adders whose
>other inputs are either "A" again or 0, and so on - the other operand,
>"B", would be presented down the side of the array, deciding which
>partial products get added to and which don't. The LSB, of course,
>gets sent out as "product", and the carry is the MSB of the next
>partial product. They form a parallelogram.
Wouldn't it go faster (log N) if you used a tree rather than
a long skinny chain?
--
The suespammers.org mail server is located in California. So are all my
other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's. I hate spam.
"Jim Granville" <[email protected]> wrote in message
news:43583dc8$[email protected]..
> [email protected] wrote:
>> Tim Wescott skrev:
>>
>> snip
>>
>>>The original question was for an under-$2 DSP chip capable of doing
>>>audio frequency stuff, including FFTs. I'm not the fellow who asked; it
>>>just sparked a tangential thought in my head about why there isn't some
>>>intermediate step on the way to a full-speed DSP.
>>>
>>
>>
>> I never have to buy stuff so I don't know anything about prices, but
>> philips recently announced a couple of 70MHz ARM7TDMIs in the 2$
>> range, it's not DSPs but at 70MHz and one cycle per 8bits of
>> 32*32->64bit multiply it'll do some dsp
>
> TI have just volume-released their 100MHz FLASH controllers, start at sub $5,
> so not quite a $2 target, but these have FLASH(not ROM) and include 12 bit
> 6Msps ADCs, a 150ps resolution PWM, and CAN bus
>
> 150ps PWM is a challenge even for FPGA ....
>
> http://focus.ti.com/docs/pr/pressrel...prelId=sc05231
>
> The sub $2 Philips devices have quite low code sizes, but they could do some
> 'audio frequency stuff'...
Also, if you open up most any DSP-based Behringer product or many of the cheap
DSP-based stomp-boxes, you will find an obsolete 24-bit TI DSP that apparently
was never sold in the US. I don't know what they cost, but it must not be much
given that a lot of that gear retails for <$100. Unfortunately, you and I can't
obtain those parts, AFAIK.
The DSP is a TMS57002, which up until now is not obsolete. It's a 24
bit fixed point DSP whichis sold in the US also used by Line6, Zoom and
others. Behringer is using it on older designs but recently I saw much
of the Motorla 56364 and very powerful Shark processors in their
products. Also I heard they designed their own DSP which are used in
the current stomp boxes.
Seems reasonable to me. Use the DCM clock shifter to get a
fraction of a clock cycle. 10ns/256 is 40 ps.
I can't quite understand the fine print well enough to work out
a design on the fly. Maybe Peter will take it as a challenge.
--
The suespammers.org mail server is located in California. So are all my
other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's. I hate spam.
"Rich Grise" <[email protected]> wrote in message
news[email protected]..
> On Wed, 19 Oct 2005 09:25:12 -0700, Tim Wescott wrote:
>
>> Jeorg's question on sci.electronics.design for an under $2 DSP chip got
>> me to thinking:
>>
>> How are 1-cycle multipliers implemented in silicon?
>
> I don't know how they do it this days, but I do know that with a
> whole shitpot load of adders, you could do it in n propagation delays,
> where n is the width of whichever operand you arrange to come in
> sideways. I almost drew a schematic. You have a set of adders as
> wide as operand "A", and its inputs are operand "A" and the "latest"
> partial product - and its outputs go to another bank of adders whose
> other inputs are either "A" again or 0, and so on - the other operand,
> "B", would be presented down the side of the array, deciding which
> partial products get added to and which don't. The LSB, of course,
> gets sent out as "product", and the carry is the MSB of the next
> partial product. They form a parallelogram.
>
> I just fired up that Xilinx S/W to see what it's got in the way of
> symbols, and it already has a 16-bit adder. With 16 of them, and 256
> AND gates, I could build a 16 x 16 multiplier that would have an
> answer in about 16 or 17 propagation delays. :-)
> <Rich fires up Xilinx ISE...>
>
> OK, it's gonna be a day or so. Please be gentle, it's my first time. :-)
>
> Thanks!
> Rich
There is a app note
There are a few fft cores included with ise
in logicore