FPGA Central - World's 1st FPGA / CPLD Portal

FPGA Central

World's 1st FPGA Portal

 

Go Back   FPGA Groups > NewsGroup > FPGA

FPGA comp.arch.fpga newsgroup (usenet)

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 11-27-2007, 06:36 PM
Jürgen Böhm
Guest
 
Posts: n/a
Default CPU design uses too many slices


Hi,

currently I am designing (as an amateur project) a 32bit Stack
oriented CPU with two stack-pointers (Data Stack/Return Stack) and some
additional registers, that are partly purely auxiliary, partly dedicated
for the intended purpose of the CPU as a specialized Lisp-Processor.
The control is microcoded and the greater part of the microcode is
already written and successfully tested (in simulation with Icarus).
Missing at the moment is parts of the ALU functions and the complete
interrupt/exception logic.
Nevertheless the design (done in Verilog), when synthesized, occupies
already about 1100 slices in a Spartan 3 FPGA, which I feel is a bit
heavy for what seems to me a very simple design.

Below I give the output of the Xilinx ISEWebpack synthesis tool

Logic Utilization Used Available Utilization Note(s)
Number of Slice Flip Flops 621 3,840 16%
Number of 4 input LUTs 2,561 3,840 66%

Logic Distribution
Number of occupied Slices 1,517 1,920 79%
Number of Slices containing only related logic 1,517 1,517 100%
Number of Slices containing unrelated logic 0 1,517 0%
Total Number 4 input LUTs 2,751 3,840 71%

(about 400/500 slices can be subtracted from the above figures, as they
result from accompanying structures like VGA driver and the like).

What catches my eye is, how small the utilization of Slice Flip/Flops
compared to the utilization of slices is: Can this be an expression of
the fact, that there is much combinatorial logic (adders, multiplexors)
and, relative to that, few registers/state elements? Are especially
adders, that I used quite generously to speed up the instructions, a
source of slices consumption? Or are multiplexors with many alternative
inputs more likely the culprits?

I would be very happy, if someone with more experience than me (being
just an hobbyist) could look at the Verilog source of the CPU and give
me some hints how to possibly lower the amount of resources needed by
the design.

Greetings,

Jürgen


--
Jürgen Böhm www.aviduratas.de
"At a time when so many scholars in the world are calculating, is it not
desirable that some, who can, dream ?" R. Thom
Reply With Quote
  #2 (permalink)  
Old 11-27-2007, 08:57 PM
Jon Elson
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices



Jürgen Böhm wrote:
> What catches my eye is, how small the utilization of Slice Flip/Flops
> compared to the utilization of slices is: Can this be an expression of
> the fact, that there is much combinatorial logic (adders, multiplexors)
> and, relative to that, few registers/state elements?

Yes, precisely.
Are especially
> adders, that I used quite generously to speed up the instructions, a
> source of slices consumption? Or are multiplexors with many alternative
> inputs more likely the culprits?
>

Yes, wide adders use a lot of LUTs. Multiplexers use up LUTs too. A
single 4-input LUT could form a single bit of a 2-input mux, wasting one
input. If you need more inputs, then you have to combine several LUTs
to perform one bit's worth of multiplexer.

Xilinx has pretty detailed info on what the basic structure of their
chips are, and you should be able to see how one would form basic logic
functions out of that. It may be that Virtex would give more resources
for this particular task than Spartan.

Jon

Reply With Quote
  #3 (permalink)  
Old 11-27-2007, 10:37 PM
Gabor
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices

On Nov 27, 3:57 pm, Jon Elson <[email protected]> wrote:
> Jürgen Böhm wrote:
> > What catches my eye is, how small the utilization of Slice Flip/Flops
> > compared to the utilization of slices is: Can this be an expression of
> > the fact, that there is much combinatorial logic (adders, multiplexors)
> > and, relative to that, few registers/state elements?

>
> Yes, precisely.
> Are especially> adders, that I used quite generously to speed up the instructions, a
> > source of slices consumption? Or are multiplexors with many alternative
> > inputs more likely the culprits?

>
> Yes, wide adders use a lot of LUTs. Multiplexers use up LUTs too. A
> single 4-input LUT could form a single bit of a 2-input mux, wasting one
> input. If you need more inputs, then you have to combine several LUTs
> to perform one bit's worth of multiplexer.
>


Another point to make is that unless you change some defaults, the
mapper will not pack slices to capacity until the whole part becomes
mostly full. So the number of occupied slices does not necessarily
represent the most compact placement of your design. The statistics
for LUTs and flip-flops are more useful for determining your actual
logic usage.

However given the fact that your number of slices is not a whole
lot more than half the number of LUTs, I'd say that further packing
of "unrelated logic" won't make your design much smaller.

> Xilinx has pretty detailed info on what the basic structure of their
> chips are, and you should be able to see how one would form basic logic
> functions out of that. It may be that Virtex would give more resources
> for this particular task than Spartan.
>
> Jon


To benefit from changing families, you probably need to go to
Virtex 5, which has 6-input LUTs. Other Virtex families look
very similar to Spartan 3 from the viewpoint of the fabric.
Reply With Quote
  #4 (permalink)  
Old 11-27-2007, 10:50 PM
Jim Granville
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices

Jürgen Böhm wrote:
> Hi,
>
> currently I am designing (as an amateur project) a 32bit Stack
> oriented CPU with two stack-pointers (Data Stack/Return Stack) and some
> additional registers, that are partly purely auxiliary, partly dedicated
> for the intended purpose of the CPU as a specialized Lisp-Processor.
> The control is microcoded and the greater part of the microcode is
> already written and successfully tested (in simulation with Icarus).
> Missing at the moment is parts of the ALU functions and the complete
> interrupt/exception logic.
> Nevertheless the design (done in Verilog), when synthesized, occupies
> already about 1100 slices in a Spartan 3 FPGA, which I feel is a bit
> heavy for what seems to me a very simple design.
>
> Below I give the output of the Xilinx ISEWebpack synthesis tool
>
> Logic Utilization Used Available Utilization Note(s)
> Number of Slice Flip Flops 621 3,840 16%
> Number of 4 input LUTs 2,561 3,840 66%
>
> Logic Distribution
> Number of occupied Slices 1,517 1,920 79%
> Number of Slices containing only related logic 1,517 1,517 100%
> Number of Slices containing unrelated logic 0 1,517 0%
> Total Number 4 input LUTs 2,751 3,840 71%
>
> (about 400/500 slices can be subtracted from the above figures, as they
> result from accompanying structures like VGA driver and the like).
>
> What catches my eye is, how small the utilization of Slice Flip/Flops
> compared to the utilization of slices is: Can this be an expression of
> the fact, that there is much combinatorial logic (adders, multiplexors)
> and, relative to that, few registers/state elements? Are especially
> adders, that I used quite generously to speed up the instructions, a
> source of slices consumption? Or are multiplexors with many alternative
> inputs more likely the culprits?
>
> I would be very happy, if someone with more experience than me (being
> just an hobbyist) could look at the Verilog source of the CPU and give
> me some hints how to possibly lower the amount of resources needed by
> the design.


You could download the Lattice Mico32, and reality check against that,
as that is open source.
Most FPGAs these days have multiport RAM, so it makes sense to optimise
your architecture to use that - in your case for registers, and maybe
even for micocode storage.

-jg

Reply With Quote
  #5 (permalink)  
Old 11-28-2007, 03:41 AM
Jürgen Böhm
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices

Jim Granville wrote:
>
> You could download the Lattice Mico32, and reality check against that,
> as that is open source.
> Most FPGAs these days have multiport RAM, so it makes sense to optimise
> your architecture to use that - in your case for registers, and maybe
> even for micocode storage.
>


Thank your for your answer:
Indeed I use RAMB16_S36 for microcode-storage, the final design will
probably need four of them, as the microcode is more than 36 bit wide.
The idea from the other posters to change to Virtex FPGAs is currently
not an option for me, as I really want to develop for the cheaper
Spartan platform, for which a lot of affordable boards are offered - if
necessary I will buy a board with the next larger Spartan 3 on it.

Greetings,

Jürgen


--
Jürgen Böhm www.aviduratas.de
"At a time when so many scholars in the world are calculating, is it not
desirable that some, who can, dream ?" R. Thom
Reply With Quote
  #6 (permalink)  
Old 11-28-2007, 04:04 AM
Jürgen Böhm
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices

Jon Elson wrote:
>
>
> Yes, wide adders use a lot of LUTs. Multiplexers use up LUTs too. A
> single 4-input LUT could form a single bit of a 2-input mux, wasting one
> input. If you need more inputs, then you have to combine several LUTs
> to perform one bit's worth of multiplexer.
>


Currently I have predominantly three (5bit select) x (32bit data size)
muxes with 16 alternatives select actually used (I overdimensioned the
muxes, as I did not exactly knew before having written the microcode,
how many inputs would be necessary). Are these muxes realized by
cascaded LUTs, and does your above remark imply, that a 5-stages-deep
chain of LUTs (1 stage for every select bit) will be used?

Greetings,

Jürgen



--
Jürgen Böhm www.aviduratas.de
"At a time when so many scholars in the world are calculating, is it not
desirable that some, who can, dream ?" R. Thom
Reply With Quote
  #7 (permalink)  
Old 11-28-2007, 05:23 PM
Joseph Samson
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices

Jürgen Böhm wrote:
> Hi,
>
> currently I am designing (as an amateur project) a 32bit Stack
> oriented CPU with two stack-pointers (Data Stack/Return Stack) and some
> additional registers, that are partly purely auxiliary, partly dedicated
> for the intended purpose of the CPU as a specialized Lisp-Processor.
> The control is microcoded and the greater part of the microcode is
> already written and successfully tested (in simulation with Icarus).
> Missing at the moment is parts of the ALU functions and the complete
> interrupt/exception logic.
> Nevertheless the design (done in Verilog), when synthesized, occupies
> already about 1100 slices in a Spartan 3 FPGA, which I feel is a bit
> heavy for what seems to me a very simple design.


[snip]

The synthesized results are really the worst case scenario. Before
worrying about a design, take it through mapping; that's where most of
the logic optimization and signal trimming happens. We have designs that
are over 100% utilized after synthesis that fit just fine after mapping.

---
Joe Samson
Pixel Velocity
Reply With Quote
  #8 (permalink)  
Old 11-28-2007, 07:56 PM
rickman
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices

On Nov 27, 10:41 pm, Jürgen Böhm <[email protected]> wrote:
> Jim Granville wrote:
>
> > You could download the Lattice Mico32, and reality check against that,
> > as that is open source.
> > Most FPGAs these days have multiport RAM, so it makes sense to optimise
> > your architecture to use that - in your case for registers, and maybe
> > even for micocode storage.

>
> Thank your for your answer:
> Indeed I use RAMB16_S36 for microcode-storage, the final design will
> probably need four of them, as the microcode is more than 36 bit wide.
> The idea from the other posters to change to Virtex FPGAs is currently
> not an option for me, as I really want to develop for the cheaper
> Spartan platform, for which a lot of affordable boards are offered - if
> necessary I will buy a board with the next larger Spartan 3 on it.


If you are trying to fit a given device, then you need to use the full
map and place portions of the tools as well. Only then will you know
for sure that your design won't fit. But what part is on your board?
You are using about 75% of available resources. I can't say for sure
about your design, but ALU logic can be very light if designed
properly. So the rest of your design may fit easily in the part.

I designed my own 16 bit CPU to have minimal size and it was about 500
LUTs, IIRC. Like you, most of the logic was from muxes, so I kept
them as small as possible, even to the point of eliminating some
instructions. Having an extra, unused select line makes them twice as
large. BTW, any unused inputs will be optimized out by the tools. So
if you don't connect the select input or data inputs, that logic will
not be generated.
Reply With Quote
  #9 (permalink)  
Old 11-28-2007, 10:13 PM
Jon Elson
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices



Jürgen Böhm wrote:
> Jon Elson wrote:
>
>>
>>Yes, wide adders use a lot of LUTs. Multiplexers use up LUTs too. A
>>single 4-input LUT could form a single bit of a 2-input mux, wasting one
>>input. If you need more inputs, then you have to combine several LUTs
>>to perform one bit's worth of multiplexer.
>>

>
>
> Currently I have predominantly three (5bit select) x (32bit data size)
> muxes with 16 alternatives select actually used (I overdimensioned the
> muxes, as I did not exactly knew before having written the microcode,
> how many inputs would be necessary). Are these muxes realized by
> cascaded LUTs, and does your above remark imply, that a 5-stages-deep
> chain of LUTs (1 stage for every select bit) will be used?

I think it probably does a little better than that. Really, it breaks
it down into basic boolean equations, and then minimizes them. So, it may
make much more efficient use than what you describe above, and it
probably gets better the more inputs you have. I think three LUTs can
do a 4-input MUX, you can almost do it with 2 but are one input short.
If you had 5 separate select inputs (like if you were originally
designing for 5 tri-state drivers on a bus) that might be less efficient
than using a 3-bit binary address for the MUX. But, if a binary address
is decoded somewhere in your logic to the 5 select lines, that will all
fall out in the logic minimization.

Jon

Reply With Quote
  #10 (permalink)  
Old 11-28-2007, 10:15 PM
Jon Elson
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices



Jürgen Böhm wrote:
> Indeed I use RAMB16_S36 for microcode-storage, the final design will
> probably need four of them, as the microcode is more than 36 bit wide.
> The idea from the other posters to change to Virtex FPGAs is currently
> not an option for me, as I really want to develop for the cheaper
> Spartan platform, for which a lot of affordable boards are offered - if
> necessary I will buy a board with the next larger Spartan 3 on it.

Yup, the low-cost Spartan was my choice for some designs, too, as I really
had no need for the special structures that the Virtex features.

Jon

Reply With Quote
  #11 (permalink)  
Old 11-28-2007, 10:33 PM
Eric Smith
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices

Jürgen Böhm wrote:
> Indeed I use RAMB16_S36 for microcode-storage, the final design will
> probably need four of them, as the microcode is more than 36 bit wide.


You can use a single BRAM as a 72-bit wide single-ported RAM, if you only
need half the "depth". For instance, normally the maximum width of a
Spartan 3 BRAM would be 512x36, but you can combine the two ports to get
256x72.

Obviously if you need greater depth or dual-port this won't help you.

Eric
Reply With Quote
  #12 (permalink)  
Old 11-29-2007, 09:33 PM
Jürgen Böhm
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices

rickman wrote:
>
> If you are trying to fit a given device, then you need to use the full
> map and place portions of the tools as well. Only then will you know
> for sure that your design won't fit. But what part is on your board?
> You are using about 75% of available resources. I can't say for sure
> about your design, but ALU logic can be very light if designed
> properly. So the rest of your design may fit easily in the part.
>


I use a Spartan-3 starter kit with a XC3S200. The utilization figures
I gave above refer to this component. Map and Place I already did, too,
but it did not shrink the design significantly.

Considering the ALU, it seems that it can become quite heavy. The
utilization figure above are with an ALU that misses some operations
which I really would have liked to implement, especially a r/lshift(x,y)
operation which shifts the 32 bit word x by an amount of y[4:0]. As long
as I kept this in the ALU I nearly had 90% device utilization and, what
is even worse, only maximal 46Mhz speed for the CPU.


> I designed my own 16 bit CPU to have minimal size and it was about 500
> LUTs, IIRC. Like you, most of the logic was from muxes, so I kept
> them as small as possible, even to the point of eliminating some
> instructions. Having an extra, unused select line makes them twice as
> large. BTW, any unused inputs will be optimized out by the tools. So
> if you don't connect the select input or data inputs, that logic will
> not be generated.


Here I would like to ask question: if I write the following

wire[4:0] sel;

case (sel)
0: case0;
...
15: case15;
endcase

then obviously one specific select-line (sel[4]) won't be used and,
following your argumentation and common-sense intuition, the size of the
multiplexor should be halved. But will this be also the case with

wire[4:0] sel

case (sel)
3: case3;
7: case7;
8: case8;
...
m: casem;
endcase

where 3,7,8,..,m form a more or less arbitrary 16-element set from the
range 0..31 ?

--
Jürgen Böhm www.aviduratas.de
"At a time when so many scholars in the world are calculating, is it not
desirable that some, who can, dream ?" R. Thom
Reply With Quote
  #13 (permalink)  
Old 11-29-2007, 09:47 PM
Jürgen Böhm
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices

Eric Smith wrote:
> Jürgen Böhm wrote:
>> Indeed I use RAMB16_S36 for microcode-storage, the final design will
>> probably need four of them, as the microcode is more than 36 bit wide.

>
> You can use a single BRAM as a 72-bit wide single-ported RAM, if you only
> need half the "depth". For instance, normally the maximum width of a
> Spartan 3 BRAM would be 512x36, but you can combine the two ports to get
> 256x72.
>
> Obviously if you need greater depth or dual-port this won't help you.
>

Right, I need the full depth, but there are two other points coming into
play here:

1. I noticed that using a dual port RAMB instead of a single port
increases (slightly) the number of used slices, even if only one port
the RAMB was used. I do not know the reason for this, maybe it is
because some external dual-port logic has to generated and added.

2. More importantly the access delay seems to be shorter for a single
port BRAM - I could lift my design from 46Mhz above the 50Mhz barrier
only by replacing dual port with single port BRAM.

Jürgen

--
Jürgen Böhm www.aviduratas.de
"At a time when so many scholars in the world are calculating, is it not
desirable that some, who can, dream ?" R. Thom
Reply With Quote
  #14 (permalink)  
Old 11-29-2007, 10:33 PM
rickman
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices

On Nov 29, 4:33 pm, Jürgen Böhm <[email protected]> wrote:
> rickman wrote:
>
> > If you are trying to fit a given device, then you need to use the full
> > map and place portions of the tools as well. Only then will you know
> > for sure that your design won't fit. But what part is on your board?
> > You are using about 75% of available resources. I can't say for sure
> > about your design, but ALU logic can be very light if designed
> > properly. So the rest of your design may fit easily in the part.

>
> I use a Spartan-3 starter kit with a XC3S200. The utilization figures
> I gave above refer to this component. Map and Place I already did, too,
> but it did not shrink the design significantly.
>
> Considering the ALU, it seems that it can become quite heavy. The
> utilization figure above are with an ALU that misses some operations
> which I really would have liked to implement, especially a r/lshift(x,y)
> operation which shifts the 32 bit word x by an amount of y[4:0]. As long
> as I kept this in the ALU I nearly had 90% device utilization and, what
> is even worse, only maximal 46Mhz speed for the CPU.


Yes, an n stage barrel shifter is a very logic intensive function. It
can easily be larger than all of the other ALU functions combined. If
you consider what is required, you in essence need to build a mux with
an input for each possible shift on every bit. If you are shifting in
zeros instead of rotating the other bits back on the other end, you
can cut your mux roughly in half. But it is still huge. If you want
to be able to shift both left and right it is doubled again and if you
want to shift right either arithmetic or logical it is larger yet and
if you want to rotate as well it is even larger.

If you check the details of the slice logic, there should be some
additional gates to allow a pair of 4LUTs to be used to make a 4 input
mux. I would expect the tools to use this automatically, but I never
trust the tools and I check. If there is any logic driving the select
inputs rather than being connected to register outputs, that logic can
get mixed in with the mux and make quite an ugly picture. I don't
know that it is any less efficient, but I can no longer verify how
good it is. I like to verify the logic my HDL is generating.

> > I designed my own 16 bit CPU to have minimal size and it was about 500
> > LUTs, IIRC. Like you, most of the logic was from muxes, so I kept
> > them as small as possible, even to the point of eliminating some
> > instructions. Having an extra, unused select line makes them twice as
> > large. BTW, any unused inputs will be optimized out by the tools. So
> > if you don't connect the select input or data inputs, that logic will
> > not be generated.

>
> Here I would like to ask question: if I write the following
>
> wire[4:0] sel;
>
> case (sel)
> 0: case0;
> ..
> 15: case15;
> endcase
>
> then obviously one specific select-line (sel[4]) won't be used and,
> following your argumentation and common-sense intuition, the size of the
> multiplexor should be halved. But will this be also the case with


Unless you specifically specify don't care for cases 16 to 31, I don't
know what the tool assumes. I expect it will add sel[4] as an
enable. But my point is that it won't use the data inputs case16
through case31 which should cut the number of mux LUTs in half.

In fact, (I am very rusty in Verilog working mostly in VHDL) but the
above logic may well generate a latch. That is what happens with
incompletely specified functions, no? So sel[4] may end up as an
enable to a latch at the output of the mux. In VHDL you can't use a
case statement without specifying all possible cases or using an
otherwise case. If the otherwise is spec'd to output a zero, then
sel[4] will be an enable. To have sel[4] ignored you would have to
spec the case from 16 to 31 to be the same output as 0 to 15
respectively.

My original statement about the logic being automatically optimized
away would only apply if you designed the mux to have 32 data inputs
and did not drive half of them. Again, that likely is not legal, but
I don't recall what any particular compiler will do for Verilog.


> wire[4:0] sel
>
> case (sel)
> 3: case3;
> 7: case7;
> 8: case8;
> ..
> m: casem;
> endcase
>
> where 3,7,8,..,m form a more or less arbitrary 16-element set from the
> range 0..31 ?


This design will use all of the select inputs even if only half of the
data inputs are used. So again the mux logic will be reduced, but
each data input will be enabled by a full decode all of the sel
inputs. I don't think the logic will be halved in this case however,
but it depends on how it is implemented. Again without a fully spec'd
case, it may generate a latch on the output.
Reply With Quote
  #15 (permalink)  
Old 11-29-2007, 10:38 PM
rickman
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices

On Nov 29, 4:47 pm, Jürgen Böhm <[email protected]> wrote:
> Eric Smith wrote:
> > Jürgen Böhm wrote:
> >> Indeed I use RAMB16_S36 for microcode-storage, the final design will
> >> probably need four of them, as the microcode is more than 36 bit wide.

>
> > You can use a single BRAM as a 72-bit wide single-ported RAM, if you only
> > need half the "depth". For instance, normally the maximum width of a
> > Spartan 3 BRAM would be 512x36, but you can combine the two ports to get
> > 256x72.

>
> > Obviously if you need greater depth or dual-port this won't help you.

>
> Right, I need the full depth, but there are two other points coming into
> play here:
>
> 1. I noticed that using a dual port RAMB instead of a single port
> increases (slightly) the number of used slices, even if only one port
> the RAMB was used. I do not know the reason for this, maybe it is
> because some external dual-port logic has to generated and added.
>
> 2. More importantly the access delay seems to be shorter for a single
> port BRAM - I could lift my design from 46Mhz above the 50Mhz barrier
> only by replacing dual port with single port BRAM.


This sounds odd to me, but obviously your dual port design is
different from the single port design in other ways than just the
ram. You need two address busses and control signal sets, not to
mention the two data paths. How did you connect the dual port ram
that was different from the single port ram? I am pretty sure the
block ram itself fully implements the dual port memory and does not
require any slices to be used.

Reply With Quote
  #16 (permalink)  
Old 11-30-2007, 12:30 AM
Jürgen Böhm
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices

rickman wrote:
> On Nov 29, 4:47 pm, Jürgen Böhm <[email protected]> wrote:
>> Eric Smith wrote:
>>> Jürgen Böhm wrote:

>>
>> 1. I noticed that using a dual port RAMB instead of a single port
>> increases (slightly) the number of used slices, even if only one port
>> the RAMB was used. I do not know the reason for this, maybe it is
>> because some external dual-port logic has to generated and added.
>>
>> 2. More importantly the access delay seems to be shorter for a single
>> port BRAM - I could lift my design from 46Mhz above the 50Mhz barrier
>> only by replacing dual port with single port BRAM.

>
> This sounds odd to me, but obviously your dual port design is
> different from the single port design in other ways than just the
> ram. You need two address busses and control signal sets, not to
> mention the two data paths. How did you connect the dual port ram
> that was different from the single port ram? I am pretty sure the
> block ram itself fully implements the dual port memory and does not
> require any slices to be used.
>


Actually I just used "dummy signals" at the unused port B ADDR, DI,
DO,.. signals. That is, first I wrote (because of laziness, I did only
copy&paste) something like

RAMB16_S36_S36 micro_store ( ... ,.DOB(dummydob), ... );

and used only the port A. (dummydob is a signal left undeclared).

Secondly I wrote explicitly

RAMB16_S36 micro_store (..)

and got the results with faster timing and less slices used.


- Jürgen

--
Jürgen Böhm www.aviduratas.de
"At a time when so many scholars in the world are calculating, is it not
desirable that some, who can, dream ?" R. Thom
Reply With Quote
  #17 (permalink)  
Old 11-30-2007, 03:11 AM
rickman
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices

On Nov 29, 7:30 pm, Jürgen Böhm <[email protected]> wrote:
> rickman wrote:
> > On Nov 29, 4:47 pm, Jürgen Böhm <[email protected]> wrote:
> >> Eric Smith wrote:
> >>> Jürgen Böhm wrote:

>
> >> 1. I noticed that using a dual port RAMB instead of a single port
> >> increases (slightly) the number of used slices, even if only one port
> >> the RAMB was used. I do not know the reason for this, maybe it is
> >> because some external dual-port logic has to generated and added.

>
> >> 2. More importantly the access delay seems to be shorter for a single
> >> port BRAM - I could lift my design from 46Mhz above the 50Mhz barrier
> >> only by replacing dual port with single port BRAM.

>
> > This sounds odd to me, but obviously your dual port design is
> > different from the single port design in other ways than just the
> > ram. You need two address busses and control signal sets, not to
> > mention the two data paths. How did you connect the dual port ram
> > that was different from the single port ram? I am pretty sure the
> > block ram itself fully implements the dual port memory and does not
> > require any slices to be used.

>
> Actually I just used "dummy signals" at the unused port B ADDR, DI,
> DO,.. signals. That is, first I wrote (because of laziness, I did only
> copy&paste) something like
>
> RAMB16_S36_S36 micro_store ( ... ,.DOB(dummydob), ... );
>
> and used only the port A. (dummydob is a signal left undeclared).
>
> Secondly I wrote explicitly
>
> RAMB16_S36 micro_store (..)
>
> and got the results with faster timing and less slices used.


I don't know the impact of dummydob. I would expect it to use LUTs to
source a fixed value signals since you are instantiating a fixed ram
block, but I don't really know what the tools would do with that. If
it doesn't provide signal drivers it would have to minimize the dual
port rams to single port rams since that is all that is being used. I
don't think the ram blocks can ignore inputs, but again, I don't know
for sure. One of the Xilinx guys could tell you for sure.
Reply With Quote
  #18 (permalink)  
Old 11-30-2007, 01:58 PM
Brian Drummond
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices

On Thu, 29 Nov 2007 22:33:56 +0100, Jürgen Böhm <[email protected]> wrote:

>rickman wrote:


> Considering the ALU, it seems that it can become quite heavy. The
>utilization figure above are with an ALU that misses some operations
>which I really would have liked to implement, especially a r/lshift(x,y)
>operation which shifts the 32 bit word x by an amount of y[4:0]. As long
>as I kept this in the ALU I nearly had 90% device utilization and, what
>is even worse, only maximal 46Mhz speed for the CPU.


One trick with the barrel shifter is to use the multiplier blocks to
implement it : for a 32*16 shifter, 2 multipliers are enough.
Simply decode the shift distance to one-of-16.
Unless they are already fully employed elsewhere.

- Brian
Reply With Quote
  #19 (permalink)  
Old 11-30-2007, 07:19 PM
Peter Alfke
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices

Just make sure that you force the Enable and the Write Enable inputs of
the unused port Low. Nothing else.

BTW: Use the 18 x 18 multipliers to implement barrel shifters. It saves
many LUTs and is faster.
(Multiplying by a power of 2 generates a shift by the appropriate number
of positions.
The multiplexing is done inside the multiplier "at no cost".)
Peter Alfke, Xilinx Applications.

rickman wrote:
> On Nov 29, 7:30 pm, Jürgen Böhm <[email protected]> wrote:
>> rickman wrote:
>>> On Nov 29, 4:47 pm, Jürgen Böhm <[email protected]> wrote:
>>>> Eric Smith wrote:
>>>>> Jürgen Böhm wrote:
>>>> 1. I noticed that using a dual port RAMB instead of a single port
>>>> increases (slightly) the number of used slices, even if only one port
>>>> the RAMB was used. I do not know the reason for this, maybe it is
>>>> because some external dual-port logic has to generated and added.
>>>> 2. More importantly the access delay seems to be shorter for a single
>>>> port BRAM - I could lift my design from 46Mhz above the 50Mhz barrier
>>>> only by replacing dual port with single port BRAM.
>>> This sounds odd to me, but obviously your dual port design is
>>> different from the single port design in other ways than just the
>>> ram. You need two address busses and control signal sets, not to
>>> mention the two data paths. How did you connect the dual port ram
>>> that was different from the single port ram? I am pretty sure the
>>> block ram itself fully implements the dual port memory and does not
>>> require any slices to be used.

>> Actually I just used "dummy signals" at the unused port B ADDR, DI,
>> DO,.. signals. That is, first I wrote (because of laziness, I did only
>> copy&paste) something like
>>
>> RAMB16_S36_S36 micro_store ( ... ,.DOB(dummydob), ... );
>>
>> and used only the port A. (dummydob is a signal left undeclared).
>>
>> Secondly I wrote explicitly
>>
>> RAMB16_S36 micro_store (..)
>>
>> and got the results with faster timing and less slices used.

>
> I don't know the impact of dummydob. I would expect it to use LUTs to
> source a fixed value signals since you are instantiating a fixed ram
> block, but I don't really know what the tools would do with that. If
> it doesn't provide signal drivers it would have to minimize the dual
> port rams to single port rams since that is all that is being used. I
> don't think the ram blocks can ignore inputs, but again, I don't know
> for sure. One of the Xilinx guys could tell you for sure.

Reply With Quote
  #20 (permalink)  
Old 11-30-2007, 09:22 PM
glen herrmannsfeldt
Guest
 
Posts: n/a
Default Re: CPU design uses too many slices

rickman wrote:
(snip)

> Yes, an n stage barrel shifter is a very logic intensive function. It
> can easily be larger than all of the other ALU functions combined. If
> you consider what is required, you in essence need to build a mux with
> an input for each possible shift on every bit. If you are shifting in
> zeros instead of rotating the other bits back on the other end, you
> can cut your mux roughly in half. But it is still huge. If you want
> to be able to shift both left and right it is doubled again and if you
> want to shift right either arithmetic or logical it is larger yet and
> if you want to rotate as well it is even larger.


The more usual barrel shifter would be log2(n) 2 input muxes,
which is still a lot of CLBs, or maybe log4(n) 4 input muxes.

It is barrel shifters that make floating point addition and subtraction
so expensive in FPGAs. You need one for prenormalization (shift to
align the radix point), and one for postnormalization (remove leading
zeros in the result and adjust the exponent).

-- glen

Reply With Quote
Reply

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
[ANN] FPGAOptim - Do you know where your slices are going...? Martin Thompson FPGA 6 10-11-2007 03:59 PM
Xilinx multiplier out of slices Peter Sommerfeld FPGA 4 04-23-2005 03:44 PM
divide by 2^n, n=21..37 ==> 3 Virtex Slices !! Antti Lukats FPGA 8 03-29-2005 04:28 PM
How many Altera LE's to Xilinx Slices???? Guitarman FPGA 15 10-19-2004 10:36 AM
Unguided slices Miguel Silva FPGA 0 10-11-2004 11:48 AM


All times are GMT +1. The time now is 01:45 AM.


Powered by vBulletin® Version 3.8.0
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0
Copyright 2008 @ FPGA Central. All rights reserved