I am designing a simple FPU comparison unit for 32/64 bit floating
point. My RTL code is simulating OK and now I'm trying to improve the
code to make it more efficient. I'm trying to improve a 64-bit
comparison. In my unit, the comparison can be either 31 bits (single-
precision) or 63 bits (double precision), excluding sign bit. Right
now, what I do is put a 32-bit multiplexer in front of the comparator
and zero pad the single-precision value to make it 63 bits. Is there a
way to recode this to get ride of these multiplexers to handle both
cases? I've been able to do it pretty easily for detecting A == B == 0
or A != 0, B != 0.

My code is below. The line in question is "assign altb64 = ...."

input [63:0] srcA, srcB; // operands A/B
input sp; // 1 - float32, 0 - float64
input eqop; // if this is strictly a == b
operation
output aeqb, aleb, altb; // a == b, a <= b, a < b
output invalidexception; // if invalid exception is thrown

/*AUTOWIRE*/
// Beginning of automatic wires (for undeclared instantiated-module
outputs)
wire aNaN; // From cmpnan of
fpuNaNU.v
wire asigNaN; // From cmpnan of
fpuNaNU.v
wire bNaN; // From cmpnan of
fpuNaNU.v
wire bsigNaN; // From cmpnan of
fpuNaNU.v
// End of automatics

// if A or B is a NaN / signaling NAN.
assign aorbNaN = aNaN | bNaN;
assign aorbsigNaN = asigNaN | bsigNaN;

// IEEE standard states that equality comparison don't throw an
invalid
// exception unless the operand is a signaling NaN. for <=/<, the
exception
// is thrown if atleast one of the operands is NaN.
assign invalidexception = eqop ? aorbsigNaN : aorbNaN;

// determine if A/B have same sign.
assign signmismatch = sp ? srcA[31] ^ srcB[31] : srcA[63] ^ srcB
[63];

// determine if A == B == 0 for single/double precision.
assign aandbzero32 = ~(| (srcA[30:0] | srcB[30:0]));
assign aandbzero64 = aandbzero32 & ~(| (srcA[62:31] | srcB
[62:31]));

// determine if A == B for single/double precision.
assign aeqb32 = (srcA[30:0] == srcB[30:0]);
assign aeqb64 = aeqb32 & (srcA[62:31] == srcB[62:31]);

// check for equality.
// a) if signs of A/B mismatch, A == B == 0.
// b) if signs match, A == B if all the bits are equal.
assign equal = sp ? (signmismatch & aandbzero32) | (~signmismatch &
aeqb32)
: (signmismatch & aandbzero64) | (~signmismatch &
aeqb64);

// check for less than.
// a) if signs of A/B mismatch, A < B if A.sign == 1 and A != 0 and
B != 0.
// b) if signs match, A < B, if (sign ^ (A < B)).
assign less = sp ? (signmismatch & srcA[31] & ~aandbzero32) |
(~signmismatch & ~equal & (srcA[31] ^ /*(srcA
[30:0] < srcB[30:0])*/ altb64))
: (signmismatch & srcA[63] & ~aandbzero64) |
(~signmismatch & ~equal & (srcA[63] ^ /*(srcA
[62:0] < srcB[62:0])*/ altb64));

< I am designing a simple FPU comparison unit for 32/64 bit floating
< point. My RTL code is simulating OK and now I'm trying to improve the
< code to make it more efficient. I'm trying to improve a 64-bit
< comparison. In my unit, the comparison can be either 31 bits (single-
< precision) or 63 bits (double precision), excluding sign bit. Right
< now, what I do is put a 32-bit multiplexer in front of the comparator

Assuming it is built of LUT4s, there is one extra input on the LUT
that build the comparator. That input could, in no more logic than
the comparator itself, indicate that the low bits should be ignored.

My guess is that the tools will figure that out from a mux in front
of the comparator, but I wouldn't say for sure without seeing the
actual logic.

If not combined, a two input MUX is about as big as the comparator.
It might be faster to generate two comparators and select the output
as appropriate. That would work best if the select logic combined
with later logic.

Most of the logic optimization rules from the TTL gate days don't
apply in LUT logic.

On Jun 3, 6:05*pm, glen herrmannsfeldt <[email protected]> wrote:
> pallav <[email protected]> wrote:
>
> < I am designing a simple FPU comparison unit for 32/64 bit floating
> < point. My RTL code is simulating OK and now I'm trying to improve the
> < code to make it more efficient. I'm trying to improve a 64-bit
> < comparison. In my unit, the comparison can be either 31 bits (single-
> < precision) or 63 bits (double precision), excluding sign bit. Right
> < now, what I do is put a 32-bit multiplexer in front of the comparator
>
> Assuming it is built of LUT4s, there is one extra input on the LUT
> that build the comparator. *That input could, in no more logic than
> the comparator itself, indicate that the low bits should be ignored.
>
> My guess is that the tools will figure that out from a mux in front
> of the comparator, but I wouldn't say for sure without seeing the
> actual logic. *
>
> If not combined, a two input MUX is about as big as the comparator.
> It might be faster to generate two comparators and select the output
> as appropriate. *That would work best if the select logic combined
> with later logic.
>
> Most of the logic optimization rules from the TTL gate days don't
> apply in LUT logic. *
>
> -- glen

Thanks for the response. I was targeting this more for CMOS logic and
ASIC
design flow. However, I had planned to run it on an FPGA for
verification.
I haven't gotten around to writing the synthesis scripts to see what
Synopsys
DC compiler would generate. Perhaps I should look into that quickly.

pallav <[email protected]> wrote:
< I think I got this to work:
(snip on modified comparator)

< This gets rid of the 32-bit muxes in front of the
< comparator inputs and puts a 1-bit 2:1 mux at the output.

So many posts are for FPGA that was I thought about first.

The delay is presumably not so different, but much less logic.
Since you can't have 64 bit input gates in CMOS, how does the
equality test work? Otherwise, for the less than part you can
modify the carry logic to ignore the low half.

On Jun 3, 8:28*pm, glen herrmannsfeldt <[email protected]> wrote:
> pallav <[email protected]> wrote:
>
> < I think I got this to work:
> (snip on modified comparator)
>
> < This gets rid of the 32-bit muxes in front of the
> < comparator inputs and puts a 1-bit 2:1 mux at the output.
>
> So many posts are for FPGA that was I thought about first.
>
> The delay is presumably not so different, but much less logic.
> Since you can't have 64 bit input gates in CMOS, how does the
> equality test work? *Otherwise, for the less than part you can
> modify the carry logic to ignore the low half.
>
> -- glen

The logic structure for 32 bit equality comparator would just be a
bunch of XNOR (equivalence) gates that compute A[i] == B[i]. This
32-bit result can then be fed into a binary tree of AND gates
(basically
detect if all 32 bits are 1s). So that's 1 + 5 (log2 32) = 6 stages of
logic
if we assume 2-input AND gates. However, if we have 4 input AND gates,
then that reduces to 1 + 3 = 4 stages. XNOR can be made fast with
mirror logic
in static CMOS. Of course, I'm not counting the inverters needed for
the AND gate
as a logic stage.

Usually, many cell libraries have basic gates with fanins (inputs) of
up to 5-6 (maybe more perhaps, I think).

<> The delay is presumably not so different, but much less logic.
<> Since you can't have 64 bit input gates in CMOS, how does the
<> equality test work? ?Otherwise, for the less than part you can
<> modify the carry logic to ignore the low half.

< The logic structure for 32 bit equality comparator would just be a
< bunch of XNOR (equivalence) gates that compute A[i] == B[i]. This
< 32-bit result can then be fed into a binary tree of AND gates
< (basically detect if all 32 bits are 1s). So that's 1 + 5
< (log2 32) = 6 stages of logic if we assume 2-input AND gates.

Last I knew, it was four inputs for the widest CMOS gates.
The reason for the question is that you might be able to force
the low half to indicate equality with minimal logic and no
additional gate delay. Then you only need to force the carry
logic in a similar way.

< However, if we have 4 input AND gates, then that reduces
< to 1 + 3 = 4 stages. XNOR can be made fast with mirror logic
< in static CMOS. Of course, I'm not counting the inverters needed
< for the AND gate as a logic stage.

Maybe you aren't worried about delay. The delay model is very
different in an FPGA.