Hello,

I recently requested advice as to performing bit-matrix multiplication

on bit matrices (bitwise AND followed by a population count), one

matrix stored onboard an

fpga in block rams, the other one (first

operand) streaming in row by row.

It was thought that one of the latest pcie cards would be able to

provide dot-product throughput limited by the pcie input speed of 16

Gbps for pcie x8. The adders would be 80-bits wide (80 bits arrive per

cycle at 200 mhz over pcie x8) and each column of the onboard matrix

would be stored in 80 block rams.

My question is: How much more difficult is the problem if I must find

out the maximum dot-product, and which column produced it, for each

input vector? This operation must be performed for each input row,

yielding 1 max and argmax for every 1000 input bits. The input vector

is 1000 bits long, of course, and finishes arriving over pcie after

about 12 pcie cycles. Is there a fast enough way to argmax 1000

numbers that are 10-bits (representing each columns dot-product)? What

would the cost of the argmax operation be in

fpga space as compared to

the column adders (which are probably 80-bits wide for each column)?

Thanks for your help. I want to make sure that the max and argmax

functions will not be a limiting factor in the design of the bit-matrix

multiplier.

Also, thanks for so many helpful comments that have gotten me to this

level of understanding of the problem.

- AndrewF