I am a Python programmer writing neural network code with binary firing
and binary weight values. My code will take many days to parse my
large data sets. I have no idea how much fpga could help, what the
cost would be, and how easy it would be to access it from Python. The
problem is similar to competitive networks, where I must dot product
many million-length bit vectors (which only change occasionally) with 1
input vector. Anybody want to estimate the cost, speedup, and value an
fpga could offer me?

Seems like this problem shouldn't be so hard, but from the little
research I've done I haven't found a good value product that is
ready-made, so I'm looking at (multiple?) fpga as a coprocessor.

Dini has a new PCIe x8 board. It should work really well for your needs
but nobody has built the DMA engine and drivers for it yet so that adds
to its $10k cost. FPGAs are great for accelerating neural network
projects. There are lots of papers with algorithms for it in ACM and
IEEE journals. The Python interface is not a problem. It will come to
IOCTLs at some point. You may have to make a C API and DLL wrapper if
your Python cannot make IOCTL calls directly.

Here's my usual plug: I just wish there were a hardware vendor who
would put some cheap FPGAs (Spartan 3e 1600s) on a cheap board with
some standard DRAM and SRAM slots (unpopulated) and a PCIe x8 (or x4)
slot and then sell the board for < $300. Design the darn thing for
acceleration, not prototyping. They could make a killing on a well-made
board with 8 or 16 fast DMA channels and a driver that worked for that
really well.

>Hello,
>
>I am a Python programmer writing neural network code with binary firing
>and binary weight values. My code will take many days to parse my
>large data sets. I have no idea how much fpga could help, what the
>cost would be, and how easy it would be to access it from Python. The
>problem is similar to competitive networks, where I must dot product
>many million-length bit vectors (which only change occasionally) with 1
>input vector. Anybody want to estimate the cost, speedup, and value an
>fpga could offer me?
>
>Seems like this problem shouldn't be so hard, but from the little
>research I've done I haven't found a good value product that is
>ready-made, so I'm looking at (multiple?) fpga as a coprocessor.
>
>
Sounds to me like vector processing.
Cray (supercomputers) know all about it.
Can be done in FPGA.
For neural nets many hardware units have been designed.
Is not python horribly slow for this? C would be better? ASM?
Anyways, perhaps (this has been done), you can implement your neuron in
hardware.
There also exist vector plugin cards for the PC (designed one once).

So how many million-length bit-vector dot products might I be able to
do per second? My 3.8ghz P4 can do 125/sec. I would prefer building a
beowulf cluster if the priceerformance was similar (because fpga is
so foreign to me). Of course if you tell me 10,000/sec I will become
an instant fpga evangelist, hehe.

I have not found a decent bit-vector dot product plugin card. I think
they usually do integers or floating points, but not bits in an
efficient manner.

>So how many million-length bit-vector dot products might I be able to
>do per second? My 3.8ghz P4 can do 125/sec. I would prefer building a
>beowulf cluster if the priceerformance was similar (because fpga is
>so foreign to me). Of course if you tell me 10,000/sec I will become
>an instant fpga evangelist, hehe.
>
>Jan: I use a matrix library written in C.
Ah, Ok.
You know, this is not a 'saturday afternoon after shopping' thing
(it is that now here), I can asnwer just like that.
My old boss used to DEMAND to see the whole project else he did
not even want to venture.
Because often these things can be broken down, done in different ways.
If it was a simple multiply you could see how many nnbit ,multipliers there are
in the largest FPGA, but millions.... And you would soemhopw have to get the
data in and out.
Some project.
in Virtex (Xilinx knows more) at 500MHz you can have 512 xtreme DSP blocks
with 18x18 multiplyer. 512 x 18 = 9216 bits at the time in 2 nS.
To make a million (in a loop) x 109 = 218 nS per multiply...
Sort of a wild number, you really need to talk to these guys, I have no experience with
the virtex 4.
Over to X (or Altera).
Budget? Time? All counts.

[email protected] schrieb:
> where I must dot product
> many million-length bit vectors (which only change occasionally) with 1
> input vector. Anybody want to estimate the cost, speedup, and value an
> fpga could offer me?

If I understand you correctly, for each vector you want to know at how
many places both the vector and the input vector are 1?
Your vectors change rarely, the single input vector changes rapidly?

I would say that in an FPGA you would not do it one vector at a time but
N bits at a time for many vectors in parallel.

If your vectors are stored of chip you can process them as fast as you
can read them. With the right board at a rate of a few hundred gigabits
per second.

>So how many million-length bit-vector dot products might I be able to
> do per second? My 3.8ghz P4 can do 125/sec.
Thats 125MBits/s. That is very easy to beat.
You probably can get some affordable board with 64-bit 200MHz SRAM and a
small FPGA. This will get you about a factor of ten over the P4.

On the other hand the P4 value seems to low. According to "Hacker's
Delight" pages 65ff counting the number of bits in a 32-bit word takes
less than 20 instruction. Adding one instruction for the initial AND,
the loads, index updates and some loop control instructions results in
about one instruction per bit. This means that a P4 should be able to do
a few gigabits per second.
And thats wihtout using MMX instructions which can do the dot product
after the first three reductions.

[email protected] wrote:
> So how many million-length bit-vector dot products might I be able to
> do per second? My 3.8ghz P4 can do 125/sec. I would prefer building a
> beowulf cluster if the priceerformance was similar (because fpga is
> so foreign to me). Of course if you tell me 10,000/sec I will become
> an instant fpga evangelist, hehe.

If I understood correctly :

- You have many (what is many for you ? 100, 1000, 1000000 ?) million
bits vectors that are quite 'static' (or at least don't change much
compared to your 'input' vector). Let says you have N of them and that
you million bit vector is in fact 1024*1024 bits long.
- You also have 1 input vector that change quite often.
- You want the N cross products which is basically the number of 1 in
the bit wise AND of the fixed vectors and the input vector.

To get an estimation on how fast it could be done, N should be known ...
or at least a range because I think the main limitation is gonna be
the bandwith between the host and the card and not the FPGA itself.

An FPGA can do the cross product pretty easily, imagine you get the
vectors 32 bits by 32 bits. First the 32 first bits of 'input' then the
32 first bits of the 'references' one by one. Then the 32 bits after
that, and so on. So to enter all the vectors info for 1 given
input vector, you need (N+1)*2^15 cycles. The logic doing the cross
product is just a AND bit by bit, a stage that perform the counting of
the 32 bits, then a 21 bits adder that stores the result in block rams
(given that N is sufficiently small to fits the 21 bits results in
block ram. Let says < 16384 for a small FPGA). The logic doing that
could easily be pipelined to go at > 100 MHz even in a small cheap
spartan 3 and since you need 2^15 cycles to do a complete vector (if
N>>1) that would be 3000/s and that's in a small FPGA.

Now, use a 128 bits wide DDR2 memory that's 256 bits in parallel, use
a high speed grade to run the whole things at > 250 MHz and you get
60.000 of them in parallel ...

But as I said, you need to get the data into the DDR2 memory and
organized so that the read is efficient. pretty easy. A million bit
vector is 128kb, getting 60 thousands of them per seconds is 7.5 GBytes
of traffic per second ...

Of course, you need to define N better and theses numbers are just for
the first design I can think of with the info you provided. You mileage
may vary. I think it could be done pretty quickly if you hire someone
that already has and has used, a memory controller and whatever
controller is needed to input/output the data. And getting data in/out
is the real challenge here ...

> Hello,
>
> I am a Python programmer writing neural network code with binary firing
> and binary weight values. My code will take many days to parse my
> large data sets. I have no idea how much fpga could help, what the
> cost would be, and how easy it would be to access it from Python. The
> problem is similar to competitive networks, where I must dot product
> many million-length bit vectors (which only change occasionally) with 1
> input vector. Anybody want to estimate the cost, speedup, and value an
> fpga could offer me?

I assume you have looked for algorithmic speed-ups? (Also FPGAs have
different algorithmic speed-ups available than conventional computers.)

Algorithmic speed-ups might be available if:
a) the bit vectors are sparse (i.e. only a small fraction are ones, or
a small fraction are zeros)
b) the bit vectors are non-random (e.g. you are matching to shift
register sequences, or to highly-compressible sequences that can be
described in considerably less data than the raw bit stream)
c) the bit vectors are related (e.g. you are using the neural net to
listen to a data stream for a pattern: you don't find it, so you shift
by one bit and try again).
d) You can do pruning (e.g. if you don't find any evidence of a match
after doing 10% of the sequence, you can abandon that vector and try
the next)
e) You can match multiple input vectors instead of just 1. (Since most
of your conventional processor time is going to be spent waiting around
for slow DRAM to get the next memory fetch of megabit matching vectors,
you may as well compare it to a few dozen inputs, rather than just
one).

As a Python programmer, you will probably find it easier to use C than
to learn VHDL/Verilog to the extent you need to implement this. If a
single order-of-magnitude speed-up will solve your problems, then
changing to a language closer to the metal may be enough and is easy
enough to try.

--
David M. Palmer [email protected] (formerly @clark.net, @ematic.com)

The conclusion seems to be that it would depend on system bandwidth,
but if there are 20 (possibly 1,000 as I scale to larger problems and
need greater capacity) "reference" million-bit-vectors, then I only
need to read the input vector from memory 1/20th as often as your
figure, right?

Jan wrote:
>> Budget? Time? All counts.
I'd prefer a scalable solution so I can pay more later for more speed.
I parse pieces of a sentence independently, so the problem breaks up
easily across several fpga cards, if that is cheaper. I perform Monte
Carlo sampling, which means that I run 30 indpendent experiments. So
if I'm bandwidth limited I should breakup the problem to 30 fpga
equipped computers.

Kolja wrote:
>> On the other hand the P4 value seems to low. According to "Hacker's
>> Delight" pages 65ff counting the number of bits in a 32-bit word takes
>> less than 20 instruction.

I agree that I should write it in C/C++ to see what my max P4 speed is,
but http://www-db.stanford.edu/~manku/bi.../bitcount.html indicates
that 16-bit lookup tables are most efficient (fits in L1 cache on a
AMD64, which I don't have, using only 3 clock cycles), which is what I
already use. The next step is definately to implement in C/C++ by
hand.

The consensus seems to be that it could be done well in fpga, and that
price scales with the problem much better than clusters of CPU's acting
on their own. It also seems that I would have to hire someone to build
it because it is not simple enough for a newbie to try. I hack xboxes,
i-opener (net appliance), and other hardware so I am a DIY learning
kind of guy, and it seems to me that at 27 years of age, being able to
implement pieces of my algorithms in hardware would be a great
advantage to decades of my future work.

I could get $200 to $500 from my advisor for a developer's kit for
sure. Any additional suggestions on specific hardware, estimated time
to develop for a newbie, etc. are greatly appreciated.

[email protected] schrieb:
> but http://www-db.stanford.edu/~manku/bi.../bitcount.html indicates
> that 16-bit lookup tables are most efficient (fits in L1 cache on a
> AMD64, which I don't have, using only 3 clock cycles), which is what I
> already use.
That link suggests that for 32-bit words you get to about 2.4Gbps in a
P4 in 2002. For 128-bit words in 2006 you probably get at least twice
that value. So something below 10gbps is likely to be the value for the
P4 using SSE2.

> Wow, you guys have all been really helpful.
>
> The conclusion seems to be that it would depend on system bandwidth,
> but if there are 20 (possibly 1,000 as I scale to larger problems and
> need greater capacity) "reference" million-bit-vectors, then I only
> need to read the input vector from memory 1/20th as often as your
> figure, right?

My figures assumed that you need to read the vectors that you compare
the input value against from memory and that that dominates the
bandwidth. If you can afford the money to store all input vectors on
chip except for the input vector you can reach insane speeds (TM).

You can get about 10 vectors in a very expensive FPGA.

If you do your computation for many input vectors for the same reference
vectors, you can also do the following:

Split the on chip memory to contain k bits each for N reference vectors.
As well as MxN intermediate counts for processing M input vectors.
Now you load part of your reference vectors and process the
corresponding parts of the input vectors over and over again.
For 100 reference vectors and 100 input vectors and k=32 you get
51200 bits per bus cycle. (you read 200 words for 10000 dot products of
32 bits)
With a 200MHz bus that's 1Tbps.

That design uses about 20000 LUTs and can be implemented in an
affordable XC3S1600E evaluation board with a fast 32-bit memory (did
anybody say PCIe?)

Kolja Sulimma

P.S.:
Masking bit vectors and doing sums of lookup table accesses can be done
really fast on graphics cards....

"David M. Palmer" <[email protected]> wrote in message
news:150420061212343276%[email protected]
> In article <[email protected] .com>,
> <[email protected]> wrote:
>
>> Hello,
>>
>> I am a Python programmer writing neural network code with binary firing
>> and binary weight values. My code will take many days to parse my
>> large data sets. I have no idea how much fpga could help, what the
>> cost would be, and how easy it would be to access it from Python. The
>> problem is similar to competitive networks, where I must dot product
>> many million-length bit vectors (which only change occasionally) with 1
>> input vector. Anybody want to estimate the cost, speedup, and value an
>> fpga could offer me?
>
> I assume you have looked for algorithmic speed-ups? (Also FPGAs have
> different algorithmic speed-ups available than conventional computers.)
>
> As a Python programmer, you will probably find it easier to use C than
> to learn VHDL/Verilog to the extent you need to implement this. If a
> single order-of-magnitude speed-up will solve your problems, then
> changing to a language closer to the metal may be enough and is easy
> enough to try.

In that vein, you should definitely look at the performance you can get
using the free (IIRC) Intel Performance Library. Bit vector manipulations
are probably some of the graphics-related operations that are supported by
MMX and its descendants that are in the P4 processor. If your compiler isn't
smart enough to do it natively (Intel sells high performance C compilers),
then figuring out the free API and calling it is still quite doable. We do
some digital signal processing (FFTs, filtering, reconstruction, etc.) on a
mix of platforms. Sometimes FPGAs make sense (less size, weight, and power)
and sometimes Pentiums or Power PCs make sense (cost, ease of programming).

I see the pci-express starter kit HW-S3PCIE-DK (at http://www.xilinx.com/xlnx/xebiz/des...y=HW-S3PCIE-DK)
will have a Spartan 3e version due in "april". But I'm not sure it
will be the 1600 version when it does come out. At $350 it is indeed a
steal, and outperforming a Pentium 4 by 100 times wets my whistle. Is
this what you're talking about? Implementing part of my algorithm in
this hardware effectively removes several factors from my run time
complexity, woohoo!

FYI, my neural network theory revolves around several "feature
detectors" (the reference vectors) competing to recognize various
locality-sensitive-hashes (where collisions imply nearness). Thus the
fundamental principle of competition between feature detectors could
become testable at the scale of the brain. The algorithm extends to
optical character recognition, speech recognition, and language
understanding because constructs in these domains can all be translated
into the same language; locality sensitive hashes (though it may take a
million bits, which the brain has no problem with).

Though I'll probably wait some months to use the fpga so that I can
finish my dissertation, it will be nice to make claims as to the
real-world speedups that can be expected from the massively parrallel
characteristics of the brain derived algorithm.

Thanks for your help, let me know if there is a product you think is
better suited,
AndrewF

[email protected] wrote:
> I am a Python programmer writing neural network code with binary firing
> and binary weight values. My code will take many days to parse my
> large data sets. I have no idea how much fpga could help, what the
> cost would be, and how easy it would be to access it from Python. The
> problem is similar to competitive networks, where I must dot product
> many million-length bit vectors (which only change occasionally) with 1
> input vector. Anybody want to estimate the cost, speedup, and value an
> fpga could offer me?

Probably you could get considerable speedup by changing
to a compiled language and by selecting a good algorithm.
The most likely speed limit is the rate at which you can get
data in and out of the computer.
If an fpga can help, how much it can help is likely to be affected
by how it's connected: USB, PCI, PCI-X, or even front-side bus.

In article <[email protected] .com>,
<[email protected]> wrote:
>Hello,
>
>I am a Python programmer writing neural network code with binary firing
>and binary weight values. My code will take many days to parse my
>large data sets. I have no idea how much fpga could help, what the
>cost would be, and how easy it would be to access it from Python. The
>problem is similar to competitive networks, where I must dot product
>many million-length bit vectors (which only change occasionally) with 1
>input vector.

How dense are your rarely-changing bit vectors, and if you write them
down in 15-bit chunks, how many of the 32768 possible 15-bit chunks
are used? If the answer to the first is 'not very' then you can use
sparse techniques; if the answer to the second is 'few', then look-up
tables will fit nicely in L2 cache.

Bitwise AND of two 128-bit vectors ought to take under a nanosecond on
a fast P4, then you can do various of the popcount tricks bytewise in
the SSE2 registers then add up the subregisters with a PSADBW call
against a register containing zeroes, accumulate in two halves of an
SSE register and an add at the end. I really think you should ask the
people in comp.lang.asm.x86 to see how fast they can get your code on
CPU before contemplating FPGAs: you ought to be able to manage
main-memory speed, which is about two thousand a second, or faster if
you can fit things into L2 cache.

An FPGA will be limited by memory speed again; whilst with an
unlimited budget you can get several channels of SRAM or DDR2 and
perhaps five times the peak memory performance of a P4, the hardware's
so expensive that you'd be better off getting a large number of cheap
P4 boxes unless you're doing this at truly large scale.

I think for the current project I will use C++, or hopefully ASM/SSE2
if someone writes it for me or I research it more. The advantage of
the the fpga is that many reference vectors (loaded from onboard
memory) can multiply by an incoming input vector (loaded over pcie) at
the same time. With a spartan-3 pcie starter kit with DDR I could hope
for 200 to 250 Gbits per second dot product. If you are right about
the P4 being fast enough to bottleneck at the memory then I could hope
for 25.6 Gbps memory bandwidth. That's actually pretty good.

When I switch to optical recognition I will have perhaps 1,000
reference vectors that will be much smaller, only about 1kilobit each
or so. Since these reference vectors could fit in block ram I could
see 16 terabit per second on that project.

I will approach the assembly forum, thanks for your help.

> When I switch to optical recognition I will have perhaps 1,000
> reference vectors that will be much smaller, only about 1kilobit each
> or so. Since these reference vectors could fit in block ram I could
> see 16 terabit per second on that project.

Since the reference vectors are relatively stable, it may be effective
to integrate them with the logic that actually sums the bits, and keep
them in distributed RAM. See the thread on "Counting 1's" back in Oct
03, which illustrates an approach. The input leafs of the tree start
the summing process by adding 3 bits in two LUTs. This leaves an extra
LUT input that could be used to select between two different reference
vectors embedded in the leafs. Using SRL16's would allow changing the
reference vectors. Just a thought.

> When I switch to optical recognition I will have perhaps 1,000
> reference vectors that will be much smaller, only about 1kilobit each
> or so. Since these reference vectors could fit in block ram I could
> see 16 terabit per second on that project.

If one takes a nanosecond to read a kilobit,
that is still only 1 terabit per second.
In any case, doing the ands will probably not
take much longer than it takes to read the data.
Counting the bits is more interesting.
Putting the fpga on the front side bus
would probably allow a helpful data rate.
As you said you were going to try,
assembly is a better next choice than an fpga.

Michael Hennebry wrote:
> Putting the fpga on the front side bus
> would probably allow a helpful data rate.

Speaking of which, has anyone come out with a commercial version of
Pilchard or TKDM type approach yet? Simple, I think elegant, idea:
FPGA(s) on a DIMM stick that sits in main memory.