DJ Delorie wrote:
> "JJ" <[email protected]> writes:
> > I have been suggesting a Transputer resurgence for some time by
>
> I always thought it would be neat to design a CPU cell in a QFP fpga,
> such that all the pins on each side were designed to interface to an
> adjacent cell - making the PCB routing trivial. The cells along the
> boundary would be programmed to use the free edges to talk to external
> peripherals.
>
Given the
FPGA resources needed for 1 PE, 1 BRam & about 500 Luts/FFs
and then putting around 10 with a shared MMU which requires unknown
resources at this time, one might get a combined resource figure that
is still insignificant compared to the size of the largest FPGAs that
would likely be placed in these 940 sockets.
Each MMU uses more resources than a few PEs but also would chew up a
good portion of I/Os pins say 120 or so for 1 RLDRAM interface and more
for external links. It becomes obvious one is really I/O limited or
content limited so an array of much smaller FPGAs makes more sense on a
TRAM carrier type board. Then every
FPGA might get 4 MMU memory systems
giving effectively 40 or so PEs running at 300Mhz or 100Mips each. The
total 40x100mips still doesn't look so good compared to 1 Opteron, but
the system is very different. You end up with 160 or so threads since
each PE is a 4 way MTA, you have to have every thread busy and that
requires occam or HDL like parallel programming v possibly only 1
thread on an Opteron. The big payback is that all these threads get to
see almost no memory wall with full random access over their local
memory banks with some additional latency for nearby MMUs and more so
for off
FPGA nodes.
You either have a thread wall or you have a memory wall. The thread
wall is not really a problem for occam, csp, Transputer, parallel
people but is a huge barrier to most Opteron customers. The memory wall
though is a real problem requiring possibly 1000 clock cycle memory
accesses for all accesses that miss the cache system and caches can
never be big enough for the sorts of datasets some have in mind, nor
can the TLBs have enough asssociativity. I believe these memory walls
are most likely halving typical throughput of sequential cpus for even
a modest miss rate. Thats why I am prone to suggesting getting rid of
the Opteron and put the cpus right inside the algorithm with local
copros per PE or better still per MMU. One such copro could be a FP
unit which uses the same reasoning as the MMU. If a FP unit can deliver
1 flop per clock shared over 40 threads, each thread gets FP slices
with very little latency in the order of a load, store op.
> I suppose with a BGA you could use the outer rows to talk to adjacent
> cells, and the inner rows to interface to a RAM chip on the other side
> of the board.
I haven't really worried too much about packaging BGA v edge connected,
I suspect that the medium size parts are big enough to hold enough PEs
and use up the I/O for RLDRAM and some for HT like links. I would
probably put each
FPGA & related RLDRAM on its own module so it would
look a little like these DRC modules or really a SFF modern TRAM. That
separates the module design from the motherboard design and then you
can get some volume on these modules.
Don't even ask why I wouldn't use regular SDRAM, about 20x less random
throughput, would effectively limit me to only 1-2 or so PEs per MMU,
and that would leave the
FPGA almost empty.
John Jakson
transputer guy