FPGA Central - World's 1st FPGA / CPLD Portal

FPGA Central

World's 1st FPGA Portal

 

Go Back   FPGA Groups > NewsGroup > FPGA

FPGA comp.arch.fpga newsgroup (usenet)

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 04-28-2006, 08:17 PM
Jan Panteltje
Guest
 
Posts: n/a
Default DRC has announced its newest FPGA that drops into AMD's Socket 940

http://www.dailytech.com/article.aspx?newsid=1920

So... I do see a possibility here.
Reply With Quote
  #2 (permalink)  
Old 04-28-2006, 08:48 PM
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940


Jan Panteltje wrote:
> http://www.dailytech.com/article.aspx?newsid=1920
>
> So... I do see a possibility here.


Definitely cool.

But only where an FPGA is truly handy. E.g. grid work.

I think servers [e.g. SSL work] is best served with two processors than
one and the FPGA.

8x200Mhz only provides 400MB/sec traffic to the CPU so really this is
useful for tasks which either totally reside on the FPGA side of the
board or have really high latency (e.g. PK work).

The FPGA would have to beat ~3500 RSA-1024/sec before it would be worth
more than an Opteron 275 (even more for the 285s) in the same socket.
I'd see use for this in animation work though where an FPGA can
raytrace a scene much faster than a CPU can and the work is high
latency.

Still cool though. Good to see people using the 940 socket for more
than just Opterons :-)

Tom

Reply With Quote
  #3 (permalink)  
Old 04-28-2006, 09:28 PM
Paul Rubin
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940

[email protected] writes:
> The FPGA would have to beat ~3500 RSA-1024/sec before it would be worth
> more than an Opteron 275 (even more for the 285s) in the same socket.


What about the number of AES/sec?
Reply With Quote
  #4 (permalink)  
Old 04-28-2006, 09:39 PM
c d saunter
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940

[email protected] wrote:

: Jan Panteltje wrote:
: > http://www.dailytech.com/article.aspx?newsid=1920

<snip>

: 8x200Mhz only provides 400MB/sec traffic to the CPU so really this is
: useful for tasks which either totally reside on the FPGA side of the
: board or have really high latency (e.g. PK work).

Sitting on the HT bus like that offers residence about as close as you can
get to a mainstream CPU. Given the new HT3 stuff - faster and links
possible over 1 meter - i.e. directly joining blades - I really like this
aproach. Especially given the memory architecture that goes along with
HT/Opterons. It's bringing mainstream CPUs and FPGAs back into the point
to point multiple interconnect world of the TigerSHARCs and the old TI
C40s.

It feels a bit like a resurgence to the old British Transputer except with
gate arrays mixing with CPUs on an equal footing in terms of connectivity.

cds
Reply With Quote
  #5 (permalink)  
Old 04-28-2006, 09:48 PM
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940


Paul Rubin wrote:
> [email protected] writes:
> > The FPGA would have to beat ~3500 RSA-1024/sec before it would be worth
> > more than an Opteron 275 (even more for the 285s) in the same socket.

>
> What about the number of AES/sec?


If it were triggered independently it'd be worth it. A 2.6Ghz
processor [less than half the cost of this FPGA] can do upto 10,156,250
AES-128-ECB/sec with plain C code. That's roughly 160MiB/sec of
throughput.

Now, if you could have this thing trigger automatically. For instance,
have an APIC that responds to interrupts from a network controller that
would be a boost.

The typical AES core takes ~14 cycles to encrypt but in FPGAs normally
run at most at a couple hundred MHz at most [usually topping out
between 100 and 200Mhz at most]. 200Mhz is 13 times less than 2.6Ghz
which is equivalent to 182 cycles at 2.6Ghz. This is less than the 256
cycles that the Opteron takes but only marginally so. For the cost of
it you'd be better served by dropping another Opteron in. A single 285
core could top out at 20.3M AES/sec which way more than the typical
FPGA can hope to achieve.

Where this would fly I think is on PDU work as I described tying
directly to the network controller. You really need higher latency
work.

It should also be trivial to get ECC [especially binary field] PK much
faster and lower latency on an FPGA than the typical Opteron.

Tom

Reply With Quote
  #6 (permalink)  
Old 04-28-2006, 09:55 PM
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940

c d saunter wrote:
> [email protected] wrote:
>
> : Jan Panteltje wrote:
> : > http://www.dailytech.com/article.aspx?newsid=1920
>
> <snip>
>
> : 8x200Mhz only provides 400MB/sec traffic to the CPU so really this is
> : useful for tasks which either totally reside on the FPGA side of the
> : board or have really high latency (e.g. PK work).
>
> Sitting on the HT bus like that offers residence about as close as you can
> get to a mainstream CPU. Given the new HT3 stuff - faster and links
> possible over 1 meter - i.e. directly joining blades - I really like this
> aproach. Especially given the memory architecture that goes along with
> HT/Opterons. It's bringing mainstream CPUs and FPGAs back into the point
> to point multiple interconnect world of the TigerSHARCs and the old TI
> C40s.


HT links are not solely designed for speed. Latency is the key. 16
lanes of PCIe can compete just fine with a 16x16 1Ghz HT link in terms
of bandwidth.

Oddly enough the best tasks for this are things which don't return back
to back [e.g. raytrace a scene].

What this does open the door for though is for mixed architecture
systems. E.g. synthesize a MIPS core in the FPGA and map the DDR
controller on to it.

Then you have x86 and MIPS in the same system.

That'd be cool.

Tom

Reply With Quote
  #7 (permalink)  
Old 04-28-2006, 10:02 PM
c d saunter
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940

[email protected] wrote:

: HT links are not solely designed for speed. Latency is the key. 16
: lanes of PCIe can compete just fine with a 16x16 1Ghz HT link in terms
: of bandwidth.

: Oddly enough the best tasks for this are things which don't return back
: to back [e.g. raytrace a scene].

I wouldn't call that odd - a modern CPU hiding behind caches with long
pipelines is always going to struggle with low latency
back/forewards/back/forewards shared tasks with an FPGA/Clearspeed/xxx
- certainly interesting things happen with FPGA silicon and CPU
silicon coupled in a SOC or on an FPGA but the clock rates are far below a
dedicated CPU.

On the serial / parallel issue I have a leaning towards parallel for
simplicity when it comes to the FPGA code and latency, although serial has
benefits for physical complexity and routing. Also it feels like they
leap frog each other every few months in terms of bandwidth! The world is
squeezing itself down a thin pipe these days though...

: What this does open the door for though is for mixed architecture
: systems. E.g. synthesize a MIPS core in the FPGA and map the DDR
: controller on to it.

: Then you have x86 and MIPS in the same system.

: That'd be cool.

An awfull lot of cool things are on their way...
Reply With Quote
  #8 (permalink)  
Old 04-28-2006, 11:08 PM
Paul Rubin
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940

[email protected] writes:
> The typical AES core takes ~14 cycles to encrypt but in FPGAs normally
> run at most at a couple hundred MHz at most [usually topping out
> between 100 and 200Mhz at most]. 200Mhz is 13 times less than 2.6Ghz
> which is equivalent to 182 cycles at 2.6Ghz. This is less than the 256
> cycles that the Opteron takes but only marginally so.


I'd think if you're going to use such an expensive and exotic approach
at all, you'd pipeline it to get one AES operation per cycle, maybe
even more than one if you're doing something like EAX mode, or CTR
mode ona large block in parallel.
Reply With Quote
  #9 (permalink)  
Old 04-28-2006, 11:24 PM
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940

Paul Rubin wrote:
> [email protected] writes:
> > The typical AES core takes ~14 cycles to encrypt but in FPGAs normally
> > run at most at a couple hundred MHz at most [usually topping out
> > between 100 and 200Mhz at most]. 200Mhz is 13 times less than 2.6Ghz
> > which is equivalent to 182 cycles at 2.6Ghz. This is less than the 256
> > cycles that the Opteron takes but only marginally so.

>
> I'd think if you're going to use such an expensive and exotic approach
> at all, you'd pipeline it to get one AES operation per cycle, maybe
> even more than one if you're doing something like EAX mode, or CTR
> mode ona large block in parallel.


Even with pipelining you're still on a fairly limited bus. At best you
top out at whatever the bus between the two actually is. Keep in mind
this is an FPGA and not ASIC. So chances are it won't clock that high
anyways. My 200Mhz quote is just a really really optimistic quote.
>From what I recall from my past job you'd be lucky to get something

complicated like a PDU clocking higher than PCI freq [33Mhz]. So while
you could get an AES core ~100Mhz it would only be doing CTR mode at
most.

Block ciphers are not where this will shine. Specially when the other
processor is an Opteron.

The trick to making good use of something like an FPGA isn't serial
speed. Even if you designed a custom RISC ALU on the FPGA it'd clock
probably around 50Mhz. Even with the best ISA you can craft for it the
Opteron could EMULATE the thing faster than you could run it. Where
the FPGA will shine is for tasks with a LOT of parallel computation.
Think like 16 FPU pipelines or a single cycle GF(2) multiplier, etc,
etc, etc.

Other tasks where this would shine would be custom DSP filters, e.g.
offload MPEG work. A FIR or IIR filter of significant delay [e.g.
accuracy] could be constructed in a pipeline to get 1 sample/cycle at
decent clock rates.

Tom

Reply With Quote
  #10 (permalink)  
Old 04-28-2006, 11:55 PM
Austin Lesea
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket940

tomstdenis,

http://www.xilinx.com/bvdocs/ipcente...data_sheet.pdf

Shows some of the claimed clock rates for their AES encrypt/decrypt IP
core. 257 MHz (V2 Pro) to 252 MHz (V4). Throughput in b/s is ~ 2 to 3
X the clock rate (per this datasheet). Other cores run just shy of 200 MHz.

Other data from this same vendor makes claims of up to 20 Gbs for
throughput of their 'fast' FPGA based AES encryptors and decryptors.

At one time we made a 10 Gbs decryptor to prove that distributing full
resolution theater real time movies could be done with one FPGA in the
'projector.' This prevents piracy by decrypting the movie at the
projector itself (at no time is the full digital information available
for copying).

This was back in the Virtex II days, so the 20 Gbs claim is perfectly
reasonable for V4 today (IMO).

There are a number of other IP vendors with encryptors and decryptors
for our FPGAs.

http://xgoogle.xilinx.com/search?out...partialfields=
or
http://tinyurl.com/hajhj

Austin

[email protected] wrote:

> Paul Rubin wrote:
>
>>[email protected] writes:
>>
>>>The typical AES core takes ~14 cycles to encrypt but in FPGAs normally
>>>run at most at a couple hundred MHz at most [usually topping out
>>>between 100 and 200Mhz at most]. 200Mhz is 13 times less than 2.6Ghz
>>>which is equivalent to 182 cycles at 2.6Ghz. This is less than the 256
>>>cycles that the Opteron takes but only marginally so.

>>
>>I'd think if you're going to use such an expensive and exotic approach
>>at all, you'd pipeline it to get one AES operation per cycle, maybe
>>even more than one if you're doing something like EAX mode, or CTR
>>mode ona large block in parallel.

>
>
> Even with pipelining you're still on a fairly limited bus. At best you
> top out at whatever the bus between the two actually is. Keep in mind
> this is an FPGA and not ASIC. So chances are it won't clock that high
> anyways. My 200Mhz quote is just a really really optimistic quote.
>>From what I recall from my past job you'd be lucky to get something

> complicated like a PDU clocking higher than PCI freq [33Mhz]. So while
> you could get an AES core ~100Mhz it would only be doing CTR mode at
> most.
>
> Block ciphers are not where this will shine. Specially when the other
> processor is an Opteron.
>
> The trick to making good use of something like an FPGA isn't serial
> speed. Even if you designed a custom RISC ALU on the FPGA it'd clock
> probably around 50Mhz. Even with the best ISA you can craft for it the
> Opteron could EMULATE the thing faster than you could run it. Where
> the FPGA will shine is for tasks with a LOT of parallel computation.
> Think like 16 FPU pipelines or a single cycle GF(2) multiplier, etc,
> etc, etc.
>
> Other tasks where this would shine would be custom DSP filters, e.g.
> offload MPEG work. A FIR or IIR filter of significant delay [e.g.
> accuracy] could be constructed in a pipeline to get 1 sample/cycle at
> decent clock rates.
>
> Tom
>

Reply With Quote
  #11 (permalink)  
Old 04-29-2006, 01:22 AM
DJ Delorie
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940


[email protected] writes:
> E.g. synthesize a MIPS core in the FPGA and map the DDR controller
> on to it.


Just one? Why not a couple dozen small purpose-designed RISC cores,
running in parallel?
Reply With Quote
  #12 (permalink)  
Old 04-29-2006, 01:42 AM
JJ
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940


c d saunter wrote:
> [email protected] wrote:
>
> : Jan Panteltje wrote:
> : > http://www.dailytech.com/article.aspx?newsid=1920
>
> <snip>
>
> : 8x200Mhz only provides 400MB/sec traffic to the CPU so really this is
> : useful for tasks which either totally reside on the FPGA side of the
> : board or have really high latency (e.g. PK work).
>
> Sitting on the HT bus like that offers residence about as close as you can
> get to a mainstream CPU. Given the new HT3 stuff - faster and links
> possible over 1 meter - i.e. directly joining blades - I really like this
> aproach. Especially given the memory architecture that goes along with
> HT/Opterons. It's bringing mainstream CPUs and FPGAs back into the point
> to point multiple interconnect world of the TigerSHARCs and the old TI
> C40s.
>
> It feels a bit like a resurgence to the old British Transputer except with
> gate arrays mixing with CPUs on an equal footing in terms of connectivity.
>
> cds


Um yes it does look familiar doesn't it. If you go to the origins of HT
when it was called something else at AlphaWorks IIRC, the key people
had originally come from Inmos and had worked on the PLLs for the
Transputer and maybe those links too. The fellow is now a Fellow at AMD
after they bought them out. In a previous life, same people were at
Meiko and did their own routers used to stitch up T800s then later
several other cpus ultimately leading to the Alpha platform after Meiko
went belly up.

When I first heard Xilinx was taking a HT license, which seems a long
time ago now, I wondered when this would happen.

When I first saw the early marketing for the Hammer with 1,2,3,4 of
these HT links and the memory channel too, I could only say out loud,
looks & smells like a Transputer to me with 20yrs development but it
isn't really, it doesn't have the process scheduler or any real support
for programming concurrently per occam, just links. But when I see the
product today with a huge price premium on the no of HT links, I am
dissapointed, one Opteron with 1 link is cheap enough, add more links,
the cost goes way up as it looks more and more like a server platform.
The no of Links on the Transputer was always an issue back then, 4 is a
minimum.

The socket module though looks a bit like SFF TRAM module but the multi
socket Opteron boards are not really TRAM carriers that can be
populated with general purpose computing modules on a grid. Perhaps
that will come back again but probably with more modest links.

I have been suggesting a Transputer resurgence for some time by
building an FPGA Processor Element hooked up with a specialized MMU
that shares the available memory bandwidth of RLDRAM amongst many PEs
using latency hiding Multithreading to make the PEs not appear to have
any memory wall. By distributing n.PE+MMUs into the fabric, one can
then add algorithm specific extentions or coprocessors to each and copy
the node systolic fashion over the array. Each PE only uses only 1
BRam, so quite a few PEs would fit. The Transputer is really now
defined by all the good stuff that goes into the MMU rather than the
PEs. There is a paper on it at wotug.org for anyone interested.

When you build algorithms in FPGA around arrays of customizeable PEs I
think some of the reasons for having an Opteron in the system may
become moot, put the cpus into the FPGA as many copies as you can get
since all the real bandwidth is in/out of all the Blockrams, not the
more limited I/O pins.

I will have to look more into HT3 though.

John Jakson
transputer guy

Reply With Quote
  #13 (permalink)  
Old 04-29-2006, 02:49 AM
DJ Delorie
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940


"JJ" <[email protected]> writes:
> I have been suggesting a Transputer resurgence for some time by


I always thought it would be neat to design a CPU cell in a QFP fpga,
such that all the pins on each side were designed to interface to an
adjacent cell - making the PCB routing trivial. The cells along the
boundary would be programmed to use the free edges to talk to external
peripherals.

I suppose with a BGA you could use the outer rows to talk to adjacent
cells, and the inner rows to interface to a RAM chip on the other side
of the board.
Reply With Quote
  #14 (permalink)  
Old 04-29-2006, 03:42 AM
JJ
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940


DJ Delorie wrote:
> "JJ" <[email protected]> writes:
> > I have been suggesting a Transputer resurgence for some time by

>
> I always thought it would be neat to design a CPU cell in a QFP fpga,
> such that all the pins on each side were designed to interface to an
> adjacent cell - making the PCB routing trivial. The cells along the
> boundary would be programmed to use the free edges to talk to external
> peripherals.
>


Given the FPGA resources needed for 1 PE, 1 BRam & about 500 Luts/FFs
and then putting around 10 with a shared MMU which requires unknown
resources at this time, one might get a combined resource figure that
is still insignificant compared to the size of the largest FPGAs that
would likely be placed in these 940 sockets.

Each MMU uses more resources than a few PEs but also would chew up a
good portion of I/Os pins say 120 or so for 1 RLDRAM interface and more
for external links. It becomes obvious one is really I/O limited or
content limited so an array of much smaller FPGAs makes more sense on a
TRAM carrier type board. Then every FPGA might get 4 MMU memory systems
giving effectively 40 or so PEs running at 300Mhz or 100Mips each. The
total 40x100mips still doesn't look so good compared to 1 Opteron, but
the system is very different. You end up with 160 or so threads since
each PE is a 4 way MTA, you have to have every thread busy and that
requires occam or HDL like parallel programming v possibly only 1
thread on an Opteron. The big payback is that all these threads get to
see almost no memory wall with full random access over their local
memory banks with some additional latency for nearby MMUs and more so
for off FPGA nodes.

You either have a thread wall or you have a memory wall. The thread
wall is not really a problem for occam, csp, Transputer, parallel
people but is a huge barrier to most Opteron customers. The memory wall
though is a real problem requiring possibly 1000 clock cycle memory
accesses for all accesses that miss the cache system and caches can
never be big enough for the sorts of datasets some have in mind, nor
can the TLBs have enough asssociativity. I believe these memory walls
are most likely halving typical throughput of sequential cpus for even
a modest miss rate. Thats why I am prone to suggesting getting rid of
the Opteron and put the cpus right inside the algorithm with local
copros per PE or better still per MMU. One such copro could be a FP
unit which uses the same reasoning as the MMU. If a FP unit can deliver
1 flop per clock shared over 40 threads, each thread gets FP slices
with very little latency in the order of a load, store op.


> I suppose with a BGA you could use the outer rows to talk to adjacent
> cells, and the inner rows to interface to a RAM chip on the other side
> of the board.


I haven't really worried too much about packaging BGA v edge connected,
I suspect that the medium size parts are big enough to hold enough PEs
and use up the I/O for RLDRAM and some for HT like links. I would
probably put each FPGA & related RLDRAM on its own module so it would
look a little like these DRC modules or really a SFF modern TRAM. That
separates the module design from the motherboard design and then you
can get some volume on these modules.

Don't even ask why I wouldn't use regular SDRAM, about 20x less random
throughput, would effectively limit me to only 1-2 or so PEs per MMU,
and that would leave the FPGA almost empty.

John Jakson
transputer guy

Reply With Quote
  #15 (permalink)  
Old 04-29-2006, 04:24 AM
JJ
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940


For any processor with no substantial caches, one might assume every
5th opcode is a load or store, for a nice register heavy design, maybe
every 10th opcode. For a classic SDRAM interface the performance will
be very poor. The usual thing to do is to gang up lots of very
expensive Brams into I, D caches which gives up alot of the parallel
bandwidth they each have when used separately. Even still each core now
uses lots of Bram, some cpu logic, an SDRAM controller and a good chunk
of the I/O is gone. That sort of system can be replicated maybe 4 times
depending on I/O count and none of these has any performance to write
home about. But one could put additional algorithmic content next to
each node.

Memory limits and hence I/O pads is the crux of the problem. My
Transputer design uses 1 Bram/PE hence on paper maybe 554 PEs might fit
in the biggest FPGA but that doesn't work. The Lut/Bram useage takes it
down to half that and then assume the MMUs consume the rest of the
fabric in a regular tileing. Still the memory traffic of 250 odd PEs
can't be funneled through maybe only 4 memory interfaces even RLDRAM,
so the PE count either has to come way down and or more of the Brams
have to be used as local caches which gives up alot of their bandwidth
again.

One way around the I/O limit I have been thinking of is to bring the
RLDRAM inside the FPGA. SInce we can't do that, instead replicate the
RLDRAM logical architecture of n concurrent slower banks using up all
remaining BRam aggregating them into cache that can be shared with
multiple PEs at the L1 level. Only when those miss does the L2 RLDRAM
come in to play, so trading down PEs for Bram caches allows more
Transputer nodes to share the few RLDRAM interface.

..( (n*PE + MMU + Bram cache)*k + MMU + RLDRAM interface) *4 or so.

Q
I am curious about how many separate memory channels people have
actually put onto the largest FPGAs, I suspect on the highend for
independant RLDRAM controllers it is around 4 due to specialized use
of the clock resources needed to make the DDR interfaces work. I also
wonder if these serial interface DRAMs have come out yet that would
allow many more memory channels to per FPGA.

John Jakson
transputer guy

Reply With Quote
  #16 (permalink)  
Old 04-29-2006, 07:26 AM
Allan Herriman
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940

On Fri, 28 Apr 2006 15:55:32 -0700, Austin Lesea <[email protected]>
wrote:

>tomstdenis,
>
>http://www.xilinx.com/bvdocs/ipcente...data_sheet.pdf
>
>Shows some of the claimed clock rates for their AES encrypt/decrypt IP
>core. 257 MHz (V2 Pro) to 252 MHz (V4). Throughput in b/s is ~ 2 to 3
>X the clock rate (per this datasheet). Other cores run just shy of 200 MHz.
>
>Other data from this same vendor makes claims of up to 20 Gbs for
>throughput of their 'fast' FPGA based AES encryptors and decryptors.
>
>At one time we made a 10 Gbs decryptor to prove that distributing full
>resolution theater real time movies could be done with one FPGA in the
>'projector.' This prevents piracy by decrypting the movie at the
>projector itself (at no time is the full digital information available
>for copying).
>
>This was back in the Virtex II days, so the 20 Gbs claim is perfectly
>reasonable for V4 today (IMO).



20Gb/s was perfectly reasonable for V2P a few years ago.

I can't tell you how I know that

Allan


>
>There are a number of other IP vendors with encryptors and decryptors
>for our FPGAs.
>
>http://xgoogle.xilinx.com/search?out...partialfields=
>or
>http://tinyurl.com/hajhj
>
>Austin
>
>[email protected] wrote:
>
>> Paul Rubin wrote:
>>
>>>[email protected] writes:
>>>
>>>>The typical AES core takes ~14 cycles to encrypt but in FPGAs normally
>>>>run at most at a couple hundred MHz at most [usually topping out
>>>>between 100 and 200Mhz at most]. 200Mhz is 13 times less than 2.6Ghz
>>>>which is equivalent to 182 cycles at 2.6Ghz. This is less than the 256
>>>>cycles that the Opteron takes but only marginally so.
>>>
>>>I'd think if you're going to use such an expensive and exotic approach
>>>at all, you'd pipeline it to get one AES operation per cycle, maybe
>>>even more than one if you're doing something like EAX mode, or CTR
>>>mode ona large block in parallel.

>>
>>
>> Even with pipelining you're still on a fairly limited bus. At best you
>> top out at whatever the bus between the two actually is. Keep in mind
>> this is an FPGA and not ASIC. So chances are it won't clock that high
>> anyways. My 200Mhz quote is just a really really optimistic quote.
>>>From what I recall from my past job you'd be lucky to get something

>> complicated like a PDU clocking higher than PCI freq [33Mhz]. So while
>> you could get an AES core ~100Mhz it would only be doing CTR mode at
>> most.
>>
>> Block ciphers are not where this will shine. Specially when the other
>> processor is an Opteron.
>>
>> The trick to making good use of something like an FPGA isn't serial
>> speed. Even if you designed a custom RISC ALU on the FPGA it'd clock
>> probably around 50Mhz. Even with the best ISA you can craft for it the
>> Opteron could EMULATE the thing faster than you could run it. Where
>> the FPGA will shine is for tasks with a LOT of parallel computation.
>> Think like 16 FPU pipelines or a single cycle GF(2) multiplier, etc,
>> etc, etc.
>>
>> Other tasks where this would shine would be custom DSP filters, e.g.
>> offload MPEG work. A FIR or IIR filter of significant delay [e.g.
>> accuracy] could be constructed in a pipeline to get 1 sample/cycle at
>> decent clock rates.
>>
>> Tom
>>

Reply With Quote
  #17 (permalink)  
Old 05-05-2006, 07:48 AM
Rob Warnock
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940

<[email protected]> wrote:
+---------------
| What this does open the door for though is for mixed architecture
| systems. E.g. synthesize a MIPS core in the FPGA and map the DDR
| controller on to it.
|
| Then you have x86 and MIPS in the same system.
+---------------

But *not* necessarily running ccNUMA with each other!! See my recent
post on "comp.lang.lisp" [yeah, they were talking about the prospects
for using the same DRC FPGA for an update on the Lisp Machine]:

http://groups.google.com/group/comp....1488796602931d

especially the bits about the difference between "non-coherent HT",
used for ordinary I/O (PIOs & DMA), and the "coherent HT" used for
the inter-Opteron ccNUMA cache-coherency. I *strongly* suspect the
DRC FPGA[1] only does non-coherent HT, which, while just fine for
a DMA-style crypto co-processor, wouldn't let your FPGA-based MIPS
CPU participate in the Opteron cache-coherency protocol.


-Rob

[1] Well, the *chip* could probably do either; I'm actually referring
to whatever libraries of HT protocol support that come with it.

-----
Rob Warnock <[email protected]>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

Reply With Quote
  #18 (permalink)  
Old 05-05-2006, 11:50 AM
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940


Rob Warnock wrote:
> <[email protected]> wrote:
> +---------------
> | What this does open the door for though is for mixed architecture
> | systems. E.g. synthesize a MIPS core in the FPGA and map the DDR
> | controller on to it.
> |
> | Then you have x86 and MIPS in the same system.
> +---------------
>
> But *not* necessarily running ccNUMA with each other!! See my recent
> post on "comp.lang.lisp" [yeah, they were talking about the prospects
> for using the same DRC FPGA for an update on the Lisp Machine]:
>
> http://groups.google.com/group/comp....1488796602931d
>
> especially the bits about the difference between "non-coherent HT",
> used for ordinary I/O (PIOs & DMA), and the "coherent HT" used for
> the inter-Opteron ccNUMA cache-coherency. I *strongly* suspect the
> DRC FPGA[1] only does non-coherent HT, which, while just fine for
> a DMA-style crypto co-processor, wouldn't let your FPGA-based MIPS
> CPU participate in the Opteron cache-coherency protocol.


They would definitely be non coherent. For the device to work though
it would need a memory controller that interprets HT. There is no way
for the Opteron to talk to the other node (from a software point of
view) other than by memory read/writes.

Likely the device reserves a space for it's registers then maps the
memory somewhere. The Opteron would have to be the boot processor and
it would setup the DRCs node link table and other jazz.

>From the opterons point of view all of the memory on the DRC side of

the link would be uncacheable. That's about the only way to make it
work.

....

Still wants a x86-mips or even x86-ARM hybrid... muahahhahaaha

Tom

Reply With Quote
  #19 (permalink)  
Old 05-07-2006, 01:07 AM
Piotr Wyderski
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940

Austin Lesea wrote:

> This prevents piracy by decrypting the movie at the
> projector itself (at no time is the full digital information available
> for copying).


How do you prevent the pirates from stealing your private
symmetric AES key from the FPGA? _This_ is the hard part,
not the decryption process. I can easily implement an over
1Gbit/s 128-bit AES en/decryptor even on a Cyclone, but it
is meaningless, as the key is not (and cannot be) protected.

Best regards
Piotr Wyderski

Reply With Quote
  #20 (permalink)  
Old 05-07-2006, 02:18 AM
Eric Smith
Guest
 
Posts: n/a
Default Re: DRC has announced its newest FPGA that drops into AMD's Socket 940

"Piotr Wyderski" <[email protected]> writes:
> How do you prevent the pirates from stealing your private
> symmetric AES key from the FPGA? _This_ is the hard part,
> not the decryption process. I can easily implement an over
> 1Gbit/s 128-bit AES en/decryptor even on a Cyclone, but it
> is meaningless, as the key is not (and cannot be) protected.


Use a Virtex II or Virtex 4 and it can be.

There are degrees of protection. The protection available in the
Virtex II or Virtex 4 isn't absolute, of course, but it would take
tremendous resources to extract a key embedded in one. You wouldn't
be able to read the key back out electrically due to the FPGA's own
encryption system, which is based on triple DES or AES with a key
in internal SRAM.

To extract the application symmetric AES key, you'd have to be able
to decap the FPGA without cutting power to it or shorting out any
internal nodes, then microprobe it. And you'd have to know *where*
to probe it; unless you had the original design files, just finding
where the application key was stored would be an immense task.

(Note that I'm not talking about finding out where the FPGA bitsream
decryption key is stored; that would be relatively easy since you
could use ANY decapped Virtex II/4 part to search for that. The
application's decryption key would be somewhere inside the FPGA
configuration.)

Reply With Quote
Reply

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
ZIF press-fit socket for QFP FPGA packages dima2882 FPGA 1 08-31-2005 04:44 PM
Need recommendation on an FPGA board with a USB socket. Sea Squid FPGA 6 03-17-2005 05:29 AM
solderless breadboard + fpga + smt-adaptable socket? Adam Megacz FPGA 1 06-01-2004 03:34 AM


All times are GMT +1. The time now is 01:37 AM.


Powered by vBulletin® Version 3.8.0
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0
Copyright 2008 @ FPGA Central. All rights reserved