On Tue, 21 Oct 2008 11:30:46 -0700 (PDT), "Steven G. Johnson"
<
[email protected]> wrote:
>On Oct 21, 3:32*am, Eric Jacobsen <
[email protected]> wrote:
>> The cas() function has identical symmetry to cos() with an offset of
>> pi/4. *For a LUT that has a power-of-two length that offset is fairly
>> trivial to work around as far as folding the LUT is concerned.
>
>Except that between the cos and sin tables, for power-of-two sizes,
>there is a lot of addtitional redundancy because cos(x) = sin(pi/2 -
>x).
>
>So, for example, for n=16, if you look at the cos and sin tables,
>there are three distinct constants (not counting trivial constants
>like 1 and 0, and sign changes, all of which can be optimized out).
>For the cas table, there are four distinct nontrivial constants.
>
>In general, you can show that the number of nontrivial constants is
>the same, differing by at most one in favor of FFTs, just as I
>claimed.
>
>(For odd lengths, the pi/4 shift breaks the symmetry once you
>discretize, and the cas table has no reflection symmetry at all,
>unlike cos and sin, although in that case the cos and sin tables must
>be distinct.)
My point was just that there are simple equivalences between cas() and
sin() and cos() with trivial phase offsets and a scale constant. So
whatever you do for one can be done with the others. Any differences
come down to accesses and whether one needs two LUTs or can get by
with one, which might depend on implementation constraints and perhaps
favor one over the other. That's all I was getting at and I think
it's still true.
You disputed my claim that cas() could be folded and were incorrect to
do so. It can be folded the same as sin() or cos(), one can even use
a reduced sin() or cos() LUT to compute cas() with a trivial phase
offset and an output scale factor. One can even move the scale factor
out further in the process if desired.
>> required for cas() compared to cos() and sin()? * I don't agree,
>> despite the fact that cas() can be folded the same as sin() and cos().
>> I think the amount of storage is the same, but since FHT twiddles are
>> always real-valued the accesses to that memory can be managed
>> differently.
>
>I'm glad you agree that the storage is the same; you had earlier
>claimed that there was a "memory saving is in the twidlle factors",
>which on its face sounded like you were claiming lower storage.
>
>Regarding the twiddles being "real valued" and hence yielding some
>important difference in the access pattern, you are mistaken. If you
>actually look at the structure of FHT algorithms, they still use pairs
>of constants at a time. In fact, if you look at Bracewell's original
>FHT algorithm and most subsequent FHT algorithms, they don't use the
>cas function at all -- the twiddle factors are actually pairs of (cos,
>sin) values, exactly as for an FFT.
I've done both ways.
>> Note the operative use of past-tense, since the experience I was
>> describing was in about 1988, which was pretty contemporary with the
>> early "Look how great the FHT is!" papers, as well as Sorensen's RFFT
>
>First, you have used present and future tense many times in this
>thread when describing the FHTs supposed substantial advantages in
>performance and memory requirements.
>
>Second, the fact that you cite your own 20-year-old results in support
>of your claim, from a your admittedly incomplete understanding at the
>time of the differences (or lack thereof) between FHTs and real-input
>FFTs, does not make your claims any more correct. It just means that
>you have a better excuse for being wrong.
No, it means I've addressed you continually pointing to your mistaken
interpretation of what I wrote as ammunition for your position. I
even quoted what I orginally wrote so that I could clarify what I
meant, and you still insist my position is something it is not. There
is no logical defense against that so there's not much I can do there.
I've cited some personal experience with memory savings in a system
(not necessarily the transform alone), and have been happy to clarify
that context. I have not claimed and do not believe that there are
any big differences between an FHT and an RFFT in either fundamental
computation load or memory requirement. I've known this for a couple
of decades. You seem to be in violent agreement on that point.
I do however, from a few decades experience, know that little
differences can become "substantial" in making something work or not
under certain sets of constraints. It's too bad that you've chosen
to hang so much meaning on that word, but so be it. You seem to have
used that to mistakenly interpret what I wrote as a factor of two in
one case, and it took a fair amount of effort to get back to the point
where you don't really disagree with me there at all. I think. It's
hard to tell for certain.
>> In my experience, at that time, in my application, on my Amiga, with
>> Sorensen's translated RFFT and my hand-coded FHT, I was able to get my
>> algorithms working for my thesis using less memory (and about the same
>> computation load) with the FHT than with the RFFT. *
>
>Then you implemented the algorithms imperfectly.
Personally, I think all implementations are imperfect and have
compromises, but that's just me.
>Both can operate in-
>place, and both require the same size twiddle tables. Unless you are
>talking about instruction memory, not data memory, which is negligible
>here unless your dataset is very small, but if code size is the major
>issue then there are other things you can do; see below.
I'm talking about the amount of memory it took to get the particular
job done that I was doing, which wasn't just comparing transforms. I
had real-valued input, did a lot of cross-correlation in the frequency
domain, magnitude detection, and some inverse transforms for parameter
extraction. Doing all that, including the FD cross-correlations, and
magnitude detection, required "substantially" less memory than the
RFFT solution I had at hand (which was Sorensen's) used to do the same
process. I don't recall the details and, as I mentioned, I'm not
inclined to revive a 20+-year-old computer to get the files. I do
recall that it had to do with getting some interesting savings due to
careful management of the use of the output buffers and the
cross-correlation operations and then feeding the inverse transform.
It had something to do with an exploitation of the data arrangement at
the output of the FHT. I do not recall the details.
Doing the same function with the RFFT that I had available to me at
the time required a chunk (IIRC something like N/2, I don't recall
completely it was a long time ago) somewhere in the process that I
didn't need with the FHT. That turned out to be the only advantage,
and when it was all done I thought it wasn't even worth it. It's
stuck in my mind because, while the difference was small (but
"substantial" by my judgement) it was a clear advantage that I got
with the FHT that I couldn't get with that particular RFFT for that
particular application for the particular process I was using.
>> I don't think that's the case. * Look more closely at the results
>> table. * If the FFT under comparison were based on the FHT under
>> comparison, it would be at least the same size as the FHT. * It is
>> not. *It is significantly smaller. * This suggests that it was not an
>> FHT-based FFT under comparison. * Whether or not it was an RFFT is not
>> at all clear.
>
>The paper explicitly states, "FHT is taken as the basic VLSI
>architecture. It is used as architecture for obtaining all other
>transforms." Maybe they were quoting the additional gates to get an
>FFT from an FHT?
>
>In any case, I agree that the paper is very unclear -- it's not
>something you can really draw any useful conclusions from regarding
>FHT vs. RFFT.
>
>> I merely offered it as an existence proof of a recent published case
>> where somebody found an advantage of an FHT over (what I thought was a
>> real-input) FFT. * It's not clear now whether it was a real-input FFT
>> or not so who knows.
>
>Certainly, there are people still publishing the occasional paper on
>FHT algorithms, and even making vague claims of performance benefits
>(although nowadays they just cite conceptual preferences). I have no
>disagreement on that. But for 20 years they haven't been able to back
>it up with clear comparisons, and the (rather sketchy) paper you cite
>is no different in this regard.
Do a Google search on "VLSI Hartley FHT". There are actually a fair
number of papers doing something quite similar to what this guy did,
i.e., applying the FHT as a transform engine for compressors that need
to do multiple types of transforms. I don't know whether people
still do that task that way, but I have heard rumblings in that sense
here and there. The compressors in question do operate on real-valued
inputs. I offer the aggregate of those papers as existence proof that
there are people out there who find, for their particular applications
under their particular set of constraints, that an FHT provides an
advantage. That's all I've been claiming, is that those cases can
exist.
>> >Of course they are different, but you still have yet to (correctly)
>> >explain a clear concrete advantage for FHTs in performance or memory
>> >in *any* circumstance.
>>
>> My Amiga twenty years ago doesn't count? * Self-inverse doesn't count?
>
>Self-inverse is not a performance advantage, nor is it memory
>advantage except for extremely small transforms -- i.e. it's only O(1)
>savings at best. And in the latter case there are actually ways to
>get an inverse RFFT from a forward RFFT. In practice, when one is
>talking about memory advantages one is normally talking about the data
>[O(n) memory]. You yourself kept talking about twiddle factors,
>addressing, pipelining, data permutations, etcetera, which is what I
>was disagreeing with -- for you to suddenly turn around and pretend
>that you were only talking about code size from the self-inverse
>property (which I was the one to point out, BTW) is disingenuous.
I am, and have been, talking about implementation advantages. An
advantage in implementation may not involve changing the overall
memory size requirement, it may not involve (and likely won't)
changing the number of multiplies or additions performed. It may,
however, involve reducing the number of multipliers that need to be
instantiated, or the precision carried at a certain point, or the type
of memory used (which, even if the size is the same, can matter), or
the cycle count to get the job done, or the power consumed in doing
it.
As I pointed out, while the conference paper results got an FFT that
was substantially smaller in die real-estate than the FHT, had longer
latency and consumed substantially more power than the FHT
implementation. I think that the FFT could not have been based on an
FHT or it would have occupied more die real-estate than the FHT
compared. That shows that there was an implementation advantage to
the FFT in this particular comparison for die real-estate, but an
advantage to the FHT for latency and power consumption. Depending
on which of those factors were important to a user one would choose
one or the other accordingly, but if latency and/or power consumption
were the deciding factors, then clearly the FHT would have an
advantage in that case.
It could be, and I wouldn't be surprised at all, if they both had the
same amount of memory and did the same number of operations, but the
implementation differences in control, layout, or wherever the
differences were, got the results shown. That's a very typical thing
to happen in these sorts of tradeoffs, so "implementation advantage"
can take many forms which have real value that have to do with metrics
OTHER THAN total memory or operation count or what the
trellis/butterfly diagram looks like. That's what I've been trying
to get at even from my earliest posts in this thread.
But I do not exclude the possibility that even memory or computational
advantages might be gained in the basic transform one way or other
depending on the circumstances, and the fact that the outputs of the
two transforms are organized differently suggests that that could be
exploited. I've thrown out some possible thoughts there, and I've had
some personal (albeit long ago) experiences, but I've not claimed
anything concrete and I've not offered examples because I don't have
any. A couple decades of experience with such things tells me that
it would be foolish to conclude that it cannot be done, and my
knowledge of the facts suggests that it likely could be under the
right circumstances. That's all I've offered.
You seem to have misinterpreted a lot of what I've said, and since
that can easily be from me not being clear, I've tried to clarify
where I thought those misinterpretations could be. If you want to
ignore the clarifications and insist I meant something other than I
did there's not much I can do about that. It seems to me you've just
latched onto interpreting certain bits of what I've said in the least
favorable light to whatever it is that's important to you about this.
I can continue to try to clarify if it's helpful.
>Your vague reference to unpublished results on your Amiga doesn't
>count because there's no indication that your real-input FFT was
>especially good -- it says more about you than about the algorithms.
I told you where the real-input FFT came from. I'm a bit surprised
that you seem to be using it for a character attack.
If you think comp.dsp should restrict itself to only discussion of the
results of published papers and exclude unpublished personal
experience I'd disagree strongly with that notion. Meanwhile I'll
continue to post my thoughts based on my experiences and people can
take them for what they're worth...that's how the internet works
pretty much.
>It's trivial to find *some* real-input FFT code that is slower than
>*some* FHT code, but that's not particularly meaningful. What is
>meaningful is that people have been trying to implement FHTs for 20
>years, and the best real-input FFTs have consistently been at least as
>good. At a more fundamental level, no one has identified a difference
>between the two algorithms that would seem to give a substantial
>advantage one way or the other in performance---arithmetic, memory
>access patterns, memory utilization, etcetera are all essentially
>identical.
The VLSI-related papers, including the one I already cited, are
existence proofs to the contrary (with the mentioned caveats). As
I've tried to point out for a while, unless the comparisons on which
you're relying included all the possible implementation constraints
that typically present themselves (and I know they haven't), then they
have to be interpreted in the scope for which they apply. As I said
early on in the thread, operation counts don't tell the whole story.
Nor do memory sizes. Even some of the oft-cited comparison papers
show tradeoffs based on things like what butterfly architecture is
used. Wherever there's a difference there's a possible exploitation
in implementation.
And I have to say you wouldn't necessarily know whether anybody has
really gotten a big advantage with an FHT because LOTS of people who
work on these things don't publish (for a lot of good reasons). And
sometimes even when somebody does want to publish on a topic that got
hashed out ten years ago, it won't make it onto the editor's radar
because it's not the topic du jour. The VLSI papers (including the
one cited) seem to be being published in places other than where you
might be looking.
So I put only so much weight on the literature's conclusions, because
it should be interpreted within the scope that it exists, and I don't
think that scope includes all possible constraints. The number of
permutations of possible success metrics and constraints in
implementations is so huge that it couldn't possibly be covered in the
literature. Nobody is going to publish a paper because they saved a
few mutlipliers or a small amount of memory or a bit of power
consumption in a specific implementation with a specific library
because no editor would find it interesting. And that's appropriate
because the scope is so small. I understand this, and that's why I
find these smatterings of hints that guys doing compression in VLSI
using FHTs might be onto something for that particular application. I
think it would be foolish to say that since academic literature hasn't
identified a general top-level algorithmic advantage one way or other
that no one has or can find one in implementation.
It's a bit different when you're building things in industry than it
is in academia.
>> Not necessarily smaller chunks, differently organized chunks. *If you
>> want to re-use the same memory block for the input buffer, in-place
>> processing, and output buffer, for an FHT that's not that big of a
>> deal. * For an RFFT (as I understand them), the addressing will
>> require more effort at some point or other (perhaps only at the
>> output)
>
>Your understanding of real-input FFT algorithms is incorrect.
>
>To quote Sorensen in 1987, "The resulting programs are nearly the
>same, due to the fundamental similarity between the Hartley and RFFT
>algorithms, so the indexing and other overhead costs are essentially
>the same".
>
>In the classic bit-reversal approach to in-place transforms, the FHT
>and RFFT algorithms require the same permutation on the input or
>output. As Sorensen wrote in another 1985 paper, "The permutation of
>the time indices will be exactly the same as for a Cooley-Tukey
>radix-2 FFT, because the same decomposition sequence is applied to
>both."
>
>I should also mention that there have been a few developments in FFT
>algorithms since 1965, and in particular it is perfectly possible to
>perform an in-place transform without any separate bit-reversal stage
>at all. Basically, you mix some transpositions in with the other
>computation stages in order to do both permutations and computations
>at the same time, which reduces the memory pressure compared to a
>separate bit-reversal pass. And on doesn't generally want to make
>log2(n) separate passes over the whole array either; you want to
>reorder the computations to improve locality. But since FHTs and
>RFFTs are essentially the same from the perspective of the
>computational graph and memory access patterns, the same optimizations
>apply to both (although in practice more work has been done on the
>latter).
FWIW, I wasn't talking about bit reversal. I'm aware of what you seem
to be trying to point out.
>> I merely think one should look beyond academic papers and software
>> library comparisons before concluding that all implementers everywhere
>> under all constraints will be bound to a certain outcome. *I have
>> difficulty imagining why one would think that was how things work.
>
>The burden of proof is on the person claiming a substantial
>performance/memory advantage one way or another, not on the person
>pointing out that in 20 years no one has demonstrated such an
>advantage. And whenever you have been pinned down to specifics on
>naming a difference that could lead to a substantial performance
>difference, you have been wrong.
>
>Regards,
>Steven G. Johnson
Again, I'm not making any big claims, I don't have a dog in this hunt,
I don't care what people use or don't use. I'm just sharing my
perspective and experience and it evidently is different than yours,
and more notably comes from a pretty different perspective.
You, however, do seem to be claiming something or other, I'm not quite
sure what it is other than that I'm wrong. I don't agree that because
academic literature has come to some conclusion that it must be true
in all cases, because generally and almost universally they can't even
begin to cover detailed implementation issues. Papers have to be
taken within their context, and I've completely agreed with the
general conclusion from the beginning (for the last twenty years!)
that the differences in the basic transforms are small and subtle, and
I've repeatedly said so myself in this very thread. So I'm not
arguing with your general conclusion or that the papers are wrong, but
those small subtle differences can become significant, even huge,
under some implemetation constraints. That won't be a surprise to
anyone who's worked under such constraints and there's some evidence
that others working in VLSI have actually seen advantages with the
FHT.
The big issue seems to be that I don't have a specific set of
constraints to cite where a measurable "advantage" can be
demonstrated. That's correct, I don't. I'm not inclined to go cook
one up, either, because it's a lot of work. Usually the constraints
come to you and you find what you find to solve the problem at hand
with the tools and building blocks you have. That's when you find the
real differences for that situation, and, again, I've already said
that in my experience just plugging FFT cores in has been the way to
go for ages. Until I hit a set of constraints that requires me to do
differently I am unlikely to do so.
Eric Jacobsen
Minister of Algorithms
Abineau Communications
http://www.ericjacobsen.org
Blog: http://www.dsprelated.com/blogs-1/hf/Eric_Jacobsen.php