PDA

View Full Version : how to optimize c code of Cordic algorithm


praveen
12-11-2003, 10:46 AM
Hello,
I have implemented cordic for finding the atan in adsp 2191. But it
takes 3714 cycles for its execution. I have implemented it in c. Can
something tell me how can i optimize the code for that it takes less
than 500 cycles.
my code is
LUT is the lookup table
x and y are the two input whsoe atan to be determined

for(i=0;i<=25;i++)
{
x1=x;
if (y>0)
{
x=x+(y>>i);
y=y-(x1>>i);
ang=ang+LUT[i];
}
else
{
x=x-(y>>i);
y=y+(x1>>i);
ang=ang-LUT[i];
}

Please suggest me technic my which i can reduce the number of cycles

Waiting for reply
With regards
praveen

Jim Gort
12-12-2003, 02:28 AM
Praveen:

In general, C compilers for DSPs do not always fully utilize the DSP
hardware. This includes the zero-overhead looping, multi-function
instructions, etc. Thus, the general answer to your problem is to look at
the assembly generated by your compiler for your algorithm (e.g., the .lst
file, or whatever your compiler produces), read the manual on the 2191 to
understand its (powerful) instruction set and hits taken for branches, etc.,
and then hand-optimize the assembly.

Most DSP engineers start by writing in assembly when they know that the
function is time-critical. Or, you can purchase (or be given free from ADI)
libraries of optimized assembly code implementations with C callable
wrappers. Either of these two approaches is much better then the answer to
your question, which is given in the above paragraph.

Also, in my opinion, and for the group, attempts to optimize C code given an
understanding of how it will be compiled requires such an understanding of
the C compiler that one is better off understanding the processor and
writing it in assembly to begin with. Comments?

Jim Gort

"praveen" <[email protected]> wrote in message
news:[email protected] om...
> Hello,
> I have implemented cordic for finding the atan in adsp 2191. But it
> takes 3714 cycles for its execution. I have implemented it in c. Can
> something tell me how can i optimize the code for that it takes less
> than 500 cycles.
> my code is
> LUT is the lookup table
> x and y are the two input whsoe atan to be determined
>
> for(i=0;i<=25;i++)
> {
> x1=x;
> if (y>0)
> {
> x=x+(y>>i);
> y=y-(x1>>i);
> ang=ang+LUT[i];
> }
> else
> {
> x=x-(y>>i);
> y=y+(x1>>i);
> ang=ang-LUT[i];
> }
>
> Please suggest me technic my which i can reduce the number of cycles
>
> Waiting for reply
> With regards
> praveen

Matt Timmermans
12-12-2003, 04:00 AM
"Jim Gort" <[email protected]> wrote in message
news:mJ9Cb.509423$Tr4.1413036@attbi_s03...
> Also, in my opinion, and for the group, attempts to optimize C code given
an
> understanding of how it will be compiled requires such an understanding of
> the C compiler that one is better off understanding the processor and
> writing it in assembly to begin with. Comments?
>
> Jim Gort

It's a shame you can't do both, really. C compilers often can't effectively
determine when to use special processor features, but they are typically
better than programmers at things like register allocation. If I were
writing a development system for a DSP, it would probably be a C compiler
with a few extra DSP datatypes and a lot of intrinsic functions. The
intrinsic functions would be patterned after typical DSP processor features,
would have C prototypes, and would be equivalent to C library functions.
But the compiler for a given DSP would have a priori knowledge of these
functions and would compile calls to them into assembler that explicitly
uses the DSP processor features when present.

It probably wouldn't take too much work to make a good library of intrinsics
that have efficient implementations across a wide variety of DSPs.

Jerry Avins
12-12-2003, 04:52 AM
Jim Gort wrote:

...


> Also, in my opinion, and for the group, attempts to optimize C code given an
> understanding of how it will be compiled requires such an understanding of
> the C compiler that one is better off understanding the processor and
> writing it in assembly to begin with. Comments?
>
> Jim Gort

...

The advantage of C or any other high-level language is that the code is
portable. PPT, or Programmers Principal Tautology: portable code is
useless if it too slow to be useful. (Otherwise known as AD -- Avins's
Duh.) Personally I'd rather "just do it" in assembler than try to psych
out a compiler. When I'm just learning a processor, fixing the compiled
code (or trying to) makes sense. Once I know it fairly well, it doesn't.
For one thing, I'm likely to factor the code differently for assembler,
so the compiler output often isn't a good fit to my end result.

Jerry
Engineering is the art of making what you want from things you can get.
ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ

Parthasarathy
12-12-2003, 05:44 AM
[email protected] (praveen) wrote in message news:<[email protected]>...
> Hello,
> I have implemented cordic for finding the atan in adsp 2191. But it
> takes 3714 cycles for its execution. I have implemented it in c. Can
> something tell me how can i optimize the code for that it takes less
> than 500 cycles.
> my code is
>


Optimisation of the algorithm:
I guess, knowing the error (in the angle) that one can tolerate,
optimisation on the number of entries in the LUT and the Number of
iterations is possible.

Bye
Partha

Ken Asbury
12-12-2003, 02:23 PM
"Matt Timmermans" <[email protected]> wrote in message news:<[email protected]>...
> "Jim Gort" <[email protected]> wrote in message
> news:mJ9Cb.509423$Tr4.1413036@attbi_s03...
> > Also, in my opinion, and for the group, attempts to optimize C code given
> an
> > understanding of how it will be compiled requires such an understanding of
> > the C compiler that one is better off understanding the processor and
> > writing it in assembly to begin with. Comments?
> >
> > Jim Gort
>
> It's a shame you can't do both, really. C compilers often can't effectively
> determine when to use special processor features, but they are typically
> better than programmers at things like register allocation. If I were
> writing a development system for a DSP, it would probably be a C compiler
> with a few extra DSP datatypes and a lot of intrinsic functions. The
> intrinsic functions would be patterned after typical DSP processor features,
> would have C prototypes, and would be equivalent to C library functions.
> But the compiler for a given DSP would have a priori knowledge of these
> functions and would compile calls to them into assembler that explicitly
> uses the DSP processor features when present.

Kinda' like Bittware, only free?

> It probably wouldn't take too much work to make a good library of intrinsics
> that have efficient implementations across a wide variety of DSPs.

I'd suggest that meeting your goals of "efficient implementation"
and "variety" are going to take a bit more work than you
anticipate, given that one of my favorite "tricks" is to take
working C or C++ code written by really smart people like you
guys and make it run really fast (factors of 3-10) in assembly on
the same processor or to migrate it to another processor.

Within a family's architecture you're absolutely right but
moving between vendors the internals can be significantly
different especially when moving to or from PC-based code
(modify Jerry's "AD" to apply to that latter!).

With a little thought (I won't bore you with the details unless
you're foolish enough to ask) one can make the translation
highly testable and allow direct comparison of the results
created by the two code sets. Further,you can provide for later
recovery of the C code (using compiler switches) if needed.

Warranted, this is not the most efficient code possible in any
given environment because the differing methodology one could
use starting from scratch in assembler but it seems to represent
a suitable compromise between optimization, verification and
maintenance to my customer base.

Ken

Matt Timmermans
12-13-2003, 01:21 AM
"Ken Asbury" <[email protected]> wrote in message
news:[email protected] om...
>
> Kinda' like Bittware, only free?

Not really like that.

I'm talking about wee tiny simple functions that have obvious trivial
intrinsic translations into CPU features. x86 CPUs, for example, have what
amounts to a strcpy() instruction. Back in the days when function call
overhead was considered significant, it was common for compilers to replace
calls to strcpy with that simple instruction. It's like adding an operator
to the language without changing the semantics of the language itself.

For DSPs, you could use the same technique to add language support for lots
of simple things like saturating arithmetic, fixed-point multiplication,
circular buffers, zero overhead loops, etc.

> I'd suggest that meeting your goals of "efficient implementation"
> and "variety" are going to take a bit more work than you
> anticipate, given that one of my favorite "tricks" is to take
> working C or C++ code written by really smart people like you
> guys and make it run really fast (factors of 3-10) in assembly on
> the same processor or to migrate it to another processor.

Yes, and the goal of an intrinsic library is to let you do much the same
trick in C. The intrinsic functions would have trivial and obvious
translations into assembly instructions for the features that the processor
supports. For reatures that the processor doesn't support, you get some
hand-optimized inline function that isn't quite as good.

Because the translation to assembly on any given platform is obvious, you
can hand-optimize C code for that platform in a predictable way, and your
code would remain portable to the extent that your program would have the
same semantic meaning across all platforms, even though you might want to
re-optimize it for processors that were significantly different. You also
get to let the compiler manage register allocation, stack shuffling,
instruction scheduling, type checking, and all that tedious stuff that
compilers are better at than people these days.

Dirk Bell
12-13-2003, 03:13 AM
Praveen,

Please provide the following info:

Are you using single precision (16 bits) or double precision (32 bits)
each to store 'x' and 'y'?
Are you using integer or fractional math?
Are the values of 'x' and 'y' using the entire range of the number of
bits they are stored in?
How many bits represent 'ang'?
Why does i go from 0 to 25?
Describe the contents of your LUT.

Have you verified that the result at each iteration of the loop is
what you expected? How about the final results? For what range of
input angles?

Thanks,


Dirk

Dirk A. Bell
DSP Consultant


[email protected] (praveen) wrote in message news:<[email protected]>...
> Hello,
> I have implemented cordic for finding the atan in adsp 2191. But it
> takes 3714 cycles for its execution. I have implemented it in c. Can
> something tell me how can i optimize the code for that it takes less
> than 500 cycles.
> my code is
> LUT is the lookup table
> x and y are the two input whsoe atan to be determined
>
> for(i=0;i<=25;i++)
> {
> x1=x;
> if (y>0)
> {
> x=x+(y>>i);
> y=y-(x1>>i);
> ang=ang+LUT[i];
> }
> else
> {
> x=x-(y>>i);
> y=y+(x1>>i);
> ang=ang-LUT[i];
> }
>
> Please suggest me technic my which i can reduce the number of cycles
>
> Waiting for reply
> With regards
> praveen

Randy Yates
12-13-2003, 03:35 AM
Jim Gort wrote:
> [...]
> Also, in my opinion, and for the group, attempts to optimize C code given an
> understanding of how it will be compiled requires such an understanding of
> the C compiler that one is better off understanding the processor and
> writing it in assembly to begin with. Comments?

AMEN BROTHER!!!!! Jim, you and I think alike!

--Randy


--
% Randy Yates % "...the answer lies within your soul
%% Fuquay-Varina, NC % 'cause no one knows which side
%%% 919-577-9882 % the coin will fall."
%%%% <[email protected]> % 'Big Wheels', *Out of the Blue*, ELO
http://home.earthlink.net/~yatescr

Randy Yates
12-13-2003, 03:41 AM
Matt Timmermans wrote:

>[...]
> It's a shame you can't do both, really.

You can. Write your time-critical code in assembly
and make it C-callable. Intrinsics have the same
problem as the one Jim addressed - in the time it
takes to learn and apply them, you could've written
in assembly, and the code is still less readable and
unportable - two big reasons for writing in C to begin
with.

I think folks who cling to C are in denial - you're
just gonna have to break down and code in assembly if
you want optimum performance. Learn it. Live it. Love it.
--
% Randy Yates % "...the answer lies within your soul
%% Fuquay-Varina, NC % 'cause no one knows which side
%%% 919-577-9882 % the coin will fall."
%%%% <[email protected]> % 'Big Wheels', *Out of the Blue*, ELO
http://home.earthlink.net/~yatescr

praveen
12-13-2003, 04:59 AM
Hello,

I need 25 iteration,so my look up table is 4 byte size each,
Size of x,y,atan is 4 byte each. My accuracy of estimation is of the
order 1 microradian.

waiting for reply
with regards
praveen

Matt Timmermans
12-13-2003, 05:25 AM
"Randy Yates" <[email protected]> wrote in message
news:[email protected] et...
> You can. Write your time-critical code in assembly
> and make it C-callable. Intrinsics have the same
> problem as the one Jim addressed - in the time it
> takes to learn and apply them, you could've written
> in assembly, and the code is still less readable and
> unportable - two big reasons for writing in C to begin
> with.

No fair, Randy -- you didn't count the time it takes to learn and apply
assembly. And if you have one intrinsics library across platforms, you only
have to learn that once.
Once you've leared to program in a couple assembly languages, learning
another one is a waste of neurons. It won't teach you any great truths or
help you think differently -- it's just work that becomes worthless when you
want to switch processor families.

> I think folks who cling to C are in denial - you're
> just gonna have to break down and code in assembly if
> you want optimum performance. Learn it. Live it. Love it.

Well, yeah, that's certainly true as things stand, but that's mostly because
C sucks for DSP, and that's mostly because C was optimized for different
kinds of CPUs. Poor DSP performance is not an inescapable characteristic of
all higher level languages. Theres no reason a DSP-optimized C-level
language couldn't get more than half the performance in a similar space for
less than half the effort. In the current scheme of things, that is almost
always a trade-off I'd jump at.

And an intrinsics library can be just like having a new language, except
without the nifty syntactic sugar and stricter type checking that you'd get
if you'd designed a really new language instead.

John Monro
12-13-2003, 08:08 AM
praveen wrote:

>Hello,
>I have implemented cordic for finding the atan in adsp 2191. But it
>takes 3714 cycles for its execution. I have implemented it in c. Can
>something tell me how can i optimize the code for that it takes less
>than 500 cycles.
>my code is
>LUT is the lookup table
>x and y are the two input whsoe atan to be determined
>
>for(i=0;i<=25;i++)
> {
> x1=x;
> if (y>0)
> {
> x=x+(y>>i);
> y=y-(x1>>i);
> ang=ang+LUT[i];
> }
> else
> {
> x=x-(y>>i);
> y=y+(x1>>i);
> ang=ang-LUT[i];
> }
>
>Please suggest me technic my which i can reduce the number of cycles
>
>Waiting for reply
>With regards
>praveen
>
>
Praveen,
That number of cycles seems high.
The DSP chip is capable of doing a multi-bit shift in a single cycle,
using its barrel shifter.
The compiler should compile each 'C' multi-bit shift into a single
assembly-language shift.
If, for some reason the compiler is producing a number of single-bit
shifts for each multi-bit shift
then that would account for the high cycle count.

Regards,
John

Jerry Avins
12-13-2003, 04:38 PM
Matt Timmermans wrote:

...

> Because the translation to assembly on any given platform is obvious, you
> can hand-optimize C code for that platform in a predictable way, and your
> code would remain portable to the extent that your program would have the
> same semantic meaning across all platforms, even though you might want to
> re-optimize it for processors that were significantly different. You also
> get to let the compiler manage register allocation, stack shuffling,
> instruction scheduling, type checking, and all that tedious stuff that
> compilers are better at than people these days.

More easily said than done, but that's the general idea. There's a Forth
with extensions for the 'C31, but the author isn't proud enough of it to
release it. He's accustomed to deep optimizing, but this one is better
than C (for that machine), but not close enough to assembly.

Jerry
--
Engineering is the art of making what you want from things you can get.
ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ

Jerry Avins
12-13-2003, 05:11 PM
John Monro wrote:

...

> Praveen,
> That number of cycles seems high. The DSP chip is capable of doing a
> multi-bit shift in a single cycle, using its barrel shifter.
> The compiler should compile each 'C' multi-bit shift into a single
> assembly-language shift. If, for some reason the compiler is producing
> a number of single-bit shifts for each multi-bit shift
> then that would account for the high cycle count.
>
> Regards,
> John

Here we go, psyching out a stupid compiler again. (Well, pretty smart
actually, but stupid compared to Praveen.) And when you modify the .obj
file to remove the extra instructions, all subsequent labels need
address fix-ups. (Until you slow the program down a bit by forcing the
compiler to do the fix-up, or insert no-ops where their only harm is
taking up room.

Jerry
--
Engineering is the art of making what you want from things you can get.
ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ

Randy Yates
12-13-2003, 05:43 PM
Matt Timmermans wrote:

> "Randy Yates" <[email protected]> wrote in message
> news:[email protected] et...
>
>>You can. Write your time-critical code in assembly
>>and make it C-callable. Intrinsics have the same
>>problem as the one Jim addressed - in the time it
>>takes to learn and apply them, you could've written
>>in assembly, and the code is still less readable and
>>unportable - two big reasons for writing in C to begin
>>with.
>
>
> No fair, Randy -- you didn't count the time it takes to learn and apply
> assembly. And if you have one intrinsics library across platforms, you only
> have to learn that once.

What if different processors require different instrinsics?

> Once you've leared to program in a couple assembly languages, learning
> another one is a waste of neurons.

It may be a pain in the ass, but concluding it's a waste of neurons
is a bit presumptuous. It depends on what your situation is.

One scenario is where the optimizations gained provide
real performance improvements to the end-user and/or enable you
to cost-reduce a mass-marketed product. I did just that - if you
buy a Sony Ericsson T226, T230, or T237, you'll be buying just
such an solution. And I'm here to tell you, it feels damn good to
be able to do this for your company. (I really don't want to go
into it since it may be IP-sensitive.)

> It won't teach you any great truths or
> help you think differently -- it's just work that becomes worthless when you
> want to switch processor families.

If it saved a crapload of money, then I wouldn't call that worthless.

>>I think folks who cling to C are in denial - you're
>>just gonna have to break down and code in assembly if
>>you want optimum performance. Learn it. Live it. Love it.
>
>
> Well, yeah, that's certainly true as things stand, but that's mostly because
> C sucks for DSP, and that's mostly because C was optimized for different
> kinds of CPUs. Poor DSP performance is not an inescapable characteristic of
> all higher level languages. Theres no reason a DSP-optimized C-level
> language couldn't get more than half the performance in a similar space for
> less than half the effort. In the current scheme of things, that is almost
> always a trade-off I'd jump at.

It may be that we really agree with each other, Matt. I certainly agree that
it isn't worth spending a month optimizing some code if it's just for a test
fixture that will be used for a few weeks. But like I said above, whether the
extra time and effort are really worth it or not depend on the situation, and
some situations DEFINITELY warrant the descendence into hard-core assembly.

> And an intrinsics library can be just like having a new language, except
> without the nifty syntactic sugar and stricter type checking that you'd get
> if you'd designed a really new language instead.

I dunno, it'd have to serve me breakfast before I'd agree we need YANL (yet
another new language). I really don't care for C#, and I'd even challenge
Perl and such. Just use C, man (or C++).
--
% Randy Yates % "...the answer lies within your soul
%% Fuquay-Varina, NC % 'cause no one knows which side
%%% 919-577-9882 % the coin will fall."
%%%% <[email protected]> % 'Big Wheels', *Out of the Blue*, ELO
http://home.earthlink.net/~yatescr

Jerry Avins
12-13-2003, 08:57 PM
Jerry Avins wrote:

...
> ... There's a Forth
> with extensions for the 'C31, but the author isn't proud enough of it to
> release it. He's accustomed to deep optimizing, but this one is better
> than C (for that machine), but not close enough to assembly.

But, but, but .... That guy ought to learn how to write.

Jerry
--
Engineering is the art of making what you want from things you can get.
ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ

Jerry Avins
12-13-2003, 09:05 PM
Randy Yates wrote:

> Matt Timmermans wrote:
>
>> "Randy Yates" <[email protected]> wrote in message
>> news:[email protected] et...
>>
>>> You can. Write your time-critical code in assembly
>>> and make it C-callable. Intrinsics have the same
>>> problem as the one Jim addressed - in the time it
>>> takes to learn and apply them, you could've written
>>> in assembly, and the code is still less readable and
>>> unportable - two big reasons for writing in C to begin
>>> with.
>>
>>
>>
>> No fair, Randy -- you didn't count the time it takes to learn and apply
>> assembly. And if you have one intrinsics library across platforms,
>> you only
>> have to learn that once.
>
>
> What if different processors require different instrinsics?

The point is that the compiler vendor should be writing the processor-
specific intrinsics packages. Just as compilers can optimize the same
code for a pentium or a PPC depending on switch settings.

Given the small customer base, I don't see that happening. Until it
does, people like you, who can write in assembler, will be in demand.

...

Jerry
--
Engineering is the art of making what you want from things you can get.
ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ

Randy Yates
12-13-2003, 10:03 PM
Jerry Avins wrote:
> Randy Yates wrote:
>
>> Matt Timmermans wrote:
>>
>>> "Randy Yates" <[email protected]> wrote in message
>>> news:[email protected] et...
>>>
>>>> You can. Write your time-critical code in assembly
>>>> and make it C-callable. Intrinsics have the same
>>>> problem as the one Jim addressed - in the time it
>>>> takes to learn and apply them, you could've written
>>>> in assembly, and the code is still less readable and
>>>> unportable - two big reasons for writing in C to begin
>>>> with.
>>>
>>>
>>>
>>>
>>> No fair, Randy -- you didn't count the time it takes to learn and apply
>>> assembly. And if you have one intrinsics library across platforms,
>>> you only
>>> have to learn that once.
>>
>>
>>
>> What if different processors require different instrinsics?
>
>
> The point is that the compiler vendor should be writing the processor-
> specific intrinsics packages. Just as compilers can optimize the same
> code for a pentium or a PPC depending on switch settings.

Jerry,

Here's an intrinsic right out of TI's documentation for the C54x C
compiler:

long _smac(long src, int op1, int op2); MAC Multiplies
op1 and op2, shifts the result left by 1, and adds it to src.
Produces a saturated 32-bit result. (OVM and FRCT set)

Now that's a pretty special-purpose intrinsic that is essentially
tied to the architecture of the machine. This is precisely what
I mean. It's not the implementations of intrinsics, it's their very
definitions.

At some point you just *cannot* abstract or generalize operations
since it's those very operations that give you the performance
improvement. Sort of a physical law.

Here's another thing I don't like about being tied to C. Many times
a hard-core assembly optimization requires organizing the data in
a very specific way. Now sure, you could organize it that way in
C too, but if you were just thinking in C you wouldn't think to
do the organization in the first place because you wouldn't be
doing the low-level instructions.

Like I said, assembly: learn it, live it, love it.

> Given the small customer base, I don't see that happening. Until it
> does, people like you, who can write in assembler, will be in demand.

Isn't it funny how one's hearing improves a hundred-fold when praises
are being said? Thank you, Jerry. I sure hope you're right.
--
% Randy Yates % "...the answer lies within your soul
%% Fuquay-Varina, NC % 'cause no one knows which side
%%% 919-577-9882 % the coin will fall."
%%%% <[email protected]> % 'Big Wheels', *Out of the Blue*, ELO
http://home.earthlink.net/~yatescr

Jerry Avins
12-14-2003, 03:53 AM
Randy Yates wrote:

...

> Jerry,
>
> Here's an intrinsic right out of TI's documentation for the C54x C
> compiler:
>
> long _smac(long src, int op1, int op2); MAC Multiplies
> op1 and op2, shifts the result left by 1, and adds it to src.
> Produces a saturated 32-bit result. (OVM and FRCT set)
>
> Now that's a pretty special-purpose intrinsic that is essentially
> tied to the architecture of the machine. This is precisely what
> I mean. It's not the implementations of intrinsics, it's their very
> definitions.

Right, but a mac needs special code to set up and finish. _smac() is
good for the middle omly. On long filters, that's most of it.

> At some point you just *cannot* abstract or generalize operations
> since it's those very operations that give you the performance
> improvement. Sort of a physical law.

No HLL I know gives access to flags like carry and overflow. It's hard
to write efficient code without them. For example, dividing by 2^n with
rounding is best done by right_shift n, add_immediate 0 with carry. The
rounding operation is one insteuction in assembler. How many in C?

> Here's another thing I don't like about being tied to C. Many times
> a hard-core assembly optimization requires organizing the data in
> a very specific way. Now sure, you could organize it that way in
> C too, but if you were just thinking in C you wouldn't think to
> do the organization in the first place because you wouldn't be
> doing the low-level instructions.
>
> Like I said, assembly: learn it, live it, love it.

...

Jerry
--
Engineering is the art of making what you want from things you can get.
ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ

Dirk A. Bell
12-14-2003, 07:26 PM
Praveen,

A few comments:

1) The code as presented assumes that x>=0. Max total shift possible in your
code is a little more than 90 degrees.
2) From your comments you are using double precision variables and math,
which is expensive computationally. Depending on your application single
precision might work adequately.
3) The accuracy you have stated is required does not require the loop to
iterate 26 times.
4) Your cordic code is short enough that you should be able to determine the
assembly code generated and present that to the group for suggestions of
what to change. The question of how many shifts the C compiler is using to
implement '>>i' would be answered by this. If the answer is 'i' shifts then
there are simple alternatives to save processing. Other potential problems
may also be apparent.

A few more questions:

1)The original values loaded into x and y have how many bits each?
2) Where are they placed in the 32 bits of the x and y variables prior to
starting the routine?

Dirk A. Bell
DSP Consultant


"praveen" <[email protected]> wrote in message
news:[email protected] om...
> Hello,
>
> I need 25 iteration,so my look up table is 4 byte size each,
> Size of x,y,atan is 4 byte each. My accuracy of estimation is of the
> order 1 microradian.
>
> waiting for reply
> with regards
> praveen

praveen
12-15-2003, 01:01 PM
Hello,

> Are you using single precision (16 bits) or double precision (32 bits)
> each to store 'x' and 'y'?

I am using double precision


> Are you using integer or fractional math?

integer

> Are the values of 'x' and 'y' using the entire range of the number of
> bits they are stored in?

my range of x and y is maximum of 2 and minimum of -2. But i am 32 bit
to represent it.


> How many bits represent 'ang'?
i am using 32 bit

> Why does i go from 0 to 25?

because my estimation of accuracy should of the order of 1
microradians.

> Describe the contents of your LUT.
its contains value from 45 degrees to 0. with step size of 45/26.

static long LUT[26]={23592960,13927738,7359034,3735561,1875029,938429 ,469329,234679,117341,58671,29335,14668,7334,3667, 1833,917,458,229,115,57,29,14,7,4,2,1};


>
> Have you verified that the result at each iteration of the loop is
> what you expected? How about the final results? For what range of
> input angles?

Yes the result is fine as expected. Its also 32 bit.

waiting for reply
with regards
praveen

Dirk A. Bell
12-15-2003, 02:38 PM
Praveen,

See my last post for more comments, questions.

Dirk

"praveen" <[email protected]> wrote in message
news:[email protected] om...
> Hello,
>
> > Are you using single precision (16 bits) or double precision (32 bits)
> > each to store 'x' and 'y'?
>
> I am using double precision
>
>
> > Are you using integer or fractional math?
>
> integer
>
> > Are the values of 'x' and 'y' using the entire range of the number of
> > bits they are stored in?
>
> my range of x and y is maximum of 2 and minimum of -2. But i am 32 bit
> to represent it.
>
>
> > How many bits represent 'ang'?
> i am using 32 bit
>
> > Why does i go from 0 to 25?
>
> because my estimation of accuracy should of the order of 1
> microradians.
>
> > Describe the contents of your LUT.
> its contains value from 45 degrees to 0. with step size of 45/26.
>
> static long
LUT[26]={23592960,13927738,7359034,3735561,1875029,938429 ,469329,234679,1173
41,58671,29335,14668,7334,3667,1833,917,458,229,11 5,57,29,14,7,4,2,1};
>
>
> >
> > Have you verified that the result at each iteration of the loop is
> > what you expected? How about the final results? For what range of
> > input angles?
>
> Yes the result is fine as expected. Its also 32 bit.
>
> waiting for reply
> with regards
> praveen

praveen
12-25-2003, 07:35 AM
Hello,

If you write the cordic in C for atan . How much is the usual number of cycle
required?????

waiting for reply
praveen