Register
Site Search
Forums
Welcome to PocketMatrix. PocketMatrix is dedicated to providing the best online community for mobile device developers and enthusiests. What's new?

## Writing Efficient C for ARM

#### Writing Efficient C for ARM

I recently came across this document on ARM's website. While it strictly deals with optimization techniques using the ARM C compiler, some things do carry over to the Microsoft ARM C compiler provided with eVC. I haven't got around to testing some of the topics covered in the article and looking at the resulting assembly language output from the MS compiler. Perhaps one of you lads with tons of time on your hands can do this and share your findings? Anyway, it makes good bathroom reading if nothing else.<br><br>
Digby
pm Insider

Posts: 1011
Joined: Apr 29, 2001 @ 1:53pm

#### Re: Writing Efficient C for ARM

Thanks Digby! That's some really useful information. I read through it, and took some notes along the way. Here is basically a consolidated version:
In general, avoid division and modulo.

If you must use division or modulo, then:
• Use unsigned ints.
• Try to divide by powers of 2. The compiler will automatically use shifts instead of divides. Will it also use binary AND for modulus? I assume this only pertains to static const divisors?
• If you need both the remainder and dividend, place your div and mod operators as close together as possible, so the compiler can perform the operation once.
• Divisions by 10 are specially optimized and perform better than other divisions (except by powers of 2).
• Use logic instead of modulo if possible.
• If you are dividing by a number 2^n then you can just shift right by n (very fast!)
• If you are performing modulus by a number 2^n, then you can binary AND by (2^n)-1 (very fast!)

In boolean logic use arithmetic operators (not div or mod) over boolean when possible.
Code: Select all
`1 2 3 4 bool PointInRect1( Point p, Rectangle *r ) {  return ( p.x >= r->xmin && p.x < r->xmax &&    p.y >= r->ymin && p.y < r->ymax );}4 lines; 2 keywds; 0 nums; 33 ops; 0 strs; 0 coms    Syntactic Coloring v0.4 - Dan East   `

Use:
Code: Select all
`1 2 3 4 5 bool PointInRect2( Point p, Rectangle *r ) {  return ( ( unsigned ) ( p.x - r->xmin ) < r->xmax &&    ( unsigned ) ( p.y - r->ymin ) < r->ymax );}5 lines; 4 keywds; 0 nums; 33 ops; 0 strs; 0 coms    Syntactic Coloring v0.4 - Dan East   `

When dataprocessing instructions are performed (MOV, ADD, AND, MUL), the zero and negative flags are set, in addition to the actual math operation. Thus if a logical comparison to zero is performed immediately after the operation, no comparison actually needs to be performed, only a conditional branch. So compare to 0 when possible, and perform the operation in the same compound statement as the compare, so the compiler can just use the flag set by the operator.

The above pertains to loops as well. Since comparison to 0 is desirable, count down to zero when possible. This will result in a substantial saving in clock cycles.

The compiler does not automatically unroll small loops. So if you perform a loop that iterates a small number of times, you should unroll the loop into multiple steps manually. This of course results in a larger executable, but speed is almost always a higher priority.

Use switch statements with dense case values. The rule is that you should use case values that occupy at least 50% of the possible values between your min and max case values. In other words, this is bad:
Code: Select all
`1 2 3 4 5   case 0:  case 1:  case 2:  case 900:  case 2000:5 lines; 5 keywds; 5 nums; 5 ops; 0 strs; 0 coms    Syntactic Coloring v0.4 - Dan East   `

In this case there is a range of 2001 unique values, of which only 5 are used. The compiler would split this into nested case statements. Better yet, if you can simply look-up into an array indexed by your switch value, you can save additional time by omitting the case statements altogether.

Register usage:
The compiler will attempt to assign int, float and pointer vars to registers. However, this can only be done in specific conditions:

Var must be local or a function parameter. Thus you should never perform heavy looping / math on a global var. Make local copies of global vars you do a lot of operations on. Class member vars are obviously not local vars.
If you take the address of a var (& operator), you should not assign it to another var, otherwise the compiler cannot assign it to a register. If you need to pass a var as a pointer to another function (and of course you are trying to get the compiler to assign that var to a register) then you should make a copy of the var and pass a pointer to the copy to the function.

Avoid repetitive pointer chains on the same base:
Code: Select all
`1 2 3 4 a->b->c1a->b->c2a->b->c3a->b->c44 lines; 0 keywds; 0 nums; 16 ops; 0 strs; 0 coms    Syntactic Coloring v0.4 - Dan East   `

You should either create a pointer to the deepest common level:
Code: Select all
`1 2 3 4 5 b=a->b;b->c1b->c2b->c3b->c45 lines; 0 keywds; 0 nums; 12 ops; 0 strs; 0 coms    Syntactic Coloring v0.4 - Dan East   `

Or simply include the c-level object directly in the definition of its parent, instead of a pointer to the c-level object:
Code: Select all
`1 2 3 4 a->b.c1a->b.c2a->b.c3a->b.c44 lines; 0 keywds; 0 nums; 12 ops; 0 strs; 0 coms    Syntactic Coloring v0.4 - Dan East   `

Avoid functions with large numbers of vars. Register vars may be spilled (moved into memory) to make room for other vars, which is expensive. Consider breaking the function into smaller functions, or at least make the range of usage of the var ( the range in which it is "alive" ) as consolidated as possible. Also reuse vars as much as possible. The usage of "register" in the var declaration can help the compiler determine which vars are of most importance, and it will try to keep those from spilling.

Avoid use of shorts and chars whenever possible. At least copy them over to ints to do your operations. Don't forget that mod is bad, so don't copy a unsigned char to an int, do your math, then mod it by 256 when your done. You would be better off doing the math directly with the char type in the first place. Or instead of modulus you could binary AND by 255.
Apparently ARM processors based on ARM architecture version 4 or greater can perform load, store and shifts as efficiently with shorts and signed chars as with ints. I don't know if the ARM processors used by Pocket PCs are version 4+ or not. Anybody else know? If so, then try and perform only shifts directly on shorts and unsigned chars, and copy to ints for other math.

The first four parameters passed to a function are stored in registers. Thus you should try to use no more than 4 parameters per function for the time-critical code.
Other considerations for parameter passing:
Pass pointers to structs instead of the struct itself.
Instead of passing many arguments, create a struct that holds those vars and pass a pointer to that struct.
long long (64 bit int) and double (64 bit float) take 2 registers a piece.
Functions that take a variable number of parameters force all parameters on the stack (note this is not referring to C++ class members functions with default values).

Try and make all frequently called functions leaf functions (a leaf function is a "dead end" and does not call any other functions).
Assuming function a is called very often, and it calls function b (and only calls function b), then there are two ways to make function a into a leaf:

Incorporate the functionality of b directly into a. This may increase executable size, but speed will be greatly increased. This may also make it more difficult to maintain your code (since you may have multiple copies of b's implementation strewn about). However, you can #define b's implementation once and just place that macro in each of the functions that used to call b.
OR
Make function b inline.

If one function calls another, try to call that function in a return statement:
Code: Select all
`1 2 3 4 int a() {  do_a_bunch_of_stuff;  return b();}4 lines; 2 keywds; 0 nums; 8 ops; 0 strs; 0 coms    Syntactic Coloring v0.4 - Dan East   `

Code: Select all
`1 2 3 4 5 6 int a() {  do_some_things;  b();  do_some_more_things;  return something;}6 lines; 2 keywds; 0 nums; 10 ops; 0 strs; 0 coms    Syntactic Coloring v0.4 - Dan East   `

USE LOOKUP TABLES! Pre-calculate and use lookup tables when possible (including when memory constraints allow).

Floating Point (if you really have to use it)):
Floating point math is hundreds of times slower than integer math.
Division is typically twice a slow as mult, so multiply by the reciprocal instead of dividing when possible.
Use floats over doubles whenever range / precision allows.
Use lookup tables over calls to sin, exp and log.
Few if any optimizations are applied to floating point operations. Do not do redundant or inefficient operations assuming the compiler will optimize them.
The documentation recommends that when "floating performance will not reach the required level for a particular application" you should use fixed point math instead. If you're starting a project from scratch you might as well go with fixed point in the first place if performance matters in the slightest degree.

Dan East
Last edited by Dan East on Dec 3, 2004 @ 3:00pm, edited 4 times in total.

Dan East

Posts: 5264
Joined: Jan 25, 2001 @ 5:19pm
Location: Virginia, USA

#### Re: Writing Efficient C for ARM

Man, you're quite the typist there Dan. Just a reminder that those things are only known to be valid for the ARM C compiler - not the ARM compiler provided with the Microsoft Embedded Visual Tools SDK (clarm.exe).<br><br>There's also a whitepaper up on the ARM site regarding fixed-point math.  This is probably old hat for someone like yourself, but others might be interested.  Here's the link:<br>Fixed Point Arithmetic on the ARM<br><br>
Digby
pm Insider

Posts: 1011
Joined: Apr 29, 2001 @ 1:53pm

#### Re: Writing Efficient C for ARM

Please, Please don't rely on eVC to compile efficient code, because it won't!!!. apart from the natural speed of the ARM it has some very good features that the compiler just don't use , (see assem thread that i started), all instructions can be conditional (not just branches) depending on the outcome of the previous instruction e.g.<br><br>a=a-b<br>if (a<0)<br>{c=0;}<br>else<br>{c=100;}<br><br>can be represented in assembler...<br>r0=a;<br>r1=b;<br>r2=c;<br><br>subs r2,r0,r1<br>movmi r2,#0;<br>movpl r2,#100;<br><br>3 instructions!!!!!!!!<br>evc will push/pull all the vars off/on the stack, then use a CMP instruction for the compare!!!<br><br>Dan,<br>send me a complex and bottleneck function from ur q2 source code and i will show you the differnec hand compilation can make (i bet 3-5 times quicker)<br>but forget pushing ur C code about to get better performance, because eVC just don't compile efficent ARM code<br>Last modification: Mole - 01/03/02 at 10:26:49
Must go faster
Mole
pm Member

Posts: 10
Joined: Dec 11, 2001 @ 10:36am

#### Re: Writing Efficient C for ARM

EVC indeed messes up in a major fashion, but sadly, things that I typically used to turn into asm on the x86 are compiled just fine. I have this huge matrix / vector multiplication that compiles to 'optimal' code. I agree that the case you mentioned can be done much faster, but that's also code that I would rarely turn into hand-optimized asm. <br><br>One other thing that I noticed: When I compile with maximum optimizations, the resulting .asm file doesn't compile. It only works without optimizations. That could also explain the bad compiles you mentioned.
Give me some good data and
I will give you the world

Phantom
pm Insider

Posts: 913
Joined: Feb 21, 2001 @ 8:14am
Location: Houten, Netherlands

#### Re: Writing Efficient C for ARM

Mole,

I compiled the following with the MS ARM compiler and it generated code similar to what you posted. I don't see any loads to the stack of the variables used in the function.

Code: Select all
`1 2 3 4 5 6 7 8 9 10 11 12 int foo (int a, int b){int c;    a = a - b;    if (a < 0)        c = 0;    else        c = 100;    return c;}12 lines; 7 keywds; 3 nums; 17 ops; 0 strs; 0 coms    Syntactic Coloring v0.4 - Dan East   `

Here's the code it generated:

Code: Select all
`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |foo|      PROC; 5    : {|\$M88|; 6    : int c;; 7    :       a = a - b;      subs      r3, r0, r1; 8    :       if (a < 0); 9    :             c = 0;      mov       r0, #0; 10   :       else; 11   :             c = 100;      movpl     r0, #0x64  ; 0x64 = 100; 12   :; 13   :       return c;; 14   : }      mov       pc, lr|\$M89|      ENDP  ; |foo|29 lines; 4 keywds; 15 nums; 50 ops; 0 strs; 0 coms    Syntactic Coloring v0.4 - Dan East   `

That looks pretty good to me.
Digby
pm Insider

Posts: 1011
Joined: Apr 29, 2001 @ 1:53pm

#### Re: Writing Efficient C for ARM

Hi,<br><br>I have been following the asm topics quite closely as I would like to learn ARMasm. Where can i find out what all these instructions (mov, movpl, subs etc) mean and how I should use them?<br><br>Thanks in advance<br>Bad
Lost....... Assumed Coding!

pm Insider

Posts: 81
Joined: Sep 21, 2001 @ 6:26am
Location: Blackpool, UK

#### Re: Writing Efficient C for ARM

See the thread Assembly On iPaq<br><br>Dan East

Dan East

Posts: 5264
Joined: Jan 25, 2001 @ 5:19pm
Location: Virginia, USA

#### Re: Writing Efficient C for ARM

Thanks again Dan,<br><br>I did look at that thread but thought I would need a basic understanding before progressing to these documents.<br><br>Cheers<br>Bad<br><br>
Lost....... Assumed Coding!

pm Insider

Posts: 81
Joined: Sep 21, 2001 @ 6:26am
Location: Blackpool, UK

I was searching for something else and found this, so I fixed up formatting problems (from when PM was pre-phpBB), and bump!

If anyone has comments or corrections specific to MS's compiler then I will be happy to add them. If that information is useful (and accurate) then I'll make this sticky.

Dan East.

Dan East

Posts: 5264
Joined: Jan 25, 2001 @ 5:19pm
Location: Virginia, USA

Sticky !!

Structure
pm Member

Posts: 147
Joined: Sep 17, 2004 @ 8:28pm
Location: UK

Excellent thread, exactly what makes PocketMatrix such a valuable discussion forum. Most of the advice for writing efficient code is equally valid for other processors, not just ARM.

/vote sticky

Crayfish
pm Member

Posts: 41
Joined: Oct 1, 2004 @ 11:19am
Location: Lowestoft, UK

Allow me to add a few choice words to this discussion...

I've found that no matter how good the compiler is (and the eVC++ one is not great), the C language does not allow you to specify things which will use all of the capabilities of the target CPU. Depending on the application, good old hand-written assembly language can usually get you 2-3X the performance of C code. It certainly is a good idea to create clean C code for the compiler and I've got two good suggestions for speeding things up:

1) The ARM CPU does not do well with global (static) variables because it can only use register-indirect addressing. Keep statics in structures with a variable pointing to the structure.

2) Looping can be done a lot quicker if a loop variable is avoided. e.g.

Slow way:
s = <source pointer>
d = <destination pointer>
for (i=0; i<count; i++)
{
*d++ = *s++;
}

Fast way:
s = <source pointer>
d = <destination pointer>
pEnd = &d[count]

while (d < pEnd)
{
*d++ = *s++;
}

Enjoy,
Larry B.
bitbank
pm Member

Posts: 12
Joined: Dec 8, 2004 @ 4:13am
Location: South Florida

bitbank wrote:Slow way:
s = <source pointer>
d = <destination pointer>
for (i=0; i<count; i++)
{
*d++ = *s++;
}

Fast way:
s = <source pointer>
d = <destination pointer>
pEnd = &d[count]

while (d < pEnd)
{
*d++ = *s++;
}

Enjoy,
Larry B.

Possibly even faster way, if you can guarantee you are copying at least one value:
s = <source pointer>
d = <destination pointer>
pEnd = d+count; // Less confusing way to write &d[count]

do
{
*d++ = *s++;
}
while (d < pEnd);

fdave
pm Member

Posts: 44
Joined: Apr 12, 2004 @ 10:01pm

I have just started to look at optimizations on ARM so I might ask stupid questions now . None of the applications I have ever written for ARM have had performance requirements that required me to do more the algorithmic optimizations.

Anyhow, I figure that the ARM CPU is 32-bit. Back when I was coding pmode asm on x86 systems there were huge optimizations to gain by avoid writing to 32-bit unaligned addresses in memory.

Example: A horisontal line on screen might start at position 0x00000001 (unaligned) and end on position 0x000000A0 (aligned). If each pixel on screen were a unsigned short (16 bit), then we would gain speed if we wrote two pixels at the same time. However since the start offset is an uneven address, all memory operations in this loop would get a penalty.

If we moved only one pixel to the screen buffer in each iteration of the loop instead of two, then we would have a penalty every other pixel.

My question is, does this apply to the ARM CPU aswell? I havn't found any documentation, yet. The links in this thread is dead.

I'm going to try it out, just havn't found time to update my algorithm yet.

On x86 systems this also had to do with caching, which I dont remember exactly how it worked. Something about 4kb aligned memory would fall into different banks (dont quote me on that). Which leads me to my second and third questions. Does the ARM CPU even have cache? How do I optimize my code for it?
<a href="http://fixerfrasse.blogspot.com">blog</a>

frasse
pm Member

Posts: 12
Joined: Oct 3, 2005 @ 8:13pm

Next

### Forum Description

A discussion forum for mobile device developers on the Windows Mobile platform. Any platform specific topics are welcome.

### Moderators:

Dan East, sponge, Digby, David Horn, Kevin Gelso, RICoder

### Forum permissions

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum