Thanks Digby! That's some really useful information. I read through it, and took some notes along the way. Here is basically a consolidated version:
In general, avoid division and modulo.
If you must use division or modulo, then:
- Use unsigned ints.
- Try to divide by powers of 2. The compiler will automatically use shifts instead of divides. Will it also use binary AND for modulus? I assume this only pertains to static const divisors?
- If you need both the remainder and dividend, place your div and mod operators as close together as possible, so the compiler can perform the operation once.
- Divisions by 10 are specially optimized and perform better than other divisions (except by powers of 2).
- Use logic instead of modulo if possible.
- If you are dividing by a number 2^n then you can just shift right by n (very fast!)
- If you are performing modulus by a number 2^n, then you can binary AND by (2^n)-1 (very fast!)
In boolean logic use arithmetic operators (not div or mod) over boolean when possible.
Instead of:
- Code: Select all
1 2 3 4
| bool PointInRect1( Point p, Rectangle *r ) { return ( p.x >= r->xmin && p.x < r->xmax && p.y >= r->ymin && p.y < r->ymax ); }
|
| 4 lines; 2 keywds; 0 nums; 33 ops; 0 strs; 0 coms Syntactic Coloring v0.4 - Dan East |
Use:
- Code: Select all
1 2 3 4 5
| bool PointInRect2( Point p, Rectangle *r ) { return ( ( unsigned ) ( p.x - r->xmin ) < r->xmax && ( unsigned ) ( p.y - r->ymin ) < r->ymax );
}
|
| 5 lines; 4 keywds; 0 nums; 33 ops; 0 strs; 0 coms Syntactic Coloring v0.4 - Dan East |
When dataprocessing instructions are performed (MOV, ADD, AND, MUL), the zero and negative flags are set, in addition to the actual math operation. Thus if a logical comparison to zero is performed immediately after the operation, no comparison actually needs to be performed, only a conditional branch. So compare to 0 when possible, and perform the operation in the same compound statement as the compare, so the compiler can just use the flag set by the operator.
The above pertains to loops as well. Since comparison to 0 is desirable, count down to zero when possible. This will result in a substantial saving in clock cycles.
The compiler does not automatically unroll small loops. So if you perform a loop that iterates a small number of times, you should unroll the loop into multiple steps manually. This of course results in a larger executable, but speed is almost always a higher priority.
Use switch statements with dense case values. The rule is that you should use case values that occupy at least 50% of the possible values between your min and max case values. In other words, this is bad:
- Code: Select all
1 2 3 4 5
| case 0: case 1: case 2: case 900: case 2000:
|
| 5 lines; 5 keywds; 5 nums; 5 ops; 0 strs; 0 coms Syntactic Coloring v0.4 - Dan East |
In this case there is a range of 2001 unique values, of which only 5 are used. The compiler would split this into nested case statements. Better yet, if you can simply look-up into an array indexed by your switch value, you can save additional time by omitting the case statements altogether.
Register usage:
The compiler will attempt to assign int, float and pointer vars to registers. However, this can only be done in specific conditions:
Var must be local or a function parameter. Thus you should never perform heavy looping / math on a global var. Make local copies of global vars you do a lot of operations on. Class member vars are obviously not local vars.
If you take the address of a var (& operator), you should not assign it to another var, otherwise the compiler cannot assign it to a register. If you need to pass a var as a pointer to another function (and of course you are trying to get the compiler to assign that var to a register) then you should make a copy of the var and pass a pointer to the copy to the function.
Avoid repetitive pointer chains on the same base:
- Code: Select all
1 2 3 4
| a->b->c1 a->b->c2 a->b->c3 a->b->c4
|
| 4 lines; 0 keywds; 0 nums; 16 ops; 0 strs; 0 coms Syntactic Coloring v0.4 - Dan East |
You should either create a pointer to the deepest common level:
- Code: Select all
1 2 3 4 5
| b=a->b; b->c1 b->c2 b->c3 b->c4
|
| 5 lines; 0 keywds; 0 nums; 12 ops; 0 strs; 0 coms Syntactic Coloring v0.4 - Dan East |
Or simply include the c-level object directly in the definition of its parent, instead of a pointer to the c-level object:
- Code: Select all
1 2 3 4
| a->b.c1 a->b.c2 a->b.c3 a->b.c4
|
| 4 lines; 0 keywds; 0 nums; 12 ops; 0 strs; 0 coms Syntactic Coloring v0.4 - Dan East |
Avoid functions with large numbers of vars. Register vars may be spilled (moved into memory) to make room for other vars, which is expensive. Consider breaking the function into smaller functions, or at least make the range of usage of the var ( the range in which it is "alive" ) as consolidated as possible. Also reuse vars as much as possible. The usage of "register" in the var declaration can help the compiler determine which vars are of most importance, and it will try to keep those from spilling.
Avoid use of shorts and chars whenever possible. At least copy them over to ints to do your operations. Don't forget that mod is bad, so don't copy a unsigned char to an int, do your math, then mod it by 256 when your done. You would be better off doing the math directly with the char type in the first place. Or instead of modulus you could binary AND by 255.
Apparently ARM processors based on ARM architecture version 4 or greater can perform load, store and shifts as efficiently with shorts and signed chars as with ints. I don't know if the ARM processors used by Pocket PCs are version 4+ or not. Anybody else know? If so, then try and perform only shifts directly on shorts and unsigned chars, and copy to ints for other math.
The first four parameters passed to a function are stored in registers. Thus you should try to use no more than 4 parameters per function for the time-critical code.
Other considerations for parameter passing:
Pass pointers to structs instead of the struct itself.
Instead of passing many arguments, create a struct that holds those vars and pass a pointer to that struct.
long long (64 bit int) and double (64 bit float) take 2 registers a piece.
Functions that take a variable number of parameters force all parameters on the stack (note this is not referring to C++ class members functions with default values).
Try and make all frequently called functions leaf functions (a leaf function is a "dead end" and does not call any other functions).
Assuming function a is called very often, and it calls function b (and only calls function b), then there are two ways to make function a into a leaf:
Incorporate the functionality of b directly into a. This may increase executable size, but speed will be greatly increased. This may also make it more difficult to maintain your code (since you may have multiple copies of b's implementation strewn about). However, you can #define b's implementation once and just place that macro in each of the functions that used to call b.
OR
Make function b inline.
If one function calls another, try to call that function in a return statement:
- Code: Select all
1 2 3 4
| int a() { do_a_bunch_of_stuff; return b(); }
|
| 4 lines; 2 keywds; 0 nums; 8 ops; 0 strs; 0 coms Syntactic Coloring v0.4 - Dan East |
instead of
- Code: Select all
1 2 3 4 5 6
| int a() { do_some_things; b(); do_some_more_things; return something; }
|
| 6 lines; 2 keywds; 0 nums; 10 ops; 0 strs; 0 coms Syntactic Coloring v0.4 - Dan East |
USE LOOKUP TABLES! Pre-calculate and use lookup tables when possible (including when memory constraints allow).
Floating Point (if you
really have to use it)):
Floating point math is hundreds of times slower than integer math.
Division is typically twice a slow as mult, so multiply by the reciprocal instead of dividing when possible.
Use floats over doubles whenever range / precision allows.
Use lookup tables over calls to sin, exp and log.
Few if any optimizations are applied to floating point operations. Do not do redundant or inefficient operations assuming the compiler will optimize them.
The documentation recommends that when "floating performance will not reach the required level for a particular application" you should use fixed point math instead. If you're starting a project from scratch you might as well go with fixed point in the first place if performance matters in the slightest degree.
Dan East