Recip/Divide routines.

I've noticed a few people stating that recip tables are faster than doing a divide. While I understand the obvious here, I would have thought that the cache hit would negate much of the advantage.
What size recip tables are people using? Is there a routine out there that uses a small recip table and then some code to get a high bit answer? We usually are looking for about 14->16 bits for the recip. ie;
long invW= ((1<<30)/w)<<2; //invW is 16:16
verts->x=imul16(verts->x*invW);
etc...
Actually, everthing is done is ASM to take advantage of the 32x32->64 bit multiply (and accumulate) functions of the ARM.... but you get the idea.
Cheers,
rcp
What size recip tables are people using? Is there a routine out there that uses a small recip table and then some code to get a high bit answer? We usually are looking for about 14->16 bits for the recip. ie;
long invW= ((1<<30)/w)<<2; //invW is 16:16
verts->x=imul16(verts->x*invW);
etc...
Actually, everthing is done is ASM to take advantage of the 32x32->64 bit multiply (and accumulate) functions of the ARM.... but you get the idea.

Cheers,
rcp