What size recip tables are people using? Is there a routine out there that uses a small recip table and then some code to get a high bit answer? We usually are looking for about 14->16 bits for the recip. ie;
long invW= ((1<<30)/w)<<2; //invW is 16:16
verts->x=imul16(verts->x*invW);
etc...
Actually, everthing is done is ASM to take advantage of the 32x32->64 bit multiply (and accumulate) functions of the ARM.... but you get the idea.

Cheers,
rcp