warmi: HAH! 270 sprites..

In plain c++ (well, template meta c++, but almost)!! And that includes all the verifications etc I do for each coordinate + more... I would say that the performance improvement is up to 40% (if both surfaces are aligned) compared to previous code.
Right now I also have three subloops:
(1) both surfaces are aligned or unaligned
Read source as DWORD. check source + each pixel for mask. copy one pixel as WORD or both pixels as DWORD depending on mask.
NOTE: If opacity is enabled, and both pixels should be copied, a real fast 32-bit 3-multiplications-for-two-pixels alpha blend will be performed (or a quick 50/50 if opacity is set to 128)
(2) destination is unaligned
Read source as DWORD. check source + each pixel for mask. copy each pixel as WORD.
(3) source is unaligned
Read source as WORD. check each pixel for mask. copy each pixel as WORD. write destination as DWORD.
The performance is around 7% slower than the optimized ASM routines posted earlier. Right now I am kind of satisfied with the performance. I have boosted the AlphaBltFast as well, and will do some benchmarks tomorrow...
