Away from the email for two days and whoooo.. Lots of new posts.. 
 
fzammetti: The GapiDraw 
BltFast operation is actually quite cool. It will try the following operations depending on how surfaces are stored:
(1) memcpy (both surfaces have same pitch and size)
(2) memcpy each row (both surfaces have same yPitch and are wide)
(3) unrolled copy of each row
(4) aligned 32-bit copy source to destination
(5) source-aligned 32-bit read source to destination
(6) unaligned 16-bit pixel-by-pixel copy
Personally I don't like the DrawCircle function, mostly since it's impossible to get it accelerated by operating on multiple pixels (in contrast to DrawRect, DrawLine, etc). That's why it's not in GapiDraw, and most probably never will be. I'll put a feature in GapiDraw if it can be accelerated by analyzing the display orientation of the device. DrawCircle can't. But it's a trivial implementation.
Pixel shaders are also not there, but as many has said, using Get/Release buffer solves that as well...
Keep those questions coming!