The tracking stuff was actually just a side goal of the main face-detection project, but now I'm finding it quite interesting. How's this for an idea:
Use face-detection as a calibration step, to determine which colors are "face", use those colors to filter the pixels acquired through the standard difference between frames motion detection, or just in a threshold operation. A colorbounds should then get you a fairly good idea of where in the frame the users face is, and from there you can use a little bit of edge detection for orientation calculation and location refinement.

It's a rough idea, but I think that it might be made to work.