Monday 28 July 2008

Behind the scenes of KB953311

You may think there is not alot to investigate on a simple WaitForSingleObject (W4SO) call. Millions of lines of code with this API lead us to a test coverage of 100 percent. But out of the blue it happens: W4SO doesn't do what is expected. And it's easy to prove.

If you read the description of KB953311 you may wonder what strange things a programmer must do to get into this trouble. If you read again carefully you will find out.

Any Windows CE driver contains the described code:
· You set up an event,
· You initialize an interrupt with this event
· You create an interrupt service thread
· and you wait for your event

But your event will (nearly) never be signaled, e. g. it's the power button or an overcurrent interrupt. You may wait an INFINITE timeout and everything is fine. Or you set up a timeout and check the return value of W4SO. Then you meet the first condition for this problem.
The second condition is very easy to achieve: Your system is simply at it's limits.

The online help says W4SO returns WAIT_OBJECT_0 or WAIT_TIMEOUT and you believe it. But sometimes it returns WAIT_FAILED which means invalid handle. After hours of perfect working your handle is now invalid? How long do you think you will need to find this bug?
Let's have a look at the interrupt processing.
If you call InterruptInitialize your event will be a special event which is handled in a different way. The scheduler must consider the change of the event's state after each interrupt. You have to wait with W4SO on an interrupt event, you cannot use WaitForMultipleObjects instead.

But what went wrong?
A thread with higher priority than our interrupt service thread blocks the system for more then your W4SO timeout time. This circumstances may lead your W4SO to return WAIT_FAILED and never come back to normal WAIT_TIMEOUT operation. But if you trigger the interrupt your interrupt service thread returns back to life. Microsoft was able to fix the problem with 6 lines of code.
With shared source installed we are very lucky to have the opportunity to look into the file schedule.c (WinCE500/Private/Winceos/Coreos/Nk/Kernel).
You may try to understand how the scheduler works. To be honest, it isn't trivial.
But you may use a driver and a stress test program to see the problem.
And what do we learn: There is never 100 percent test coverage, even in such a central OS component like the scheduler.

No comments: