You can't spell "evil" without "vi".

Sun Oct 19 19:01:49 UTC 2008

On Sunday 19 October 2008 05:16:13 am Rob Landley wrote:
> On Saturday 18 October 2008 16:06:22 Ralf Friedl wrote:
> > Denys Vlasenko wrote:
> > > I think scheduler is at play here.
> > > Imagine a low-end system (say ~100MHz cpu) with rather poor capabilities
> > > wrt measuring time. No microsecond clock, just timer interrupts.
> 
> I don't seem to have gotten the original of this message, but I repeat: timer 
> ticks have nothing to do with it.  This is a race between two entirely 
> in-kernel events, neither of which involves the scheduler until the event is 
> already _over_.
> 
> > > If one runs Linux on such a system with HZ=100, poll() with timeout of
> > > 10ms is basically a "one timer tick timeout". I don't know how precise it
> > > can be.
> 
> Sleeps in the linux kernel can never return early unless interrupted, but they 
> can return arbitrarily late if the system is busy doing something else.  
> (This horrifies the realtime people, but that's a can of worms.)
> 
> When your sleep expires,

How kernel can know that your sleep expires on a hardware where the only
"timer source" is timer interrupt at 100 Hz?

Let's say program did a poll(10) exactly at .00950000 seconds,
i.e. when 95% of time between timer interrupts has passed.
Kernel has no way of knowing that!
Should poll expire on next timer interrupt (which will be early),
or on the one after it (which will very often result in 20ms timeouts)?

> your process gets taken off the wait queue it's on  
> and put back in the scheduler queue.  Your process _cannot_ be scheduled 
> before this happens, but when the scheduler actually has a spare time slice 
> to hand it is a separate issue.

I know this.

> > > To be exact, can kernel ensure that it won't be significantly 
> > > _smaller_ than 10ms?
> 
> Yes.  It can.  That's the whole POINT of sleeps.

This is possible if you have some sort of high-res timer,
not just timer ticks. Of you have rather frequent timer ticks (1000 Hz).
Or both, which is generally true for today's desktops and servers.
There, yes, you can do lots of clever things, and you are right,
today's Linux kernel does them.

But we are not yet in the point in time where 1000 Hz timer interrupts
and high-res timers are available in every device. In 2020, maybe.
In 2008, not yet.

So, lacking there hardware capabilities, kernel has to be very pessimistic
and on every poll(10) assume the worst - that timer interrupt interval
is about to expire, and therefore wakeup has to be delayed for TWO
timer interrupt intervals in order to not time out before 10ms.
This decision will result in:

while (1) {
	poll(&pfd, 1, 10);
	write(1, ".", 1);
}

printing a dot every 20 ms, not 10. Surprise!

I am paranoid. I assume some kernels won't do that. I am presuming some
kernels (call them broken if you wish) will wake up poll(10) on next
timer interrupt, making it possible that poll(10) returns earlier than 10ms.

> > > What if poll() was called soon before our 10ms 
> > > timeslice was going to expire anyway?
> 
> Why do you believe time slices are 10ms?  Where did you get that from?  Even 
> the O(1) scheduler in 2.4 did timeslice aggregation, not sure about Linus's 
> original scheduler from 1992...

I believe that there are hardware where there are no timer sources
other than timer interrupt. On such hardware, kernel can't do
finer-grained scheduling decisions.

> > Imagine a high-end system with HT=1000 and poll with timeout of 1ms. Now
> > poll() happens to be called just 1µs before the end of the timeslice. If
> > the kernel wouldn't wait for *at least* the specified time, the behavior
> > would be undefined and such a timeout would be useless.
> 
> Time slices DON'T WORK THAT WAY ANYMORE.  They HAVEN'T FOR YEARS.  Jiffies are 
> a resource allocation thing, that's all.  Scheduling decisions don't wait for 
> time slices to end, it's a cheap O(1) action so it can be done on return from 
> every system call and most interrupts.

Ok, a process returns from a system call.
Without high-res timer source (like cycle counter in x86), how would you
know that it "ran long enough and we can evict it, giving CPU to someone else"?
Since it got it's CPU, did it run 1us? 10us? 999us? No way to know that
till you get next timer interrupt! On what data would you base
you finer-grained scheduling decisions?

> Processes get scheduled as soon as there's a processor to run them on that 
> isn't busy with something else.

This case is not interesting. I am thinking about the case where
num_runnable_processes >> num_cpus. In this case, processes can't
be scheduled "as soon as there's a processor to run them on".
There are no free CPUs. You have to evict someone.

--
vda