Re: [seL4] How to avoid timeout exceptions in non-buggy server code

20 Feb 2018

      On 21 Feb 2018, at 05:55, Kelly Dean <kelly@prtime.org> wrote:
...
The article “Mixed-criticality support in seL4” at https://lwn.net/Articles/745946/ says seL4's solution to the problem of CPU budget exhaustion while in a non-reentrant shared data-structure server is timeout exceptions. The server might time out by chance, through no fault of its own.
A more robust solution is to simply disallow calling any such server that advertises a WCET longer than the client's per-period CPU budget, and when a call is made, have the kernel check whether the client's remaining CPU budget ≥ server's WCET. If it is, then proceed with the call as normal. If it isn't, then verify that the client's per-period CPU budget ≥ server's WCET. If that verification succeeds, then terminate the budget for the client's current period early, and actually enter the server at the beginning of the client's _next_ period. (If the verification fails, then the kernel tells the client that the call is disallowed.)
What you’re describing is a classic example of a pessimistic approach, while the paper is advocating an optimistic one. This is analogous to optimistic vs pessimistic locking in databases, where optimistic achieves much better throughput. For example, the server’s average execution time may be a small fraction of it’s WCET (a fairly common case), but your proposal would disallow any invocations where the SC has less than the WCET left, even if the specific invocation may be short (and may be known to be short).

It is also an issue of economy of mechanisms: the timeout is a simple, policy-free mechanism that can be put to many different uses. You are advocating the kernel enforce policy (which isn’t the microkernel’s job) via a special-purpose mechanism. seL4 philosophy is to keep things simple and policy-free.
...
If the server fails to return before exhausting the client's CPU budget, then the server has violated its own WCET, which is a bug. Thus, a non-buggy server will never time out, so timeout exception handlers handle only bugs, not routine system operation.
If the server's actual WCET is W, then from the client's point of view, the server's apparent WCET is almost 2W, because the client might make the call with remaining CPU budget X such that X is just barely less than W, in which case the kernel will terminate the client's current budget early (so the client loses X) and enter the server at the beginning of the next period, and the server might then take W to return, making the client wait X+W altogether (plus, of course, the remainder of the period, during which other VMs run). But with seL4's current solution, the server's apparent WCET is even longer than 2W, because the server could run for X, time out, be restarted (taking time R), then run for W, making the client wait X+R+W altogether.
The server’s apparent WCET is only 2W if the client fails to ensure that it’s got enough budget left. That’s a client bug, the client can ensure that it invokes the server with a fresh budget. It’s not the kernel’s job to ensure that buggy clients get optimal service.

Gernot

Re: [seL4] How to avoid timeout exceptions in non-buggy server code

Gernot.Heiser＠data61.csiro.au