How to avoid timeout exceptions in non-buggy server code
The article “Mixed-criticality support in seL4” at https://lwn.net/Articles/745946/ says seL4's solution to the problem of CPU budget exhaustion while in a non-reentrant shared data-structure server is timeout exceptions. The server might time out by chance, through no fault of its own. A more robust solution is to simply disallow calling any such server that advertises a WCET longer than the client's per-period CPU budget, and when a call is made, have the kernel check whether the client's remaining CPU budget ≥ server's WCET. If it is, then proceed with the call as normal. If it isn't, then verify that the client's per-period CPU budget ≥ server's WCET. If that verification succeeds, then terminate the budget for the client's current period early, and actually enter the server at the beginning of the client's _next_ period. (If the verification fails, then the kernel tells the client that the call is disallowed.) If the server fails to return before exhausting the client's CPU budget, then the server has violated its own WCET, which is a bug. Thus, a non-buggy server will never time out, so timeout exception handlers handle only bugs, not routine system operation. If the server's actual WCET is W, then from the client's point of view, the server's apparent WCET is almost 2W, because the client might make the call with remaining CPU budget X such that X is just barely less than W, in which case the kernel will terminate the client's current budget early (so the client loses X) and enter the server at the beginning of the next period, and the server might then take W to return, making the client wait X+W altogether (plus, of course, the remainder of the period, during which other VMs run). But with seL4's current solution, the server's apparent WCET is even longer than 2W, because the server could run for X, time out, be restarted (taking time R), then run for W, making the client wait X+R+W altogether.
On 21 Feb 2018, at 05:55, Kelly Dean <kelly@prtime.org> wrote:
The article “Mixed-criticality support in seL4” at https://lwn.net/Articles/745946/ says seL4's solution to the problem of CPU budget exhaustion while in a non-reentrant shared data-structure server is timeout exceptions. The server might time out by chance, through no fault of its own.
A more robust solution is to simply disallow calling any such server that advertises a WCET longer than the client's per-period CPU budget, and when a call is made, have the kernel check whether the client's remaining CPU budget ≥ server's WCET. If it is, then proceed with the call as normal. If it isn't, then verify that the client's per-period CPU budget ≥ server's WCET. If that verification succeeds, then terminate the budget for the client's current period early, and actually enter the server at the beginning of the client's _next_ period. (If the verification fails, then the kernel tells the client that the call is disallowed.)
What you’re describing is a classic example of a pessimistic approach, while the paper is advocating an optimistic one. This is analogous to optimistic vs pessimistic locking in databases, where optimistic achieves much better throughput. For example, the server’s average execution time may be a small fraction of it’s WCET (a fairly common case), but your proposal would disallow any invocations where the SC has less than the WCET left, even if the specific invocation may be short (and may be known to be short). It is also an issue of economy of mechanisms: the timeout is a simple, policy-free mechanism that can be put to many different uses. You are advocating the kernel enforce policy (which isn’t the microkernel’s job) via a special-purpose mechanism. seL4 philosophy is to keep things simple and policy-free.
If the server fails to return before exhausting the client's CPU budget, then the server has violated its own WCET, which is a bug. Thus, a non-buggy server will never time out, so timeout exception handlers handle only bugs, not routine system operation.
If the server's actual WCET is W, then from the client's point of view, the server's apparent WCET is almost 2W, because the client might make the call with remaining CPU budget X such that X is just barely less than W, in which case the kernel will terminate the client's current budget early (so the client loses X) and enter the server at the beginning of the next period, and the server might then take W to return, making the client wait X+W altogether (plus, of course, the remainder of the period, during which other VMs run). But with seL4's current solution, the server's apparent WCET is even longer than 2W, because the server could run for X, time out, be restarted (taking time R), then run for W, making the client wait X+R+W altogether.
The server’s apparent WCET is only 2W if the client fails to ensure that it’s got enough budget left. That’s a client bug, the client can ensure that it invokes the server with a fresh budget. It’s not the kernel’s job to ensure that buggy clients get optimal service. Gernot
Gernot.Heiser@data61.csiro.au writes:
The server’s apparent WCET is only 2W if the client fails to ensure that it’s got enough budget left. That’s a client bug, the client can ensure that it invokes the server with a fresh budget. It’s not the kernel’s job to ensure that buggy clients get optimal service.
I see. Though wouldn't it make more sense for the server to check, instead of the client? Because the server needs to check anyway, so that a misbehaving client can't DoS other clients by repeatedly calling the server with insufficient budget and forcing the server to repeatedly time out and recover. And because the server is in a better position to handle its data-dependent execution times (i.e. as you said, not just pessimistically bail if remaining budget < WCET). So, the client can call unconditionally, the server can return an error code whenever it discovers it won't be able to finish in time, and the client's error checking doesn't depend on timing info about the server. Anyway, that made me think of where else limited server time is a problem. At 19:13 in your presentation “Mixed-criticality support in seL4” at https://www.youtube.com/watch?v=ijTTZgQ8cB4 you gave the example of network drivers needing timely response to avoid packet loss. The solution at 30:47 was a scheduling context with high priority, short period, and budget<period to limit utilization, which allowed a deadline guarantee for a higher-criticality thread, and allowed a low-criticality thread to use the slack. But network traffic might be mixed criticality. For example, real-time sensor data mixed with bulk file transfers. Suppose the most critical thread must be allocated 90% to cover its worst case utilization, but its average is 20%. The network driver then can be allocated only 10%, but it's plenty to transceive the critical traffic. Suppose also that to saturate the network link, the bulk transfer thread (which runs as slack) would need 50% and the network driver would need 20% (twice its allocation) to transceive fast enough. In this case, total utilization will average less than 80%, and the bulk transfer will be unnecessarily throttled. One solution would be to assign two contexts to the network driver: the standard 10% high priority context and a 100% low-priority context, and run the driver whenever either context allows. Then average total utilization could rise to 90% (with 20% for the network driver) to avoid throttling the bulk transfer. But at 23:04 in your presentation, you explained that with traditional time slicing, one client can DoS another if it doesn't have to pay the time cost of running a high-priority server, and the solution at 32:03 is running the server using the client's budget, which gives me another idea: why give the network driver an independent budget at all? Even if the driver needs a short period, it still makes sense to run using the client's budget. It's pointless to avoid dropping packets if the client is too slow to consume them anyway, which means the client must be guaranteed suitable utilization; therefore, simply guarantee enough additional utilization to cover the cost of the network driver's service, and when the driver transceives data on the client's behalf, charge the client for the time. That way, there's no need for a second driver context to fully serve the slack thread, nor even a need for one complete context (with budget) to serve the most critical thread. BTW, at 40:42, you described a UAV system with an A15-based mission board and an M3-based flight control board, which you said you would have designed to be more simple if not representing a legacy system. In particular, would you eliminate the separate flight control MCU, and run everything on one CPU?
Kelly Dean writes:
But network traffic might be mixed criticality. For example, real-time sensor data mixed with bulk file transfers. Suppose the most critical thread must be allocated 90% to cover its worst case utilization
Er, I worded that poorly. Should have been explicit that all the example percentages I gave are of CPU utilization. Maybe network link capacity could be divvied in a similar way to CPU capacity, but I was only talking about the latter.
On 24 Feb 2018, at 04:25, Kelly Dean <kelly@prtime.org> wrote:
Gernot.Heiser@data61.csiro.au writes:
The server’s apparent WCET is only 2W if the client fails to ensure that it’s got enough budget left. That’s a client bug, the client can ensure that it invokes the server with a fresh budget. It’s not the kernel’s job to ensure that buggy clients get optimal service.
I see. Though wouldn't it make more sense for the server to check, instead of the client? Because the server needs to check anyway, so that a misbehaving client can't DoS other clients by repeatedly calling the server with insufficient budget and forcing the server to repeatedly time out and recover. And because the server is in a better position to handle its data-dependent execution times (i.e. as you said, not just pessimistically bail if remaining budget < WCET). So, the client can call unconditionally, the server can return an error code whenever it discovers it won't be able to finish in time, and the client's error checking doesn't depend on timing info about the server.
The server doesn’t do any checking, the kernel enforces the budget and the server uses the timeout exception to clean up. How is a system-defined policy. Eg the time-fault handler can suspend a misbehaving client if the system is so configured. The server attempting to gate-keep and ensure that there’s enough time left for the request would be quite complex and probably expensive, as it would need a full model of its own execution. It would also be very pessimistic, given the pessimism of worst-case execution-time (WCET) analysis on pipelined processors with caches. Much better to make the client responsible for its own fate and kick it out if it misbehaves.
Anyway, that made me think of where else limited server time is a problem. At 19:13 in your presentation “Mixed-criticality support in seL4” at https://www.youtube.com/watch?v=ijTTZgQ8cB4 you gave the example of network drivers needing timely response to avoid packet loss. The solution at 30:47 was a scheduling context with high priority, short period, and budget<period to limit utilization, which allowed a deadline guarantee for a higher-criticality thread, and allowed a low-criticality thread to use the slack.
But network traffic might be mixed criticality. For example, real-time sensor data mixed with bulk file transfers. Suppose the most critical thread must be allocated 90% to cover its worst case utilization, but its average is 20%. The network driver then can be allocated only 10%, but it's plenty to transceive the critical traffic. Suppose also that to saturate the network link, the bulk transfer thread (which runs as slack) would need 50% and the network driver would need 20% (twice its allocation) to transceive fast enough. In this case, total utilization will average less than 80%, and the bulk transfer will be unnecessarily throttled.
One solution would be to assign two contexts to the network driver: the standard 10% high priority context and a 100% low-priority context, and run the driver whenever either context allows. Then average total utilization could rise to 90% (with 20% for the network driver) to avoid throttling the bulk transfer.
Not sure how realistic this scenario is. You seem to assume that somehow there is a notion of packet priority on the network? In any case, there is no conceptual problem with the driver having two threads, running with different prios and SC. While a single thread could get SCs donated from different sources, this would have to come with explicit lowering of its prio when receiving the low-crit budget. Moreover, in your scenario, the driver, handling low and high traffic, would have to be trusted and assured to high criticality. All doable, but I don’t think I’d want this sort of complexity and extra concurrency control in a critical system.
But at 23:04 in your presentation, you explained that with traditional time slicing, one client can DoS another if it doesn't have to pay the time cost of running a high-priority server, and the solution at 32:03 is running the server using the client's budget, which gives me another idea: why give the network driver an independent budget at all? Even if the driver needs a short period, it still makes sense to run using the client's budget. It's pointless to avoid dropping packets if the client is too slow to consume them anyway, which means the client must be guaranteed suitable utilization; therefore, simply guarantee enough additional utilization to cover the cost of the network driver's service, and when the driver transceives data on the client's behalf, charge the client for the time.
Not that simple: the driver needs to execute on an interrupt (or are you assuming polled I/O)? The model supports passive drivers, it can wait both for clients IPCing it (with SC donation) and on a Notification (semaphore) that delivers the interrupt and also donates an SC. However, in all those scenarios you’re making the driver high-crit, and my example was all about it being low.
That way, there's no need for a second driver context to fully serve the slack thread, nor even a need for one complete context (with budget) to serve the most critical thread.
I’m not sure I’m still following your scenario, and what it has to do with your initial idea that a server should gate-keep client time.
BTW, at 40:42, you described a UAV system with an A15-based mission board and an M3-based flight control board, which you said you would have designed to be more simple if not representing a legacy system. In particular, would you eliminate the separate flight control MCU, and run everything on one CPU?
Yes, that’s the point. With MCS support, the whole setup could run on the A15 with seL4. In fact we’re building this version (but it’s a background activity and we don’t have much in-house control know-how, which slows things down). Gernot
Gernot.Heiser@data61.csiro.au writes:
The server attempting to gate-keep and ensure that there’s enough time left for the request would be quite complex and probably expensive, as it would need a full model of its own execution. It would also be very pessimistic, given the pessimism of worst-case execution-time (WCET) analysis on pipelined processors with caches. Much better to make the client responsible for its own fate and kick it out if it misbehaves.
By “misbehave”, you mean call the server with insufficient remaining budget? If it's infeasible for the server to know whether the remaining budget is sufficient, how is it feasible for the client to know?
One solution would be to assign two contexts to the network driver: the standard 10% high priority context and a 100% low-priority context, and run the driver whenever either context allows. Then average total utilization could rise to 90% (with 20% for the network driver) to avoid throttling the bulk transfer.
Not sure how realistic this scenario is. You seem to assume that somehow there is a notion of packet priority on the network?
No; the network is saturated, but not entirely by the bulk transfer. The bulk transfer can't monopolize the CPU, so it can't block other threads from sending/receiving traffic too. I assume the critical thread's traffic is small and gets through without a problem. The point of my example was just demand for network driver service exceeding what it could supply with a fixed 10% CPU utilization limit, and available CPU time being left unused instead of being used to satisfy the demand.
in your scenario, the driver, handling low and high traffic, would have to be trusted and assured to high criticality.
If the high-criticality client depends on network access, then it must trust the network driver to provide that access, but that's the case not only in my scenario, but also in yours or any other one. But the client doesn't have to trust the network driver not to monopolize the CPU in either scenario. So I'm not sure what distinction you're making.
All doable, but I don’t think I’d want this sort of complexity and extra concurrency control in a critical system.
Suppose the network and NIC hardware are fast enough, relative to the CPU, that the network driver can't saturate them even at 100% CPU utilization. How do you choose the utilization to allocate to the driver? Do you add up the worst-case data rates of the critical (i.e. non-slack) threads, figure out the CPU utilization that the driver needs to sustain that total rate, and statically allocate accordingly?
why give the network driver an independent budget at all? Even if the driver needs a short period, it still makes sense to run using the client's budget. It's pointless to avoid dropping packets if the client is too slow to consume them anyway, which means the client must be guaranteed suitable utilization; therefore, simply guarantee enough additional utilization to cover the cost of the network driver's service, and when the driver transceives data on the client's behalf, charge the client for the time.
Not that simple: the driver needs to execute on an interrupt (or are you assuming polled I/O)? The model supports passive drivers, it can wait both for clients IPCing it (with SC donation) and on a Notification (semaphore) that delivers the interrupt and also donates an SC.
I made a mistake; the network driver does need an independent budget, but only a very small one, and my main idea still works. Suppose client thread C and network driver D have periods Cp and Dp and budgets Cb and Db. C authorizes D to use some part Cb_d of Cb, and the system accordingly creates a new scheduler context C_D_sc for use by D, with period Dp and budget Cb_d*Dp/Cp (thus with the same max CPU utilization Cb_d/Cp that C delegated, and budget normalized to D's period). Whenever C_D_sc is invoked, the time used in it is then deducted from the budget for the next invocation of C's scheduler context. Thus, C just chooses a limit; it doesn't delegate a fixed amount of utilization, so it still gets to use whatever D doesn't. When packets arrive from the network, D discovers them (while running in its main scheduler context) either immediately due to an interrupt, or by polling after at most Dp. Db is just long enough for D to examine the headers (to determine the destination clients) of as many packets as can arrive during Dp. For each client, D switches to the corresponding context C_D_sc and delivers C's packets. When switching to C_D_sc, deduct from its budget (and thus ultimately from C's scheduler context) the small amount of time that D consumed in its main scheduler context on C's behalf, which is the quantity of packets being delivered times the per-packet header reading latency; this way, D effectively never independently utilizes the CPU so long as all incoming packets have willing and able recipients. This is why I originally said D needs no independent budget; I just forgot about the initial time needed to figure out which client to charge. For packets that are corrupted, or destined for dead destination clients, or for clients with expired delegated time, etc, D does independently utilize the CPU, just enough to examine and drop the packets. In contrast to data reception, data transmission has trivial accounting. C simply calls D synchronously, which then runs as an ordinary server for that invocation; if C's remaining budget is insufficient to send the packet, then the packet is dropped. Or, call asynchronously, and D can process it in C_D_sc, since that accounting mechanism is necessary anyway for data reception.
However, in all those scenarios you’re making the driver high-crit, and my example was all about it being low.
Even in your example it's high-crit in the sense that regardless of when or how often it runs, it could misbehave and drop packets that high-crit threads depend on. So I assume you're just saying that in your example it's low-crit in the sense that it can't monopolize the CPU. But in my scenario, it can't monopolize the CPU either. Maybe my mistake about the budget mislead you. Sorry for moving the goal post.
I’m not sure I’m still following your scenario, and what it has to do with your initial idea that a server should gate-keep client time.
Separate idea. The latter just brought the former to mind.
participants (2)
-
Gernot.Heiser@data61.csiro.au
-
Kelly Dean