
On 9 Mar 2025, at 09:04, Demi Marie Obenour via Devel <devel@sel4.systems> wrote:
On 3/8/25 4:49 PM, Gernot Heiser via Devel wrote:
Why so complicated?
The equivalence of putting the functionality in the kernel, where the kernel has access to all user state, is to have a trusted server which has access to all client state (especially CSpaces). The TCB is the same, but you don’t break verification and retain the option to verify the trusted server. This approach will work for all designs.
In a specific system design you’re likely have opportunities to simplify/modularise/reduce the power of the server, making verification easier. But even the model of the all-powerful server is better (more assured) than the model of hacking the kernel and completely abandoning verification.
There are a couple classes of designs I am not sure how to implement efficiently in this model:
1. Systems that run on many loosely coupled cores where statically partitioning the kernel heap requires either reserving too much memory at startup, or playing silly tricks like adding and removing page table entries in response to page faults to keep the kernel heap size down.
How does Linux do this? As I keep saying, you’ll need to look at seL4 IPC (or PPC as I prefer to call it) as the seL4 equivalent of a Linux system call: the mechanism to switch protection state. In Linux you don’t call into the kernel on a different core, you change protection state on you own core (and the kernel may chose to hand off to a worker thread somewhere else). In that model it’s clear that you invoke the server on the same core. Depending on design, this may be invoking a shared, passive server locally, which hand off to a worker. Or you invoke a per-core server (meaning there’s a server thread and endpoint per core (or group of closely coupled core). This is system-specific policy. Trying to do this all in the kernel just forces policy in the kernel, which is exactly what a microkernel should avoid.
2. Systems where the syscall overhead is too high because of speculative execution mitigations. On x86, each context switch between userspace processes requires an IBPB, but entering the kernel often doesn’t.
This is a non-sequitur. You need to take action for timing-channel prevention when you cross a security domain, Going into and out of the kernel doesn’t, and as such doesn’t require such action. The same is going into and out of a trusted server. Preventing the server from leaking through any shared state (architected or not) is not different from preventing the kernel from leaking through shared state.
I suspect there is other hardware that tags various internal data structures with the current execution privilege as well, and requires costly flushes when switching between mutually distrusting code running at the same CPU privilege level. The trusted server requires 2 such switches, whereas kernel extensions only need 1.
The typical fallacy of trying to solve securty problems “transparently” in hardare, which always forces doing things unnecessarily. Having said that, the argument that syscalls/IPC overhead limits performance doesn’t become more true by repeating it (it’s seeing a renaissance in the scientific literature too). I’m yet to see a realistic use case where seL4 IPC cost is performance limiting, except with poor user-level design. Gernot