How to (and how not to) use seL4 IPC
Folks might be interested in this blog: https://microkerneldude.wordpress.com/2019/03/07/how-to-and-how-not-to-use-s... Gernot
On 3/7/19, Gernot.Heiser@data61.csiro.au <Gernot.Heiser@data61.csiro.au> wrote:
Folks might be interested in this blog: https://microkerneldude.wordpress.com/2019/03/07/how-to-and-how-not-to-use-s...
The IPC transport layer for UX/RT will follow these rules for the most part AFAICT (although there is one possible caveat). It will use an unstructured call-return model in which every IPC transport layer call is either a read that requests a specified number of bytes with a specified offset and expects either data or an error code, or a write that sends a specified number of bytes with a specified offset and expects a status code (unlike most IPC transport layers for microkernel OSes there will be no support for arbitrary typed messages or marshalling, with both of those left to higher-level libraries). There will be a short message API using abstract message registers (which may or may not be directly mapped onto seL4 message registers depending onto the situation), a long message API using a shared buffer, and a traditional Unix-style read()/write() API that may use either, with all three being interchangeable (messages will be automatically copied when necessary by the transport layer). All IPC endpoints will be presented as named files much as in Plan 9 (UX/RT will go even further in its file orientation in that all process memory will be presented as files as well). The raw seL4 IPC API won't be exposed to user processes, since the transport layer API will map onto it pretty closely, and I want UX/RT to have only one IPC transport layer that is compatible even between processes using different APIs (and reads and writes will not be required to go through an intermediary server so there will be no performance reason to bypass the transport layer). The caveat is that seL4 may have SMP scalability issues with a general-purpose OS running a wide variety of applications on a wide variety of hardware like UX/RT will be. In a general-purpose OS, processes shouldn't have to be statically assigned to run on specific cores (although of course that should be an option), and they shouldn't have to care about which core a server is running on. Passive servers don't completely solve the problem since I want UX/RT to be compatible with code from other OSes that don't have such a concept, and it should be possible for arbitrary processes (even something like a shell script) to serve filesystems as in QNX and Plan 9. Linux in a VM is not a substitute for native Unix compatibility since my goal is to make an advanced Unix-like OS in which all processes benefit from safer code and extensible multi-server architecture, rather than forcing anything not specifically ported to run in a "penalty box" where it still suffers from most of the problems with conventional OSes. If I can't make seL4 scale well for UX/RT I may end up having to fork it. Of all the existing open source microkernels it is still the closest to what I consider ideal in terms of architecture, despite the possible SMP scalability issues.
On 10 Mar 2019, at 17:13, Andrew Warkentin <andreww591@gmail.com<mailto:andreww591@gmail.com>> wrote: The caveat is that seL4 may have SMP scalability issues with a general-purpose OS running a wide variety of applications on a wide variety of hardware like UX/RT will be. In a general-purpose OS, processes shouldn't have to be statically assigned to run on specific cores (although of course that should be an option), and they shouldn't have to care about which core a server is running on. You need to keep in mind that, as I said in my blog, seL4 is designed as a minimal wrapper, just enough to allow securely multiplexing hardware. seL4 not migrating threads on its own is a feature, not a bug, as any migration would require a policy, and that needs to be defined at user level. seL4 gives you all the mechanisms you need. Eg you can implement a Linux-like scheduler on top (in fact, a simplified version of this was done in our EuroSys’18 paper on scheduling contexts, i.e. the one describing the MCS kernel, https://ts.data61.csiro.au/publications/csiroabstracts/Lyons_MAH_18.abstract...). AFAIK, seL4 doesn’t have scalability issues if used properly (and it’s not the kernel’s job to support ill-considered uses). To understand this, think of a typical Arm multicore SoC: 4–16 or so cores, shared L2. Inter-core cache line migration latency 10–20 cycles. Even assuming a syscall needs to access about 10 write-shared lines, that’s still well below the syscall latency. A shared kernel image makes sense, as anything else would force high overheads (from explicit cross-core communication/synchronisation at user level). Now think of a high-end x86 server processor. Latency of shipping a single cache line from one core at the periphery to the opposite end costs many 100s, probably >1,000 cycles, just for a single line. If the syscall needs 10 of those, then the latency of migrating cache lines is over an OoM larger than the cost of a fast syscall. It makes no sense whatsoever to do this, you’d be forcing high overheads onto every syscall. A single, shared seL4 image only makes sense in a domain where the cache-line migration cost remains small compared to the syscall cost. Beyond that, you want separate kernel images, i.e. a multikernel. But this doesn’t limit what you can do on top. If you want to run a single Unix system across the whole thing, this can make sense (Unix syscalls are OmM more expensive than seL4 syscalls, so the trade-offs are quite different). And you can do this across a set of seL4 clusters, they basically present you with a NUMA system. Which makes sense, as the hardware will be NUMA as well. This clustered multikernel story is described in a paper that’s now a few years old: https://ts.data61.csiro.au/publications/nictaabstracts/Peters_DEH_15.abstrac... Unfortunately, we never got around to build the middleware supporting this (as people are throwing money at us to do other cool things), so you’re on your own there for now. But there’s nothing inherent in seL4 that stops you. If I can't make seL4 scale well for UX/RT I may end up having to fork it. Of all the existing open source microkernels it is still the closest to what I consider ideal in terms of architecture, despite the possible SMP scalability issues. As I said, I don’t think there is a scalability issue. And forking is generally a bad idea, in the case of seL4 it’s an especially bad one, as you’ll lose its biggest asset, the verification story (or have to do it yourself). Gernot
On 3/10/19, Gernot.Heiser@data61.csiro.au <Gernot.Heiser@data61.csiro.au> wrote:
You need to keep in mind that, as I said in my blog, seL4 is designed as a minimal wrapper, just enough to allow securely multiplexing hardware. seL4 not migrating threads on its own is a feature, not a bug, as any migration would require a policy, and that needs to be defined at user level. seL4 gives you all the mechanisms you need. Eg you can implement a Linux-like scheduler on top (in fact, a simplified version of this was done in our EuroSys’18 paper on scheduling contexts, i.e. the one describing the MCS kernel, https://ts.data61.csiro.au/publications/csiroabstracts/Lyons_MAH_18.abstract...).
Yes, it does make sense for stuff like thread migration to be left up to user code.
AFAIK, seL4 doesn’t have scalability issues if used properly (and it’s not the kernel’s job to support ill-considered uses).
To understand this, think of a typical Arm multicore SoC: 4–16 or so cores, shared L2. Inter-core cache line migration latency 10–20 cycles. Even assuming a syscall needs to access about 10 write-shared lines, that’s still well below the syscall latency. A shared kernel image makes sense, as anything else would force high overheads (from explicit cross-core communication/synchronisation at user level).
Now think of a high-end x86 server processor. Latency of shipping a single cache line from one core at the periphery to the opposite end costs many 100s, probably >1,000 cycles, just for a single line. If the syscall needs 10 of those, then the latency of migrating cache lines is over an OoM larger than the cost of a fast syscall. It makes no sense whatsoever to do this, you’d be forcing high overheads onto every syscall.
A single, shared seL4 image only makes sense in a domain where the cache-line migration cost remains small compared to the syscall cost. Beyond that, you want separate kernel images, i.e. a multikernel.
But this doesn’t limit what you can do on top. If you want to run a single Unix system across the whole thing, this can make sense (Unix syscalls are OmM more expensive than seL4 syscalls, so the trade-offs are quite different). And you can do this across a set of seL4 clusters, they basically present you with a NUMA system. Which makes sense, as the hardware will be NUMA as well.
My plan is to implement Unix-like read()/write()-family APIs as close to the microkernel as possible, with both being implemented within the low-level system library as a single kernel call in the client to send and a single kernel call in the server to reply whenever possible; the message payload would be either stuffed into the available registers (registers will be reserved for type, offset, and length) if it is short enough or placed in a shared buffer if it is longer. As I said before there will be variants of read and write that will expose the message registers or shared buffer in a more direct fashion but these will interoperate with each other and the traditional API; since the seL4 API won't be exported to normal user code at all, the Unix-like transport layer will have to be as efficient as possible. All other Unix calls besides read()/write()/seek()-family APIs will be implemented either in servers or purely in libraries. Is there anything that would prevent a Unix-like transport layer on seL4 from using a single kernel call to send and a single kernel call to reply?
This clustered multikernel story is described in a paper that’s now a few years old: https://ts.data61.csiro.au/publications/nictaabstracts/Peters_DEH_15.abstrac... Unfortunately, we never got around to build the middleware supporting this (as people are throwing money at us to do other cool things), so you’re on your own there for now. But there’s nothing inherent in seL4 that stops you.
Is the kernel-side clustering code in mainline seL4?
As I said, I don’t think there is a scalability issue. And forking is generally a bad idea, in the case of seL4 it’s an especially bad one, as you’ll lose its biggest asset, the verification story (or have to do it yourself).
I don't plan to fork seL4 unless there is some major difference that would either significantly limit scalability or cause significant overhead for a Unix-like IPC transport layer.
participants (2)
-
Andrew Warkentin
-
Gernot.Heiser@data61.csiro.au