On Mon, Apr 13, 2020, 7:42 AM Andrew Warkentin <andreww591@gmail.com> wrote:
On 4/12/20, Heiser, Gernot (Data61, Kensington NSW) <Gernot.Heiser@data61.csiro.au> wrote:
Sure, OS structure matters a lot, and I’m certainly known for be telling people consistently that IPC payloads of more than a few words is a
indicator of a poor design. Microkernel IPC should be considered a
strong protected
function call mechanism, and you shouldn’t pass more by-value data than you would to a C function (see
https://microkerneldude.wordpress.com/2019/03/07/how-to-and-how-not-to-use-s...). My
I would think that an RPC layer with run-time marshaling of arguments as is used as the IPC transport layer on most microkernel OSes would add some overhead, even if it is using the underlying IPC layer properly, since it has to iterate over the list of arguments, determine the type of each, and copy it to a buffer, and the reverse happening on the receiving end. Passing around bulk unstructured/opaque data is quite common (e.g. for disk and network transfers), and an RPC-based transport layer adds unnecessary overhead and complexity to such use cases.
I think a better transport layer design (for an L4-like kernel at least) would be one that maintains RPC-like call-return semantics, but exposes message registers and a shared memory buffer almost directly with the only extra additions being a message length, file offset, and type code (to indicate whether a message is a short register-only message, a long message in the buffer, or an error) rather than using marshaling. This is what I plan to do on UX/RT, which will have a Unix-like IPC transport layer API that provides new read()/write()-like functions that operate on message registers or the shared buffer rather than copying as the traditional versions do (the traditional versions will also still be present of course, implemented on top of the "raw" versions).
RPC with marshaling could easily still be implemented on top of such a transport layer (for complex APIs that need marshaling) with basically no overhead compared to an RPC-based transport layer.
However, microkernel OS structure is very much determined by what
security
properties you want to achieve, in particular, which components to trust and which not. Less trust generally means more overhead, and the cost of crossing a protection domain boundary is the critical factor that determines this overhead for a given design.
It seems to be quite common for microkernel OSes to vertically split subsystems that really represent single protection domains. A good example is a disk stack. For the most part, all layers of a disk stack are dealing with the same data, just at different levels, so splitting them into processes just adds unnecessary overhead in most cases. Typically, one disk server process per device should be good enough as far as security goes. The only cases where there is any benefit at all to splitting a disk stack vertically is on systems with multiple partitions or traditional LVMs that provide raw volumes, and sometimes also systems with disk encryption. On a system with a single partition or an integrated LVM/FS like ZFS and no disk encryption there is typically no benefit to splitting up the disk stack.
For systems where keeping partitions/LVs separated is important, it should be possible to run separate "lower" and "upper" disk servers with the lower one containing the disk driver, partition driver, and LVM and the upper one containing the FS driver and disk encrpytion layer, but this should not be mandatory (this is what I plan to do on UX/RT).
A disk stack architecture like that of Genode where the disk driver, partition driver, and FS driver are completely separate programs (rather than plugins that may be run in different configurations) forces overhead on all use cases even though that overhead often provides no security or error recovery benefit.
It seems that most microkernel OSes follow the former model for some reason, and I'm not sure why.
Which OSes? I’d prefer specific data points over vague claims.
In addition to Genode, prime examples would include Minix 3 and Fuchsia.
QNX seems to be the main example of a microkernel OS that uses a minimally structured IPC transport layer (although still somewhat more structured than what UX/RT will have) and goes out of its way to avoid intermediary servers (its VFS doesn't act as an intermediary on read() or write(), and many subsystems are single processes). One paper back in the 90s benchmarked an old version as being significantly faster than contemporary System V/386 on the same hardware for most of the APIs tested (although maybe that just means System V/386 was slow; I should try to see if I can get similar results with later versions of QNX against BSD and Linux).
Speaking of UX/RT: I strongly suggest avoiding PID-based APIs in favor of handle-based APIs, and allocating file descriptors in a way that is more efficient than always picking the lowest free one. The first causes many race conditions, and the second causes a lot of synchronization overhead. io_uring is also a good API to consider.