On Mon, Apr 13, 2020, 7:42 AM Andrew Warkentin <andreww591(a)gmail.com> wrote:
On 4/12/20, Heiser, Gernot (Data61, Kensington NSW)
Sure, OS structure matters a lot, and I’m certainly known for be telling
people consistently that IPC payloads of more than a few words is a
indicator of a poor design. Microkernel IPC
should be considered a
function call mechanism, and you shouldn’t pass
more by-value data than
would to a C function (see
I would think that an RPC layer with run-time marshaling of arguments
as is used as the IPC transport layer on most microkernel OSes would
add some overhead, even if it is using the underlying IPC layer
properly, since it has to iterate over the list of arguments,
determine the type of each, and copy it to a buffer, and the reverse
happening on the receiving end. Passing around bulk
unstructured/opaque data is quite common (e.g. for disk and network
transfers), and an RPC-based transport layer adds unnecessary overhead
and complexity to such use cases.
I think a better transport layer design (for an L4-like kernel at
least) would be one that maintains RPC-like call-return semantics, but
exposes message registers and a shared memory buffer almost directly
with the only extra additions being a message length, file offset, and
type code (to indicate whether a message is a short register-only
message, a long message in the buffer, or an error) rather than using
marshaling. This is what I plan to do on UX/RT, which will have a
Unix-like IPC transport layer API that provides new
read()/write()-like functions that operate on message registers or the
shared buffer rather than copying as the traditional versions do (the
traditional versions will also still be present of course, implemented
on top of the "raw" versions).
RPC with marshaling could easily still be implemented on top of such a
transport layer (for complex APIs that need marshaling) with basically
no overhead compared to an RPC-based transport layer.
> However, microkernel OS structure is
very much determined by what
properties you want to achieve, in particular,
which components to trust
which not. Less trust generally means more
overhead, and the cost of
crossing a protection domain boundary is the critical factor that
> this overhead for a given design.
It seems to be quite common for
microkernel OSes to vertically split
subsystems that really represent single protection domains. A good
example is a disk stack. For the most part, all layers of a disk stack
are dealing with the same data, just at different levels, so splitting
them into processes just adds unnecessary overhead in most cases.
Typically, one disk server process per device should be good enough as
far as security goes. The only cases where there is any benefit at all
to splitting a disk stack vertically is on systems with multiple
partitions or traditional LVMs that provide raw volumes, and sometimes
also systems with disk encryption. On a system with a single partition
or an integrated LVM/FS like ZFS and no disk encryption there is
typically no benefit to splitting up the disk stack.
For systems where keeping partitions/LVs separated is important, it
should be possible to run separate "lower" and "upper" disk servers
with the lower one containing the disk driver, partition driver, and
LVM and the upper one containing the FS driver and disk encrpytion
layer, but this should not be mandatory (this is what I plan to do on
A disk stack architecture like that of Genode where the disk driver,
partition driver, and FS driver are completely separate programs
(rather than plugins that may be run in different configurations)
forces overhead on all use cases even though that overhead often
provides no security or error recovery benefit.
> It seems that most microkernel OSes
> follow the former model for some reason, and I'm not sure why.
> Which OSes? I’d prefer specific data
points over vague claims.
In addition to Genode, prime examples
would include Minix 3 and Fuchsia.
QNX seems to be the main example of a microkernel OS that uses a
minimally structured IPC transport layer (although still somewhat more
structured than what UX/RT will have) and goes out of its way to avoid
intermediary servers (its VFS doesn't act as an intermediary on read()
or write(), and many subsystems are single processes). One paper back
in the 90s benchmarked an old version as being significantly faster
than contemporary System V/386 on the same hardware for most of the
APIs tested (although maybe that just means System V/386 was slow; I
should try to see if I can get similar results with later versions of
QNX against BSD and Linux).
Speaking of UX/RT: I strongly suggest avoiding PID-based APIs in favor of
handle-based APIs, and allocating file descriptors in a way that is more
efficient than always picking the lowest free one. The first causes many
race conditions, and the second causes a lot of synchronization overhead.
io_uring is also a good API to consider.