[seL4] Re: sel4cp and device driver API's

14 Sep 2022

      -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On Mon, Sep 12, 2022 at 10:18:48PM -0500, Eric Jacobs wrote:
...
Demi Marie Obenour wrote:
...
...
While the basic framework is in place and performs well (it outperforms Linux without even trying too hard…) there are a number of questions that still need further research, and are unlikely to be resolved by the time of the initial release. One of them is whether drivers should be active (synchronising with notifications only) or passive PDs (using PPCs). There are a bunch of tradeoffs to consider, and we need a fair amount of experimental work to settle it. The good news is that the effect the choice has on driver as well as client implementation is minimal (a few lines changed, possibly zero on the client side).
I **strongly** recommend active drivers, for the following reasons:
1. While seL4 can perform context switches with incredible speed, lots
    of hardware requires very expensive flushing to prevent Spectre
    attacks.  On x86, for instance, one must issue an Indirect Branch
    Predictor Barrier (IPBP), which costs thousands of cycles.  Other
    processors, such as ARM, also require expensive flushing.  In the
    general case, it is not possible to safely avoid these flushes on
    current hardware.
The issue of managing the number and timing of context switches is indeed
critically important to any multi-process system. However, in as far as my
original question was about comparing the mechanisms of "shared memory+IPC"
and "shared memory+notifications" (which may or may not be captured by the
terms "active driver" vs "passive driver", I'm not sure), I'm seeking to
understand the operations of these styles of communications independently of
any policy that may be built on top of them.
In seL4, a passive server is a program that never runs except when one
of its IPC endpoints is called.  Therefore, a passive driver would be
one that is called via IPC, and an active driver would be one that uses
shared memory and notifications.
...
To put it another way, the act of making another, higher-priority thread
runnable is the action which induces a context switch; whether that action
is part of an IPC, notification, fault, etc. If a piece of code decides to
perform one of those activities, it is choosing to incur the costs of a
context-switch. It decides if and when to do this based on its own policy
and the compromises it necessarily entails; and these are made independently
of the mechanisms provided by the kernel.
This is correct.
...
...
Therefore, context switches must be kept to a
    minimum, no matter how good the kernel is at doing them.
Let us speak precisely here. The management of the cost of context switches,
being a limited resource, is a policy-based determination which must be left
up to the application to decide and not dictated by the system. What the
system should provide is mechanisms which allow the correct trade-offs to be
made, which is why I'm especially curious to see how we can support things
like scatter-gather and Segmentation Offloading which are critical for other
platforms and I expect this one too.
I do not see how scatter-gatter and segmentation offload depend on how
drivers communicate with their clients.
...
...
2. Passive drivers only make sense if the drivers are trusted.  In the
    case of e.g. a USB or network drivers, this is a bad assumption: both
    are major attack vectors and drivers ported from e.g. Linux are
    unlikely to be hardened against malicious devices.
Hmm, this I am quite surprised by. Is this an expected outcome of the seL4
security model?
It is.  An PPC callee is allowed to block for as long as it desires, and
doing so will block the caller too.  There is no way (that I am aware
of) to interrupt this wait.  Therefore, the callee can perform a trivial
denial-of-service attack on the caller.
...
This implies that a rather large swath of kernel functionality (the IPC
fastpath, cap transfer, the single-threaded event model) are simply not
available to mutually suspicious PD's. I'm very concerned about the
expansion of the T(rusted)CB into userspace, for both performance and
assurance reasons.
As mentioned above, the IPC fastpath is only useable when the caller
trusts the callee to at least not perform a denial-of-service attack.
Furthermore, since seL4 does not provide any support for copying
large messages between userspace processes, mutually distrusting PDs
will often need to perform defensive copies on both sides.  One can
implement asynchronous, single-copy message and cap transfer via a
trusted userspace message broker.
...
...
3. A passive driver must be on the same core as its caller, since
    cross-core PPC is deprecated.  An active driver does not have this
    restriction.  Active drivers can therefore run concurrently with the
    programs they serve, which is the same technique used by Linux’s
    io_uring with SQPOLL.
For a high-performance driver, I would expect to have at least one TX and RX
queue per CPU. Therefore, a local call should always be possible. The
deprecation of cross-CPU IPC is related to the basic concept that spreading
work across CPU's is generally not a good idea.
If you are referring to networking applications where the driver can
already perform a denial of service (by not processing any packets),
then using an IPC call to the driver would be justified.  Batching is
still critical for performance, though.  See Vector Packet Processing
for why.
...
But yes, if the user wants to distribute the workload in that way, passing
data between CPU's, obviously IPC fastpath is off the table, and
notifications seem like a pretty clear choice in that case. (However, the
existence of this use case is _not_ a reason to sacrifice the performance of
the same-core RTC execution model.)
If you are okay with the driver being able to perform a DoS, then IPC is
fine.
...
...
4. Linux’s io_uring has the application submit a bunch of requests into
    a ring buffer and then tell the kernel to process these requests.
    The kernel processes the requests asynchronously and writes
    completions into another ring buffer.  Userspace can use poll() or
    epoll() to wait for a completion to arrive.  This is a fantastic fit
    for active drivers: the submission syscall is replaced by a
    notification, and another notification is used to for wakeup on
    completion.
Agreed, it's a very attractive model. Indeed, that is basically how I got
started on this line of thinking; it is quite apparent that these command
ring/pipe-like structures are very flexible and could be used as the
building blocks of entire systems. So the question I wanted to answer was:
what are we leaving on the table if we go with this approach? Particularly
given the emphasis on IPC, its optimizations and the contention that fast
IPC is the foundational element of a successful, performant microkernel
system.
I think there are a few factors in play here:

1. seL4 has (so far) mostly been used in embedded systems.  I suspect
   many of these run on simple, low-power, in-order CPUs, which are
   inherently not vulnerable to Spectre.  As such, they very much can
   have fast context switches.  I, on the other hand, am mostly
   interested in using seL4 in Qubes OS, which runs on out-of-order
   hardware where Spectre is a major concern.  Therefore, context switch
   performance is limited by the need for Spectre mitigations.

2. seL4 predates speculative execution attacks by several years.
   Therefore, seL4’s design likely did not consider that hardware
   vulnerabilities would reduce IPC performance by over an order of
   magnitude.  Monolithic kernels are also impacted by this, but recent
   x86 CPUs support eIBRS, which significantly reduces syscall overhead
   so long as no context switch is required.  Except for the “IBRS
   everywhere” case on AMD systems mentioned above, context switches
   require an expensive IBPB command.

3. Even in monolithic kernels, the cost of a system call has turned out
   to be high enough to matter.  Therefore, Linux and Windows have both
   adopted ring buffer-based I/O APIs, which dramatically reduce the
   number of system calls required.  This makes the overhead of an
   individual system call much less important.  Since a system call in a
   monolithic kernel typically corresponds to an IPC in a microkernel,
   this means that the cost of an IPC is also much less important.
- -- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEdodNnxM2uiJZBxxxsoi1X/+cIsEFAmMgt1sACgkQsoi1X/+c
IsGWxBAAiO/BbFNry2MEUpYiQPmuc123TwMcP8nQnJMi1IRueI7b3XzgzLjiQeRn
o+X17McrJ/mnlHqw2oltl+SMeKF9qXwvXRUw+F6Z8R8ns15K1nASaLGqt5Jf07jp
s2zkfEf87IM9cO7BHmIldutZZ0HjMwPKQYH1yf3xOxJ7cxXmxyEM+gXVS0iJ9S4k
3dB6FvSGdkR8nkVKExpm7l5fi/8iEN2DSl4AiaRDp7VLYkK495VqWlWbtT/f4lKB
95J5aos7Rji1oa4Aqqv+Judj1v79dWIZYq8vieVLLRVuBADrE8oMQwXe52z//E/K
lDAKBjb2id/xgWTEaiUEBkXMhm+67iaMcE6LnxIxsjDoi/M4dBc1+UkZZ4LI0fKi
s9KT+38OLTDr4UeRA3ut0kCIQVyKqnQCxwu5mWzf14U2KXqpQM8duIwkuQP932v/
XEAJPjWsSlgm2KY3WC+jO2+jDHlhziWyitFzIbevM0w69+GL66dTq1WFrW5MvbX+
T+cxxn6S5Eoxl0fc170qaUXR9k9EZIZHmJqZyYbZdmSJArfEPpNtLMC7dWHhGu1V
AuMdrZWOey2iLZDSvWVgOw33Yn4g/mMwabyilzMq8zspduOE/bdJENz1cyd/4Vbh
dMihPuxJlcuEXbIYGcN6obZs41wu85LKro+z8cegoC+YCsCCXtc=
=gaYY
-----END PGP SIGNATURE-----