-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On Mon, Sep 12, 2022 at 10:18:48PM -0500, Eric Jacobs wrote:
Demi Marie Obenour wrote:
While the basic framework is in place and performs well (it outperforms Linux without even trying too hard…) there are a number of questions that still need further research, and are unlikely to be resolved by the time of the initial release. One of them is whether drivers should be active (synchronising with notifications only) or passive PDs (using PPCs). There are a bunch of tradeoffs to consider, and we need a fair amount of experimental work to settle it. The good news is that the effect the choice has on driver as well as client implementation is minimal (a few lines changed, possibly zero on the client side). I **strongly** recommend active drivers, for the following reasons:
1. While seL4 can perform context switches with incredible speed, lots of hardware requires very expensive flushing to prevent Spectre attacks. On x86, for instance, one must issue an Indirect Branch Predictor Barrier (IPBP), which costs thousands of cycles. Other processors, such as ARM, also require expensive flushing. In the general case, it is not possible to safely avoid these flushes on current hardware. The issue of managing the number and timing of context switches is indeed critically important to any multi-process system. However, in as far as my original question was about comparing the mechanisms of "shared memory+IPC" and "shared memory+notifications" (which may or may not be captured by the terms "active driver" vs "passive driver", I'm not sure), I'm seeking to understand the operations of these styles of communications independently of any policy that may be built on top of them.
In seL4, a passive server is a program that never runs except when one of its IPC endpoints is called. Therefore, a passive driver would be one that is called via IPC, and an active driver would be one that uses shared memory and notifications.
To put it another way, the act of making another, higher-priority thread runnable is the action which induces a context switch; whether that action is part of an IPC, notification, fault, etc. If a piece of code decides to perform one of those activities, it is choosing to incur the costs of a context-switch. It decides if and when to do this based on its own policy and the compromises it necessarily entails; and these are made independently of the mechanisms provided by the kernel.
This is correct.
Therefore, context switches must be kept to a minimum, no matter how good the kernel is at doing them. Let us speak precisely here. The management of the cost of context switches, being a limited resource, is a policy-based determination which must be left up to the application to decide and not dictated by the system. What the system should provide is mechanisms which allow the correct trade-offs to be made, which is why I'm especially curious to see how we can support things like scatter-gather and Segmentation Offloading which are critical for other platforms and I expect this one too.
I do not see how scatter-gatter and segmentation offload depend on how drivers communicate with their clients.
2. Passive drivers only make sense if the drivers are trusted. In the case of e.g. a USB or network drivers, this is a bad assumption: both are major attack vectors and drivers ported from e.g. Linux are unlikely to be hardened against malicious devices.
Hmm, this I am quite surprised by. Is this an expected outcome of the seL4 security model?
It is. An PPC callee is allowed to block for as long as it desires, and doing so will block the caller too. There is no way (that I am aware of) to interrupt this wait. Therefore, the callee can perform a trivial denial-of-service attack on the caller.
This implies that a rather large swath of kernel functionality (the IPC fastpath, cap transfer, the single-threaded event model) are simply not available to mutually suspicious PD's. I'm very concerned about the expansion of the T(rusted)CB into userspace, for both performance and assurance reasons.
As mentioned above, the IPC fastpath is only useable when the caller trusts the callee to at least not perform a denial-of-service attack. Furthermore, since seL4 does not provide any support for copying large messages between userspace processes, mutually distrusting PDs will often need to perform defensive copies on both sides. One can implement asynchronous, single-copy message and cap transfer via a trusted userspace message broker.
3. A passive driver must be on the same core as its caller, since cross-core PPC is deprecated. An active driver does not have this restriction. Active drivers can therefore run concurrently with the programs they serve, which is the same technique used by Linux’s io_uring with SQPOLL. For a high-performance driver, I would expect to have at least one TX and RX queue per CPU. Therefore, a local call should always be possible. The deprecation of cross-CPU IPC is related to the basic concept that spreading work across CPU's is generally not a good idea.
If you are referring to networking applications where the driver can already perform a denial of service (by not processing any packets), then using an IPC call to the driver would be justified. Batching is still critical for performance, though. See Vector Packet Processing for why.
But yes, if the user wants to distribute the workload in that way, passing data between CPU's, obviously IPC fastpath is off the table, and notifications seem like a pretty clear choice in that case. (However, the existence of this use case is _not_ a reason to sacrifice the performance of the same-core RTC execution model.)
If you are okay with the driver being able to perform a DoS, then IPC is fine.
4. Linux’s io_uring has the application submit a bunch of requests into a ring buffer and then tell the kernel to process these requests. The kernel processes the requests asynchronously and writes completions into another ring buffer. Userspace can use poll() or epoll() to wait for a completion to arrive. This is a fantastic fit for active drivers: the submission syscall is replaced by a notification, and another notification is used to for wakeup on completion.
Agreed, it's a very attractive model. Indeed, that is basically how I got started on this line of thinking; it is quite apparent that these command ring/pipe-like structures are very flexible and could be used as the building blocks of entire systems. So the question I wanted to answer was: what are we leaving on the table if we go with this approach? Particularly given the emphasis on IPC, its optimizations and the contention that fast IPC is the foundational element of a successful, performant microkernel system.
I think there are a few factors in play here: 1. seL4 has (so far) mostly been used in embedded systems. I suspect many of these run on simple, low-power, in-order CPUs, which are inherently not vulnerable to Spectre. As such, they very much can have fast context switches. I, on the other hand, am mostly interested in using seL4 in Qubes OS, which runs on out-of-order hardware where Spectre is a major concern. Therefore, context switch performance is limited by the need for Spectre mitigations. 2. seL4 predates speculative execution attacks by several years. Therefore, seL4’s design likely did not consider that hardware vulnerabilities would reduce IPC performance by over an order of magnitude. Monolithic kernels are also impacted by this, but recent x86 CPUs support eIBRS, which significantly reduces syscall overhead so long as no context switch is required. Except for the “IBRS everywhere” case on AMD systems mentioned above, context switches require an expensive IBPB command. 3. Even in monolithic kernels, the cost of a system call has turned out to be high enough to matter. Therefore, Linux and Windows have both adopted ring buffer-based I/O APIs, which dramatically reduce the number of system calls required. This makes the overhead of an individual system call much less important. Since a system call in a monolithic kernel typically corresponds to an IPC in a microkernel, this means that the cost of an IPC is also much less important. - -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEdodNnxM2uiJZBxxxsoi1X/+cIsEFAmMgt1sACgkQsoi1X/+c IsGWxBAAiO/BbFNry2MEUpYiQPmuc123TwMcP8nQnJMi1IRueI7b3XzgzLjiQeRn o+X17McrJ/mnlHqw2oltl+SMeKF9qXwvXRUw+F6Z8R8ns15K1nASaLGqt5Jf07jp s2zkfEf87IM9cO7BHmIldutZZ0HjMwPKQYH1yf3xOxJ7cxXmxyEM+gXVS0iJ9S4k 3dB6FvSGdkR8nkVKExpm7l5fi/8iEN2DSl4AiaRDp7VLYkK495VqWlWbtT/f4lKB 95J5aos7Rji1oa4Aqqv+Judj1v79dWIZYq8vieVLLRVuBADrE8oMQwXe52z//E/K lDAKBjb2id/xgWTEaiUEBkXMhm+67iaMcE6LnxIxsjDoi/M4dBc1+UkZZ4LI0fKi s9KT+38OLTDr4UeRA3ut0kCIQVyKqnQCxwu5mWzf14U2KXqpQM8duIwkuQP932v/ XEAJPjWsSlgm2KY3WC+jO2+jDHlhziWyitFzIbevM0w69+GL66dTq1WFrW5MvbX+ T+cxxn6S5Eoxl0fc170qaUXR9k9EZIZHmJqZyYbZdmSJArfEPpNtLMC7dWHhGu1V AuMdrZWOey2iLZDSvWVgOw33Yn4g/mMwabyilzMq8zspduOE/bdJENz1cyd/4Vbh dMihPuxJlcuEXbIYGcN6obZs41wu85LKro+z8cegoC+YCsCCXtc= =gaYY -----END PGP SIGNATURE-----