new RFC: seL4 Device Driver Framework
A new RFC has just landed: https://sel4.atlassian.net/browse/RFC-12 If you have any feedback, please comment there. Cheers, Gerwin
This looks really interesting. I don't feel qualified to offer any feedback but I do have a question... I'm particularly curious about the "single producer, single consumer, lockless bounded queues implemented as ring buffers" mentioned for the cross-address-space communications. Is this taken from any existing seL4 library/util (e.g. is it similar at all to shared memory communication in CAMkES?) or is it something novel developed specifically for the driver framework? In my research into the Pony language on seL4 I have run into the problem of all Pony's actor communication being single-address-space-assuming datastructures. There are some interesting lock-free SPMC and MPSC queues under the hood that only use C atomics, which I've been able to port over to the seL4 environment, but they're ultimately implemented as single-address-space linked lists, so something like this could be a useful tool for looking at communication between actors in different address spaces. (although Pony's message-passing is also built around an assumption the queues are unbounded... which might just not be workable at all between confined components. This is a discussion point for my thesis though :) ) Cheers, Stewart Webb
The "single producer, single consumer, lockless bounded queues implemented
as ring buffers" also caught my attention.
In the past, to minimize jitter, I found it useful to have a pool of
threads consuming work from a single queue. This reduces the probability
that tasks will be significantly delayed behind other tasks that are taking
an unusually long time, because additional threads in the pool become free
and allow subsequent tasks to bypass the ones taking too long. This is
well known and often implemented in post offices where there is a single
queue for multiple counters. This implies MPMC (or perhaps SPMC and MPSC
in pairs).
It’s also useful to arrange tasks on different dedicated hardware threads
so you can move tasks that are causing jitter away onto a separate hardware
thread. I think you might already have sufficient thread affinity
mechanisms.
Memory and threads for a pool have to be from the same NUMA node. Devices
are also local to a specific NUMA node.
You can probably hide most of this for devices where a single hardware
thread is sufficient to handle the device and the devices are independent
but if you are doing anything involving the CPU in conjunction with a fast
modern network device then one hardware thread won’t be enough, and
anything on NUMA hardware that involves multiple devices (multi-pathing for
example) is likely to benefit from NUMA awareness.
Harry
On Mon, 10 Oct 2022 at 14:24, Stewart Webb
This looks really interesting. I don't feel qualified to offer any feedback but I do have a question...
I'm particularly curious about the "single producer, single consumer, lockless bounded queues implemented as ring buffers" mentioned for the cross-address-space communications. Is this taken from any existing seL4 library/util (e.g. is it similar at all to shared memory communication in CAMkES?) or is it something novel developed specifically for the driver framework?
In my research into the Pony language on seL4 I have run into the problem of all Pony's actor communication being single-address-space-assuming datastructures. There are some interesting lock-free SPMC and MPSC queues under the hood that only use C atomics, which I've been able to port over to the seL4 environment, but they're ultimately implemented as single-address-space linked lists, so something like this could be a useful tool for looking at communication between actors in different address spaces. (although Pony's message-passing is also built around an assumption the queues are unbounded... which might just not be workable at all between confined components. This is a discussion point for my thesis though :) )
Cheers, Stewart Webb _______________________________________________ Devel mailing list -- devel@sel4.systems To unsubscribe send an email to devel-leave@sel4.systems
Hi Stewart,
The ring buffers were previously implemented as a library on CAmkES. You can find the PR here: https://github.com/seL4/projects_libs/pull/15
The move to the Core Platform was made due to performance overheads in the CAmkES framework. There is no reason the sDDF can’t be used on other platforms either, though the sample system will need porting.
Hi Harry,
The use of single producer, single consumer queues was an intentional simplification for the driver framework. This makes it much easier to reason about and verify as well as eliminating many possible concurrency bugs. The framework does not require single threaded address spaces though. We could have multiple threads acting as a single component, but they would be servicing different queues (and thus the queues would remain single producer, single consumer).
Ideally for larger systems with multiple clients, we would use a multiplexing component to service multiple queues between client applications and instead grow the stack laterally. This design aims to provide a strong separation of concerns as each job would be a separate component and the simplicity of the queues means there is little performance overhead. In multicore, each of these components would run on separate cores.
You can probably hide most of this for devices where a single hardware
thread is sufficient to handle the device and the devices are independent
but if you are doing anything involving the CPU in conjunction with a fast
modern network device then one hardware thread won’t be enough
We propose adding a second hw thread, one to service each direction and thus we could still maintain single producer/single consumer. We still have some work to do to expand the sDDF to other device classes (and this will involve benchmarking the framework on high throughput networking systems), so more on this to come! :)
In the mean time, happy to answer any questions and welcome any feedback/ideas!
Lucy
On 16 Oct 2022, at 4:25 pm, Harry Butterworth
On 16 Oct 2022, at 23:25, Harry Butterworth
The "single producer, single consumer, lockless bounded queues implemented as ring buffers" also caught my attention.
In the past, to minimize jitter, I found it useful to have a pool of threads consuming work from a single queue. This reduces the probability that tasks will be significantly delayed behind other tasks that are taking an unusually long time, because additional threads in the pool become free and allow subsequent tasks to bypass the ones taking too long. This is well known and often implemented in post offices where there is a single queue for multiple counters. This implies MPMC (or perhaps SPMC and MPSC in pairs).
Hi Harry, Stewart, The whole point here is that we don’t think we need this, and as a result can keep implementation complexity as well as overheads low. SPMC/MPSC is inherently more complex than SPSC, and, judging by the papers recently I read that use it, I’m reasonably confident we’ll at least match performance of such approaches (and I'll offer a grovelling retraction if proven wrong ;-). In our design there’s only one place where there’s a 1:n mapping, that’s in the multiplexer, and all it does is moving pointers from a driver-side ring to a set of per-client rings (input) or from a set of per-client rings to a driver-side ring (output). On output it will apply a simple policy when deciding which client to serve (priority-based, round-robin, or bandwidth limiting). A particular multiplexer will just implement one particular policy, and you can pick the one you want. Basically looking at building a lego set, where every lego block is single-threaded and can run on a different core. This keeps all rings SPSC, and every single piece very simple (and likely verifiable), and should be able to maximise concurrency of the whole system. We haven’t implemented and evaluated the multiplexers yet, but that’ll be one of the first things we’ll do when Lucy returns from internship/vacation in early Jan (and I’m very much looking forward to analysing results). Gernot
On 10/16/22 13:56, Gernot Heiser wrote:
On 16 Oct 2022, at 23:25, Harry Butterworth
wrote: The "single producer, single consumer, lockless bounded queues implemented as ring buffers" also caught my attention.
In the past, to minimize jitter, I found it useful to have a pool of threads consuming work from a single queue. This reduces the probability that tasks will be significantly delayed behind other tasks that are taking an unusually long time, because additional threads in the pool become free and allow subsequent tasks to bypass the ones taking too long. This is well known and often implemented in post offices where there is a single queue for multiple counters. This implies MPMC (or perhaps SPMC and MPSC in pairs).
Hi Harry, Stewart,
The whole point here is that we don’t think we need this, and as a result can keep implementation complexity as well as overheads low. SPMC/MPSC is inherently more complex than SPSC, and, judging by the papers recently I read that use it, I’m reasonably confident we’ll at least match performance of such approaches (and I'll offer a grovelling retraction if proven wrong ;-).
How do you plan to handle multi-queue devices? Modern devices often have multiple queues so that they can be used from multiple cores without any CPU-side synchronization.
In our design there’s only one place where there’s a 1:n mapping, that’s in the multiplexer, and all it does is moving pointers from a driver-side ring to a set of per-client rings (input) or from a set of per-client rings to a driver-side ring (output). On output it will apply a simple policy when deciding which client to serve (priority-based, round-robin, or bandwidth limiting). A particular multiplexer will just implement one particular policy, and you can pick the one you want. Basically looking at building a lego set, where every lego block is single-threaded and can run on a different core.
How do you plan on handling access control? Using a block device as an example, a client must not be able to perform any requests to regions of the block storage that it is not authorized to access. This could either be handled in the multiplexer itself or by having the multiplexer include an unforgeable client ID with each request sent to the driver. Also, what are the consequences of a compromised driver? Will drivers be able to escalate privileges directly, or will the multiplexer and client libraries enforce some invariants even in this case?
This keeps all rings SPSC, and every single piece very simple (and likely verifiable), and should be able to maximise concurrency of the whole system.
I agree, with the above caveat about multi-queue devices.
We haven’t implemented and evaluated the multiplexers yet, but that’ll be one of the first things we’ll do when Lucy returns from internship/vacation in early Jan (and I’m very much looking forward to analysing results).
Will it be possible for clients to pre-register buffers with the multiplexer, and for the multiplexer to in turn register them with the driver? That would allow for devices to DMA directly to client buffers while still having the IOMMU restricting what the driver can do. -- Sincerely, Demi Marie Obenour (she/her/hers)
On Sun, 16 Oct 2022 at 14:09, Lucy Parker
We propose adding a second hw thread, one to service each direction…
This is entirely plausible.
Be aware: the threads won’t be independent. For example, ACKs or timing information coming in on the receive thread will need to be communicated to the sending thread. This requires a non-blocking, lock free communication channel between the threads. Maybe just shared memory with memory barriers. It’s important to minimize the latency and minimize the jitter when forwarding information from the receive thread to the sending thread. For example, if you are trying to estimate a change in the one-way trip time for congestion control then jitter in this path will be a limiting factor for the congestion control algorithm.
On Mon, 17 Oct 2022 at 11:37, Demi Marie Obenour
On 10/16/22 13:56, Gernot Heiser wrote:
We haven’t implemented and evaluated the multiplexers yet, but that’ll be one of the first things we’ll do when Lucy returns from internship/vacation in early Jan (and I’m very much looking forward to analysing results).
Will it be possible for clients to pre-register buffers with the multiplexer, and for the multiplexer to in turn register them with the driver? That would allow for devices to DMA directly to client buffers while still having the IOMMU restricting what the driver can do.
It seems that's what the framework should aim for - as in, don't do anything that precludes mapping pages directly for access by the device, where the hardware supports it. Tackling device-initiated DMA policy would be great, especially in the case of systems with an IOMMU, however it feels like that is a concept at a slightly different level than the existing proposal. What prevents it from being tackled in a truly generic way is that the handling of bus addresses passed to the device - ensuring that they only refer to pages supplied by the same client making the request - depends on the format of the commands on the bus, so implementation of that will be device specific. It really is up to the driver to ensure that commands inserted into the queue have the correct bus addresses. This is especially true for devices that don't enforce PASID separation (that is, most of them) but also there's the need to handle PASID setup in the case that the device does support them. I guess at some point the device wants to query the underlying attached bus and ask - where and how can I map pages for device-initiated DMA, and allocate objects that can be mapped into userspace, whether the underlying mechanism is an IOMMU, a scatterlist, or a jump buffer, according to what the bus can support. -- William ML Leslie
"William" == William ML Leslie
writes:
William> On Mon, 17 Oct 2022 at 11:37, Demi Marie Obenour
William>
Will it be possible for clients to pre-register buffers with the multiplexer, and for the multiplexer to in turn register them with the driver? That would allow for devices to DMA directly to client buffers while still having the IOMMU restricting what the driver can do.
William> It seems that's what the framework should aim for - as in, William> don't do anything that precludes mapping pages directly for William> access by the device, where the hardware supports it. One of the things that needs to be evaluated is the cost of setting up and tearing down IOMMU mappings. For small transfers, copying the data may be cheaper. In any case, for things like network receive, the multiplexor is going to have to do a copy, to preserve inter-component privacy. You can't share a common DMA area for receive buffers if you want to prevent Component A seeing component B's traffic. Peter C -- Dr Peter Chubb https://trustworthy.systems/ Trustworthy Systems Group CSE, UNSW
On 10/17/22 17:54, Peter Chubb wrote:
"William" == William ML Leslie
writes: William> On Mon, 17 Oct 2022 at 11:37, Demi Marie Obenour William>
wrote: Will it be possible for clients to pre-register buffers with the multiplexer, and for the multiplexer to in turn register them with the driver? That would allow for devices to DMA directly to client buffers while still having the IOMMU restricting what the driver can do.
William> It seems that's what the framework should aim for - as in, William> don't do anything that precludes mapping pages directly for William> access by the device, where the hardware supports it.
One of the things that needs to be evaluated is the cost of setting up and tearing down IOMMU mappings. For small transfers, copying the data may be cheaper.
Setting up and tearing down IOMMU mappings is very expensive. For zero-copy to be worthwhile, the total amount of data transferred per mapping must be large. This leads to an API in which buffers are pre-registered before performing I/O and stay mapped in the IOMMU until the client explicitly requests that they be unmapped. This also means that, if the client does not trust the device, it will typically need to perform a copy to avoid TOCTOU attacks, except in the case where the data is just being passed to another component without being looked at or processed. Of course, only parts that are being accessed need to be copied, so this is still an advantage. For instance, a network stack might only need to copy the packet headers, since it typically does not need to inspect the packet body. In short, zero-copy can be a useful performance win in some cases, but high-level APIs should generally avoid it unless the device is known to be trusted by the caller (not necessarily by the entire system!). With an untrusted device, there is a high risk of TOCTOU attacks, and preventing them generally requires making a copy anyway. The main cases I can think of where zero copy is a win with an untrusted device are: 1. A large amount of data is coming in, but most of it does not need to be processed at all. Packet capture is a classic example of this: one often only needs to read the headers to determine that a packet should be discarded, and can avoid bringing large parts the packet body into cache. 2. The client will be passing the data to a component that does not trust it, without any processing of its own. The other component will likely need to make its own copy of the data, so the client’s copy would be useless.
In any case, for things like network receive, the multiplexor is going to have to do a copy, to preserve inter-component privacy. You can't share a common DMA area for receive buffers if you want to prevent Component A seeing component B's traffic.
Some NICs might support hardware flow steering, which could avoid the copy. That implies trusting the hardware, though. -- Sincerely, Demi Marie Obenour (she/her/hers)
On 17 Oct 2022, at 12:33, Demi Marie Obenour
On 17 Oct 2022, at 22:44, Harry Butterworth
Sorry, stupid Mac mailer messed up quoting again, re-sending
On 17 Oct 2022, at 12:33, Demi Marie Obenour
How do you plan to handle multi-queue devices? Modern devices often have multiple queues so that they can be used from multiple cores without any CPU-side synchronization.
We haven’t looked at this yet, but I don’t see how this would cause any inherent difficulties. Multiple device queues are logically not different from multiple devices, although they may require per-client de-multiplexing. We’ll look at this later
How do you plan on handling access control? Using a block device as an example, a client must not be able to perform any requests to regions of the block storage that it is not authorized to access. This could either be handled in the multiplexer itself or by having the multiplexer include an unforgeable client ID with each request sent to the driver. Also, what are the consequences of a compromised driver? Will drivers be able to escalate privileges directly, or will the multiplexer and client libraries enforce some invariants even in this case?
There are multiple standard approaches for access control on storage: - trust your file system – verifying a simple SD file system should be doable - encrypt all data client-side – that’s the storage equivalent of using TLS etc - partition the storage medium – the multiplexer (or, for separation of concerns, a client-side filter component) will drop requests that go to the wrong partition - using a more dynamic scheme with block-level authentication tokens – this is overkill for the static systems we’re after, but might be a model for a dynamic system (eg https://trustworthy.systems/projects/TS/smos/) It shouldn’t be the driver’s job in any case.
Will it be possible for clients to pre-register buffers with the multiplexer, and for the multiplexer to in turn register them with the driver? That would allow for devices to DMA directly to client buffers while still having the IOMMU restricting what the driver can do.
Zero-copy is a main driver of the design, but, keep in mind, this is (presently) for static architectures as needed for IoT/cyberphysical etc. Full zero-copy requires dynamic IOMMU remapping, which is know to be expensive on some platofrms (but not necessarily all). In the first instance we’re likely to use static IOMMU mappings, and use copying where required for security. This is a simple configuration option: optionally insert a copier between server and multiplexer. We’re planning to do a study of IOMMU overheads on recent platforms to driver further design considerations. Note that the results shown by Lucy at the Summit already include client-side copying (to simulate the inefficient Posix interface) – and we still beat Linux by a factor three. So we’ve got some performance headspace. Gernot
Again re-sending due to messed-up quoting:
On 17 Oct 2022, at 22:44, Harry Butterworth
On Sun, 16 Oct 2022 at 14:09, Lucy Parker
mailto:lucy.parker@student.unsw.edu.au> wrote: We propose adding a second hw thread, one to service each direction…
This is entirely plausible.
Be aware: the threads won’t be independent. For example, ACKs or timing information coming in on the receive thread will need to be communicated to the sending thread.
I think you’re talking about IP-level issues. These aren’t the driver’s business, which is solely concerned with hardware abstraction. And yes, the IP stack won’t be able to keep transmit and receive streams separate. But the IP stack will be per-client and not isolation-critical (and not introducing new complications AFAICT). Gernot _______________________________________________ Devel mailing list -- devel@sel4.systems To unsubscribe send an email to devel-leave@sel4.systems
On 18 Oct 2022, at 08:54, Peter Chubb
One more mis-formatted mail, resending...
On 18 Oct 2022, at 08:54, Peter Chubb
In any case, for things like network receive, the multiplexor is going to have to do a copy, to preserve inter-component privacy. You can't share a common DMA area for receive buffers if you want to prevent Component A seeing component B's traffic.
Not the multiplexer, that’s the job of a separate (optional) copying component in the lego set. Gernot
On 10/16/22 09:09, Lucy Parker via Devel wrote:
Hi Stewart, The ring buffers were previously implemented as a library on CAmkES. You can find the PR here: https://github.com/seL4/projects_libs/pull/15 The move to the Core Platform was made due to performance overheads in the CAmkES framework. There is no reason the sDDF can’t be used on other platforms either, though the sample system will need porting.
Hi Harry, The use of single producer, single consumer queues was an intentional simplification for the driver framework. This makes it much easier to reason about and verify as well as eliminating many possible concurrency bugs. The framework does not require single threaded address spaces though. We could have multiple threads acting as a single component, but they would be servicing different queues (and thus the queues would remain single producer, single consumer). Ideally for larger systems with multiple clients, we would use a multiplexing component to service multiple queues between client applications and instead grow the stack laterally. This design aims to provide a strong separation of concerns as each job would be a separate component and the simplicity of the queues means there is little performance overhead. In multicore, each of these components would run on separate cores.
You can probably hide most of this for devices where a single hardware thread is sufficient to handle the device and the devices are independent but if you are doing anything involving the CPU in conjunction with a fast modern network device then one hardware thread won’t be enough
We propose adding a second hw thread, one to service each direction and thus we could still maintain single producer/single consumer. We still have some work to do to expand the sDDF to other device classes (and this will involve benchmarking the framework on high throughput networking systems), so more on this to come! :)
For network devices, have you considered using hardware receive-side scaling to shard the workload among multiple independent cores? My understanding is that this is the best solution to this problem, as the fast paths operate with no cross-core synchronization. -- Sincerely, Demi Marie Obenour (she/her/hers)
participants (8)
-
Demi Marie Obenour
-
Gernot Heiser
-
Gerwin Klein
-
Harry Butterworth
-
Lucy Parker
-
Peter Chubb
-
Stewart Webb
-
William ML Leslie