Benchmarks of seL4
Hello, I come wondering: the FAQ claims that seL4 is “the world's fastest microkernel”. However, I can only find benchmarks of just seL4, without comparison to other operating systems running on the same machine. Would you happen to have such comparative benchmarks? I'd be particularly interested by something against L4Re, as it'd probably allow me to guess at the performance of something similar to L4Linux ported to seL4. Thank you for all you do on seL4! Leo
On 13 Apr 2020, at 09:37, Leo Gaspard
"Heiser, Gernot (Data61, Kensington NSW)"
We benchmark our own system and publish the results in an easy to reproduce way. No-one else does, for good reason (because no-one is keen to demonstrate how much slower they are). If we published our own measurements of other systems’ performance, people might not believe them, or we might even not tweak the other systems to get the best out of them. So the best thing for us to do is to be open about our own performance, make the claim of being the fastest, and leave it to others to challenge this (with reproducible numbers, of course).
However, there is a recent peer-reviewed paper that does compare the performance of sel4, Fiasco.OC (aka L4Re) and Zircon. It shows that seL4 performance is within 10–20% of the hardware limit, while Fiasco.OC is about a factor two slower (i.e. >100% above the hardware limit), and Zircon is way slower than that: https://dl.acm.org/doi/pdf/10.1145/3302424.3303946
Thank you for your reply! I must say I find this stance strange in a way: sure, people can say you cheated if you published benchmarks for other systems, but… is it worse than them saying you make claims without providing any grounds for them? Anyway, I'm not saying that these claims are unsubstantiated (as the paper you linked me does look like a reasonable element of it), but if said paper was linked alongside to it, and/or you published benchmarks of your competitors (and then they can challenge your results with results of their own), it'd probably avoid such assertions being made :) All that to say, thank you for your reply! Leo
On Thu, Apr 16, 2020 at 1:20 AM Leo Gaspard
Thank you for your reply! I must say I find this stance strange in a way: sure, people can say you cheated if you published benchmarks for other systems, but… is it worse than them saying you make claims without providing any grounds for them?
I suspect Gernot would prefer to avoid being beaten with his own stick. :-D https://www.cse.unsw.edu.au/~gernot/benchmarking-crimes.html
(sorry for the double-send, I used the wrong From email on the first
attempt so the post didn't make it to the list)
Jeff Waugh
On Thu, Apr 16, 2020 at 1:20 AM Leo Gaspard
wrote: Thank you for your reply! I must say I find this stance strange in a way: sure, people can say you cheated if you published benchmarks for other systems, but… is it worse than them saying you make claims without providing any grounds for them?
I suspect Gernot would prefer to avoid being beaten with his own stick. :-D
Oooh I hadn't noticed the Gernot's here were the same! Well, then, thank you for the list & paper, they're really nice to keep at hand when benchmarking :) That said, if the problem is actually configuring the competitors correctly, then there's something that differs from the context of the paper I think: when writing a paper, it's written once and then fixed in stone; while data on a website could be evolutive. In other words, explicitly saying “here are all the configuration parameters we tuned for the competitors, if you know of a way to make them better please tell us” should hopefully be enough to alleviate any concerns that the competitors were unfairly treated… would it not?
On 4/12/20, Leo Gaspard
Hello,
I come wondering: the FAQ claims that seL4 is “the world's fastest microkernel”. However, I can only find benchmarks of just seL4, without comparison to other operating systems running on the same machine.
Would you happen to have such comparative benchmarks? I'd be particularly interested by something against L4Re, as it'd probably allow me to guess at the performance of something similar to L4Linux ported to seL4.
Kernel performance is not the only factor when it comes to performance of microkernel OSes. An OS with a heavyweight structured RPC transport layer with mandatory dynamic marshaling and lots of vertical separation of subsystems into multiple processes is likely to be slower than an OS with an unstructured stream transport layer and subsystems that are mostly single processes, even if both are built on the same high-performance kernel. It seems that most microkernel OSes follow the former model for some reason, and I'm not sure why.
On 13 Apr 2020, at 13:06, Andrew Warkentin
On 4/12/20, Heiser, Gernot (Data61, Kensington NSW)
Sure, OS structure matters a lot, and I’m certainly known for be telling people consistently that IPC payloads of more than a few words is a strong indicator of a poor design. Microkernel IPC should be considered a protected function call mechanism, and you shouldn’t pass more by-value data than you would to a C function (see https://microkerneldude.wordpress.com/2019/03/07/how-to-and-how-not-to-use-s...).
I would think that an RPC layer with run-time marshaling of arguments as is used as the IPC transport layer on most microkernel OSes would add some overhead, even if it is using the underlying IPC layer properly, since it has to iterate over the list of arguments, determine the type of each, and copy it to a buffer, and the reverse happening on the receiving end. Passing around bulk unstructured/opaque data is quite common (e.g. for disk and network transfers), and an RPC-based transport layer adds unnecessary overhead and complexity to such use cases. I think a better transport layer design (for an L4-like kernel at least) would be one that maintains RPC-like call-return semantics, but exposes message registers and a shared memory buffer almost directly with the only extra additions being a message length, file offset, and type code (to indicate whether a message is a short register-only message, a long message in the buffer, or an error) rather than using marshaling. This is what I plan to do on UX/RT, which will have a Unix-like IPC transport layer API that provides new read()/write()-like functions that operate on message registers or the shared buffer rather than copying as the traditional versions do (the traditional versions will also still be present of course, implemented on top of the "raw" versions). RPC with marshaling could easily still be implemented on top of such a transport layer (for complex APIs that need marshaling) with basically no overhead compared to an RPC-based transport layer.
However, microkernel OS structure is very much determined by what security properties you want to achieve, in particular, which components to trust and which not. Less trust generally means more overhead, and the cost of crossing a protection domain boundary is the critical factor that determines this overhead for a given design.
It seems to be quite common for microkernel OSes to vertically split subsystems that really represent single protection domains. A good example is a disk stack. For the most part, all layers of a disk stack are dealing with the same data, just at different levels, so splitting them into processes just adds unnecessary overhead in most cases. Typically, one disk server process per device should be good enough as far as security goes. The only cases where there is any benefit at all to splitting a disk stack vertically is on systems with multiple partitions or traditional LVMs that provide raw volumes, and sometimes also systems with disk encryption. On a system with a single partition or an integrated LVM/FS like ZFS and no disk encryption there is typically no benefit to splitting up the disk stack. For systems where keeping partitions/LVs separated is important, it should be possible to run separate "lower" and "upper" disk servers with the lower one containing the disk driver, partition driver, and LVM and the upper one containing the FS driver and disk encrpytion layer, but this should not be mandatory (this is what I plan to do on UX/RT). A disk stack architecture like that of Genode where the disk driver, partition driver, and FS driver are completely separate programs (rather than plugins that may be run in different configurations) forces overhead on all use cases even though that overhead often provides no security or error recovery benefit.
It seems that most microkernel OSes follow the former model for some reason, and I'm not sure why.
Which OSes? I’d prefer specific data points over vague claims.
In addition to Genode, prime examples would include Minix 3 and Fuchsia. QNX seems to be the main example of a microkernel OS that uses a minimally structured IPC transport layer (although still somewhat more structured than what UX/RT will have) and goes out of its way to avoid intermediary servers (its VFS doesn't act as an intermediary on read() or write(), and many subsystems are single processes). One paper back in the 90s benchmarked an old version as being significantly faster than contemporary System V/386 on the same hardware for most of the APIs tested (although maybe that just means System V/386 was slow; I should try to see if I can get similar results with later versions of QNX against BSD and Linux).
On Mon, Apr 13, 2020, 7:42 AM Andrew Warkentin
On 4/12/20, Heiser, Gernot (Data61, Kensington NSW)
wrote: Sure, OS structure matters a lot, and I’m certainly known for be telling people consistently that IPC payloads of more than a few words is a
indicator of a poor design. Microkernel IPC should be considered a
strong protected
function call mechanism, and you shouldn’t pass more by-value data than you would to a C function (see
https://microkerneldude.wordpress.com/2019/03/07/how-to-and-how-not-to-use-s...). My
I would think that an RPC layer with run-time marshaling of arguments as is used as the IPC transport layer on most microkernel OSes would add some overhead, even if it is using the underlying IPC layer properly, since it has to iterate over the list of arguments, determine the type of each, and copy it to a buffer, and the reverse happening on the receiving end. Passing around bulk unstructured/opaque data is quite common (e.g. for disk and network transfers), and an RPC-based transport layer adds unnecessary overhead and complexity to such use cases.
I think a better transport layer design (for an L4-like kernel at least) would be one that maintains RPC-like call-return semantics, but exposes message registers and a shared memory buffer almost directly with the only extra additions being a message length, file offset, and type code (to indicate whether a message is a short register-only message, a long message in the buffer, or an error) rather than using marshaling. This is what I plan to do on UX/RT, which will have a Unix-like IPC transport layer API that provides new read()/write()-like functions that operate on message registers or the shared buffer rather than copying as the traditional versions do (the traditional versions will also still be present of course, implemented on top of the "raw" versions).
RPC with marshaling could easily still be implemented on top of such a transport layer (for complex APIs that need marshaling) with basically no overhead compared to an RPC-based transport layer.
However, microkernel OS structure is very much determined by what
security
properties you want to achieve, in particular, which components to trust and which not. Less trust generally means more overhead, and the cost of crossing a protection domain boundary is the critical factor that determines this overhead for a given design.
It seems to be quite common for microkernel OSes to vertically split subsystems that really represent single protection domains. A good example is a disk stack. For the most part, all layers of a disk stack are dealing with the same data, just at different levels, so splitting them into processes just adds unnecessary overhead in most cases. Typically, one disk server process per device should be good enough as far as security goes. The only cases where there is any benefit at all to splitting a disk stack vertically is on systems with multiple partitions or traditional LVMs that provide raw volumes, and sometimes also systems with disk encryption. On a system with a single partition or an integrated LVM/FS like ZFS and no disk encryption there is typically no benefit to splitting up the disk stack.
For systems where keeping partitions/LVs separated is important, it should be possible to run separate "lower" and "upper" disk servers with the lower one containing the disk driver, partition driver, and LVM and the upper one containing the FS driver and disk encrpytion layer, but this should not be mandatory (this is what I plan to do on UX/RT).
A disk stack architecture like that of Genode where the disk driver, partition driver, and FS driver are completely separate programs (rather than plugins that may be run in different configurations) forces overhead on all use cases even though that overhead often provides no security or error recovery benefit.
It seems that most microkernel OSes follow the former model for some reason, and I'm not sure why.
Which OSes? I’d prefer specific data points over vague claims.
In addition to Genode, prime examples would include Minix 3 and Fuchsia.
QNX seems to be the main example of a microkernel OS that uses a minimally structured IPC transport layer (although still somewhat more structured than what UX/RT will have) and goes out of its way to avoid intermediary servers (its VFS doesn't act as an intermediary on read() or write(), and many subsystems are single processes). One paper back in the 90s benchmarked an old version as being significantly faster than contemporary System V/386 on the same hardware for most of the APIs tested (although maybe that just means System V/386 was slow; I should try to see if I can get similar results with later versions of QNX against BSD and Linux).
Speaking of UX/RT: I strongly suggest avoiding PID-based APIs in favor of handle-based APIs, and allocating file descriptors in a way that is more efficient than always picking the lowest free one. The first causes many race conditions, and the second causes a lot of synchronization overhead. io_uring is also a good API to consider.
On 4/13/20, Demi Obenour
Speaking of UX/RT: I strongly suggest avoiding PID-based APIs in favor of handle-based APIs, and allocating file descriptors in a way that is more efficient than always picking the lowest free one. The first causes many race conditions, and the second causes a lot of synchronization overhead. io_uring is also a good API to consider.
A Linux-like pidfd API will be present (although it will use procfs
underneath; anonymous FDs will basically be absent from UX/RT).
io_uring will probably be mostly implemented in the root system
library, possibly with a few hooks in the root server.
On 4/13/20, Matthew Fernandez
This is effectively the approach CAmkES takes. Generated, specialised RPC entry points that use seL4’s IPC mechanisms. With enough information provided at compile time, you can generate type safe and performant marshaling code. With some care, you can generate code the compiler can easily see through and optimise. For passing bulk data, you can either use shared memory or unmap/remap pages during RPC.
When we did some optimisation work on CAmkES for a paper (~2014?) we were able to comfortably hit the limit of what one could have achieved with hand optimised IPC.
I'm planning to use static compile-time marshaling on top of a regular file descriptor for UX/RT's root server API. Since the only stable ABI will be that of the dynamically-linked root system library, the use of static marshaling shouldn't be a serious issue for binary compatibility. Most other subsystems using RPC will probably use dynamic marshaling.
On Tue, Apr 14, 2020, 12:17 AM Andrew Warkentin
On 4/13/20, Demi Obenour
wrote: Speaking of UX/RT: I strongly suggest avoiding PID-based APIs in favor of handle-based APIs, and allocating file descriptors in a way that is more efficient than always picking the lowest free one. The first causes many race conditions, and the second causes a lot of synchronization overhead. io_uring is also a good API to consider.
A Linux-like pidfd API will be present (although it will use procfs underneath; anonymous FDs will basically be absent from UX/RT). io_uring will probably be mostly implemented in the root system library, possibly with a few hooks in the root server.
On 4/13/20, Matthew Fernandez
wrote: This is effectively the approach CAmkES takes. Generated, specialised RPC entry points that use seL4’s IPC mechanisms. With enough information provided at compile time, you can generate type safe and performant marshaling code. With some care, you can generate code the compiler can easily see through and optimise. For passing bulk data, you can either
use
shared memory or unmap/remap pages during RPC.
When we did some optimisation work on CAmkES for a paper (~2014?) we were able to comfortably hit the limit of what one could have achieved with hand optimised IPC.
I'm planning to use static compile-time marshaling on top of a regular file descriptor for UX/RT's root server API. Since the only stable ABI will be that of the dynamically-linked root system library, the use of static marshaling shouldn't be a serious issue for binary compatibility. Most other subsystems using RPC will probably use dynamic marshaling.
Do you plan to support multikernel configurations and/or virtualization? How do you expect performance and security to compare to Linux? Sincerely, Demi
On Mon, Apr 13, 2020, 10:57 PM Demi Obenour
Do you plan to support multikernel configurations and/or virtualization? How do you expect performance and security to compare to Linux?
Multikernel will be supported and will be relatively transparent to user programs (the underlying topology will be visible for applications that want to control affinity). I'm probably not going to bother with Type 2 virtualization as is seen on Genode, CAmkES and L4Re, since there should be little need to run a Linux system virtualized under UX/RT on embedded systems due to the built-in compatibility with both Linux drivers and applications. I am planning to write an seL4-based Type 1 hypervisor to go with UX/RT, mostly for desktops and servers. This will be somewhat similar to Xen, but with disaggregated driver domains (the hypervisor will only run PVH and HVM-type VMs, not processes or PV-type VMs). UX/RT will run as a normal OS within a VM on top of the hypervisor (seeing as seL4 is quite lightweight, nested seL4 kernels should have less overhead than the nested Linux kernels commonly seen with KVM). I can't say for sure about overall performance, but I'm almost certain UX/RT will have much better IPC performance than Linux due to the lightweight IPC model. I would be surprised if overall performance isn't at least comparable to Linux in most situations though. It may even end up being like QNX and SysV back in the 90s with UX/RT outperforming Linux significantly in some cases once it is well optimized, but I can't guarantee that. I am expecting UX/RT to be significantly more secure than Linux. It will run all subsystems except for the kernel and root server as normal processes/libraries with limited privileges. Rust will be used for security-critical code wherever possible. The purely file-oriented architecture will significantly reduce the number of security-critical entry points compared to conventional OSes and will allow all access control rules (other than quotas) to be expressed in the same way. The new security model will be fully integrated rather than being an afterthought with no way to revert to traditional Unix root/non-root security, nor will the setuid/setgid bits have any effect. There will be a mechanism to run particular programs with escalated privileges but it will be more flexible, using roles instead of UIDs (it will also allow limiting the situations in which privilege escalation occurs), and it will store associations of privileges in a central store rather than file attributes.
On Apr 13, 2020, at 04:17, Andrew Warkentin
wrote: On 4/12/20, Heiser, Gernot (Data61, Kensington NSW)
wrote: Sure, OS structure matters a lot, and I’m certainly known for be telling people consistently that IPC payloads of more than a few words is a strong indicator of a poor design. Microkernel IPC should be considered a protected function call mechanism, and you shouldn’t pass more by-value data than you would to a C function (see https://microkerneldude.wordpress.com/2019/03/07/how-to-and-how-not-to-use-s...).
I would think that an RPC layer with run-time marshaling of arguments as is used as the IPC transport layer on most microkernel OSes would add some overhead, even if it is using the underlying IPC layer properly, since it has to iterate over the list of arguments, determine the type of each, and copy it to a buffer, and the reverse happening on the receiving end. Passing around bulk unstructured/opaque data is quite common (e.g. for disk and network transfers), and an RPC-based transport layer adds unnecessary overhead and complexity to such use cases.
I think a better transport layer design (for an L4-like kernel at least) would be one that maintains RPC-like call-return semantics, but exposes message registers and a shared memory buffer almost directly with the only extra additions being a message length, file offset, and type code (to indicate whether a message is a short register-only message, a long message in the buffer, or an error) rather than using marshaling. This is what I plan to do on UX/RT, which will have a Unix-like IPC transport layer API that provides new read()/write()-like functions that operate on message registers or the shared buffer rather than copying as the traditional versions do (the traditional versions will also still be present of course, implemented on top of the "raw" versions).
RPC with marshaling could easily still be implemented on top of such a transport layer (for complex APIs that need marshaling) with basically no overhead compared to an RPC-based transport layer.
This is effectively the approach CAmkES takes. Generated, specialised RPC entry points that use seL4’s IPC mechanisms. With enough information provided at compile time, you can generate type safe and performant marshaling code. With some care, you can generate code the compiler can easily see through and optimise. For passing bulk data, you can either use shared memory or unmap/remap pages during RPC. When we did some optimisation work on CAmkES for a paper (~2014?) we were able to comfortably hit the limit of what one could have achieved with hand optimised IPC.
participants (6)
-
Andrew Warkentin
-
Demi Obenour
-
Heiser, Gernot (Data61, Kensington NSW)
-
Jeff Waugh
-
Leo Gaspard
-
Matthew Fernandez