Somebody in the Zoom call the other day (IIRC it was Gernot) said that seL4 IPC is considerably faster than QNX IPC, but it seems that QNX is somehow faster in a simple benchmarking experiment in which a client repeatedly sends messages to a bit-bucket server, using the TSC to determine timing. I'm running both QNX and seL4 under identical VirtualBox configurations with a single core. I'm using a QNX 6.3.1 x86-32 image i have in my collection (x86-64 support was only added in 7.0; this is the most recent image I have with dev tools installed). The seL4 version I'm using is from February 2022 and is built for x86-64 with MCS enabled (the priority of the threads is 255, and both the timeslice and budget are the initial thread default of 5 ms). My benchmark times a loop of 10000 RPCs; for QNX it's just using the regular read() function on /dev/null or a custom server that also acts like /dev/null but forces a copy of the message into a server-provided buffer, and for seL4 I'm directly calling Send()/Recv()/Reply(). I've also tested Linux's /dev/null, and QNX's implementation actually seems somewhat faster than it (for the system /dev/null; the custom server's best case seems only slightly worse than under Linux). Under QNX and Linux I'm running a C implementation of the benchmark and under seL4 I tried both C and Rust implementations (which have very similar performance). The messages on QNX and Linux are 100 bytes, and on seL4 they are 2 words, which should be hitting the fast path. Here are the best and worst cases I got: seL4: 0.114-0.136s QNX, custom null server: 0.006-0.063s QNX, system null server: 0.002-0.022s Linux: 0.004-0.005s Has anyone else benchmarked seL4 IPC against that of QNX and gotten better (or comparable) results for seL4, unlike what I got? Is synchronous IPC just less optimized under seL4 than QNX?
On 18 Apr 2024, at 23:55, Andrew Warkentin
Somebody in the Zoom call the other day (IIRC it was Gernot) said that seL4 IPC is considerably faster than QNX IPC, but it seems that QNX is somehow faster in a simple benchmarking experiment in which a client repeatedly sends messages to a bit-bucket server, using the TSC to determine timing.
I'm running both QNX and seL4 under identical VirtualBox configurations with a single core. I'm using a QNX 6.3.1 x86-32 image i have in my collection (x86-64 support was only added in 7.0; this is the most recent image I have with dev tools installed). The seL4 version I'm using is from February 2022 and is built for x86-64 with MCS enabled (the priority of the threads is 255, and both the timeslice and budget are the initial thread default of 5 ms).
My benchmark times a loop of 10000 RPCs; for QNX it's just using the regular read() function on /dev/null or a custom server that also acts like /dev/null but forces a copy of the message into a server-provided buffer, and for seL4 I'm directly calling Send()/Recv()/Reply(). I've also tested Linux's /dev/null, and QNX's implementation actually seems somewhat faster than it (for the system /dev/null; the custom server's best case seems only slightly worse than under Linux).
Under QNX and Linux I'm running a C implementation of the benchmark and under seL4 I tried both C and Rust implementations (which have very similar performance). The messages on QNX and Linux are 100 bytes, and on seL4 they are 2 words, which should be hitting the fast path.
Here are the best and worst cases I got: seL4: 0.114-0.136s QNX, custom null server: 0.006-0.063s QNX, system null server: 0.002-0.022s Linux: 0.004-0.005s
Has anyone else benchmarked seL4 IPC against that of QNX and gotten better (or comparable) results for seL4, unlike what I got? Is synchronous IPC just less optimized under seL4 than QNX?
I’m not completely sure what you’re actually measuring here (actually I’m totally unsure what your setup really is, I can’t make sense of the above description in terms of what the seL4 system is doing). But measuring performance of a kernel inside a virtual machine is somewhat pointless at best. In any case it seems clear that you’re *not* measuring raw seL4 IPC. Your quoted cost of a round-trip on seL4 is say 120ms for 10,000 round trips, or 12µs for one round trip. You don’t specify the clock rate, but that would be somewhere between 12,000 and 50,000 cycles, when a round-trip IPC on x86 with MCS is <900 cycles Note that using Send()/Recv()/Reply() is the wrong way to use seL4 IPC, see https://microkerneldude.org/2019/03/07/how-to-and-how-not-to-use-sel4-ipc/ However, this will only be about a factor two slower than using the correct Call()/ReplyWait() incantation. In fact, while it is in principle possible to outperform seL4 in IPC, it is impossible to do so by more than a few percent, as the seL4 IPC cost is only about 20% higher than the pure hardware cost of kernel trap, minimal state save+restore and switching page tables. This has been independently established in a peer-reviewed paper, see https://dl.acm.org/doi/pdf/10.1145/3302424.3303946 Everything I’ve ever seen from QNX is that their IPC costs is many thousands of cycles (probably >10,000), and at least an order of magnitude slower than seL4’s. Gernot
On Thu, Apr 18, 2024 at 8:31 PM Gernot Heiser via Devel
I’m not completely sure what you’re actually measuring here (actually I’m totally unsure what your setup really is, I can’t make sense of the above description in terms of what the seL4 system is doing). But measuring performance of a kernel inside a virtual machine is somewhat pointless at best.
I don't think I've got any way to run both QNX and seL4 on the same hardware without virtualization. I wouldn't think it would make that much of a difference here.
In any case it seems clear that you’re *not* measuring raw seL4 IPC. Your quoted cost of a round-trip on seL4 is say 120ms for 10,000 round trips, or 12µs for one round trip. You don’t specify the clock rate, but that would be somewhere between 12,000 and 50,000 cycles, when a round-trip IPC on x86 with MCS is <900 cycles
I'm running on a Core i7-960, which is 3.2GHz.
Note that using Send()/Recv()/Reply() is the wrong way to use seL4 IPC, see https://microkerneldude.org/2019/03/07/how-to-and-how-not-to-use-sel4-ipc/ However, this will only be about a factor two slower than using the correct Call()/ReplyWait() incantation.
Somehow I said Send()/Recv()/Reply() when I meant Call()/Recv()/Send() (MCS of course doesn't even have a distinct Reply() function). Here's the exact code for the benchmark loop I'm using on seL4: void c_test_client(seL4_CPtr endpoint) { struct seL4_MessageInfo info = seL4_MessageInfo_new(0, 0, 0, 2); int j; for (j = 0; j < 10000; j++){ seL4_Call(endpoint, info); } } void c_test_server(seL4_CPtr endpoint, seL4_CPtr reply) { struct seL4_MessageInfo info = seL4_MessageInfo_new(0, 0, 0, 2); while (1){ seL4_Recv(endpoint, (void *)0, reply); seL4_Send(reply, info); } } This is being called from a Rust wrapper that allocates the endpoint/reply and reads the TSC; the inner part of it looks like this: let start = x86_64::_rdtsc(); x86_64::__cpuid(0); c_test_client(endpoint.to_cap()); let end = x86_64::_rdtsc(); x86_64::__cpuid(0); info!("test finished: {} {} {}", start, end, end - start); I'm not quite sure what would be slowing it down. I would have thought this would be measuring only the cost of a round trip IPC and nothing more. Is there possibly some kind of scheduling issue adding overhead? The QNX/Linux equivalent is: int main() { char buf[100]; int f = open("/dev/null", O_WRONLY); int i; for (i = 0; i < 10; i++) { uint64_t start = __rdtsc(); uint64_t end; int j; for (j = 0; j < 10000; j++){ if (write(f, buf, 100) != 100){ printf("cannot write\n"); exit(1); } } end = __rdtsc(); printf("cycles: %u\n", end - start); } }
On 19 Apr 2024, at 14:00, Andrew Warkentin
Somehow I said Send()/Recv()/Reply() when I meant Call()/Recv()/Send() (MCS of course doesn't even have a distinct Reply() function).
the correct syscall to use is ReplyWait(): Recv(…); while (1) { ReplyWait(…); } Es per my blog: Send() and Recv() should only ever be used in initialisation and exception handling.
The QNX/Linux equivalent is:
int main() { char buf[100]; int f = open("/dev/null", O_WRONLY); int i; for (i = 0; i < 10; i++) { uint64_t start = __rdtsc(); uint64_t end; int j; for (j = 0; j < 10000; j++){ if (write(f, buf, 100) != 100){ printf("cannot write\n"); exit(1); } } end = __rdtsc(); printf("cycles: %u\n", end - start); } }
so you’re using standard I/O to /dev/null My Posix is a bit rusty, but this should be buffered in the library (i.e. most calls will *not* result in a system call). And, given that the output goes to /dev/null, the data may be thrown away completely. Basically you’re measuring the cost of a function call. Gernot
On 4/19/24 02:45, Gernot Heiser via Devel wrote:
On 19 Apr 2024, at 14:00, Andrew Warkentin
wrote: Somehow I said Send()/Recv()/Reply() when I meant Call()/Recv()/Send() (MCS of course doesn't even have a distinct Reply() function).
the correct syscall to use is ReplyWait():
Recv(…); while (1) { ReplyWait(…); }
Es per my blog: Send() and Recv() should only ever be used in initialisation and exception handling.
The QNX/Linux equivalent is:
int main() { char buf[100]; int f = open("/dev/null", O_WRONLY); int i; for (i = 0; i < 10; i++) { uint64_t start = __rdtsc(); uint64_t end; int j; for (j = 0; j < 10000; j++){ if (write(f, buf, 100) != 100){ printf("cannot write\n"); exit(1); } } end = __rdtsc(); printf("cycles: %u\n", end - start); } }
so you’re using standard I/O to /dev/null
My Posix is a bit rusty, but this should be buffered in the library (i.e. most calls will *not* result in a system call).
Not true on any *nix I am familiar with. The buffered version is fwrite(). write() always results in a real system call. That said, I know nothing about QNX other than it being based on a microkernel. It might be that QNX _does_ implement userspace buffering for write() and avoid the system call overhead in this benchmark. -- Sincerely, Demi Marie Obenour (she/her/hers)
On Fri, Apr 19, 2024 at 12:47 AM Gernot Heiser via Devel
the correct syscall to use is ReplyWait():
Recv(…); while (1) { ReplyWait(…); }
Es per my blog: Send() and Recv() should only ever be used in initialisation and exception handling.
Rewriting the server function to: void c_test_server(seL4_CPtr endpoint, seL4_CPtr reply) { struct seL4_MessageInfo info = seL4_MessageInfo_new(0, 0, 0, 2); seL4_Recv(endpoint, (void *)0, reply); while (1){ seL4_ReplyRecv(endpoint, info, (void *)0, reply); } } made little difference.
so you’re using standard I/O to /dev/null
My Posix is a bit rusty, but this should be buffered in the library (i.e. most calls will *not* result in a system call). And, given that the output goes to /dev/null, the data may be thrown away completely.
Basically you’re measuring the cost of a function call.
As Demi said, read()/write() aren't usually buffered. I haven't
actually checked whether QNX is buffering it, but I'd be a little bit
surprised if it were.
On Fri, Apr 19, 2024 at 3:12 AM Indan Zupancic
Hello Andrew,
I'm not quite sure what would be slowing it down. I would have thought this would be measuring only the cost of a round trip IPC and nothing more. Is there possibly some kind of scheduling issue adding overhead?
Yes, you are running it in a virtual machine, remember? Any timing is suspect.
Wouldn't that affect timing relative to bare metal more than timing of kernels relative to each other? I guess seL4 might just somehow not get along with virtualization (the timings in QEMU and VirtualBox are fairly similar), although I'd be a little surprised if it were affected that much relative to QNX.
The QNX/Linux equivalent is:
int main() { char buf[100]; int f = open("/dev/null", O_WRONLY); int i; for (i = 0; i < 10; i++) { uint64_t start = __rdtsc(); uint64_t end; int j; for (j = 0; j < 10000; j++){ if (write(f, buf, 100) != 100){ printf("cannot write\n"); exit(1); } } end = __rdtsc(); printf("cycles: %u\n", end - start); } }
This doesn't do any IPC and measures the time of 10k dummy write syscalls.
The POSIX equivalent would be to have two processes using pipes or fifos with the server sending a reply and the sender waiting for the reply before sending the next message.
AFAIK writing to /dev/null does some IPC under QNX (/dev/null is exported by the process server) although I suspect it's not copying anything. In any case I wrote my own null server that specifically copies the data into its own buffer (that's what the "custom null server" timings are) in order to avoid any optimizations like that, and it's slower than real /dev/null, but still way faster than what I'm getting out of seL4.
Hello Andrew, On 2024-04-19 10:25, Andrew Warkentin wrote:
made little difference.
The difference is the overhead of doing 10k system calls in seL4.
As Demi said, read()/write() aren't usually buffered. I haven't actually checked whether QNX is buffering it, but I'd be a little bit surprised if it were.
You're right it's not buffered in user space, but it is buffered in kernel space though.
On Fri, Apr 19, 2024 at 3:12 AM Indan Zupancic
wrote: Yes, you are running it in a virtual machine, remember? Any timing is suspect.
Wouldn't that affect timing relative to bare metal more than timing of kernels relative to each other? I guess seL4 might just somehow not get along with virtualization (the timings in QEMU and VirtualBox are fairly similar), although I'd be a little surprised if it were affected that much relative to QNX.
Yes. What I meant is that I wouldn't trust the numbers to measure the individual IPC timing for sel4, because of overheads and the chance of unexpected interruptions like a different host task being scheduled. It's good enough for relative comparisons, just be mindful of the pitfalls.
This doesn't do any IPC and measures the time of 10k dummy write syscalls.
The POSIX equivalent would be to have two processes using pipes or fifos with the server sending a reply and the sender waiting for the reply before sending the next message.
AFAIK writing to /dev/null does some IPC under QNX (/dev/null is exported by the process server) although I suspect it's not copying anything. In any case I wrote my own null server that specifically copies the data into its own buffer (that's what the "custom null server" timings are) in order to avoid any optimizations like that, and it's slower than real /dev/null, but still way faster than what I'm getting out of seL4.
Because it is asynchronous: You do 10k system calls which queue some data, QNX can do one context switch to the server to handle all 10k of them in one go. The client needs to read a reply from the server for each message sent to make it equivalent to seL4's code. If system call overhead is what you want to measure, fine, but make the seL4 version asynchronous too, by using notifications and shared memory queues and having the same priority for server and client tasks. IPC in seL4 should be seen as a protected procedure call into another process, it shouldn't be used for message passing in the classical sense, see: https://microkerneldude.org/2019/03/07/how-to-and-how-not-to-use-sel4-ipc/ Greetings, Indan
On Fri, Apr 19, 2024 at 4:22 AM Indan Zupancic
Because it is asynchronous: You do 10k system calls which queue some data, QNX can do one context switch to the server to handle all 10k of them in one go.
The client needs to read a reply from the server for each message sent to make it equivalent to seL4's code.
QNX's read() appears to be a very simple wrapper around MsgSend(), which is synchronous and has similar RPC-like semantics to seL4_Call except that it operates on a user-provided arbitrary-length buffer instead of a fixed one. I don't think there's any combining of multiple calls into one going on here.
I should also add that its write() is the same kind of simple wrapper around MsgSend()
Hello Andrew, On 2024-04-19 12:40, Andrew Warkentin wrote:
On Fri, Apr 19, 2024 at 4:22 AM Indan Zupancic
wrote: Because it is asynchronous: You do 10k system calls which queue some data, QNX can do one context switch to the server to handle all 10k of them in one go.
The client needs to read a reply from the server for each message sent to make it equivalent to seL4's code.
QNX's read() appears to be a very simple wrapper around MsgSend(), which is synchronous and has similar RPC-like semantics to seL4_Call except that it operates on a user-provided arbitrary-length buffer instead of a fixed one.
If what you think is true, QNX can do a context switch in 100 ns, or the whole send and reply with two context switches in 600 ns for the custom server version. I have my doubts, I thought context switch overhead was higher on x86. Do you have access to the source code of the QNX version you are running? Just write a low-level test using MsgSend() and MsgReply() directly and add a way to verify that all 10k calls are handled correctly. E.g. pass an integer, let the server increment it and use the reply for the next call. The final value should be 10k. There's probably a mistake with seL4's code or kernel configuration. Even when using the slow path I wouldn't expect such high numbers, at least not on ARM (I have no experience with seL4 on x86). 11 us per iteration you get is extremely slow, even for a debug build (except if you get debug prints now and then). Greetings, Indan
Hello Andrew, On 2024-04-19 05:00, Andrew Warkentin wrote:
Here's the exact code for the benchmark loop I'm using on seL4:
void c_test_client(seL4_CPtr endpoint) { struct seL4_MessageInfo info = seL4_MessageInfo_new(0, 0, 0, 2); int j; for (j = 0; j < 10000; j++){ seL4_Call(endpoint, info); } }
void c_test_server(seL4_CPtr endpoint, seL4_CPtr reply) { struct seL4_MessageInfo info = seL4_MessageInfo_new(0, 0, 0, 2); while (1){ seL4_Recv(endpoint, (void *)0, reply); seL4_Send(reply, info); } }
This measures synchronous calls, so 20k context switches back and forth between the client and the server.
I'm not quite sure what would be slowing it down. I would have thought this would be measuring only the cost of a round trip IPC and nothing more. Is there possibly some kind of scheduling issue adding overhead?
Yes, you are running it in a virtual machine, remember? Any timing is suspect.
The QNX/Linux equivalent is:
int main() { char buf[100]; int f = open("/dev/null", O_WRONLY); int i; for (i = 0; i < 10; i++) { uint64_t start = __rdtsc(); uint64_t end; int j; for (j = 0; j < 10000; j++){ if (write(f, buf, 100) != 100){ printf("cannot write\n"); exit(1); } } end = __rdtsc(); printf("cycles: %u\n", end - start); } }
This doesn't do any IPC and measures the time of 10k dummy write syscalls. The POSIX equivalent would be to have two processes using pipes or fifos with the server sending a reply and the sender waiting for the reply before sending the next message. Greetings, Indan
On 4/19/24 05:12, Indan Zupancic wrote:
Hello Andrew,
On 2024-04-19 05:00, Andrew Warkentin wrote:
Here's the exact code for the benchmark loop I'm using on seL4:
void c_test_client(seL4_CPtr endpoint) { struct seL4_MessageInfo info = seL4_MessageInfo_new(0, 0, 0, 2); int j; for (j = 0; j < 10000; j++){ seL4_Call(endpoint, info); } }
void c_test_server(seL4_CPtr endpoint, seL4_CPtr reply) { struct seL4_MessageInfo info = seL4_MessageInfo_new(0, 0, 0, 2); while (1){ seL4_Recv(endpoint, (void *)0, reply); seL4_Send(reply, info); } }
This measures synchronous calls, so 20k context switches back and forth between the client and the server.
I'm not quite sure what would be slowing it down. I would have thought this would be measuring only the cost of a round trip IPC and nothing more. Is there possibly some kind of scheduling issue adding overhead?
Yes, you are running it in a virtual machine, remember? Any timing is suspect.
Should not be an order of magnitude off, though. -- Sincerely, Demi Marie Obenour (she/her/hers)
The topic of sel4 vs QNX performance came up in the developer hangout and I
found the entire discussion perplexing for the following reasons:
- not inconsequential, Gernot, who I presume is reasonably expert, was
strongly of the opinion it was a non-issue.
- marginal performance gains, even if possible, are not even remotely
warranted at the cost of any security whatsoever.
- some timing issues may be security critical and altering them could
compromise the security of the kernel.
- hardware capacity is vastly improved in speed as of 2024 by the time
sel4 is fully productionized hardware capacity will most likely erase any
differential in speed.
- I do not find it hard to believe that sel4 is perfectly adequate as RTOS
within its current architecture.
- There does not seem to me to be any advantage in modifying access to the
kernel and there are plenty of disadvantages. It's all downside. The OS is
open source, so people are at liberty to modify as they please. They just
can't call it sel4.
- To me, the moniker 'sel4' is a promise and a guarantee and that promise
cannot reasonably be kept by making special use-case exceptions.
- QNX minimum requirements call for a greater than 1GHz processor here:
https://www.qnx.com/developers/docs/6.5.0SP1.update/com.qnx.doc.momentics_in...
- Say, for the sake of argument, there is a system call scenario whereby
the kernel architecture is causing an unwarranted delay. That might be a
defect and could be addressed as such.
Finally, there is an old development rule of thumb that first you make it
work before you look to optimize.
I have been researching security since the late 1980s. I designed and wrote
the secure network access system for a Canadian bank in the early 1990s. As
it happens, I designed a security protocol twenty years ago that is in use
by the Canadian mint to secure money. Since 1992, security has been an
integral part of my ongoing research project into data packaging. Years ago
I got permission to release code using Gutmann's cryptlib and I spent a
great deal of time reviewing and learning from that code.
I don't really hold myself out as an expert WRT security. However, I have
much to say about it, and I do have some knowledge. I would not presume to
question Gernot's judgment with respect to sel4 either security or
otherwise. Gernot has expressed confidence in sel4 and that is good enough
for me.
On Thu, Apr 18, 2024 at 10:31 PM Gernot Heiser via Devel
On 18 Apr 2024, at 23:55, Andrew Warkentin
wrote: Somebody in the Zoom call the other day (IIRC it was Gernot) said that seL4 IPC is considerably faster than QNX IPC, but it seems that QNX is somehow faster in a simple benchmarking experiment in which a client repeatedly sends messages to a bit-bucket server, using the TSC to determine timing.
I'm running both QNX and seL4 under identical VirtualBox configurations with a single core. I'm using a QNX 6.3.1 x86-32 image i have in my collection (x86-64 support was only added in 7.0; this is the most recent image I have with dev tools installed). The seL4 version I'm using is from February 2022 and is built for x86-64 with MCS enabled (the priority of the threads is 255, and both the timeslice and budget are the initial thread default of 5 ms).
My benchmark times a loop of 10000 RPCs; for QNX it's just using the regular read() function on /dev/null or a custom server that also acts like /dev/null but forces a copy of the message into a server-provided buffer, and for seL4 I'm directly calling Send()/Recv()/Reply(). I've also tested Linux's /dev/null, and QNX's implementation actually seems somewhat faster than it (for the system /dev/null; the custom server's best case seems only slightly worse than under Linux).
Under QNX and Linux I'm running a C implementation of the benchmark and under seL4 I tried both C and Rust implementations (which have very similar performance). The messages on QNX and Linux are 100 bytes, and on seL4 they are 2 words, which should be hitting the fast path.
Here are the best and worst cases I got: seL4: 0.114-0.136s QNX, custom null server: 0.006-0.063s QNX, system null server: 0.002-0.022s Linux: 0.004-0.005s
Has anyone else benchmarked seL4 IPC against that of QNX and gotten better (or comparable) results for seL4, unlike what I got? Is synchronous IPC just less optimized under seL4 than QNX?
I’m not completely sure what you’re actually measuring here (actually I’m totally unsure what your setup really is, I can’t make sense of the above description in terms of what the seL4 system is doing). But measuring performance of a kernel inside a virtual machine is somewhat pointless at best.
In any case it seems clear that you’re *not* measuring raw seL4 IPC. Your quoted cost of a round-trip on seL4 is say 120ms for 10,000 round trips, or 12µs for one round trip. You don’t specify the clock rate, but that would be somewhere between 12,000 and 50,000 cycles, when a round-trip IPC on x86 with MCS is <900 cycles
Note that using Send()/Recv()/Reply() is the wrong way to use seL4 IPC, see https://microkerneldude.org/2019/03/07/how-to-and-how-not-to-use-sel4-ipc/ However, this will only be about a factor two slower than using the correct Call()/ReplyWait() incantation.
In fact, while it is in principle possible to outperform seL4 in IPC, it is impossible to do so by more than a few percent, as the seL4 IPC cost is only about 20% higher than the pure hardware cost of kernel trap, minimal state save+restore and switching page tables. This has been independently established in a peer-reviewed paper, see https://dl.acm.org/doi/pdf/10.1145/3302424.3303946
Everything I’ve ever seen from QNX is that their IPC costs is many thousands of cycles (probably >10,000), and at least an order of magnitude slower than seL4’s.
Gernot
_______________________________________________ Devel mailing list -- devel@sel4.systems To unsubscribe send an email to devel-leave@sel4.systems
-- Bob Trower --- From Gmail webmail account. ---
participants (5)
-
Andrew Warkentin
-
Bob Trower
-
Demi Marie Obenour
-
Gernot Heiser
-
Indan Zupancic