Request for enhancement: Notification queues
While studying seL4, I noticed that there is no mechanism to wait for one of many notifications. This proposal solves this with a new type of capability, which I call a notification queue. I don’t have the knowledge to determine how they should be implemented, so instead I will describe the general idea. A notification queue is a capability to which notification objects can be attached. When an attached endpoint is notified, a user-provided pointer is pushed onto the queue. Notification queues are edge-triggered, so subsequent notifications do not cause further entries to be added to the queue unless the notification has been read. The queue is implemented as a buffer which is read-only to userspace, but read-write to seL4. When creating a notification queue, userspace must provide sufficient untyped memory to ensure that the queue will not fill up. This ensures that sending a notification cannot fail due to resource exhaustion. Queues can be resized by the application if needed. Attempts to add too many endpoints to a queue results in an error being returned. A notification object can only be attached to one notification queue at a time. It can be detached from a queue and later attached to another. Notification queues can be used whenever a notification object is expected, except that they cannot be badged. Badging a notification queue would make no sense. When polled, a notification queue returns the number of events that have been added to it since it was last polled. Adding one notification queue to another may or may not be allowed. Allowing it would increase flexibility, but also increase implementation complexity. It is also possible to attach a TCB to a queue; the queue is notified when the TCB exits. Only seL4 can send to a notification queue. No userspace process can ever possess a send capability to such a queue. While a notification queue can be accessed from any number of tasks, performance on multi-core systems may be improved if each queue is used from the same core that its attached notifications are used on. Constructive feedback on this proposal is welcome. Sincerely, Demi
Hi Demi, Thanks for your suggestion. The issue of some select-kind of feature keeps coming up. In terms of the minimality principle [1], the question must be asked why this feature must be in the kernel. It seems to me that it would be straightforward to implement it on top of present seL4. What functionality are you after that is not possible to achieve with a user-level implementation, and what convincing use cases do you have for such functionality? Furthermore, there is already a (restricted) form of similar functionality available. Instead waiting for a set of notifications, you can wait on a single notifications where different notifiers have different badges on their caps, and the badges have different bits set, so they can be distinguished even when multiple ones are signalled. This latter version is restricted because of the limited size of badges (word size minus a few bits). I was thinking a while back whether we’d need a more general notion of notifications (that scale more) but have not yet seen a convincing use case. Gernot [1] Minimality principle: "A concept is tolerated inside the microkernel only if moving it outside the kernel, i.e., permitting competing implementations, would prevent the implementation of the system's required functionality.” [Liedtke, SOSP’95]
On 21 Jan 2020, at 14:29, Demi M. Obenour <demiobenour@gmail.com> wrote:
While studying seL4, I noticed that there is no mechanism to wait for one of many notifications. This proposal solves this with a new type of capability, which I call a notification queue.
I don’t have the knowledge to determine how they should be implemented, so instead I will describe the general idea. A notification queue is a capability to which notification objects can be attached. When an attached endpoint is notified, a user-provided pointer is pushed onto the queue. Notification queues are edge-triggered, so subsequent notifications do not cause further entries to be added to the queue unless the notification has been read.
The queue is implemented as a buffer which is read-only to userspace, but read-write to seL4. When creating a notification queue, userspace must provide sufficient untyped memory to ensure that the queue will not fill up. This ensures that sending a notification cannot fail due to resource exhaustion. Queues can be resized by the application if needed. Attempts to add too many endpoints to a queue results in an error being returned.
A notification object can only be attached to one notification queue at a time. It can be detached from a queue and later attached to another.
Notification queues can be used whenever a notification object is expected, except that they cannot be badged. Badging a notification queue would make no sense. When polled, a notification queue returns the number of events that have been added to it since it was last polled.
Adding one notification queue to another may or may not be allowed. Allowing it would increase flexibility, but also increase implementation complexity. It is also possible to attach a TCB to a queue; the queue is notified when the TCB exits.
Only seL4 can send to a notification queue. No userspace process can ever possess a send capability to such a queue.
While a notification queue can be accessed from any number of tasks, performance on multi-core systems may be improved if each queue is used from the same core that its attached notifications are used on.
Constructive feedback on this proposal is welcome.
Sincerely,
Demi
Hi Gernot, Thanks for your prompt response! I was not expecting to get a response at all, much less such a fast one, so thank you. It is indeed straightforward to implement this in user-space. However, I have not found a way to do so that is both clean and performant. The naive approach requires one task for each notification to be polled on, which obviously is both slow and wasteful. There are faster approaches that use a user-space daemon. However, that still requires several context switches, as well as substantial complexity. It also requires all messages to go through a central broker, which is ugly, slow, and does not compose well with other seL4 features. The message broker would likely find itself becoming a buggy and slow reimplementation of much of seL4 itself. In my opinion, the main justification for putting this in the kernel is performance. seL4’s motto is “Security is no excuse for poor performance!”, and I believe that the performance improvement justifies the extra complexity in this case. seL4 makes the same decision in other places, such as IPC fast paths, so there is precedent. While a userspace implementation is possible, it will be substantially slower than a kernel implementation. I suspect that the performance overhead will be at least a factor of 2, due to the doubling of context switches. I consider this overhead unacceptable for an interface that will often be used in hot paths. I suspect the main reason that this has not been needed as much in the past is that static systems are less likely to need to wait on one of thousands of events to happen. However, as seL4 starts being used in highly dynamic systems, I suspect that this will become more and more necessary. I consider performance to be part of seL4’s required functionality, and my understanding is that the seL4 developers agree with this. While seL4 is currently mostly used in static systems via CAmkES, I believe that it has potential to be used in highly dynamic systems as well. It should be possible to build a full desktop OS on seL4, and for it to be just as fast as Linux. I hope to see one someday. Sincerely, Demi On 2020-01-20 23:04, Heiser, Gernot (Data61, Kensington NSW) wrote:
Hi Demi,
Thanks for your suggestion. The issue of some select-kind of feature keeps coming up.
In terms of the minimality principle [1], the question must be asked why this feature must be in the kernel. It seems to me that it would be straightforward to implement it on top of present seL4. What functionality are you after that is not possible to achieve with a user-level implementation, and what convincing use cases do you have for such functionality?
Furthermore, there is already a (restricted) form of similar functionality available. Instead waiting for a set of notifications, you can wait on a single notifications where different notifiers have different badges on their caps, and the badges have different bits set, so they can be distinguished even when multiple ones are signalled.
This latter version is restricted because of the limited size of badges (word size minus a few bits). I was thinking a while back whether we’d need a more general notion of notifications (that scale more) but have not yet seen a convincing use case.
Gernot
[1] Minimality principle: "A concept is tolerated inside the microkernel only if moving it outside the kernel, i.e., permitting competing implementations, would prevent the implementation of the system's required functionality.” [Liedtke, SOSP’95]
On 21 Jan 2020, at 14:29, Demi M. Obenour <demiobenour@gmail.com> wrote:
While studying seL4, I noticed that there is no mechanism to wait for one of many notifications. This proposal solves this with a new type of capability, which I call a notification queue.
I don’t have the knowledge to determine how they should be implemented, so instead I will describe the general idea. A notification queue is a capability to which notification objects can be attached. When an attached endpoint is notified, a user-provided pointer is pushed onto the queue. Notification queues are edge-triggered, so subsequent notifications do not cause further entries to be added to the queue unless the notification has been read.
The queue is implemented as a buffer which is read-only to userspace, but read-write to seL4. When creating a notification queue, userspace must provide sufficient untyped memory to ensure that the queue will not fill up. This ensures that sending a notification cannot fail due to resource exhaustion. Queues can be resized by the application if needed. Attempts to add too many endpoints to a queue results in an error being returned.
A notification object can only be attached to one notification queue at a time. It can be detached from a queue and later attached to another.
Notification queues can be used whenever a notification object is expected, except that they cannot be badged. Badging a notification queue would make no sense. When polled, a notification queue returns the number of events that have been added to it since it was last polled.
Adding one notification queue to another may or may not be
allowed. Allowing it would increase flexibility, but also increase implementation complexity. It is also possible to attach a TCB to a queue; the queue is notified when the TCB exits.
Only seL4 can send to a notification queue. No userspace process can ever possess a send capability to such a queue.
While a notification queue can be accessed from any number of tasks, performance on multi-core systems may be improved if each queue is used from the same core that its attached notifications are used on.
Constructive feedback on this proposal is welcome.
Sincerely,
Demi
_______________________________________________ Devel mailing list Devel@sel4.systems https://sel4.systems/lists/listinfo/devel
On 1/20/20, Demi M. Obenour <demiobenour@gmail.com> wrote:
Hi Gernot,
Thanks for your prompt response! I was not expecting to get a response at all, much less such a fast one, so thank you.
It is indeed straightforward to implement this in user-space. However, I have not found a way to do so that is both clean and performant. The naive approach requires one task for each notification to be polled on, which obviously is both slow and wasteful. There are faster approaches that use a user-space daemon. However, that still requires several context switches, as well as substantial complexity. It also requires all messages to go through a central broker, which is ugly, slow, and does not compose well with other seL4 features. The message broker would likely find itself becoming a buggy and slow reimplementation of much of seL4 itself.
In my opinion, the main justification for putting this in the kernel is performance. seL4’s motto is “Security is no excuse for poor performance!”, and I believe that the performance improvement justifies the extra complexity in this case. seL4 makes the same decision in other places, such as IPC fast paths, so there is precedent. While a userspace implementation is possible, it will be substantially slower than a kernel implementation. I suspect that the performance overhead will be at least a factor of 2, due to the doubling of context switches. I consider this overhead unacceptable for an interface that will often be used in hot paths.
I suspect the main reason that this has not been needed as much in the past is that static systems are less likely to need to wait on one of thousands of events to happen. However, as seL4 starts being used in highly dynamic systems, I suspect that this will become more and more necessary. I consider performance to be part of seL4’s required functionality, and my understanding is that the seL4 developers agree with this. While seL4 is currently mostly used in static systems via CAmkES, I believe that it has potential to be used in highly dynamic systems as well. It should be possible to build a full desktop OS on seL4, and for it to be just as fast as Linux. I hope to see one someday.
I'm working on just such an OS <https://gitlab.com/uxrt>, and I've thought it would be nice to have some form of kernel API to wait for multiple notifications as well. Initially, I was planning to just implement select() and related APIs by creating a temporary endpoint and then spawning one thread (in the calling process, not a separate daemon) per notification and then using either badges or a specific message type to signal on that endpoint which notification is ready, although I'm not quite sure how well that will scale. I am trying to avoid intermediary servers as much as possible, unlike in most other microkernel OSes that have (mostly pointless) vertical layering everywhere (the VFS will act as an intermediary on open()/close() and all directory accesses since that's probably the best way to enforce permissions, but read()/write()-family APIs on non-directories will use kernel APIs to communicate directly with the other process). Another kernel feature I've thought might improve performance for a dynamic OS would be some form of API for transferring multiple capabilities between CNodes in a single operation, since a dynamic OS may transfer multiple page capabilities at once when mapping memory (in UX/RT I'm planning to manage all capabilities in the root server and only map copied capabilities into the VSpaces of user processes).
On 21 Jan 2020, at 15:31, Demi M. Obenour <demiobenour@gmail.com> wrote:
It is indeed straightforward to implement this in user-space. However, I have not found a way to do so that is both clean and performant.
Sorry for being slow in replying. If anything can be done at user level it should, this is a fundamental aspect of seL4, and the core enabler of both its high performance as well as verifiability. Any diversion from that principle needs a very convincing case to be made (with a realistic and clearly important use case). Please note that our promise to the community is that any changes we make to the kernel will be verified, and they cannot go into mainline until the official verified platforms have been re-verified. This means that the more invasive the proposed change, the more costly, and the stronger the justification needed. At the moment, large dynamic systems are (unfortunately) not a high priority for seL4. I hope this will change in the foreseeable future, but right now we’ve got too many things that are higher on the list of priorities. Also, I’m not convinced that even if we agreed that we wanted such functionality, your proposal would be the right way to do it. As I explained, Notifications already provide the same kind of functionality you are after, but the limited width of badges restricts it to a moderate number of notifiers. (At present that limitation is essentially given by the word size of 32-bit systems, that should definitely be changed on 64-bit systems, but even then it’ll be around 64.) I strongly suspect the more fundamental and general extension would be to remove the scalability limitation of Notifications, probably by generalising them to a hierarchy, i.e. you signal a sub-notification which then signals the higher-level notification. I’m not claiming having thought this through, though. But everything needs a really convincing argument, that’s based on a functional limitation that prevents implementation of certain important classes of systems. If it’s only a performance argument, we’ll need to be convinced that it really is a performance bottleneck in realistic systems. Gernot
On 2020-01-21 20:41, Heiser, Gernot (Data61, Kensington NSW) wrote:
On 21 Jan 2020, at 15:31, Demi M. Obenour <demiobenour@gmail.com> wrote:
It is indeed straightforward to implement this in user-space. However, I have not found a way to do so that is both clean and performant.
Sorry for being slow in replying.
Not a problem!
If anything can be done at user level it should, this is a fundamental aspect of seL4, and the core enabler of both its high performance as well as verifiability. Any diversion from that principle needs a very convincing case to be made (with a realistic and clearly important use case).
Please note that our promise to the community is that any changes we make to the kernel will be verified, and they cannot go into mainline until the official verified platforms have been re-verified. This means that the more invasive the proposed change, the more costly, and the stronger the justification needed.
At the moment, large dynamic systems are (unfortunately) not a high priority for seL4. I hope this will change in the foreseeable future, but right now we’ve got too many things that are higher on the list of priorities.
Also, I’m not convinced that even if we agreed that we wanted such functionality, your proposal would be the right way to do it. As I explained, Notifications already provide the same kind of functionality you are after, but the limited width of badges restricts it to a moderate number of notifiers. (At present that limitation is essentially given by the word size of 32-bit systems, that should definitely be changed on 64-bit systems, but even then it’ll be around 64.)
I strongly suspect the more fundamental and general extension would be to remove the scalability limitation of Notifications, probably by generalising them to a hierarchy, i.e. you signal a sub-notification which then signals the higher-level notification. I’m not claiming having thought this through, though.
But everything needs a really convincing argument, that’s based on a functional limitation that prevents implementation of certain important classes of systems. If it’s only a performance argument, we’ll need to be convinced that it really is a performance bottleneck in realistic systems.
Gernot
_______________________________________________ Devel mailing list Devel@sel4.systems https://sel4.systems/lists/listinfo/devel
On Tue, Jan 21, 2020, 8:42 PM Heiser, Gernot (Data61, Kensington NSW) <Gernot.Heiser@data61.csiro.au> wrote:
On 21 Jan 2020, at 15:31, Demi M. Obenour <demiobenour@gmail.com> wrote:
It is indeed straightforward to implement this in user-space. However, I have not found a way to do so that is both clean and performant.
Sorry for being slow in replying.
Not a problem. If anything can be done at user level it should, this is a fundamental
aspect of seL4, and the core enabler of both its high performance as well as verifiability. Any diversion from that principle needs a very convincing case to be made (with a realistic and clearly important use case).
Agreed. I consider scalable communication between userspace applications to be such a use case, but I understand if you do not.
Please note that our promise to the community is that any changes we make to the kernel will be verified, and they cannot go into mainline until the official verified platforms have been re-verified. This means that the more invasive the proposed change, the more costly, and the stronger the justification needed.
I 100% agree. At the moment, large dynamic systems are (unfortunately) not a high
priority for seL4. I hope this will change in the foreseeable future, but right now we’ve got too many things that are higher on the list of priorities.
Me too. I was attracted to seL4 mostly because of the hope that it could replace Xen in QubesOS. QubesOS certainly qualifies as a large dynamic system!
Also, I’m not convinced that even if we agreed that we wanted such functionality, your proposal would be the right way to do it.
Me neither! If it was, I would be highly surprised. The reason I proposed it is that it is the best one I could think of, and because it does not require userspace to keep side tables.
As I explained, Notifications already provide the same kind of functionality you are after, but the limited width of badges restricts it to a moderate number of notifiers. (At present that limitation is essentially given by the word size of 32-bit systems, that should definitely be changed on 64-bit systems, but even then it’ll be around 64.)
The scalability problem is what worried me. It also seems to be a problem for others, who (unlike me) are actually building stuff based on seL4.
I strongly suspect the more fundamental and general extension would be to remove the scalability limitation of Notifications, probably by generalising them to a hierarchy, i.e. you signal a sub-notification which then signals the higher-level notification. I’m not claiming having thought this through, though.
That approach seems good. It does require side tables, but it is probably significantly less intrusive. Being able to read multiple levels of notifications in a single system call would be especially helpful.
But everything needs a really convincing argument, that’s based on a functional limitation that prevents implementation of certain important classes of systems. If it’s only a performance argument, we’ll need to be convinced that it really is a performance bottleneck in realistic systems.
Do you believe it is likely to be? Highly dynamic systems generally avoid arbitrary limits (not imposed by the physical hardware) whenever possible, since someone *will* eventually hit them. That said, one could make a decent argument that a typical userspace client process will not have over 20 processes that will be sending data to it at high speed. Most userspace processes don’t simultaneously interact with 20 different subsystems of a monolithic kernel, after all. I would expect a userspace process to need to receive notifications from network, graphics, input, storage, Bluetooth, and USB subsystems. Everything else will likely be low-speed control messages, which can be handled by a userspace bus. Implementing D-Bus on top of seL4 IPC would be a nice project.
Gernot
Demi
On 1/22/20, Demi M. Obenour <demiobenour@gmail.com> wrote:
Me too. I was attracted to seL4 mostly because of the hope that it could replace Xen in QubesOS. QubesOS certainly qualifies as a large dynamic system!
I think a better approach to a secure microkernel-based Linux-compatible OS is to make a Linux-compatible microkernel-native OS rather than trying to virtualize Linux, since it's easier to run everything with least privilege if your protection domains are processes rather than VMs. You don't even have to give up Linux's hardware support since the LKL project <https://github.com/lkl/linux> should allow the use of Linux drivers within user-mode servers reasonably easily (it does mean giving up much of the vertical modularity seen in many microkernel OSes, but in general I think that vertical modularity offers relatively little benefit for most OS subsystems and isn't worth the overhead).
I strongly suspect the more fundamental and general extension would be to remove the scalability limitation of Notifications, probably by generalising them to a hierarchy, i.e. you signal a sub-notification which then signals the higher-level notification. I’m not claiming having thought this through, though.
That approach seems good. It does require side tables, but it is probably significantly less intrusive. Being able to read multiple levels of notifications in a single system call would be especially helpful.
Yeah, I'd agree that seems like a good way to handle multiple notifications as well.
Highly dynamic systems generally avoid arbitrary limits (not imposed by the physical hardware) whenever possible, since someone *will* eventually hit them. That said, one could make a decent argument that a typical userspace client process will not have over 20 processes that will be sending data to it at high speed. Most userspace processes don’t simultaneously interact with 20 different subsystems of a monolithic kernel, after all. I would expect a userspace process to need to receive notifications from network, graphics, input, storage, Bluetooth, and USB subsystems. Everything else will likely be low-speed control messages, which can be handled by a userspace bus. Implementing D-Bus on top of seL4 IPC would be a nice project.
Personally I'd say a better approach is to just go with a pure file-oriented architecture in which byte and sequential packet streams accessed through a filesystem namespace are the only form of IPC exposed to user programs, and then implement things like D-Bus and other RPC on top of that.
On 2020-01-22 23:35, Andrew Warkentin wrote:
On 1/22/20, Demi M. Obenour <demiobenour@gmail.com> wrote:
Me too. I was attracted to seL4 mostly because of the hope that it could replace Xen in QubesOS. QubesOS certainly qualifies as a large dynamic system!
I think a better approach to a secure microkernel-based Linux-compatible OS is to make a Linux-compatible microkernel-native OS rather than trying to virtualize Linux, since it's easier to run everything with least privilege if your protection domains are processes rather than VMs. You don't even have to give up Linux's hardware support since the LKL project <https://github.com/lkl/linux> should allow the use of Linux drivers within user-mode servers reasonably easily (it does mean giving up much of the vertical modularity seen in many microkernel OSes, but in general I think that vertical modularity offers relatively little benefit for most OS subsystems and isn't worth the overhead).
I strongly suspect the more fundamental and general extension would be to remove the scalability limitation of Notifications, probably by generalising them to a hierarchy, i.e. you signal a sub-notification which then signals the higher-level notification. I’m not claiming having thought this through, though.
That approach seems good. It does require side tables, but it is probably significantly less intrusive. Being able to read multiple levels of notifications in a single system call would be especially helpful.
Yeah, I'd agree that seems like a good way to handle multiple notifications as well.
Highly dynamic systems generally avoid arbitrary limits (not imposed by the physical hardware) whenever possible, since someone *will* eventually hit them. That said, one could make a decent argument that a typical userspace client process will not have over 20 processes that will be sending data to it at high speed. Most userspace processes don’t simultaneously interact with 20 different subsystems of a monolithic kernel, after all. I would expect a userspace process to need to receive notifications from network, graphics, input, storage, Bluetooth, and USB subsystems. Everything else will likely be low-speed control messages, which can be handled by a userspace bus. Implementing D-Bus on top of seL4 IPC would be a nice project.
Personally I'd say a better approach is to just go with a pure file-oriented architecture in which byte and sequential packet streams accessed through a filesystem namespace are the only form of IPC exposed to user programs, and then implement things like D-Bus and other RPC on top of that.
I think that is a good idea in some cases, but I believe that it will impose unnecessary overhead in others. In the case of graphics, for example, marshaling would be significant overhead. I believe that having the graphics driver be an seL4 native server would be a better approach. The same goes for virtualization, networking, and storage. Graphics is interesting because a GPU has its own MMU, which can provide strong, hardware-enforced isolation. This allows for secure graphics drivers to be implemented in a surprisingly small amount of code, provided that legacy device support is dropped. The Genode project was able to implement full support for OpenGL this way. For networking, it would be nice if performance could be similar to that achieved by DPDK/Snabb. That isn’t compatible with the Berkeley sockets API. Message buses, such as D-Bus, can likely be made much faster if the server can copy memory from one client’s address space to another. That’s why Android has Binder, for example, and why kdbus was almost a thing on Linux. Sincerely, Demi
On 1/23/20, Demi M. Obenour <demiobenour@gmail.com> wrote:
I think that is a good idea in some cases, but I believe that it will impose unnecessary overhead in others. In the case of graphics, for example, marshaling would be significant overhead. I believe that having the graphics driver be an seL4 native server would be a better approach. The same goes for virtualization, networking, and storage.
UX/RT's IPC transport layer will be a thin layer on top of seL4 IPC and will basically be just enough to make it look Unix-like. No intermediary server will be required for read()/write()-like calls except on directories. There will be read()/write()-like calls that expose message registers and shared buffers pretty much directly (a couple of registers will be reserved for message size and type). There will also be a new "message special" file type that always returns the same amount of data originally written when read. All of this should have pretty much no overhead, unless a message is read with a different API than was used to write it (in which case messages will be automatically copied between message registers and the shared buffer), but that will be easy enough to avoid. There should be no need to make RPC a primitive (which would add overhead of its own for transfers of bulk unstructured data) or expose the raw kernel APIs (which would create compatibility issues because seL4's API doesn't seem to be stable).
participants (3)
-
Andrew Warkentin
-
Demi M. Obenour
-
Heiser, Gernot (Data61, Kensington NSW)