Hi together, during the proceedings of the last seL4 summit I participated in a discussion about Microkit usability. In particular we were talking about common patterns for IPC in Microkit, potential pitfalls and how the Microkit Tutorial introduces some of them implicitly (by requiring the user to apply them). At some point Kent suggested to put the topic on the mailing list, so here we go 😊. When initially playing through the Microkit Tutorial, I assumed that the budget/period of a protection domain were providing uninterrupted time-slots. I was thus slightly surprised, when I found out that sending a notification to a higher priority PD would immediately preempt my current PD. Now, this on its own is not problematic, Gernot simply commented my mistake with "Yes, priorities are observed". However: When playing the Microkit tutorial, I understood the way to solve Part 2 (Serial Server to Client Communication) to be as follows: * [serial server] write received char to shared buffer. * [serial server] notify client. * [client] read char from shared buffer. This is also the current approach found in the provided solution [1]. This works in some (most!) cases, but it is not reliable. It could be that chars get lost, due to the serial server overwriting the buffer before it was consumed form the client. The specific details if this works depend upon the combination of: * Data size (not-so-critical with one byte, but more data means more time is required to consume the buffer form the client) * Budget/Period of the client (can the client read all data before it is preempted by exhaustion of its budget in the current period) * Hardware Performance (same as above) * Priorities (if the serial server is higher priority and gets triggered by another hardware event before the client can consume the char, it gets lost again) This is problematic: the details of the scheduling are not very explicit in the manual (roughly one half of a page), but the first example of IPC taught via the tutorial relies on a lot of implicit circumstance on how PDs are scheduled to function correctly. Now I have three suggestions to improve upon this. 1. In the tutorial, explicitly mention the possibility of using the same notification channel in the opposite direction for a read-ACK. If another serial interrupt comes, but no ACK was received, just log an error. The point is not, to introduce ring-buffers and a the machinery for robust communication -- that would like explode the scope for a simple tutorial. Rather, the point is to teach people that this IPC mechanism may silently fail (as in chars get dropped). 2. For certain scheduling + IPC scenarios, examples with a graphical representation would be good. Here are two examples: Sending data to a higher priority PD (via notifications + shared memory) [2] and sending data to a lower priority PD (via notification + shared memory + ACK notification) [3]. Please bear with me, my talent to intuitive graphical representation is of limited nature, this for sure could be made better looking. In the end, I'd like to have an engineer short on time quickly look up the current scenario (sharing data bi-/unidir?, other PD is strictly higher prio/lower prio?, silent loss of data tolerable or not?, ...) from a table, take a look at the timing diagram and pick the correct approach for the task at hand. Microkit does not provide very sophisticated IPC APIs, and that is good! It keeps Microkit small, and the API IMHO provides everything needed, for synchronous and asynchronous communication, for large and small data, ... But composing the IPC correctly is not trivial, so anything we can to take sharp edges of the experience without bloating the API should be considered. 3. I think, there is a finite number of relevant IPC scenarios. Maybe one can build a tool to verify them (something something Promela?) and display them graphically (something something PlantUML Timing Diagrams?). This could either be used to fill a chapter in the Microkit Manual ("Common IPC Patterns, their Behavior and Pitfalls"?), or exposed as small (Web?) GUI Tool. Maybe this could become part of a student thesis even? I hope the above might spark some thought on improving accessibility of Microkit. Kind Regards, Wanja [PlantUML diagram] [PlantUML diagram] [1] https://github.com/au-ts/microkit_tutorial/blob/6733b7864217831e4e77616647eb... [2] https://www.plantuml.com/plantuml/uml/VL9DYzim4BthLppIGoxP49kwWHvADlHOIYc5FI... [3] https://www.plantuml.com/plantuml/uml/XLDDYzim4BthLpnosKisnA4Ecq9BjlHOIkbbpr... —————————————————————————— Deutsches Zentrum für Luft- und Raumfahrt (DLR) Institut für Flugsystemtechnik | Sichere Systeme & Systems Engineering | Lilienthalplatz 7 | 38108 Braunschweig Wanja Zaeske Telefon 0531 295-1114 | Fax 0531 295-2931 | wanja.zaeske@dlr.de<mailto:wanja.zaeske@dlr.de> DLR.de<http://www.dlr.de/>
On 26 Nov 2024, at 03:18, Wanja.Zaeske--- via Devel <devel@sel4.systems> wrote:
Hi together,
during the proceedings of the last seL4 summit I participated in a discussion about Microkit usability. In particular we were talking about common patterns for IPC in Microkit, potential pitfalls and how the Microkit Tutorial introduces some of them implicitly (by requiring the user to apply them). At some point Kent suggested to put the topic on the mailing list, so here we go 😊.
Hi Wanja, Constructive suggestions on improving the documentation are always welcome ;-) I guess the Microkit tutorial needs to strike a balance between being highly approachable (which is its prime objective) and guiding people in the right direction with complete information. That’s not trivial, and may require auxiliary documentation for the more advanced user. Also, at this time the Microkit documentation clearly doesn’t attempt to be independent of the kernel documentation. (This *may* be a worthwhile goal eventually for those who really just require the much simpler Microkit abstractions, but you want to avoid repeating too much, as this creates consistency problems. Eg I don’t think the Microkit docs should aim to replace the sDDF docs.)
When initially playing through the Microkit Tutorial, I assumed that the budget/period of a protection domain were providing uninterrupted time-slots. I was thus slightly surprised, when I found out that sending a notification to a higher priority PD would immediately preempt my current PD. Now, this on its own is not problematic, Gernot simply commented my mistake with "Yes, priorities are observed". However:
It’s not only not problematic, it’s essential ;-)
When playing the Microkit tutorial, I understood the way to solve Part 2 (Serial Server to Client Communication) to be as follows:
* [serial server] write received char to shared buffer. * [serial server] notify client. * [client] read char from shared buffer.
This is also the current approach found in the provided solution [1]. This works in some (most!) cases, but it is not reliable. It could be that chars get lost, due to the serial server overwriting the buffer before it was consumed form the client.
Keep in mind that this only mirrors what happens in hardware: If software doesn't consume input from the serial port quickly enough, you’ll lose characters. The other key thing to keep in mind (and maybe this needs to be emphasised more in the Microkit documentation) is that the core idea behind the seL4 device driver framework (sDDF) is that the driver has one and only one job: translating a hardware-specific *device* interface to a hardware-independent *device class* interface. Meaning an sDDF driver *only* provides a minimal hardware abstraction. This a great enabler of simplicity of sDDF drivers, making them a fraction of the size of Linux drivers, and easy to write. Importantly, sDDF drivers are, as far as possible, policy free. A direct implication of this is that buffering isn’t the driver’s job, it’s policy that should be done in another component (separation of concerns). Meaning the driver just reflects the fact that input characters get lost if you don’t consume them quickly enough. And loss can be prevented by appropriate assignment of priorities. The serial driver section in the sDDF document is right now mostly AWOL (as there are a number of others, I’ve got a bunch of device class specs sitting on my desk waiting to be reviewed…) There is discussion of priorities in the general part of the sDDF doc: §4.3.1 specifies that the driver should be higher prio than the virtualiser, and §4.3.2 similarly says that the virt should be higher prio than the client. But there really needs to be more info on serial drivers. I repeat, more/better documentation is definitely a good idea (incl diagrams!). But it’s also important that this documentation is driven by (and reflective of) the principles behind the design, rather than being a collection of recipes. We need to help the developer to get the right mental model of the system. Outsiders like you can certainly help this progress. (We do the same with our students, but their “novice” status tends to vanish rather quickly ;-) Gernot
In general I will say that that the Microkit manual and tutorial could be improved to give people a better mental model around how scheduling works on Microkit and the MCS version of seL4.
I think, there is a finite number of relevant IPC scenarios. Maybe one can build a tool to verify them (something something Promela?) and display them graphically (something something PlantUML Timing Diagrams?). This could either be used to fill a chapter in the Microkit Manual ("Common IPC Patterns, their Behavior and Pitfalls"?), or exposed as small (Web?) GUI Tool. Maybe this could become part of a student thesis even?
The main hurdle I’ve found with reasoning about how things are going to get scheduled in seL4 (although this problem is probably not unique to seL4) is that the answer is often ‘it depends on the rest of the system’. So, having a bunch of properties/rules that people can refer to quickly might not actually be realistic or it may actually be misleading. For example, the 2 UML diagrams that you linked look fairly different depending on whether you’re on single-core or multi-core. I think the first step would be to actually understand all the properties of the scheduler, for example the property mentioned about signalling to a higher priority PD is not documented in the seL4 kernel manual (from my brief reading of the scheduling section, I might be wrong). This makes me think about what other properties are missing from the kernel manual.
The other key thing to keep in mind (and maybe this needs to be emphasised more in the Microkit documentation) is that the core idea behind the seL4 device driver framework (sDDF) is that the driver has one and only one job: translating a hardware-specific *device* interface to a hardware-independent *device class* interface. Meaning an sDDF driver *only* provides a minimal hardware abstraction. This a great enabler of simplicity of sDDF drivers, making them a fraction of the size of Linux drivers, and easy to write.
The Microkit tutorial implements a very basic serial driver, it is not based on sDDF.
I hope the above might spark some thought on improving accessibility of Microkit.
I’ve made [1] but I’m not sure when I’ll get around to it. [1]: https://github.com/au-ts/microkit_tutorial/issues/18 Ivan
Not documented on the seL4 manual
It is, albeit briefly :) "Of the threads eligible for scheduling, the highest priority thread in a runnable state is chosen." It would be appropriate for the tutorial to note the expected change in the scheduler state, as this can be unexpected when one is first working with seL4 or a similar priority based system (including SCHED_FIFO and friends in POSIX). Cheers Anna. On Wed, Nov 27, 2024 at 10:08 AM Ivan Velickovic via Devel <devel@sel4.systems> wrote:
In general I will say that that the Microkit manual and tutorial could be improved to give people a better mental model around how scheduling works on Microkit and the MCS version of seL4.
I think, there is a finite number of relevant IPC scenarios. Maybe one can build a tool to verify them (something something Promela?) and display them graphically (something something PlantUML Timing Diagrams?). This could either be used to fill a chapter in the Microkit Manual ("Common IPC Patterns, their Behavior and Pitfalls"?), or exposed as small (Web?) GUI Tool. Maybe this could become part of a student thesis even?
The main hurdle I’ve found with reasoning about how things are going to get scheduled in seL4 (although this problem is probably not unique to seL4) is that the answer is often ‘it depends on the rest of the system’. So, having a bunch of properties/rules that people can refer to quickly might not actually be realistic or it may actually be misleading.
For example, the 2 UML diagrams that you linked look fairly different depending on whether you’re on single-core or multi-core.
I think the first step would be to actually understand all the properties of the scheduler, for example the property mentioned about signalling to a higher priority PD is not documented in the seL4 kernel manual (from my brief reading of the scheduling section, I might be wrong). This makes me think about what other properties are missing from the kernel manual.
The other key thing to keep in mind (and maybe this needs to be emphasised more in the Microkit documentation) is that the core idea behind the seL4 device driver framework (sDDF) is that the driver has one and only one job: translating a hardware-specific *device* interface to a hardware-independent *device class* interface. Meaning an sDDF driver *only* provides a minimal hardware abstraction. This a great enabler of simplicity of sDDF drivers, making them a fraction of the size of Linux drivers, and easy to write.
The Microkit tutorial implements a very basic serial driver, it is not based on sDDF.
I hope the above might spark some thought on improving accessibility of Microkit.
I’ve made [1] but I’m not sure when I’ll get around to it.
[1]: https://github.com/au-ts/microkit_tutorial/issues/18
Ivan _______________________________________________ Devel mailing list -- devel@sel4.systems To unsubscribe send an email to devel-leave@sel4.systems
It is, albeit briefly :) "Of the threads eligible for scheduling, the highest priority thread in a runnable state is chosen.”
Ah okay thanks for pointing that out. I guess I was looking for something more explicit but it makes sense that the kernel manual would be talking about things more generally rather than specific things like ‘what happens if you signal’. Would you say all the properties of the scheduler are documented in the manual currently? If so I’ll start by reading it closely and seeing what applies to Microkit systems. Ivan
For example, the 2 UML diagrams that you linked look fairly different depending on whether you’re on single-core or multi-core.
Yes, and a diagram can be easily used to highlight a) the critical chain of events for a successful communication, b) which links in this chain are relying on scheduling details c) which links are relying on sequential scheduling. And it is this very distinction (whether an successful communication can take place, for example in the face of single-core/multi-core), which (implicit) assumptions on the scheduling have to be met for it to succeed, and (optionally) what step would make a failed communication observable that I'd wish to have more (and more visual!) documentation on. I know that this can be derived by hand by any engineer who understands the principles of Microkit + seL4, but that is an error prone process.
Hi Gernot, thank you for the prompt and elaborate answer. A few quick comments I fully agree with your stance on being approachable/guiding through the Microkit Tutorial. That's why I suggested not to solving the issue (e.g. by buffering) part of the tutorial, but to mention that there might be a lurking issue in the proposed communication for some cases. Awareness is key, everything else explodes the scope of the tutorial IMO. I do think one should be able to build and understand Microkit applications by consuming the Microkit Manual and the relevant sections of the sDDF docs. While it is acceptable to have higher complexity topics only in the seL4 docs, this risks the promise of Microkit; if one in the end needs to consume the seL4 documentation to know about even basic behavior of ones application built on Microkit, this greatly reduces the value proposition of Microkit. Striking the right balance here is hard. In particular, I argue the Microkit Manual should be a bit more detailed on how things are scheduled, because this very information is highly relevant for building correct & robust systems based on Microkit. Scheduling assumptions _are_ foundational to many common IPC patterns in Microkit, more so than in other OS.
It’s not only not problematic, it’s essential ;-)
While for the seL4 design having priority preemption may be essential, I see this as an OS design tradeoff. In particular looking through the glasses of safety certification, non-preemptive yet priority driven scheduling certainly has some appeal compared to priority preemptive scheduling. Don't get me wrong, I just want to point out that priority preemption is neither the only nor necessarily the best design, especially for safety (where having no preemption greatly simplifies reasoning while downsides like reduced performance and increased latency may be completely acceptable). I won't try to coerce anyone into building non-preemptive seL4, so I stop the topic here to not derail this conversation :D.
Keep in mind that this only mirrors what happens in hardware: If software doesn't consume input from the serial port quickly enough, you’ll lose characters.
Yes, absolutely! The problem is not loosing characters, the problem is _silently_ loosing characters. Failure is omnipresent, and for safety critical systems a huge chunk of effort goes into reasoning about failure paths and system degradation. Silent failure is however terrible, you can not degrade gracefully if you are not aware of any degradation. Also, many serial interfaces that I'm aware of do buffer in hardware, so technically, the software abstraction does less than the HW, instead of matching the HW's behavior. Nitpick, I know.
The other key thing to keep in mind (and maybe this needs to be emphasised more in the Microkit documentation) is that the core idea behind the seL4 device driver framework (sDDF) is that the driver has one and only one job: translating a hardware-specific *device* interface to a hardware-independent *device class* interface. Meaning an sDDF driver *only* provides a minimal hardware abstraction. This a great enabler of simplicity of sDDF drivers, making them a fraction of the size of Linux drivers, and easy to write.
Importantly, sDDF drivers are, as far as possible, policy free.
A direct implication of this is that buffering isn’t the driver’s job, it’s policy that should be done in another component (separation of concerns). Meaning the driver just reflects the fact that input characters get lost if you don’t consume them quickly enough. And loss can be prevented by appropriate assignment of priorities.
The serial driver section in the sDDF document is right now mostly AWOL (as there are a number of others, I’ve got a bunch of device class specs sitting on my desk waiting to be reviewed…)
There is discussion of priorities in the general part of the doc: §4.3.1 specifies that the driver should be higher prio than the virtualiser, and §4.3.2 similarly says that the virt should be higher prio than the client. But there really needs to be more info on serial drivers.
I agree, mostly. I think I catch the drift, and for specific examples discussed, I concur with your argument. Two things worth noting:
And loss can be prevented by appropriate assignment of priorities.
This statement is wrong. One can create a system that implements the priorities as per the recommendations in the sDDF yet fails to consume its input fully, i.e. through setting the client PDs budget smaller than required to consume one chunk of input. Priorities (as you pointed out), but also scheduling parameters (budget/period of partitions), external factors (rate at which data is coming in), and internal factors (system load, i.e. contention on the memory bus) all come to play for whether input is fully consumed or not. For the serial-port this hopefully is not a concern on any HW from this century, but there is of course other, higher bandwidth comms that are desirable. It is also this kind of "What if?" and "Which PD can adversely effect which other PD, and how?" that are relevant for building safety critical systems. Secondly, being completely policy-free is a double-sided sword. In general, I support the effort. However, if to use a more sophisticated hardware I would need to make dozens of non-trivial design decisions (as the upstream implementation is completely un-opinionated/policy free), this will hurt real-world adoption (especially if there is an alternative that just has "the vanilla way of doing things" work without any decision paralysis). I think being 100% policy-free can lead to an SW landscape that is universally capable, but for which there exist no ecosystem of reusable components, because each and every component is very individually designed for its specific use-case. I think Microkit is much more opinionated than seL4, and I think that is what makes it more useful to me. Being opinionated (albeit resulting in technical compromise more easily) is the way to grow an ecosystem of reusable components, and that is precisely the currency with which one can foster greater adoption. Takeaway: policy-free (and simple to operate!) drivers are great, but from time to time one should not defer _all_ decisions to the user, but set an opinionated precedence for _some_ of them. Also, with seL4 you already offer a product heavily leaning towards "no technical compromise, but all the complexity and tightest possible coupling".
I repeat, more/better documentation is definitely a good idea (incl diagrams!).
Yes!
But it’s also important that this documentation is driven by (and reflective of) the principles behind the design, rather than being a collection of recipes.
Is that really a contradiction? I think from an engineering perspective, recipes are very useful. They help to have many independent engineers grow a more or less coherent body of work, that even if not intended in the first place, actually composes well together (software component reuse again...). That goes without saying, having the engineers understand the design principles behind their recipes is crucial too. Thank you for the exchange! I certainly learned a bit more about your/the seL4 design philosophy, even if I don't unconditionally agree to every last detail. Kind Regards, Wanja —————————————————————————— Deutsches Zentrum für Luft- und Raumfahrt (DLR) Institut für Flugsystemtechnik | Sichere Systeme & Systems Engineering | Lilienthalplatz 7 | 38108 Braunschweig Wanja Zaeske Telefon 0531 295-1114 | Fax 0531 295-2931 | wanja.zaeske@dlr.de<mailto:wanja.zaeske@dlr.de> DLR.de<http://www.dlr.de/>
Hello, I just wanted to say that I agree with a lot of things that Wanja says in this mail and that I think he makes some very good points about common infrastructure requiring policy decisions. On 2024-11-28 09:26, Wanja.Zaeske--- via Devel wrote:
While for the seL4 design having priority preemption may be essential, I see this as an OS design tradeoff. In particular looking through the glasses of safety certification, non-preemptive yet priority driven scheduling certainly has some appeal compared to priority preemptive scheduling. Don't get me wrong, I just want to point out that priority preemption is neither the only nor necessarily the best design, especially for safety (where having no preemption greatly simplifies reasoning while downsides like reduced performance and increased latency may be completely acceptable). I won't try to coerce anyone into building non-preemptive seL4, so I stop the topic here to not derail this conversation :D.
In seL4 that is a matter of configuration: Just use a period equal to your watchdog timeout value and you will practically have non-preemtive scheduling. And instead of a global watchdog, with seL4 you can use timeout faults for more fine grained control. Or give all tasks a different priority.
One can create a system that implements the priorities as per the recommendations in the sDDF yet fails to consume its input fully, i.e. through setting the client PDs budget smaller than required to consume one chunk of input. Priorities (as you pointed out), but also scheduling parameters (budget/period of partitions), external factors (rate at which data is coming in), and internal factors (system load, i.e. contention on the memory bus) all come to play for whether input is fully consumed or not. For the serial-port this hopefully is not a concern on any HW from this century, but there is of course other, higher bandwidth comms that are desirable. It is also this kind of "What if?" and "Which PD can adversely effect which other PD, and how?" that are relevant for building safety critical systems.
Yes, system design is hard and you need to know a lot of scheduling and other seL4 details to make safety critical systems. But documenting all intricacies of flow control, fault detection, buffering considerations etc. is neither the task of the Microkit documentation, nor the seL4 manual. sDDF should document at least some of it though. It would be nice if there was more general awareness of safety critical system requirements by people making infrastructure like Microkit and sDDF, basic things like fault recovery are currently missing. Especially in a microkernel system, faults should not propagate through the system to other components, so all infrastructure should be designed with fault containment and recovery in mind.
Secondly, being completely policy-free is a double-sided sword. In general, I support the effort. However, if to use a more sophisticated hardware I would need to make dozens of non-trivial design decisions (as the upstream implementation is completely un-opinionated/policy free), this will hurt real-world adoption (especially if there is an alternative that just has "the vanilla way of doing things" work without any decision paralysis). I think being 100% policy-free can lead to an SW landscape that is universally capable, but for which there exist no ecosystem of reusable components, because each and every component is very individually designed for its specific use-case. I think Microkit is much more opinionated than seL4, and I think that is what makes it more useful to me. Being opinionated (albeit resulting in technical compromise more easily) is the way to grow an ecosystem of reusable components, and that is precisely the currency with which one can foster greater adoption. Takeaway: policy-free (and simple to operate!) drivers are great, but from time to time one should not defer _all_ decisions to the user, but set an opinionated precedence for _some_ of them. Also, with seL4 you already offer a product heavily leaning towards "no technical compromise, but all the complexity and tightest possible coupling".
Strongly agreed with this. One big problem is that having a lot of infrastructure that is not build with a certain use case in mind quickly becomes useless, especially with strong safety or security requirements. Except for CPU usage and priorities, seL4 has all the required mechanism to create modular, stand-alone reusable components. However, to let them work together and foster re-use, a common system API/ABI is needed and a way to tie everything together. If the infrastructure is good, people only need to replace some modules if they don't match requirements instead of writing all infrastructure themselves. I think it's necessary to split this into a system API and a system ABI: An API makes it possible to avoid making system design decisions too early. e.g. I'm not a fan of sDDF requiring multiple pointless PD transitions: It's fine to have separation of concern, but if both components have the same security and safety impact, then running them in different PDs is just pointless overhead. Same for time servers: Whether getting the time needs an IPC call or just a CPU instruction, with possibly getting extra info from shared memory, should not be decided beforehand. But having well defined common infrastructure is crucial. sDDF is a good building block for communication, not only for drivers, but for system services too. But it does not solve the API/ABI problem: That needs to be defined per driver framework and system service. The challenge is to create an API that is simple for simple systems, but has the flexibility to be expanded with more complex functionality without breaking all existing users. This is especially true for networking and file systems. Microkit should expose all untyped memory and other resources it doesn't use itself, so that it's easy to create a more dynamic system on top of Microkit and launch stand-alone subsystems with a well-defined interface. Practically such subsystem would look almost the same as a PD to Microkit, but with UT and other additional resources (such subsystem could be build using Microkit itself). If this is not supported, then using Microkit or not becomes an all-or-nothing choice, reducing adoption. If you know you can always get to the low-level seL4 mechanisms and add missing bits yourself on top of Microkit, the choice to use it becomes much safer. To some extent Lions OS (1) and SMOS (2) are a step in the right direction, but they have a very long way to go. Greetings, Indan 1) https://lionsos.org 2) https://trustworthy.systems/projects/smos
participants (6)
-
Anna Lyons
-
Gernot Heiser
-
Indan Zupancic
-
Ivan Velickovic
-
Wanja.Zaeske@dlr.de
-
wanja.zaeske@dlr.de