[seL4] Re: On Common Infrastructure

16 Dec 2024

      Hello,

I just wanted to say that I agree with a lot of things that Wanja says
in this mail and that I think he makes some very good points about
common infrastructure requiring policy decisions.

On 2024-11-28 09:26, Wanja.Zaeske--- via Devel wrote:
...
While for the seL4 design having priority preemption may be essential, 
I see this as an OS design tradeoff. In particular looking through the 
glasses of safety certification, non-preemptive yet priority driven 
scheduling certainly has some appeal compared to priority preemptive 
scheduling. Don't get me wrong, I just want to point out that priority 
preemption is neither the only nor necessarily the best design, 
especially for safety (where having no preemption greatly simplifies 
reasoning while downsides like reduced performance and increased 
latency may be completely acceptable). I won't try to coerce anyone 
into building non-preemptive seL4, so I stop the topic here to not 
derail this conversation :D.
In seL4 that is a matter of configuration: Just use a period equal to 
your
watchdog timeout value and you will practically have non-preemtive 
scheduling.
And instead of a global watchdog, with seL4 you can use timeout faults 
for
more fine grained control. Or give all tasks a different priority.
...
One can create a system that implements the priorities as per the 
recommendations in the sDDF yet fails to consume its input fully, i.e. 
through setting the client PDs budget smaller than required to consume 
one chunk of input. Priorities (as you pointed out), but also 
scheduling parameters (budget/period of partitions), external factors 
(rate at which data is coming in), and internal factors (system load, 
i.e. contention on the memory bus) all come to play for whether input 
is fully consumed or not. For the serial-port this hopefully is not a 
concern on any HW from this century, but there is of course other, 
higher bandwidth comms that are desirable. It is also this kind of 
"What if?" and "Which PD can adversely effect which other PD, and how?" 
that are relevant for building safety critical systems.
Yes, system design is hard and you need to know a lot of scheduling and
other seL4 details to make safety critical systems. But documenting all
intricacies of flow control, fault detection, buffering considerations
etc. is neither the task of the Microkit documentation, nor the seL4
manual. sDDF should document at least some of it though.

It would be nice if there was more general awareness of safety critical
system requirements by people making infrastructure like Microkit and 
sDDF,
basic things like fault recovery are currently missing. Especially in a
microkernel system, faults should not propagate through the system to 
other
components, so all infrastructure should be designed with fault 
containment
and recovery in mind.
...
Secondly, being completely policy-free is a double-sided sword. In 
general, I support the effort. However, if to use a more sophisticated 
hardware I would need to make dozens of non-trivial design decisions 
(as the upstream implementation is completely un-opinionated/policy 
free), this will hurt real-world adoption (especially if there is an 
alternative that just has "the vanilla way of doing things" work 
without any decision paralysis). I think being 100% policy-free can 
lead to an SW landscape that is universally capable, but for which 
there exist no ecosystem of reusable components, because each and every 
component is very individually designed for its specific use-case. I 
think Microkit is much more opinionated than seL4, and I think that is 
what makes it more useful to me. Being opinionated (albeit resulting in 
technical compromise more easily) is the way to grow an ecosystem of 
reusable components, and that is precisely the currency with which one 
can foster greater adoption. Takeaway: policy-free (and simple to 
operate!) drivers are great, but from time to time one should not defer 
_all_ decisions to the user, but set an opinionated precedence for 
_some_ of them. Also, with seL4 you already offer a product heavily 
leaning towards "no technical compromise, but all the complexity and 
tightest possible coupling".
Strongly agreed with this.

One big problem is that having a lot of infrastructure that is not build
with a certain use case in mind quickly becomes useless, especially with
strong safety or security requirements.

Except for CPU usage and priorities, seL4 has all the required mechanism
to create modular, stand-alone reusable components. However, to let them
work together and foster re-use, a common system API/ABI is needed and a
way to tie everything together. If the infrastructure is good, people 
only
need to replace some modules if they don't match requirements instead of
writing all infrastructure themselves.

I think it's necessary to split this into a system API and a system ABI:
An API makes it possible to avoid making system design decisions too 
early.

e.g. I'm not a fan of sDDF requiring multiple pointless PD transitions:
It's fine to have separation of concern, but if both components have the
same security and safety impact, then running them in different PDs is
just pointless overhead.

Same for time servers: Whether getting the time needs an IPC call or 
just
a CPU instruction, with possibly getting extra info from shared memory,
should not be decided beforehand.

But having well defined common infrastructure is crucial. sDDF is a good
building block for communication, not only for drivers, but for system
services too. But it does not solve the API/ABI problem: That needs to 
be
defined per driver framework and system service. The challenge is to 
create
an API that is simple for simple systems, but has the flexibility to be
expanded with more complex functionality without breaking all existing
users. This is especially true for networking and file systems.

Microkit should expose all untyped memory and other resources it doesn't
use itself, so that it's easy to create a more dynamic system on top of
Microkit and launch stand-alone subsystems with a well-defined 
interface.
Practically such subsystem would look almost the same as a PD to 
Microkit,
but with UT and other additional resources (such subsystem could be 
build
using Microkit itself). If this is not supported, then using Microkit or
not becomes an all-or-nothing choice, reducing adoption. If you know you
can always get to the low-level seL4 mechanisms and add missing bits
yourself on top of Microkit, the choice to use it becomes much safer.

To some extent Lions OS (1) and SMOS (2) are a step in the right 
direction,
but they have a very long way to go.

Greetings,

Indan

1) https://lionsos.org
2) https://trustworthy.systems/projects/smos

[seL4] Re: On Common Infrastructure

Indan Zupancic