RFC-5: seL4 Core Platform

Gernot Heiser

18 Nov 2020 18 Nov '20

10:44 p.m.

Dear seL4 Community, Those who attended today’s closing session at the seL4 Summit have heard about this, for the rest I’m pleased to announce that we have an RFC out for the seL4 Core Platform, a minimal seL4-based OS for embedded/cyberphysical/IoT applications, see https://sel4.atlassian.net/browse/RFC-5 Feedback welcome. Gernot

Show replies by date

Demi M. Obenour

19 Nov 19 Nov

11:28 a.m.

On 11/18/20 10:44 PM, Gernot Heiser wrote:

...

Dear seL4 Community,

Those who attended today’s closing session at the seL4 Summit have heard about this, for the rest I’m pleased to announce that we have an RFC out for the seL4 Core Platform, a minimal seL4-based OS for embedded/cyberphysical/IoT applications, see https://sel4.atlassian.net/browse/RFC-5

Feedback welcome.

Gernot

I hope it is okay for me to respond here, as opposed to on JIRA, but if you prefer the latter, just let me know and I will comment there. The design document appears to be well-written, and gives a good explanation of what the seL4 Core Platform is. That said, there are two ways I believe it could be improved. The first is an explanation for why the restrictions stated are acceptable. For instance, it is clear that the targeted systems are static. Why is dynamism not required? Is it because these systems are created with a single purpose in mind, and so do not need the level of extensibility that a dynamic system offers? The second is the ability to forcibly reset a protection domain to its initial state. For instance, a printer might have a PostScript interpreter and a PDF rasterizer. Both of these have been major sources of vulnerabilities in the past, and are of such complexity that reimplementation may not be viable. Instead, the untrusted rendering code can run in its own protection domain. After a print job has been processed, the renderer is forcibly reset, preventing any attack from harming other jobs. This feature can also be used to recover from a crash of an individual component, without having to reboot the entire system. Sincerely, Demi

Heiser, Gernot (Data61, Kensington NSW)

21 Nov 21 Nov

8:26 p.m.

On 20 Nov 2020, at 03:28, Demi M. Obenour <demiobenour@gmail.com> wrote:

...

The first is an explanation for why the restrictions stated are acceptable. For instance, it is clear that the targeted systems are static. Why is dynamism not required? Is it because these systems are created with a single purpose in mind, and so do not need the level of extensibility that a dynamic system offers?

As the design document states: It’s meant to be simple but sufficient for the target domains. More generality means more complexity. The proposed model should serve almost all present use cases. Even Cog System’s “ultra-secure phone" fits this model: All the dynamicism, such as run-time installable apps, happen inside a virtual machine with an Android guest. From the seL4 PoV, the phone is a simple, static architecture with a fixed number of protected components: the “use VM” containing a full Android system with virtual drivers, the VMM, a console, a I/O multiplexing component, a “driver VM” (which should eventually become individual driver components), a network encryption component and a storage encryption VM. If the Platform supports this use case, it will definitely support most others. For the (presently very small) class of deployments that need more generality, a different framework is needed. Such more general frameworks may be standardised in the future (and may or may not be extensions of the Core Platform), but for now this is not the aim. If you want a system where a fully general OS is needed, Genode is likely your best choice (at the expense of losing most of the assurance).

...

The second is the ability to forcibly reset a protection domain to its initial state. […]

As the RFC clearly spells out, this is “a first step”, we won’t get very far trying to boil the ocean. Re-initialising components is definitely planned, but for a future release (see the last few minutes of my talk at the seL4 Summit: https://www.youtube.com/watch?v=khcjH77riGA&list=PLtoQeavghzr1Z3mF5B-f_xnFm8... <https://www.youtube.com/watch?v=khcjH77riGA&list=PLtoQeavghzr1Z3mF5B-f_xnFm81MPPwA-&index=4>). Gernot

Demi M. Obenour

22 Nov 22 Nov

12:33 a.m.

On 11/21/20 8:26 PM, Heiser, Gernot (Data61, Kensington NSW) wrote:

...

On 20 Nov 2020, at 03:28, Demi M. Obenour <demiobenour@gmail.com> wrote:

...
The first is an explanation for why the restrictions stated are acceptable. For instance, it is clear that the targeted systems are static. Why is dynamism not required? Is it because these systems are created with a single purpose in mind, and so do not need the level of extensibility that a dynamic system offers?

As the design document states: It’s meant to be simple but sufficient for the target domains. More generality means more complexity. The proposed model should serve almost all present use cases. Even Cog System’s “ultra-secure phone" fits this model: All the dynamicism, such as run-time installable apps, happen inside a virtual machine with an Android guest. From the seL4 PoV, the phone is a simple, static architecture with a fixed number of protected components: the “use VM” containing a full Android system with virtual drivers, the VMM, a console, a I/O multiplexing component, a “driver VM” (which should eventually become individual driver components), a network encryption component and a storage encryption VM. If the Platform supports this use case, it will definitely support most others.

That makes sense. Would it be possible to include a “non-goals” section, for the benefit of readers who have not attended the seL4 Summit?

...

For the (presently very small) class of deployments that need more generality, a different framework is needed. Such more general frameworks may be standardised in the future (and may or may not be extensions of the Core Platform), but for now this is not the aim. If you want a system where a fully general OS is needed, Genode is likely your best choice (at the expense of losing most of the assurance).

That’s understandable. A fully general OS is a far more difficult task, even if it does not attempt to conform to POSIX or other legacy standards. It certainly does not belong in “a first step”.

...

...
The second is the ability to forcibly reset a protection domain to its initial state. […]

As the RFC clearly spells out, this is “a first step”, we won’t get very far trying to boil the ocean. Re-initialising components is definitely planned, but for a future release (see the last few minutes of my talk at the seL4 Summit: https://www.youtube.com/watch?v=khcjH77riGA&list=PLtoQeavghzr1Z3mF5B-f_xnFm8... <https://www.youtube.com/watch?v=khcjH77riGA&list=PLtoQeavghzr1Z3mF5B-f_xnFm81MPPwA-&index=4>).

Indeed, “boiling the ocean” is not a good place to start. One last question: will cross-core communications channels be allowed?

...

Gernot

Sincerely, Demi

Heiser, Gernot (Data61, Kensington NSW)

4:25 a.m.

On 22 Nov 2020, at 16:33, Demi M. Obenour <demiobenour@gmail.com> wrote:

...

One last question: will cross-core communications channels be allowed?

Fair question, the spec doesn’t spell this out explicitly (but should – we’ll fix this). As the spec says, the PD has a scheduling context (SC) on which its init procedure and notification procedure (NP) execute. SCs are tied to a core, and thus the NP always runs on that core. Communication is via a shared buffer (the channel-attached MR) and synchronising with Notifications. Notifications work perfectly fine across cores, and thus this communication is core-transparent. PPCs are always core-local (as the server runs on the client’s scheduling context), so while the NP executes on *its* core, the PP executes on the client’s core. Note the consequence: if a channel connects PD1 running on Core1 with PD2 running on Core2, then, in the absence of a PP, all communication on this channel is cross-core. If, however, PD2 has a PP, and PD1 performs a PPC on PD2, then it’s not a-priori clear on which core the PP executes. If PD1 does the PPC from its NP, then the PP will execute on Core1. If, however, PD1 also has a PP, and invokes the PPC from its PP, that PPC a will execute on the core of whoever called PD1’s PP. I’m sure I’ve confused everyone by now ;-) Gernot

Andrew Warkentin

2:56 a.m.

On 11/21/20, Heiser, Gernot (Data61, Kensington NSW) <Gernot.Heiser@data61.csiro.au> wrote:

...

As the design document states: It’s meant to be simple but sufficient for the target domains. More generality means more complexity. The proposed model should serve almost all present use cases. Even Cog System’s “ultra-secure phone" fits this model: All the dynamicism, such as run-time installable apps, happen inside a virtual machine with an Android guest. From the seL4 PoV, the phone is a simple, static architecture with a fixed number of protected components: the “use VM” containing a full Android system with virtual drivers, the VMM, a console, a I/O multiplexing component, a “driver VM” (which should eventually become individual driver components), a network encryption component and a storage encryption VM. If the Platform supports this use case, it will definitely support most others.

Of course, in a system that runs a legacy general-purpose OS in a VM on top of a static system, only applications that are specifically written to use the secure APIs can actually be properly secured (and that's assuming that the secure static part of the system is actually capable of doing what the application needs). An seL4-native general purpose OS could apply the same security model to everything, including unmodified applications for legacy OSes assuming it provides a compatible environment, albeit at the very likely expense of formal verification of most OS components and subsystems above the kernel. However, even a mostly-unverified microkernel OS should still be considerably more secure than a legacy OS assuming the design is remotely reasonable.

Heiser, Gernot (Data61, Kensington NSW)

4:33 a.m.

On 22 Nov 2020, at 18:56, Andrew Warkentin <andreww591@gmail.com<mailto:andreww591@gmail.com>> wrote: Of course, in a system that runs a legacy general-purpose OS in a VM on top of a static system, only applications that are specifically written to use the secure APIs can actually be properly secured (and that's assuming that the secure static part of the system is actually capable of doing what the application needs). An seL4-native general purpose OS could apply the same security model to everything, including unmodified applications for legacy OSes assuming it provides a compatible environment, albeit at the very likely expense of formal verification of most OS components and subsystems above the kernel. However, even a mostly-unverified microkernel OS should still be considerably more secure than a legacy OS assuming the design is remotely reasonable. Yes. While this is a reasonable use case, it’s not one we see happening. Very few real-world systems are built from scratch, they almost inevitably are an evolution of something existing. And, compared to pulling out critical assets to run native (as in the Cog phone), the additional security gain from running *everything* native is small, compared to the huge refactoring/porting effort. That’s why just about everyone opts for the HACMS-style incremental cyber retrofit, as Cog have done it. But if you think you gain enough by going fully native on an unverified microkernel OS, then Genode is what you should use. If you want high assurance for your critical assets, then the seL4 Core Platform is the way to go (at least once it supports VMs). Gernot

Andrew Warkentin

5:21 a.m.

On 11/22/20, Heiser, Gernot (Data61, Kensington NSW) <Gernot.Heiser@data61.csiro.au> wrote:

...

Yes. While this is a reasonable use case, it’s not one we see happening. Very few real-world systems are built from scratch, they almost inevitably are an evolution of something existing. And, compared to pulling out critical assets to run native (as in the Cog phone), the additional security gain from running *everything* native is small, compared to the huge refactoring/porting effort. That’s why just about everyone opts for the HACMS-style incremental cyber retrofit, as Cog have done it.

But if you think you gain enough by going fully native on an unverified microkernel OS, then Genode is what you should use. If you want high assurance for your critical assets, then the seL4 Core Platform is the way to go (at least once it supports VMs).

I'm (still) working on UX/RT, my own implementation of a QNX-like OS based on seL4, which I've posted about here in the past. It will use code from existing free Unices wherever possible. This includes using the LKL project to run multiple kernel-only Linux systems as servers to provide device drivers, disk filesystems, and a network stack. Most of the "regular" user-mode tools and libraries will be forked or ported from Linux and BSD. Many Linux-specifc API functions will be natively supported (making it easier to implement a Linux binary compatibility environment as well as porting Linux applications). I'm certainly not attempting to build a full-featured Unix-like system from scratch by myself because that would be a lot of reinvention of the wheel. Much of what I need is already out there. I just need to fill in the gaps. Basically, I'm intending to make a general-purpose Unix-like microkernel OS that is practical for most use cases where legacy Unices (including Linux) are currently used, while hopefully being more secure and easier to manage than any legacy Unix, and still maintaining a reasonable degree of backwards compatibility (and also following the Unix philosophy, or at least my interpretation of it, considerably better than any other modern system).

Demi M. Obenour

1:01 p.m.

On 11/22/20 4:33 AM, Heiser, Gernot (Data61, Kensington NSW) wrote:

...

Yes. While this is a reasonable use case, it’s not one we see happening. Very few real-world systems are built from scratch, they almost inevitably are an evolution of something existing. And, compared to pulling out critical assets to run native (as in the Cog phone), the additional security gain from running *everything* native is small, compared to the huge refactoring/porting effort. That’s why just about everyone opts for the HACMS-style incremental cyber retrofit, as Cog have done it.

Are there any situations where the amount of critical code is so large that running everything native is worthwhile? Large avionics systems with millions of lines of critical flight control code are the first one that comes to mind, but I wonder if there are others. Perhaps a reasonably secure web browser? (Speaking of avionics: is it possible to scale formal verification to systems of that scale? I would love to see formally-verified avionics software.)

...

But if you think you gain enough by going fully native on an unverified microkernel OS, then Genode is what you should use. If you want high assurance for your critical assets, then the seL4 Core Platform is the way to go (at least once it supports VMs).

Gernot

Sincerely, Demi

Matthew Fernandez

1:36 p.m.

...

On Nov 22, 2020, at 10:01, Demi M. Obenour <demiobenour@gmail.com> wrote:

(Speaking of avionics: is it possible to scale formal verification to systems of that scale? I would love to see formally-verified avionics software.)

Inria and Airbus have had success doing this since at least ~2005. If you want to read about this, Sandrine Blazy and Xavier Leroy have discussed it in some prior publications. Today I believe this process is supported commercially by the company AbsInt.

Demi M. Obenour

1:43 p.m.

On 11/22/20 1:36 PM, Matthew Fernandez wrote:

...

...
On Nov 22, 2020, at 10:01, Demi M. Obenour <demiobenour@gmail.com> wrote:

(Speaking of avionics: is it possible to scale formal verification to systems of that scale? I would love to see formally-verified avionics software.)

Inria and Airbus have had success doing this since at least ~2005. If you want to read about this, Sandrine Blazy and Xavier Leroy have discussed it in some prior publications. Today I believe this process is supported commercially by the company AbsInt.

Is this full functional verification, or “merely” proving freedom from runtime errors? (Quotes because proving freedom from runtime errors is still a substantial victory.) Sincerely, Demi

Matthew Fernandez

1:45 p.m.

...

On Nov 22, 2020, at 10:43, Demi M. Obenour <demiobenour@gmail.com> wrote:

On 11/22/20 1:36 PM, Matthew Fernandez wrote:

...
...
On Nov 22, 2020, at 10:01, Demi M. Obenour <demiobenour@gmail.com> wrote:

(Speaking of avionics: is it possible to scale formal verification to systems of that scale? I would love to see formally-verified avionics software.)

Inria and Airbus have had success doing this since at least ~2005. If you want to read about this, Sandrine Blazy and Xavier Leroy have discussed it in some prior publications. Today I believe this process is supported commercially by the company AbsInt.

Is this full functional verification, or “merely” proving freedom from runtime errors? (Quotes because proving freedom from runtime errors is still a substantial victory.)

That I do not know. You would have to read their papers or ask the authors.

Heiser, Gernot (Data61, Kensington NSW)

6:02 p.m.

On 23 Nov 2020, at 05:36, Matthew Fernandez <matthew.fernandez@gmail.com<mailto:matthew.fernandez@gmail.com>> wrote: Inria and Airbus have had success doing this since at least ~2005. If you want to read about this, Sandrine Blazy and Xavier Leroy have discussed it in some prior publications. Today I believe this process is supported commercially by the company AbsInt. AbsInt AFAIK does high-assurance worst-case execution-time (WCET) analysis, not general software assurance. But yes, Airbus has been a pioneer in formal verification of critical avionics software, for at least 30 years. Gernot

Demi M. Obenour

12:46 p.m.

On 11/22/20 2:56 AM, Andrew Warkentin wrote:

...

On 11/21/20, Heiser, Gernot (Data61, Kensington NSW) <Gernot.Heiser@data61.csiro.au> wrote:

...
As the design document states: It’s meant to be simple but sufficient for the target domains. More generality means more complexity. The proposed model should serve almost all present use cases. Even Cog System’s “ultra-secure phone" fits this model: All the dynamicism, such as run-time installable apps, happen inside a virtual machine with an Android guest. From the seL4 PoV, the phone is a simple, static architecture with a fixed number of protected components: the “use VM” containing a full Android system with virtual drivers, the VMM, a console, a I/O multiplexing component, a “driver VM” (which should eventually become individual driver components), a network encryption component and a storage encryption VM. If the Platform supports this use case, it will definitely support most others.

Of course, in a system that runs a legacy general-purpose OS in a VM on top of a static system, only applications that are specifically written to use the secure APIs can actually be properly secured (and that's assuming that the secure static part of the system is actually capable of doing what the application needs). An seL4-native general purpose OS could apply the same security model to everything, including unmodified applications for legacy OSes assuming it provides a compatible environment, albeit at the very likely expense of formal verification of most OS components and subsystems above the kernel. However, even a mostly-unverified microkernel OS should still be considerably more secure than a legacy OS assuming the design is remotely reasonable.

Do you believe that an seL4-native general-purpose OS is practical? I certainly hope so! What do you believe such an OS would look like? Regarding security, I mostly agree with you, but I do have one caveat to point out. There is a distinction between assurance (essentially, correctness) and security. For instance, a legacy VM can still be very secure if there is no way for untrusted data to enter it, or if its interactions with untrusted data are extremely limited. That’s how QubesOS works, and it has proven to be quite successful. In the case of QubesOS, this has allowed Linux-native desktop environments (such as XFCE and KDE) to be used in security-critical areas. This in turn allows for a much better user experience than would otherwise be possible given the project’s resources, and has been critical in QubesOS’s success in the desktop space. Sincerely, Demi

Andrew Warkentin

10:04 p.m.

On 11/22/20, Demi M. Obenour <demiobenour@gmail.com> wrote:

...

Do you believe that an seL4-native general-purpose OS is practical? I certainly hope so! What do you believe such an OS would look like?

UX/RT will look like a loose reimplementation of QNX with several concepts from Plan 9 and Linux thrown in. It won't really be a clone of any of those systems though. I think a QNX-like architecture (i.e. a natively Unix-like system with the filesystem as an IPC transport layer, as well as avoidance of most vertical modularization of subsystems) is the most practical architecture for a microkernel-based general-purpose OS.

...

There is a distinction between assurance (essentially, correctness) and security. For instance, a legacy VM can still be very secure if there is no way for untrusted data to enter it, or if its interactions with untrusted data are extremely limited. That’s how QubesOS works, and it has proven to be quite successful. In the case of QubesOS, this has allowed Linux-native desktop environments (such as XFCE and KDE) to be used in security-critical areas. This in turn allows for a much better user experience than would otherwise be possible given the project’s resources, and has been critical in QubesOS’s success in the desktop space.

Yes, that kind of system can work and be reasonably secure, but it adds significant overhead, and can't really be implemented easily on top of a static system, since you pretty much need one VM per application.

Zach Lym

5:38 p.m.

Why not WASI <https://wasi.dev>? It's a non-POSIX compatible, capability oriented system runtime that leans on WASM as an IR to communicate high-level data types across isolated execution environments (browser, TEE, lambda functions, etc.). WASM has a formal model and they would ideally like to formalize WASI to ensure deterministic/reproducible behavior where appropriate. It may not have all the primitives you need yet, but they are still in the prototyping phase and nothing is set in stone. Thank you, -Zach Lym On Wed, Nov 18, 2020 at 7:47 PM Gernot Heiser <gernot@unsw.edu.au> wrote:

...

Dear seL4 Community,

Those who attended today’s closing session at the seL4 Summit have heard about this, for the rest I’m pleased to announce that we have an RFC out for the seL4 Core Platform, a minimal seL4-based OS for embedded/cyberphysical/IoT applications, see https://sel4.atlassian.net/browse/RFC-5

Feedback welcome.

Gernot _______________________________________________ Devel mailing list -- devel@sel4.systems To unsubscribe send an email to devel-leave@sel4.systems

Gernot Heiser

6:06 p.m.

On 23 Nov 2020, at 09:38, Zach Lym <zachlym@indolering.com<mailto:zachlym@indolering.com>> wrote: Why not WASI<https://wasi.dev/>? It's a non-POSIX compatible, capability oriented system runtime that leans on WASM as an IR to communicate high-level data types across isolated execution environments (browser, TEE, lambda functions, etc.). WASM has a formal model and they would ideally like to formalize WASI to ensure deterministic/reproducible behavior where appropriate. It may not have all the primitives you need yet, but they are still in the prototyping phase and nothing is set in stone. because we’re looking for a minimal, easy-to-use OS layer around seL4, rather than emulating some other OS. A lot of design efforts have for 15 years gone into seL4 abstractions and mechanisms, and we think we have got them mostly right by now. And the seL4 Core Platform is the OS designed to use them int he best possible way. Gernot

1692

Age (days ago)

1696

Last active (days ago)

List overview

Download

16 comments

6 participants

participants (6)

Andrew Warkentin
Demi M. Obenour
Gernot Heiser
Heiser, Gernot (Data61, Kensington NSW)
Matthew Fernandez
Zach Lym

RFC-5: seL4 Core Platform

tags

participants (6)