[seL4] Re: More enhancements for general-purpose systems that wouldn't be accepted into seL4?

5 Dec 2020

      On 12/2/20, Demi M. Obenour wrote:
...
That’s understandable.  From my perspective, it appears that the only
change required would be to swap out the VMM, since the same kernel
capabilities would be required either way.  The only difference would
be that a nested instance of UX/RT would need to get untyped memory
objects somehow, which seems simple.  Of course, I could very well
be missing something here ― if this would meaningfully increase
the complexity of the system, it probably isn’t worth it.
In addition to the aforementioned issues with binary compatibility, it
would also complicate booting such systems. UX/RT will have a uniform
boot process in which the kernel, root server, and all other files
required for early boot are contained within a "supervisor image" in
the root directory of the root container being booted rather than
having multiple images on disk, and the root server will be dependent
on having the Multiboot2 info passed through to it (so it doesn't have
to understand the filesystem format of the supervisor image). There
would also have to be some kind of API for communication with
alternate personalities running on the same kernel (although of course
on a hypervisor there would have to be an API for communicating with
other VMs as well).

Also, it's kind of redundant because UX/RT will already have extensive
containerization support at the VFS level (and since literally all IPC
and user memory will be file-based this means complete control over
what is accessible in containers). This would really just be pushing
containerization down to the kernel level, which would be of little
benefit.
...
That said, from the way you phrased your message, I thought you were
referring to a type-1 hypervisor that would run below UX/RT.  IMO, that
is where such a tool really belongs ― it can provide strong isolation
guarantees between multiple seL4-based systems, and still allow each
of those systems to use hardware virtualization if they so desire.
For instance, an embedded system might have high-assurance components
running on the seL4 Core Platform, while UX/RT is used as a replacement
for VMs running Linux.  Similarly, a hypothetical seL4-based QubesOS
might use this type-1 hypervisor to isolate qubes from each other.
Yes, that's exactly what I was planning to do. I'm wanting to write a
Type 1 hypervisor based on seL4 or a fork, which will be independent
of UX/RT (although it will be distributed along with it). It will have
a somewhat Xen-ish architecture in that all services will be provided
by backend drivers running in VMs (the VMM will run as the root server
and will be the only process-like thing running on the hypervisor
microkernel).
...
FYI, since you plan on a Linux compatibility layer, you might want to
contact the illumos developers.  illumos is, of course, a completely
different OS design, but they do have a full Linux compatibility
layer and might be able to give some implementation advice.
Their Linux compatibility layer probably wouldn't really be all that
relevant to that of UX/RT. UX/RT's Linux system call handling will be
purely library-based, which is rather different from a kernel-based
compatibility layer.
...
An officially-supported, API- and ABI- stable C library is planned,
so this may not be a roadblock for much longer.
I was under the impression that it wasn't going to have a stable ABI
since it is intended more for static systems.
...
1. The need for an emulator removes many of the assurance guarantees
   provided by seL4, since one must rely on the correctness of the
   emulator to prevent in-VM privilege escalation vulnerabilities.
   Such vulnerabilities are not uncommon in existing hypervisors.
There isn't necessarily a need for an emulator as such for PVH-type
VMs, although there will of course be device backends.

Also, since this is meant for dynamic systems, verification of the
microkernel is somewhat less relevant, since the TCB of any process or
VM will always include highly dynamic user-mode subsystems that will
be difficult or impossible to verify (the UX/RT root server as well as
that of the hypervisor will be written in Rust to reduce the potential
for significant vulnerabilities).

For a system with a static enclave running on the same processor as a
dynamic OS, a verified kernel could still be used (presumably without
any direct hypercall support unless that could be verified, since it
would be less likely that there would be a need for high-throughput
IPC between the two sides).
...
2. Nested hardware virtualization is quite difficult to implement, and
   has significant overhead.  On the other hand, nested virtualization
   based on seL4 capabilities is free.
UX/RT itself probably won't do anything with hardware virtualization
extensions directly. It will support hosting I/O emulators and device
backends for the underlying Type 1 hypervisor, and possibly
Skybridge-type acceleration, but those will only use hypercalls.
...
3. I doubt seL4 supports issuing any specialized hypercall
   instructions, so you might need to fall back to emulation.
That's why I was wondering about adding kernel-level support for
direct hypercalls.

On 12/3/20, Indan Zupancic wrote:
...
That seems possible to implement. The challenge is to implement
this without increasing seL4's complexity and without slowing down
all notification and IPC system calls, as it would be a couple of
special checks somewhere. It would require new system calls, e.g:
seL4_Notification_SetIRQHandler() and seL4_IPC_SetIRQHandler().
Or it can be a TCB/VCPU binding instead.
It does not solve blocking outgoing IPC calls, but I assume you
don't mind those. Interrupts must be kept disabled for the VM
during such calls.
...
It also has to work on x86 as well.
The VM receive side needs special seL4 kernel support to generate
interrupts when there are pending notifications or messages. This
is architecture independent.
The VM send side needs a way to directly make calls into the seL4
kernel instead of going via the VMM. This needs VM kernel support,
because such instructions are usually privileged. All VM <-> VMM
communication happens via the seL4 kernel already, so all that is
needed is for seL4 to know whether it is a VMM call or a system
call. On ARM this happens by looking at whether it was a svc or a
hvc call. It seems likely that x86 can do something similar.
For people who are wondering what use this has, the advantage of
taking the VMM out of the loop for seL4 <-> VM communication is not
only possible performance increases, but it also simplifies the VMM
and makes it more agnostic about the system as a whole. The system
design becomes simpler because you can use the normal seL4 API
everywhere, including VM user space. This makes it easier to move
code between VM user space and native seL4.
I'm thinking of an interface where hypercalls are made by writing to
virtual registers rather than using explicit trapping instructions,
with the possibility to create multiple hypercall interfaces per VM,
each with its own set of endpoint capabilities (although interrupt
delivery would probably still be best off being centralized with a
single interface per VCPU, for use only by the kernel). This would
make it easy for user processes within a VM to make hypercalls since
the guest system could create a hypercall interface for each process.
Hypercall support for user processes in VMs would also pretty much
require moving to a semi-asynchronous model where only the hypercall
interface itself gets blocked without blocking the VCPU (attempting to
do anything with a hypercall interface that has a call already in
progress would fail, but the VCPU would continue to run and all other
hypercall interfaces would remain available). There would presumably
have to be a dummy TCB (with only a CSpace but no VSpace or CPU state
defined) for each hypercall interface separate from the TCB of the VM.
IPC for normal threads would remain synchronous.

[seL4] Re: More enhancements for general-purpose systems that wouldn't be accepted into seL4?

Andrew Warkentin