On 12/2/20, Demi M. Obenour wrote:
That’s understandable. From my perspective, it appears that the only change required would be to swap out the VMM, since the same kernel capabilities would be required either way. The only difference would be that a nested instance of UX/RT would need to get untyped memory objects somehow, which seems simple. Of course, I could very well be missing something here ― if this would meaningfully increase the complexity of the system, it probably isn’t worth it.
In addition to the aforementioned issues with binary compatibility, it would also complicate booting such systems. UX/RT will have a uniform boot process in which the kernel, root server, and all other files required for early boot are contained within a "supervisor image" in the root directory of the root container being booted rather than having multiple images on disk, and the root server will be dependent on having the Multiboot2 info passed through to it (so it doesn't have to understand the filesystem format of the supervisor image). There would also have to be some kind of API for communication with alternate personalities running on the same kernel (although of course on a hypervisor there would have to be an API for communicating with other VMs as well). Also, it's kind of redundant because UX/RT will already have extensive containerization support at the VFS level (and since literally all IPC and user memory will be file-based this means complete control over what is accessible in containers). This would really just be pushing containerization down to the kernel level, which would be of little benefit.
That said, from the way you phrased your message, I thought you were referring to a type-1 hypervisor that would run below UX/RT. IMO, that is where such a tool really belongs ― it can provide strong isolation guarantees between multiple seL4-based systems, and still allow each of those systems to use hardware virtualization if they so desire. For instance, an embedded system might have high-assurance components running on the seL4 Core Platform, while UX/RT is used as a replacement for VMs running Linux. Similarly, a hypothetical seL4-based QubesOS might use this type-1 hypervisor to isolate qubes from each other.
Yes, that's exactly what I was planning to do. I'm wanting to write a Type 1 hypervisor based on seL4 or a fork, which will be independent of UX/RT (although it will be distributed along with it). It will have a somewhat Xen-ish architecture in that all services will be provided by backend drivers running in VMs (the VMM will run as the root server and will be the only process-like thing running on the hypervisor microkernel).
FYI, since you plan on a Linux compatibility layer, you might want to contact the illumos developers. illumos is, of course, a completely different OS design, but they do have a full Linux compatibility layer and might be able to give some implementation advice.
Their Linux compatibility layer probably wouldn't really be all that relevant to that of UX/RT. UX/RT's Linux system call handling will be purely library-based, which is rather different from a kernel-based compatibility layer.
An officially-supported, API- and ABI- stable C library is planned, so this may not be a roadblock for much longer.
I was under the impression that it wasn't going to have a stable ABI since it is intended more for static systems.
1. The need for an emulator removes many of the assurance guarantees provided by seL4, since one must rely on the correctness of the emulator to prevent in-VM privilege escalation vulnerabilities. Such vulnerabilities are not uncommon in existing hypervisors.
There isn't necessarily a need for an emulator as such for PVH-type VMs, although there will of course be device backends. Also, since this is meant for dynamic systems, verification of the microkernel is somewhat less relevant, since the TCB of any process or VM will always include highly dynamic user-mode subsystems that will be difficult or impossible to verify (the UX/RT root server as well as that of the hypervisor will be written in Rust to reduce the potential for significant vulnerabilities). For a system with a static enclave running on the same processor as a dynamic OS, a verified kernel could still be used (presumably without any direct hypercall support unless that could be verified, since it would be less likely that there would be a need for high-throughput IPC between the two sides).
2. Nested hardware virtualization is quite difficult to implement, and has significant overhead. On the other hand, nested virtualization based on seL4 capabilities is free.
UX/RT itself probably won't do anything with hardware virtualization extensions directly. It will support hosting I/O emulators and device backends for the underlying Type 1 hypervisor, and possibly Skybridge-type acceleration, but those will only use hypercalls.
3. I doubt seL4 supports issuing any specialized hypercall instructions, so you might need to fall back to emulation.
That's why I was wondering about adding kernel-level support for direct hypercalls. On 12/3/20, Indan Zupancic wrote:
That seems possible to implement. The challenge is to implement this without increasing seL4's complexity and without slowing down all notification and IPC system calls, as it would be a couple of special checks somewhere. It would require new system calls, e.g: seL4_Notification_SetIRQHandler() and seL4_IPC_SetIRQHandler(). Or it can be a TCB/VCPU binding instead.
It does not solve blocking outgoing IPC calls, but I assume you don't mind those. Interrupts must be kept disabled for the VM during such calls.
It also has to work on x86 as well.
The VM receive side needs special seL4 kernel support to generate interrupts when there are pending notifications or messages. This is architecture independent.
The VM send side needs a way to directly make calls into the seL4 kernel instead of going via the VMM. This needs VM kernel support, because such instructions are usually privileged. All VM <-> VMM communication happens via the seL4 kernel already, so all that is needed is for seL4 to know whether it is a VMM call or a system call. On ARM this happens by looking at whether it was a svc or a hvc call. It seems likely that x86 can do something similar.
For people who are wondering what use this has, the advantage of taking the VMM out of the loop for seL4 <-> VM communication is not only possible performance increases, but it also simplifies the VMM and makes it more agnostic about the system as a whole. The system design becomes simpler because you can use the normal seL4 API everywhere, including VM user space. This makes it easier to move code between VM user space and native seL4.
I'm thinking of an interface where hypercalls are made by writing to virtual registers rather than using explicit trapping instructions, with the possibility to create multiple hypercall interfaces per VM, each with its own set of endpoint capabilities (although interrupt delivery would probably still be best off being centralized with a single interface per VCPU, for use only by the kernel). This would make it easy for user processes within a VM to make hypercalls since the guest system could create a hypercall interface for each process. Hypercall support for user processes in VMs would also pretty much require moving to a semi-asynchronous model where only the hypercall interface itself gets blocked without blocking the VCPU (attempting to do anything with a hypercall interface that has a call already in progress would fail, but the VCPU would continue to run and all other hypercall interfaces would remain available). There would presumably have to be a dummy TCB (with only a CSpace but no VSpace or CPU state defined) for each hypercall interface separate from the TCB of the VM. IPC for normal threads would remain synchronous.