
Hi all I have a couple of questions on how user-level proofs are constructed in the seL4 ecosystem. Just to start, can someone point me to an example of a user-level proof, just so I can take a look at how specifications are exported to user processes and how everything is stitched together. But also, I'm a bit confused on what kind of assumptions are needed to build verified user-level code in the face of revocable memory. I.e., it seems that one needs to assume (and trust) that the memory of any verified (or even rather mission critical) process is never revoked. Maybe I'm wrong and missing something and you have a way in mind for how to bypass such trust. Or maybe one can argue that such trust is reasonable, which is probably true for a "flat" system in which the mission critical domain is created right from the trusted boot process or some trusted TCB above it. Yet it seems hard to argue that such trust is reasonable when the recursion is deeper, e.g., the boot process launches Linux, and then Linux tries to start some kind of a mission critical process like a microkernel-protected software TEE or something (out of its memory hence it can later revoke it). And one wants to build a proof about this TCB yet such proof needs to deal with memory that can be revoked at any time or assume that revocation is impossible. Just wandering what is the take on this. I apologize if this all is covered somewhere in seL4 docs and I just didn't dig deep enough. Thank you! Anton

On 26 Mar 2025, at 04:51, Anton Burtsev via Devel <devel@sel4.systems> wrote: I have a couple of questions on how user-level proofs are constructed in the seL4 ecosystem. Just to start, can someone point me to an example of a user-level proof, just so I can take a look at how specifications are exported to user processes and how everything is stitched together. There aren’t any fully stitched-together user-level proofs yet. Mostly, because user-level verification requires a concurrency logic on top of the kernel. There is work underway in multiple groups to address that. One exception to needing concurrency is the user-level system initialiser (because it is the only process that runs at that time). Proofs here: https://github.com/seL4/l4v/tree/master/sys-init Other current approaches to user-level verification are to ignore the concurrency aspect and use sequential verification via SMT or other techniques for user-level programs, with a manually crafted adaptation of the seL4 spec. The UNSW folks have had some success with that, and I’ll let them comment on it. But also, I'm a bit confused on what kind of assumptions are needed to build verified user-level code in the face of revocable memory. I.e., it seems that one needs to assume (and trust) that the memory of any verified (or even rather mission critical) process is never revoked. Yes, of course. In static systems that is simple and guaranteed by correct system initialisation. In a normal CAmkES or Microkit system, nobody in the system has the authority to revoke the memory of other components after initialisation, and seL4’s integrity theorem gives you that this stays this way. In dynamic systems, that is less easy and requires a proof about the components that do have authority to revoke memory of other components. Maybe I'm wrong and missing something and you have a way in mind for how to bypass such trust. Or maybe one can argue that such trust is reasonable, which is probably true for a "flat" system in which the mission critical domain is created right from the trusted boot process or some trusted TCB above it. Those systems don’t need to flat, they could easily be hierarchical. The important part is that they are static in the sense that authority to revoke no longer exists in the system after (verified) user-level initialisation is done. In that kind of system, the problem is not a problem. Most current high-assurance systems are in that class. Dynamic systems are more interesting, but that is because resource re-use inherently has this problem. It doesn’t really come from anything related to seL4. If you can destroy a resource, the chain that leads to that action needs to be trusted. If you cannot destroy resources, but can create them, you immediately have the potential for denial of service. seL4 provides the mechanisms to do all that, but it doesn’t prescribe how you use them. Yet it seems hard to argue that such trust is reasonable when the recursion is deeper, e.g., the boot process launches Linux, and then Linux tries to start some kind of a mission critical process like a microkernel-protected software TEE or something (out of its memory hence it can later revoke it). And one wants to build a proof about this TCB yet such proof needs to deal with memory that can be revoked at any time or assume that revocation is impossible. You would never give Linux authority to directly create and revoke memory that others need to trust. In fact, something like Linux should never get much direct seL4 authority at all. It’ll usually be managed by a virtual machine monitor and even that would be constrained in authority as much as you can. If you want an untrusted Linux process to start something trusted, you’d add a separate management component that is small and can be trusted, which Linux can then signal to start something. In a dynamic system, you’d usually want to do that anyway, just for keeping dynamic seL4 resource and authority handling confined and verifiable (you could imagine such management components to be nested and hierarchical as well). I would still question the wisdom of letting a large untrusted code base like Linux trigger anything trusted. How do you know it does so when you want it to? It could just pretend to any client that it triggered the trusted component, but fake a result instead. Etc. There are ways around most of these problems, but it’d usually better to architect the system in a way that doesn’t require that trust inversion in the first place. The exception is when you build multiple system versions to migrate between architectures, then an intermediate state like that makes sense. Just wandering what is the take on this. I apologize if this all is covered somewhere in seL4 docs and I just didn't dig deep enough. I don’t think it is covered explicitly in the docs, it’s too high level for that. It would be a good idea write up some of higher-level concepts of how things can be organised to minimise trust. Cheers, Gerwin

Thank you! This makes sense. However, it still seems that in some cases revocation is more of a nuisance than help. For example, in a classical Xen split device driver model, i.e., a shared device driver VM (or a process in seL4) multiplexes a single physical device across multiple client VMs. In Xen each client connects to the driver VM and provides a set of pages to establish a region of shared memory for communication. The fact that these pages provided by the client protects the driver from a denial of service attack. In Xen it's impossible to revoke the pages -- once they are granted they belong to the driver VM (at least if I remember correctly, there might be some nuances and I didn't look at Xen recently). This scenario seems hard in seL4 as the shared pages can be revoked at any time. I guess, in seL4 world one structures the system like this: there is a protocol which requires the client VM to work with the TCB along these lines: give up N pages from it's memory allocation back to the TCB which then "moves" them into the driver. The TCB has to be "below" the client in the capability chain to make sure that the client looses the ability to revoke the pages from the driver, Effectively, in seL4, if my understanding above is correct, one structures the system in such a manner that revocation is eliminated, i.e., the client cannot revoke its pages. But then a logical question: why do we need revocation in the first place? In the end the client trusts the driver to release the pages back via the TCB when connection is teared down. This seems natural -- there is a degree of trust and cooperation between clients and the driver. In other words, what is the killer use-case for fine-grained revocation in your opinion? I.e., what kind of system is impossible without it? Or even: did you see scenarios on top of seL4 when people rely on fine-grained revocation (i.e., revoke pages, memory regions, access to devices) instead of a rather coarse-grained termination of the entire processes? Anton On Tue, Mar 25, 2025 at 09:52:51PM +0000, Gerwin Klein wrote:
On 26 Mar 2025, at 04:51, Anton Burtsev via Devel <devel@sel4.systems> wrote:
I have a couple of questions on how user-level proofs are constructed in the seL4 ecosystem.
Just to start, can someone point me to an example of a user-level proof, just so I can take a look at how specifications are exported to user processes and how everything is stitched together.
There aren???t any fully stitched-together user-level proofs yet. Mostly, because user-level verification requires a concurrency logic on top of the kernel. There is work underway in multiple groups to address that.
One exception to needing concurrency is the user-level system initialiser (because it is the only process that runs at that time). Proofs here: https://github.com/seL4/l4v/tree/master/sys-init
Other current approaches to user-level verification are to ignore the concurrency aspect and use sequential verification via SMT or other techniques for user-level programs, with a manually crafted adaptation of the seL4 spec. The UNSW folks have had some success with that, and I???ll let them comment on it.
But also, I'm a bit confused on what kind of assumptions are needed to build verified user-level code in the face of revocable memory. I.e., it seems that one needs to assume (and trust) that the memory of any verified (or even rather mission critical) process is never revoked.
Yes, of course. In static systems that is simple and guaranteed by correct system initialisation. In a normal CAmkES or Microkit system, nobody in the system has the authority to revoke the memory of other components after initialisation, and seL4???s integrity theorem gives you that this stays this way.
In dynamic systems, that is less easy and requires a proof about the components that do have authority to revoke memory of other components.
Maybe I'm wrong and missing something and you have a way in mind for how to bypass such trust. Or maybe one can argue that such trust is reasonable, which is probably true for a "flat" system in which the mission critical domain is created right from the trusted boot process or some trusted TCB above it.
Those systems don???t need to flat, they could easily be hierarchical. The important part is that they are static in the sense that authority to revoke no longer exists in the system after (verified) user-level initialisation is done. In that kind of system, the problem is not a problem. Most current high-assurance systems are in that class.
Dynamic systems are more interesting, but that is because resource re-use inherently has this problem. It doesn???t really come from anything related to seL4. If you can destroy a resource, the chain that leads to that action needs to be trusted. If you cannot destroy resources, but can create them, you immediately have the potential for denial of service. seL4 provides the mechanisms to do all that, but it doesn???t prescribe how you use them.
Yet it seems hard to argue that such trust is reasonable when the recursion is deeper, e.g., the boot process launches Linux, and then Linux tries to start some kind of a mission critical process like a microkernel-protected software TEE or something (out of its memory hence it can later revoke it). And one wants to build a proof about this TCB yet such proof needs to deal with memory that can be revoked at any time or assume that revocation is impossible.
You would never give Linux authority to directly create and revoke memory that others need to trust. In fact, something like Linux should never get much direct seL4 authority at all. It???ll usually be managed by a virtual machine monitor and even that would be constrained in authority as much as you can.
If you want an untrusted Linux process to start something trusted, you???d add a separate management component that is small and can be trusted, which Linux can then signal to start something. In a dynamic system, you???d usually want to do that anyway, just for keeping dynamic seL4 resource and authority handling confined and verifiable (you could imagine such management components to be nested and hierarchical as well).
I would still question the wisdom of letting a large untrusted code base like Linux trigger anything trusted. How do you know it does so when you want it to? It could just pretend to any client that it triggered the trusted component, but fake a result instead. Etc. There are ways around most of these problems, but it???d usually better to architect the system in a way that doesn???t require that trust inversion in the first place. The exception is when you build multiple system versions to migrate between architectures, then an intermediate state like that makes sense.
Just wandering what is the take on this. I apologize if this all is covered somewhere in seL4 docs and I just didn't dig deep enough.
I don???t think it is covered explicitly in the docs, it???s too high level for that. It would be a good idea write up some of higher-level concepts of how things can be organised to minimise trust.
Cheers, Gerwin

On 26 Mar 2025, at 15:22, Anton Burtsev via Devel <devel@sel4.systems> wrote: For example, in a classical Xen split device driver model, i.e., a shared device driver VM (or a process in seL4) multiplexes a single physical device across multiple client VMs. In Xen each client connects to the driver VM and provides a set of pages to establish a region of shared memory for communication. The fact that these pages provided by the client protects the driver from a denial of service attack. In Xen it's impossible to revoke the pages -- once they are granted they belong to the driver VM (at least if I remember correctly, there might be some nuances and I didn't look at Xen recently). This scenario seems hard in seL4 as the shared pages can be revoked at any time. Well, in seL4, unprivileged tasks don’t normally *own* their resources (in the sense of having access to the caps that control them – apps don’t normally have access to teir own Cspace) but this is left to a more privileged resource manager, which would be responsible for establishing the shared memory in which case this problem doesn’t exist. I guess, in seL4 world one structures the system like this: there is a protocol which requires the client VM to work with the TCB along these lines: give up N pages from it's memory allocation back to the TCB which then "moves" them into the driver. The TCB has to be "below" the client in the capability chain to make sure that the client looses the ability to revoke the pages from the driver, That would be one way of doing it in seL4 if the client *does* own the pages. However, a better protocol in that case would be for the client to *grant* the page to the server (driver) and the server to map them back to the client.By the same protocol the client could request a revoke, by requesting the server to grant them back. Effectively, in seL4, if my understanding above is correct, one structures the system in such a manner that revocation is eliminated, i.e., the client cannot revoke its pages. Revocation is needed for any dynamic resource management. But that doesn’t mean that an untrusted app is given that power. Eg, how do you implement demand paging? The server unmaps some of your memory, and re-maps it on demand. But then a logical question: why do we need revocation in the first place? In the end the client trusts the driver to release the pages back via the TCB when connection is teared down. This seems natural -- there is a degree of trust and cooperation between clients and the driver. That’s still revocation, isn’t it? Just not by the untrusted app. Of course, no-one stops you from building silly systems where untrusted tasks can stuff up others. Gernot

On 27 Mar 2025, at 05:22, Gernot Heiser <gernot@unsw.edu.au> wrote: But then a logical question: why do we need revocation in the first place? In the end the client trusts the driver to release the pages back via the TCB when connection is teared down. This seems natural -- there is a degree of trust and cooperation between clients and the driver. That’s still revocation, isn’t it? Just not by the untrusted app. Forgot to say: There is, in a well-designed system, no need for a server to trust clients. And mutually-distrusting relationships are easy to set up, but definitely need a trusted intermediary. Of course, this is not different from other OSes: you trust the OS. Here, the “OS” is a collection of servers, rather than a monolith, and there can be a hierarchy of trust. Gernot

On 27 Mar 2025, at 05:22, Gernot Heiser <gernot@unsw.edu.au> wrote: I guess, in seL4 world one structures the system like this: there is a protocol which requires the client VM to work with the TCB along these lines: give up N pages from it's memory allocation back to the TCB which then "moves" them into the driver. The TCB has to be "below" the client in the capability chain to make sure that the client looses the ability to revoke the pages from the driver, That would be one way of doing it in seL4 if the client *does* own the pages. Note that even if you use a protocol where the client supplies shared-memory frames directly to the server, the server can still protect itself from the client. If the client unmaps the shared page, the server will trigger a page fault on its next access. That invokes the server’s page-fault handler (normally part of the resource manager, but could be a thread inside the server). The page-fault handler finds that the fault is due to a page shared with the client and takes appropriate action. This would presumably involve revoking all further client access (i.e. revoking all of the client’s caps to server endpoints/notifications etc) – a clear case where revocation is essential to stop further requests from the mis-behaving client. The fault handler would then clean up state the server holds on behalf of the client. And it would recover the server by aborting whatever operation was faulted and re-set the server to a sane state. In most designs, this work would be split between two handlers: The page-fault handler being part of a trusted resource manager (i.e. something higher up in the trust hierarchy) which invokes a server-internal handler for cleanup/recovery. But theoretically it could be all internal to the server (which then needs to rely on owning its internal memory, i.e. can’t be revoked). If the handler is external (part of a trusted resource manager) then that could implement transparent demand paging of the server too, it would have to do enough bookkeeping to be able to distinguish between faults on client-provided memory and faults triggered by server pages being swapped out. Demand paging must be functionally transparent to the paged task. Meaning the resource manager must observe an invariant that if a page is unmapped (eg for swapping to backing store), a page fault triggered by that will only restart the paged task once that page is mapped to a frame that has the same content as the one that had been unmapped earlier. Then the functionality of the paged task is unaffected by paging. Paging is still observable in the timing, of course, but that’s not different from being preempted by a higher-priority thread. The design space is big – which is part of the challenge with seL4: it’s easy to do bad designs, and requires a lot of expertise to do good ones. Which is why we’re working on sample designs. LionsOS is a one with a static architecture, that avoids a lot of the complexity (and should ease verification). But we also have a fully dynamic one in the works, but that one is at a very early stage. Gernot

On 3/26/25 4:04 PM, Gernot Heiser via Devel wrote:
On 27 Mar 2025, at 05:22, Gernot Heiser <gernot@unsw.edu.au> wrote:
I guess, in seL4 world one structures the system like this: there is a protocol which requires the client VM to work with the TCB along these lines: give up N pages from it's memory allocation back to the TCB which then "moves" them into the driver. The TCB has to be "below" the client in the capability chain to make sure that the client looses the ability to revoke the pages from the driver,
That would be one way of doing it in seL4 if the client *does* own the pages.
Note that even if you use a protocol where the client supplies shared-memory frames directly to the server, the server can still protect itself from the client.
If the client unmaps the shared page, the server will trigger a page fault on its next access. That invokes the server’s page-fault handler (normally part of the resource manager, but could be a thread inside the server). The page-fault handler finds that the fault is due to a page shared with the client and takes appropriate action. This would presumably involve revoking all further client access (i.e. revoking all of the client’s caps to server endpoints/notifications etc) – a clear case where revocation is essential to stop further requests from the mis-behaving client. The fault handler would then clean up state the server holds on behalf of the client. And it would recover the server by aborting whatever operation was faulted and re-set the server to a sane state.
In most designs, this work would be split between two handlers: The page-fault handler being part of a trusted resource manager (i.e. something higher up in the trust hierarchy) which invokes a server-internal handler for cleanup/recovery.
But theoretically it could be all internal to the server (which then needs to rely on owning its internal memory, i.e. can’t be revoked).
If the handler is external (part of a trusted resource manager) then that could implement transparent demand paging of the server too, it would have to do enough bookkeeping to be able to distinguish between faults on client-provided memory and faults triggered by server pages being swapped out.
Demand paging must be functionally transparent to the paged task. Meaning the resource manager must observe an invariant that if a page is unmapped (eg for swapping to backing store), a page fault triggered by that will only restart the paged task once that page is mapped to a frame that has the same content as the one that had been unmapped earlier. Then the functionality of the paged task is unaffected by paging. Paging is still observable in the timing, of course, but that’s not different from being preempted by a higher-priority thread.
The design space is big – which is part of the challenge with seL4: it’s easy to do bad designs, and requires a lot of expertise to do good ones. Which is why we’re working on sample designs. LionsOS is a one with a static architecture, that avoids a lot of the complexity (and should ease verification). But we also have a fully dynamic one in the works, but that one is at a very early stage.
Will it be possible to implement a POSIX-like API on top of this? By “POSIX-like”, I mean “similar enough to POSIX that existing applications like browsers, web servers, etc can be ported fairly easily.” Having to rely on VMs for anything that isn’t written from scratch for seL4 would not be great. -- Sincerely, Demi Marie Obenour (she/her/hers)

On 27 Mar 2025, at 07:18, Demi Marie Obenour via Devel <devel@sel4.systems> wrote: Which is why we’re working on sample designs. LionsOS is a one with a static architecture, that avoids a lot of the complexity (and should ease verification). But we also have a fully dynamic one in the works, but that one is at a very early stage. Will it be possible to implement a POSIX-like API on top of this? By “POSIX-like”, I mean “similar enough to POSIX that existing applications like browsers, web servers, etc can be ported fairly easily.” Having to rely on VMs for anything that isn’t written from scratch for seL4 would not be great. We’re definitely layering an (inherently less efficient, Posix-style) synchronous interface over the asynchronous native API. In fact, that’s already used in your LionsOS-based web server that runs the sel4.systems web site. Others who are closer to the code can provide more details. Gernot

"Gernot" == Gernot Heiser via Devel <devel@sel4.systems> writes:
Something's wrong with the way Apple mailer quotes emails ... it is not easy to see what's a reply. I've tried to assign quotations correctly... Demi> Will it be possible to implement a POSIX-like API on top of Demi> this? By “POSIX-like”, I mean “similar enough to POSIX that Demi> existing applications like browsers, web servers, etc can be Demi> ported fairly easily.” Having to rely on VMs for anything Demi> that isn’t written from scratch for seL4 would not be great. Gernot> We’re definitely layering an (inherently less efficient, Gernot> Posix-style) synchronous interface over the asynchronous Gernot> native API. In fact, that’s already used in your LionsOS-based Gernot> web server that runs the sel4.systems web site. Others who Gernot> are closer to the code can provide more details. We have nothing that's completely POSIX. There's a project that might be coming up that will need more of a POSIX library, but it's unclear right now how complete it will have to be. We can at the moment provide a socket interface (courtesy of LWIP); and open/close/read/write for files, that are POSIX-ish (they don't have full POSIX semantics, especially in the error cases, but are near enough). Other features used by a production webserver like NGINX or Apache (mmap(), clone(), UNIX sockets or named pipes for WSGI, SIGPOLL for async IO (or signals in general), fork()/exec() for running separate processes like PHP) are unlikely. Browsers also have moved a long way from just needing POSIX interfaces. Both Firefox and Chromium (for example) use the namespace/container features of modern kernels to sandbox unsafe javascript, WASM etc., and generally set up multiple threads/processes using clone() or fork(). In addition, the environment in which a POSIX process runs is not currently supported. On seL4, there is not a complete standard filesystem, with /dev, /proc, /tmp etc., and even if you created device nodes in /dev you'd need to do a lot of work to hook them up to the actual device implementations. (/dev actually would be doable, but seems not useful: the native LionsOS/sDDF mechanisms are much more efficient in this context, even with a POSIX read/write mechanism on top) So, in short, we can do the basic filesystem and network operations fairly easily; anything more is too hard for now. -- Dr Peter Chubb https://trustworthy.systems/ Trustworthy Systems Group CSE, UNSW Core hours: Mon 8am-3pm; Wed: 8am-5pm; Fri 8am-12pm.

On 3/26/25 4:04 PM, Gernot Heiser via Devel wrote:
On 27 Mar 2025, at 05:22, Gernot Heiser <gernot@unsw.edu.au> wrote:
I guess, in seL4 world one structures the system like this: there is a protocol which requires the client VM to work with the TCB along these lines: give up N pages from it's memory allocation back to the TCB which then "moves" them into the driver. The TCB has to be "below" the client in the capability chain to make sure that the client looses the ability to revoke the pages from the driver,
That would be one way of doing it in seL4 if the client *does* own the
Note that even if you use a protocol where the client supplies
shared-memory frames directly to the server, the server can still protect itself from the client.
If the client unmaps the shared page, the server will trigger a page
fault on its next access.
That invokes the server’s page-fault handler (normally part of the resource manager, but could be a thread inside the server). The page-fault handler finds that the fault is due to a page shared with
This would presumably involve revoking all further client access (i.e. revoking all of the client’s caps to server endpoints/notifications etc) – a clear case where revocation is essential to stop further requests from
The fault handler would then clean up state the server holds on behalf of the client. And it would recover the server by aborting whatever operation was faulted and re-set the server to a sane state.
In most designs, this work would be split between two handlers: The
But theoretically it could be all internal to the server (which then
needs to rely on owning its internal memory, i.e. can’t be revoked).
If the handler is external (part of a trusted resource manager) then
"Having to rely on VMs for anything that isn’t written from scratch for seL4 would not be great." A stripped down Linux kernel seL4 VM boots in few seconds. Just tweak Linux to be fast and have a full flexible environment. As a QubesOS user I don't see any speed difference from running Xen guests... On Wednesday, March 26, 2025, Demi Marie Obenour via Devel <devel@sel4.systems> wrote: pages. the client and takes appropriate action. the mis-behaving client. page-fault handler being part of a trusted resource manager (i.e. something higher up in the trust hierarchy) which invokes a server-internal handler for cleanup/recovery. that could implement transparent demand paging of the server too, it would have to do enough bookkeeping to be able to distinguish between faults on client-provided memory and faults triggered by server pages being swapped out.
Demand paging must be functionally transparent to the paged task.
Meaning the resource manager must observe an invariant that if a page is unmapped (eg for swapping to backing store), a page fault triggered by that will only restart the paged task once that page is mapped to a frame that has the same content as the one that had been unmapped earlier. Then the functionality of the paged task is unaffected by paging.
Paging is still observable in the timing, of course, but that’s not different from being preempted by a higher-priority thread.
The design space is big – which is part of the challenge with seL4: it’s easy to do bad designs, and requires a lot of expertise to do good ones. Which is why we’re working on sample designs. LionsOS is a one with a static architecture, that avoids a lot of the complexity (and should ease verification). But we also have a fully dynamic one in the works, but that one is at a very early stage.
Will it be possible to implement a POSIX-like API on top of this? By “POSIX-like”, I mean “similar enough to POSIX that existing applications like browsers, web servers, etc can be ported fairly easily.” Having to rely on VMs for anything that isn’t written from scratch for seL4 would not be great. -- Sincerely, Demi Marie Obenour (she/her/hers) _______________________________________________ Devel mailing list -- devel@sel4.systems To unsubscribe send an email to devel-leave@sel4.systems

On 27 Mar 2025, at 08:16, Hugo V.C. via Devel <devel@sel4.systems> wrote:
"Having to rely on VMs for anything that isn’t written from scratch for seL4 would not be great."
A stripped down Linux kernel seL4 VM boots in few seconds. Just tweak Linux to be fast and have a full flexible environment. As a QubesOS user I don't see any speed difference from running Xen guests…
That seems long. We used to have this down to the order of 100ms (but I don’t think anyone looked into VM startup times recently). Gernot

100ms for Linux Kernel + initramfs boot? On Wednesday, March 26, 2025, Gernot Heiser <gernot@unsw.edu.au> wrote:
On 27 Mar 2025, at 08:16, Hugo V.C. via Devel <devel@sel4.systems> wrote:
"Having to rely on VMs for anything that isn’t written from scratch for seL4 would not be great."
A stripped down Linux kernel seL4 VM boots in few seconds. Just tweak
Linux
to be fast and have a full flexible environment. As a QubesOS user I don't see any speed difference from running Xen guests…
That seems long. We used to have this down to the order of 100ms (but I don’t think anyone looked into VM startup times recently).
Gernot

On Wed, 26 Mar 2025 at 03:54, Anton Burtsev via Devel <devel@sel4.systems> wrote:
Hi all
I have a couple of questions on how user-level proofs are constructed in the seL4 ecosystem.
Just to start, can someone point me to an example of a user-level proof, just so I can take a look at how specifications are exported to user processes and how everything is stitched together.
But also, I'm a bit confused on what kind of assumptions are needed to build verified user-level code in the face of revocable memory. I.e., it seems that one needs to assume (and trust) that the memory of any verified (or even rather mission critical) process is never revoked.
I hope this is helpful - some learnings from trying to support mutual suspicion in an EROS descendant that follows a very similar model to seL4. Note that we do not need to handle the case of arbitrary memory being rescinded, only memory that we've been sent by a client. The driver's own memory is owned by the driver, so the only way it can go away without the driver's consent is if the process is destroyed. I don't know if that matters to you. When you set up a driver that maps some client memory that could be revoked out from under it: - the driver should have a handler registered for memory faults. - the memory handler should be able to identify which faults are from regions that were used for mapping client pages. - if a fault happens on a user-supplied page, the handler should either adjust the instruction pointer and stack to a function that can abandon the transaction, or map an empty page in that location and make a note that the driver can check when it is done working with client pages. the driver needs to accept that it might experience a partial read/write of a client's data and decide how to deal with that. Additionally, if we are going to map the page to the driver via the iommu, we pin it, which among other things prevents re-allocation of the page. As far as I know, this isn't necessary in seL4, but maybe there's a similar mechanism to ensure devices can't read re-allocated memory without having to copy. -- William ML Leslie

Hi William Thank you for bringing this up! I've heard of this technique and that's why I wanted to explicitly clarify this use case. Intuitively it feels extremely hard to get right -- transactions that can abort on any memory access, yet you want to recover to some clean state. Imagine trying to verify something like this -- it's concurrency on steroids. I.e., I get it, maybe it's possible, especially with some careful programming primitives but it seems very very hard and impractical. Is there a link to source code? Anton On Wed, Mar 26, 2025 at 12:49:59PM +1000, William ML Leslie wrote:
On Wed, 26 Mar 2025 at 03:54, Anton Burtsev via Devel <devel@sel4.systems> wrote:
Hi all
I have a couple of questions on how user-level proofs are constructed in the seL4 ecosystem.
Just to start, can someone point me to an example of a user-level proof, just so I can take a look at how specifications are exported to user processes and how everything is stitched together.
But also, I'm a bit confused on what kind of assumptions are needed to build verified user-level code in the face of revocable memory. I.e., it seems that one needs to assume (and trust) that the memory of any verified (or even rather mission critical) process is never revoked.
I hope this is helpful - some learnings from trying to support mutual suspicion in an EROS descendant that follows a very similar model to seL4.
Note that we do not need to handle the case of arbitrary memory being rescinded, only memory that we've been sent by a client. The driver's own memory is owned by the driver, so the only way it can go away without the driver's consent is if the process is destroyed. I don't know if that matters to you.
When you set up a driver that maps some client memory that could be revoked out from under it: - the driver should have a handler registered for memory faults. - the memory handler should be able to identify which faults are from regions that were used for mapping client pages. - if a fault happens on a user-supplied page, the handler should either adjust the instruction pointer and stack to a function that can abandon the transaction, or map an empty page in that location and make a note that the driver can check when it is done working with client pages. the driver needs to accept that it might experience a partial read/write of a client's data and decide how to deal with that.
Additionally, if we are going to map the page to the driver via the iommu, we pin it, which among other things prevents re-allocation of the page. As far as I know, this isn't necessary in seL4, but maybe there's a similar mechanism to ensure devices can't read re-allocated memory without having to copy.
-- William ML Leslie

On Thu, 27 Mar 2025 at 03:07, Anton Burtsev <aburtsev@flux.utah.edu> wrote:
Hi William
Thank you for bringing this up! I've heard of this technique and that's why I wanted to explicitly clarify this use case. Intuitively it feels extremely hard to get right -- transactions that can abort on any memory access, yet you want to recover to some clean state.
That's the thing though, they can't abort on any memory access, only on access to the volatile message mapped into a (usually statically known) location. You can reduce the state space further by only reading/writing from the message from a single place in the program - doing your unmarshalling up front, and building your output right before notifying the client.
Imagine trying to verify something like this -- it's concurrency on steroids. I.e., I get it, maybe it's possible, especially with some careful programming primitives but it seems very very hard and impractical.
Is there a link to source code?
Sorry, I don't know if there are any public examples of this technique. I'd like to release some drivers I have that use this (specifically, they abort the transaction and return to open wait), but that is a bit involved right now. In practice though, it involves restoring the registers of the driver process to a previously known-safe state, which means dropping the message currently being handled on the floor. presumably the caller was fine with this. -- William ML Leslie
participants (7)
-
Anton Burtsev
-
Demi Marie Obenour
-
Gernot Heiser
-
Gerwin Klein
-
Hugo V.C.
-
Peter Chubb
-
William ML Leslie