[seL4] Re: User-level proofs and capability revocation

27 Mar 2025

      On 27 Mar 2025, at 05:22, Gernot Heiser <gernot@unsw.edu.au> wrote:

I guess, in seL4
world one structures the system like this: there is a protocol which
requires the client VM to work with the TCB along these lines:  give up
N pages from it's memory allocation back to the TCB which then "moves"
them into the driver. The TCB has to be "below" the client in the
capability chain to make sure that the client looses the ability to
revoke the pages from the driver,

That would be one way of doing it in seL4 if the client *does* own the pages.

Note that even if you use a protocol where the client supplies shared-memory frames directly to the server, the server can still protect itself from the client.

If the client unmaps the shared page, the server will trigger a page fault on its next access.
That invokes the server’s page-fault handler (normally part of the resource manager, but could be a thread inside the server).
The page-fault handler finds that the fault is due to a page shared with the client and takes appropriate action.
This would presumably involve revoking all further client access (i.e. revoking all of the client’s caps to server endpoints/notifications etc) – a clear case where revocation is essential to stop further requests from the mis-behaving client.
The fault handler would then clean up state the server holds on behalf of the client.
And it would recover the server by aborting whatever operation was faulted and re-set the server to a sane state.

In most designs, this work would be split between two handlers: The page-fault handler being part of a trusted resource manager (i.e. something higher up in the trust hierarchy) which invokes a server-internal handler for cleanup/recovery.

But theoretically it could be all internal to the server (which then needs to rely on owning its internal memory, i.e. can’t be revoked).

If the handler is external (part of a trusted resource manager) then that could implement transparent demand paging of the server too, it would have to do enough bookkeeping to be able to distinguish between faults on client-provided memory and faults triggered by server pages being swapped out.

Demand paging must be functionally transparent to the paged task. Meaning the resource manager must observe an invariant that if a page is unmapped (eg for swapping to backing store), a page fault triggered by that will only restart the paged task once that page is mapped to a frame that has the same content as the one that had been unmapped earlier. Then the functionality of the paged task is unaffected by paging.
Paging is still observable in the timing, of course, but that’s not different from being preempted by a higher-priority thread.

The design space is big – which is part of the challenge with seL4: it’s easy to do bad designs, and requires a lot of expertise to do good ones.
Which is why we’re working on sample designs.
LionsOS is a one with a static architecture, that avoids a lot of the complexity (and should ease verification).
But we also have a fully dynamic one in the works, but that one is at a very early stage.

Gernot