Re: [seL4] AMD CPUs and SMP scalability

29 Jan 2020

      On 29 Jan 2020, at 09:35, Demi M. Obenour <demiobenour@gmail.com> wrote:
...
My understanding is that seL4 plans on scaling to large multi-core
server processors by using Intel TSX.
Not sure where you heard this, but it’s not true.

True is that we have evaluated TSX as a possible implementation of concurrency control of a multicore seL4 on x86. No decision has been made, as the verification implications are poorly understood ATM, and it is unclear whether it is worth using a highly architecture/manufacturer-specific mechanism for something that is very costly to verify.

Ignoring that point, there are zero plans to make the multicore seL4 kernel scale to large number of processors, for the simple reason that it doesn’t make sense to run a single seL4 image over many cores.

A simple back-of-the-envelope calculation explains why. A fastpath IPC operation (which should be the operation dominates kernel time in a well-designed system) costs of the order of 300 cycles single-core. If you do this on a multicore system, then any shared state needs to be migrated to the core that executes the syscall. On a processor with a shared L2 cache (typical Arm multicores) then this will add about a dozen of cycles for each cache line that must be migrated. A ballpark figure is that the shared state would be about 10 cache lines (and the kernel may need some more microoptimisation to get to that point). This means the 300 cycles go up to about 400+, or about 1/3 performance degradation.

If you now look at a manycore processor, cache-line migrations across the die are of the order of 300–1,000 cycles, so the cost of the IPC blows out from 300 to 3,000–10,000 cycles. This makes no sense at all.

The bottom line is that a single-image kernel makes sense where you share an L2. If you only share an L3 between a small number of cores then it becomes highly questionable whether a shared image makes sense (and note that Intel processors that pretend to have a shared L3 typically have a distributed L3 made to appear shared by having them connected by a fast tokenring network). If you’re into manycore territory it’s obvious the model does not make sense.

The right model for manycores is the clustered multikernel, where clusters sharing an L2 run a single kernel image, and clusters coordinate through user-level mechanisms. This is consistent with seL4 not trying to provide hardware abstraction, it should only do what userlevel cannot, or can only do with high overhead. In this case it means making the NUMA architecture explicitlky visible at user level, and having the system on top manage it.

We have not yet worked on the required user-level infrastructure (beyond an exploratory student thesis a couple years ago).

Gernot

Re: [seL4] AMD CPUs and SMP scalability

Heiser, Gernot (Data61, Kensington NSW)