Re: [seL4] AMD CPUs and SMP scalability

31 Jan 2020

      On 2020-01-29 05:21, Heiser, Gernot (Data61, Kensington NSW) wrote:
...
On 29 Jan 2020, at 09:35, Demi M. Obenour <demiobenour@gmail.com> wrote:
...
My understanding is that seL4 plans on scaling to large multi-core
server processors by using Intel TSX.
Not sure where you heard this, but it’s not true.
It was my own misinterpretation of a research paper.
...
True is that we have evaluated TSX as a possible implementation of concurrency control of a multicore seL4 on x86. No decision has been made, as the verification implications are poorly understood ATM, and it is unclear whether it is worth using a highly architecture/manufacturer-specific mechanism for something that is very costly to verify.
Ignoring that point, there are zero plans to make the multicore seL4 kernel scale to large number of processors, for the simple reason that it doesn’t make sense to run a single seL4 image over many cores.
A simple back-of-the-envelope calculation explains why. A fastpath IPC operation (which should be the operation dominates kernel time in a well-designed system) costs of the order of 300 cycles single-core. If you do this on a multicore system, then any shared state needs to be migrated to the core that executes the syscall. On a processor with a shared L2 cache (typical Arm multicores) then this will add about a dozen of cycles for each cache line that must be migrated. A ballpark figure is that the shared state would be about 10 cache lines (and the kernel may need some more microoptimisation to get to that point). This means the 300 cycles go up to about 400+, or about 1/3 performance degradation.
How much are those times effected by processor vulnerability
mitigations?
...
If you now look at a manycore processor, cache-line migrations across the die are of the order of 300–1,000 cycles, so the cost of the IPC blows out from 300 to 3,000–10,000 cycles. This makes no sense at all.
The bottom line is that a single-image kernel makes sense where you share an L2. If you only share an L3 between a small number of cores then it becomes highly questionable whether a shared image makes sense (and note that Intel processors that pretend to have a shared L3 typically have a distributed L3 made to appear shared by having them connected by a fast tokenring network). If you’re into manycore territory it’s obvious the model does not make sense.
The right model for manycores is the clustered multikernel, where clusters sharing an L2 run a single kernel image, and clusters coordinate through user-level mechanisms. This is consistent with seL4 not trying to provide hardware abstraction, it should only do what userlevel cannot, or can only do with high overhead. In this case it means making the NUMA architecture explicitlky visible at user level, and having the system on top manage it.
That is perfectly understandable, and your explanation was awesome.

My worry is less about the scalability of a single seL4 kernel image,
and more about being able to build scalable systems on top of seL4.
A single seL4 image may not scale, but it should be possible to build
a system on top of a multikernel that scales as well as a monolithic
kernel does, and that hides much of the complexity from applications.
seL4 should not provide hardware abstraction, but it should be
both possible and practical to build an OS on top of it that does.
Code that pushes the limits of performance will likely still need to
be heavily optimized for specific hardware, but that is true today,
so it isn’t a regression.
...
We have not yet worked on the required user-level infrastructure (beyond an exploratory student thesis a couple years ago).
Gernot
I am very excited to see the results when it does get created!

Sincerely,

Demi