On 3/10/19, Gernot.Heiser(a)data61.csiro.au <Gernot.Heiser(a)data61.csiro.au> wrote:
You need to keep in mind that, as I said in my blog, seL4 is designed as a
minimal wrapper, just enough to allow securely multiplexing hardware. seL4
not migrating threads on its own is a feature, not a bug, as any migration
would require a policy, and that needs to be defined at user level. seL4
gives you all the mechanisms you need. Eg you can implement a Linux-like
scheduler on top (in fact, a simplified version of this was done in our
EuroSys’18 paper on scheduling contexts, i.e. the one describing the MCS
kernel,
https://ts.data61.csiro.au/publications/csiroabstracts/Lyons_MAH_18.abstrac…).
Yes, it does make sense for stuff like thread migration to be left up
to user code.
AFAIK, seL4 doesn’t have scalability issues if used properly (and it’s not
the kernel’s job to support ill-considered uses).
To understand this, think of a typical Arm multicore SoC: 4–16 or so cores,
shared L2. Inter-core cache line migration latency 10–20 cycles. Even
assuming a syscall needs to access about 10 write-shared lines, that’s still
well below the syscall latency. A shared kernel image makes sense, as
anything else would force high overheads (from explicit cross-core
communication/synchronisation at user level).
Now think of a high-end x86 server processor. Latency of shipping a single
cache line from one core at the periphery to the opposite end costs many
100s, probably >1,000 cycles, just for a single line. If the syscall needs
10 of those, then the latency of migrating cache lines is over an OoM larger
than the cost of a fast syscall. It makes no sense whatsoever to do this,
you’d be forcing high overheads onto every syscall.
A single, shared seL4 image only makes sense in a domain where the
cache-line migration cost remains small compared to the syscall cost. Beyond
that, you want separate kernel images, i.e. a multikernel.
But this doesn’t limit what you can do on top. If you want to run a single
Unix system across the whole thing, this can make sense (Unix syscalls are
OmM more expensive than seL4 syscalls, so the trade-offs are quite
different). And you can do this across a set of seL4 clusters, they
basically present you with a NUMA system. Which makes sense, as the hardware
will be NUMA as well.
My plan is to implement Unix-like read()/write()-family APIs as close
to the microkernel as possible, with both being implemented within the
low-level system library as a single kernel call in the client to send
and a single kernel call in the server to reply whenever possible; the
message payload would be either stuffed into the available registers
(registers will be reserved for type, offset, and length) if it is
short enough or placed in a shared buffer if it is longer. As I said
before there will be variants of read and write that will expose the
message registers or shared buffer in a more direct fashion but these
will interoperate with each other and the traditional API; since the
seL4 API won't be exported to normal user code at all, the Unix-like
transport layer will have to be as efficient as possible. All other
Unix calls besides read()/write()/seek()-family APIs will be
implemented either in servers or purely in libraries.
Is there anything that would prevent a Unix-like transport layer on
seL4 from using a single kernel call to send and a single kernel call
to reply?
This clustered multikernel story is described in a paper that’s now a few
years old:
https://ts.data61.csiro.au/publications/nictaabstracts/Peters_DEH_15.abstra…
Unfortunately, we never got around to build the middleware supporting this
(as people are throwing money at us to do other cool things), so you’re on
your own there for now. But there’s nothing inherent in seL4 that stops
you.
Is the kernel-side clustering code in mainline seL4?
As I said, I don’t think there is a scalability issue. And forking is
generally a bad idea, in the case of seL4 it’s an especially bad one, as
you’ll lose its biggest asset, the verification story (or have to do it
yourself).
I don't plan to fork seL4 unless there is some major difference that
would either significantly limit scalability or cause significant
overhead for a Unix-like IPC transport layer.