[seL4] Re: some performance problem when test 4 cores SMP benchmark of seL4bench project 答复: Devel Digest, Vol 127, Issue 1

7 Dec 2021


      On Thu, Dec 2, 2021 at 9:28 PM Gernot Heiser <gernot@unsw.edu.au> wrote:
...
...
On 2 Dec 2021, at 16:17, yadong.li <yadong.li@horizon.ai> wrote:
First, I got the data of IMX8MM_EVK_64 and TX2 from https://github.com/seL4/sel4bench/actions/runs/1469475721#artifacts, the sel4bench-results-imx8mm_evk file and sel4bench-results-tx2 file, unpack the file out, I find xxxx_SMP_64.json
Secondly, the test is the smp benchmark form sel4bench-manifest project, the source file is sel4bench/apps/smp/src/main.c
The test scenario look like below:
  A pair thread of ping-pong on the same core, the ping thread will wait for "ipc_normal_delay" time then send 0 len ipc message to pong thread, then return. I think the 500 cycles mean how long ipc_normal_delay will really delay
...
The above scenario will test on one core, or mutil core. If we run 4 cores, every core will have a ping thread and a pong thread run like above description, then record the sum of all cores ping-pong counts.
ok, but what is the metric reported? [Apologies for not being on top of the details of our benchmarking setups.]
Looking at the sel4bench smp benchmark implementation, the metric is
the total number of "operations" in a single second.  An operation is
a round trip intra address space seL4_Call + seL4_ReplyRecv between 2
threads on the same core with each thread delaying for the cycle count
before performing the next operation.  After 1 second of all cores
performing these operations continuously and maintaining a core-local
(on a separate cache line) count, the total number of operations is
added together and reported as the final number. So you would expect
that the reported metric would scale following Amdahl's law based on
the proportion of an operation that is serialized inside the kernel
lock which would potentially vary across platforms.
...
It cannot simply be throughput, else the doubled delay should be reflected in a significantly reduced throughput, but it has almost no effect.
...
I think this experiment is used to illustrate in multi core, our seL4 kernel big lock will not affect mutli-core performance, am I right ?
Not quite. As there’s only one big lock, only one core can execute the kernel at any time. If one core is in the IPC while another core is trying to IPC, even though both IPCs are core-local, the second will have to wait until the first gets out of the lock.
As the delay is higher than the syscall latency, you’d expect perfect scalability from one core to two (with the lock essentially synchronising the threads). For 3 or 4  cores the combined latency of two IPCs is larger than the 500cy delay and you expect lock contention, resulting in reduced scaling, while it should still scale almost perfectly with the 1000cy delay. This is exactly what you see for the i.MX8 and the TX2.
...
Addition：
  Our seL4_Call performance is same with other platform
                     XXXX             IMX8MM_EVK_64     TX2_64
  seL4_Call           367(0)             378(2)              492(16)          client->server, same vspace, ipc_len is 0
  seL4_ReplyRecv     396(0)             402(2)               513(16)          server->client, same vspace, ipc_len is 0
OK, so baseline performance is good. But are these measured on a single-core or SMP kernel (i.e. is locking included)?
...
My test results below:
                                                                      ARM platform
      Test item                         XXX                  IMX8MM_EVK_64          TX2
                                           mean(Stddev)
     500 cycles, 1 core        636545(46)            625605(29)            598142(365)
     500 cycles, 2 cores      897900(2327)      1154209(44)           994298(94)
     500 cycles, 3 cores     1301679(2036)      1726043(65)        1497740(127)
     500 cycles, 4 cores     1387678(549)       2172109(12674)   1545872(109)
    1000 cycles, 1 core       636529(42)            625599(22)            597627(161)
    1000 cycles, 2 cores      899212(3384)      1134110(34)          994437(541)
    1000 cycles, 3 cores     1297322(5028)      1695385(45)        1497547(714)
    1000 cycles, 4 cores     1387149(456)        2174605(81)        1545716(614)
I notice your standard deviations for 2 and 3 cores are surprisingly high (although still small in relative terms).
Did you try running the same again? Are the numbers essentially the same or are multiple runs all over the shop?
There are some issues with our benchmarking methodology. Fixing up sel4bench is one of the projects I’d like to do if I got a student for it, or maybe someone from the community would want to help?
But just from looking at the data I’m not sure that’s the issue here.
Gernot
_______________________________________________
Devel mailing list -- devel@sel4.systems
To unsubscribe send an email to devel-leave@sel4.systems

[seL4] Re: some performance problem when test 4 cores SMP benchmark of seL4bench project 答复: Devel Digest, Vol 127, Issue 1

Kent Mcleod