Re: some performance problem when test 4 cores SMP benchmark of seL4bench project 答复: Devel Digest, Vol 127, Issue 1
Hi professor Heiser:
To understand what’s going on I’d need to know what these numbers are:
- what is being measured, and what’s the 500/100cy parameter?
- which web site are the “official” numbers from, they aren’t at https://sel4.systems/About/Performance/
First, I got the data of IMX8MM_EVK_64 and TX2 from https://github.com/seL4/sel4bench/actions/runs/1469475721#artifacts, the sel4bench-results-imx8mm_evk file and sel4bench-results-tx2 file, unpack the file out, I find xxxx_SMP_64.json
Secondly, the test is the smp benchmark form sel4bench-manifest project, the source file is sel4bench/apps/smp/src/main.c
The test scenario look like below:
A pair thread of ping-pong on the same core, the ping thread will wait for "ipc_normal_delay" time then send 0 len ipc message to pong thread, then return. I think the 500 cycles mean how long ipc_normal_delay will really delay
The above scenario will test on one core, or mutil core. If we run 4 cores, every core will have a ping thread and a pong thread run like above description, then record the sum of all cores ping-pong counts.
I think this experiment is used to illustrate in multi core, our seL4 kernel big lock will not affect mutli-core performance, am I right ?
Addition:
Our seL4_Call performance is same with other platform
XXXX IMX8MM_EVK_64 TX2_64
seL4_Call 367(0) 378(2) 492(16) client->server, same vspace, ipc_len is 0
seL4_ReplyRecv 396(0) 402(2) 513(16) server->client, same vspace, ipc_len is 0
Thank you for your help
-----邮件原件-----
发件人: devel-request@sel4.systems [mailto:devel-request@sel4.systems]
发送时间: 2021年12月2日 9:00
收件人: devel@sel4.systems
主题: Devel Digest, Vol 127, Issue 1
Send Devel mailing list submissions to
devel@sel4.systems
To subscribe or unsubscribe via email, send a message with subject or body 'help' to
devel-request@sel4.systems
You can reach the person managing the list at
devel-owner@sel4.systems
When replying, please edit your Subject line so it is more specific than "Re: Contents of Devel digest..."
Today's Topics:
1. Subscription (Xin Wang)
2. some performance problem when test 4 cores SMP benchmark of seL4bench project
(yadong.li)
3. Re: some performance problem when test 4 cores SMP benchmark of seL4bench project
(Gernot Heiser)
----------------------------------------------------------------------
Message: 1
Date: Wed, 1 Dec 2021 06:52:45 +0000
From: Xin Wang
On 2 Dec 2021, at 16:17, yadong.li
wrote: First, I got the data of IMX8MM_EVK_64 and TX2 from https://github.com/seL4/sel4bench/actions/runs/1469475721#artifacts, the sel4bench-results-imx8mm_evk file and sel4bench-results-tx2 file, unpack the file out, I find xxxx_SMP_64.json Secondly, the test is the smp benchmark form sel4bench-manifest project, the source file is sel4bench/apps/smp/src/main.c The test scenario look like below: A pair thread of ping-pong on the same core, the ping thread will wait for "ipc_normal_delay" time then send 0 len ipc message to pong thread, then return. I think the 500 cycles mean how long ipc_normal_delay will really delay
The above scenario will test on one core, or mutil core. If we run 4 cores, every core will have a ping thread and a pong thread run like above description, then record the sum of all cores ping-pong counts.
ok, but what is the metric reported? [Apologies for not being on top of the details of our benchmarking setups.] It cannot simply be throughput, else the doubled delay should be reflected in a significantly reduced throughput, but it has almost no effect.
I think this experiment is used to illustrate in multi core, our seL4 kernel big lock will not affect mutli-core performance, am I right ?
Not quite. As there’s only one big lock, only one core can execute the kernel at any time. If one core is in the IPC while another core is trying to IPC, even though both IPCs are core-local, the second will have to wait until the first gets out of the lock. As the delay is higher than the syscall latency, you’d expect perfect scalability from one core to two (with the lock essentially synchronising the threads). For 3 or 4 cores the combined latency of two IPCs is larger than the 500cy delay and you expect lock contention, resulting in reduced scaling, while it should still scale almost perfectly with the 1000cy delay. This is exactly what you see for the i.MX8 and the TX2.
Addition: Our seL4_Call performance is same with other platform XXXX IMX8MM_EVK_64 TX2_64 seL4_Call 367(0) 378(2) 492(16) client->server, same vspace, ipc_len is 0 seL4_ReplyRecv 396(0) 402(2) 513(16) server->client, same vspace, ipc_len is 0
OK, so baseline performance is good. But are these measured on a single-core or SMP kernel (i.e. is locking included)?
My test results below: ARM platform Test item XXX IMX8MM_EVK_64 TX2 mean(Stddev) 500 cycles, 1 core 636545(46) 625605(29) 598142(365) 500 cycles, 2 cores 897900(2327) 1154209(44) 994298(94) 500 cycles, 3 cores 1301679(2036) 1726043(65) 1497740(127) 500 cycles, 4 cores 1387678(549) 2172109(12674) 1545872(109) 1000 cycles, 1 core 636529(42) 625599(22) 597627(161) 1000 cycles, 2 cores 899212(3384) 1134110(34) 994437(541) 1000 cycles, 3 cores 1297322(5028) 1695385(45) 1497547(714) 1000 cycles, 4 cores 1387149(456) 2174605(81) 1545716(614)
I notice your standard deviations for 2 and 3 cores are surprisingly high (although still small in relative terms). Did you try running the same again? Are the numbers essentially the same or are multiple runs all over the shop? There are some issues with our benchmarking methodology. Fixing up sel4bench is one of the projects I’d like to do if I got a student for it, or maybe someone from the community would want to help? But just from looking at the data I’m not sure that’s the issue here. Gernot
On Thu, Dec 2, 2021 at 9:28 PM Gernot Heiser
On 2 Dec 2021, at 16:17, yadong.li
wrote: First, I got the data of IMX8MM_EVK_64 and TX2 from https://github.com/seL4/sel4bench/actions/runs/1469475721#artifacts, the sel4bench-results-imx8mm_evk file and sel4bench-results-tx2 file, unpack the file out, I find xxxx_SMP_64.json Secondly, the test is the smp benchmark form sel4bench-manifest project, the source file is sel4bench/apps/smp/src/main.c The test scenario look like below: A pair thread of ping-pong on the same core, the ping thread will wait for "ipc_normal_delay" time then send 0 len ipc message to pong thread, then return. I think the 500 cycles mean how long ipc_normal_delay will really delay
The above scenario will test on one core, or mutil core. If we run 4 cores, every core will have a ping thread and a pong thread run like above description, then record the sum of all cores ping-pong counts.
ok, but what is the metric reported? [Apologies for not being on top of the details of our benchmarking setups.]
Looking at the sel4bench smp benchmark implementation, the metric is the total number of "operations" in a single second. An operation is a round trip intra address space seL4_Call + seL4_ReplyRecv between 2 threads on the same core with each thread delaying for the cycle count before performing the next operation. After 1 second of all cores performing these operations continuously and maintaining a core-local (on a separate cache line) count, the total number of operations is added together and reported as the final number. So you would expect that the reported metric would scale following Amdahl's law based on the proportion of an operation that is serialized inside the kernel lock which would potentially vary across platforms.
It cannot simply be throughput, else the doubled delay should be reflected in a significantly reduced throughput, but it has almost no effect.
I think this experiment is used to illustrate in multi core, our seL4 kernel big lock will not affect mutli-core performance, am I right ?
Not quite. As there’s only one big lock, only one core can execute the kernel at any time. If one core is in the IPC while another core is trying to IPC, even though both IPCs are core-local, the second will have to wait until the first gets out of the lock.
As the delay is higher than the syscall latency, you’d expect perfect scalability from one core to two (with the lock essentially synchronising the threads). For 3 or 4 cores the combined latency of two IPCs is larger than the 500cy delay and you expect lock contention, resulting in reduced scaling, while it should still scale almost perfectly with the 1000cy delay. This is exactly what you see for the i.MX8 and the TX2.
Addition: Our seL4_Call performance is same with other platform XXXX IMX8MM_EVK_64 TX2_64 seL4_Call 367(0) 378(2) 492(16) client->server, same vspace, ipc_len is 0 seL4_ReplyRecv 396(0) 402(2) 513(16) server->client, same vspace, ipc_len is 0
OK, so baseline performance is good. But are these measured on a single-core or SMP kernel (i.e. is locking included)?
My test results below: ARM platform Test item XXX IMX8MM_EVK_64 TX2 mean(Stddev) 500 cycles, 1 core 636545(46) 625605(29) 598142(365) 500 cycles, 2 cores 897900(2327) 1154209(44) 994298(94) 500 cycles, 3 cores 1301679(2036) 1726043(65) 1497740(127) 500 cycles, 4 cores 1387678(549) 2172109(12674) 1545872(109) 1000 cycles, 1 core 636529(42) 625599(22) 597627(161) 1000 cycles, 2 cores 899212(3384) 1134110(34) 994437(541) 1000 cycles, 3 cores 1297322(5028) 1695385(45) 1497547(714) 1000 cycles, 4 cores 1387149(456) 2174605(81) 1545716(614)
I notice your standard deviations for 2 and 3 cores are surprisingly high (although still small in relative terms).
Did you try running the same again? Are the numbers essentially the same or are multiple runs all over the shop?
There are some issues with our benchmarking methodology. Fixing up sel4bench is one of the projects I’d like to do if I got a student for it, or maybe someone from the community would want to help?
But just from looking at the data I’m not sure that’s the issue here.
Gernot _______________________________________________ Devel mailing list -- devel@sel4.systems To unsubscribe send an email to devel-leave@sel4.systems
On 7 Dec 2021, at 13:43, Kent Mcleod
Looking at the sel4bench smp benchmark implementation, the metric is the total number of "operations" in a single second. An operation is a round trip intra address space seL4_Call + seL4_ReplyRecv between 2 threads on the same core with each thread delaying for the cycle count before performing the next operation. After 1 second of all cores performing these operations continuously and maintaining a core-local (on a separate cache line) count, the total number of operations is added together and reported as the final number. So you would expect that the reported metric would scale following Amdahl's law based on the proportion of an operation that is serialized inside the kernel lock which would potentially vary across platforms.
Thanks for the explanation, Kent. Observations: 1) The metric is essentially independent of the delay. Looking at the single-core figures for the i/MX8, I get 1598.5 ns in both cases, the difference being 15ps. Doesn’t make sense to me. 2) Assuming this processor runs at the 1.8GHz it seems speced for, this corresponds to 2877 cycles, which is huge, even if the 1000cy delay is subtracted! 3) As I said before, intra-AS IPC is a meaningless metric we should never use (but that’s incidental to the particular thing we want to measure here). 4) Having to do these calculations to understand the numbers is a sure indication that the results are presented in an unsuitable form. I can’t see how these figures make sense. Gernot
participants (3)
-
Gernot Heiser
-
Kent Mcleod
-
yadong.li