Re: some performance problem when test 4 cores SMP benchmark of seL4bench project 答复: Devel Digest, Vol 127, Issue 1
Hi professor Heiser: To understand what’s going on I’d need to know what these numbers are: - what is being measured, and what’s the 500/100cy parameter? - which web site are the “official” numbers from, they aren’t at https://sel4.systems/About/Performance/ First, I got the data of IMX8MM_EVK_64 and TX2 from https://github.com/seL4/sel4bench/actions/runs/1469475721#artifacts, the sel4bench-results-imx8mm_evk file and sel4bench-results-tx2 file, unpack the file out, I find xxxx_SMP_64.json Secondly, the test is the smp benchmark form sel4bench-manifest project, the source file is sel4bench/apps/smp/src/main.c The test scenario look like below: A pair thread of ping-pong on the same core, the ping thread will wait for "ipc_normal_delay" time then send 0 len ipc message to pong thread, then return. I think the 500 cycles mean how long ipc_normal_delay will really delay The above scenario will test on one core, or mutil core. If we run 4 cores, every core will have a ping thread and a pong thread run like above description, then record the sum of all cores ping-pong counts. I think this experiment is used to illustrate in multi core, our seL4 kernel big lock will not affect mutli-core performance, am I right ? Addition: Our seL4_Call performance is same with other platform XXXX IMX8MM_EVK_64 TX2_64 seL4_Call 367(0) 378(2) 492(16) client->server, same vspace, ipc_len is 0 seL4_ReplyRecv 396(0) 402(2) 513(16) server->client, same vspace, ipc_len is 0 Thank you for your help -----邮件原件----- 发件人: devel-request@sel4.systems [mailto:devel-request@sel4.systems] 发送时间: 2021年12月2日 9:00 收件人: devel@sel4.systems 主题: Devel Digest, Vol 127, Issue 1 Send Devel mailing list submissions to devel@sel4.systems To subscribe or unsubscribe via email, send a message with subject or body 'help' to devel-request@sel4.systems You can reach the person managing the list at devel-owner@sel4.systems When replying, please edit your Subject line so it is more specific than "Re: Contents of Devel digest..." Today's Topics: 1. Subscription (Xin Wang) 2. some performance problem when test 4 cores SMP benchmark of seL4bench project (yadong.li) 3. Re: some performance problem when test 4 cores SMP benchmark of seL4bench project (Gernot Heiser) ---------------------------------------------------------------------- Message: 1 Date: Wed, 1 Dec 2021 06:52:45 +0000 From: Xin Wang <xin.wang@bst.ai> Subject: [seL4] Subscription To: "devel@sel4.systems" <devel@sel4.systems> Message-ID: <BL0PR18MB2146092BCA3DBADBC26984ADF7689@BL0PR18MB2146.nam prd18.prod.outlook.com> Content-Type: text/plain; charset="gb2312" Hi sirs, Subscription Thanks, 从 Windows 版邮件<https://go.microsoft.com/fwlink/?LinkId=550986>发送 ------------------------------ Message: 2 Date: Wed, 1 Dec 2021 14:58:17 +0000 From: yadong.li <yadong.li@horizon.ai> Subject: [seL4] some performance problem when test 4 cores SMP benchmark of seL4bench project To: "devel@sel4.systems" <devel@sel4.systems> Message-ID: <2ae9929de02e481796d3d697182842c3@horizon.ai> Content-Type: text/plain; charset="gb2312" Hi, Now, I meet some performance problem when test 4 cores SMP benchmark of seL4bench on our platform. Out platform is XXX, But I get the test data of IMX8MM_EVK_64 and TX2 platform from seL4 website, I think they are official statistics. My test results below: ARM platform Test item XXX IMX8MM_EVK_64 TX2 mean(Stddev) 500 cycles, 1 core 636545(46) 625605(29) 598142(365) 500 cycles, 2 cores 897900(2327) 1154209(44) 994298(94) 500 cycles, 3 cores 1301679(2036) 1726043(65) 1497740(127) 500 cycles, 4 cores 1387678(549) 2172109(12674) 1545872(109) 1000 cycles, 1 core 636529(42) 625599(22) 597627(161) 1000 cycles, 2 cores 899212(3384) 1134110(34) 994437(541) 1000 cycles, 3 cores 1297322(5028) 1695385(45) 1497547(714) 1000 cycles, 4 cores 1387149(456) 2174605(81) 1545716(614) From these compare data: 1. When test smp bench on one core, the performance of several platform is similar 2. When test smp bench on muti core, the result of IMX8MM_EVK_64 is beauty, the result of 4 cores is 3.47 times as good as 1 core, I think it’s good 3. But the platform of TX2 has some different performance, the result of 2 cores is 1.66 times as good as 1 core, I still think is good, But the result of 3 cores almost have the same ping-pong count with 4 cores, why add one core, the count result not add as our expected ? 4. The performance of our platform is badly, on our platform, the result of 3 cores almost also have the same ping-pong count with 4 cores, and our count result of 4 cores just 2 times as good as one core, I think it is very bad 5. I want to know what are the possible causes of the badly performance about our platform XXX and TX2 ? ------------------------------ Message: 3 Date: Wed, 1 Dec 2021 21:09:41 +0000 From: Gernot Heiser <gernot@unsw.edu.au> Subject: [seL4] Re: some performance problem when test 4 cores SMP benchmark of seL4bench project To: "devel@sel4.systems" <devel@sel4.systems> Message-ID: <720E9728-1079-4455-BB0E-34A7C5CE88F4@unsw.edu.au> Content-Type: text/plain; charset="utf-8" Hi Yandong, To understand what’s going on I’d need to know what these numbers are: - what is being measured, and what’s the 500/100cy parameter? - which web site are the “official” numbers from, they aren’t at https://sel4.systems/About/Performance/ Gernot On 2 Dec 2021, at 01:58, yadong.li<http://yadong.li/> <yadong.li@horizon.ai<mailto:yadong.li@horizon.ai>> wrote: Hi, Now, I meet some performance problem when test 4 cores SMP benchmark of seL4bench on our platform. Out platform is XXX, But I get the test data of IMX8MM_EVK_64 and TX2 platform from seL4 website, I think they are official statistics. My test results below: ARM platform Test item XXX IMX8MM_EVK_64 TX2 mean(Stddev) 500 cycles, 1 core 636545(46) 625605(29) 598142(365) 500 cycles, 2 cores 897900(2327) 1154209(44) 994298(94) 500 cycles, 3 cores 1301679(2036) 1726043(65) 1497740(127) 500 cycles, 4 cores 1387678(549) 2172109(12674) 1545872(109) 1000 cycles, 1 core 636529(42) 625599(22) 597627(161) 1000 cycles, 2 cores 899212(3384) 1134110(34) 994437(541) 1000 cycles, 3 cores 1297322(5028) 1695385(45) 1497547(714) 1000 cycles, 4 cores 1387149(456) 2174605(81) 1545716(614) From these compare data: 1. When test smp bench on one core, the performance of several platform is similar 2. When test smp bench on muti core, the result of IMX8MM_EVK_64 is beauty, the result of 4 cores is 3.47 times as good as 1 core, I think it’s good 3. But the platform of TX2 has some different performance, the result of 2 cores is 1.66 times as good as 1 core, I still think is good, But the result of 3 cores almost have the same ping-pong count with 4 cores, why add one core, the count result not add as our expected ? 4. The performance of our platform is badly, on our platform, the result of 3 cores almost also have the same ping-pong count with 4 cores, and our count result of 4 cores just 2 times as good as one core, I think it is very bad 5. I want to know what are the possible causes of the badly performance about our platform XXX and TX2 ? _______________________________________________ Devel mailing list -- devel@sel4.systems<mailto:devel@sel4.systems> To unsubscribe send an email to devel-leave@sel4.systems<mailto:devel-leave@sel4.systems> ------------------------------ Subject: Digest Footer _______________________________________________ Devel mailing list -- devel@sel4.systems To unsubscribe send an email to devel-leave@sel4.systems ------------------------------ End of Devel Digest, Vol 127, Issue 1 *************************************
On 2 Dec 2021, at 16:17, yadong.li <yadong.li@horizon.ai> wrote:
First, I got the data of IMX8MM_EVK_64 and TX2 from https://github.com/seL4/sel4bench/actions/runs/1469475721#artifacts, the sel4bench-results-imx8mm_evk file and sel4bench-results-tx2 file, unpack the file out, I find xxxx_SMP_64.json Secondly, the test is the smp benchmark form sel4bench-manifest project, the source file is sel4bench/apps/smp/src/main.c The test scenario look like below: A pair thread of ping-pong on the same core, the ping thread will wait for "ipc_normal_delay" time then send 0 len ipc message to pong thread, then return. I think the 500 cycles mean how long ipc_normal_delay will really delay
The above scenario will test on one core, or mutil core. If we run 4 cores, every core will have a ping thread and a pong thread run like above description, then record the sum of all cores ping-pong counts.
ok, but what is the metric reported? [Apologies for not being on top of the details of our benchmarking setups.] It cannot simply be throughput, else the doubled delay should be reflected in a significantly reduced throughput, but it has almost no effect.
I think this experiment is used to illustrate in multi core, our seL4 kernel big lock will not affect mutli-core performance, am I right ?
Not quite. As there’s only one big lock, only one core can execute the kernel at any time. If one core is in the IPC while another core is trying to IPC, even though both IPCs are core-local, the second will have to wait until the first gets out of the lock. As the delay is higher than the syscall latency, you’d expect perfect scalability from one core to two (with the lock essentially synchronising the threads). For 3 or 4 cores the combined latency of two IPCs is larger than the 500cy delay and you expect lock contention, resulting in reduced scaling, while it should still scale almost perfectly with the 1000cy delay. This is exactly what you see for the i.MX8 and the TX2.
Addition: Our seL4_Call performance is same with other platform XXXX IMX8MM_EVK_64 TX2_64 seL4_Call 367(0) 378(2) 492(16) client->server, same vspace, ipc_len is 0 seL4_ReplyRecv 396(0) 402(2) 513(16) server->client, same vspace, ipc_len is 0
OK, so baseline performance is good. But are these measured on a single-core or SMP kernel (i.e. is locking included)?
My test results below: ARM platform Test item XXX IMX8MM_EVK_64 TX2 mean(Stddev) 500 cycles, 1 core 636545(46) 625605(29) 598142(365) 500 cycles, 2 cores 897900(2327) 1154209(44) 994298(94) 500 cycles, 3 cores 1301679(2036) 1726043(65) 1497740(127) 500 cycles, 4 cores 1387678(549) 2172109(12674) 1545872(109) 1000 cycles, 1 core 636529(42) 625599(22) 597627(161) 1000 cycles, 2 cores 899212(3384) 1134110(34) 994437(541) 1000 cycles, 3 cores 1297322(5028) 1695385(45) 1497547(714) 1000 cycles, 4 cores 1387149(456) 2174605(81) 1545716(614)
I notice your standard deviations for 2 and 3 cores are surprisingly high (although still small in relative terms). Did you try running the same again? Are the numbers essentially the same or are multiple runs all over the shop? There are some issues with our benchmarking methodology. Fixing up sel4bench is one of the projects I’d like to do if I got a student for it, or maybe someone from the community would want to help? But just from looking at the data I’m not sure that’s the issue here. Gernot
On Thu, Dec 2, 2021 at 9:28 PM Gernot Heiser <gernot@unsw.edu.au> wrote:
On 2 Dec 2021, at 16:17, yadong.li <yadong.li@horizon.ai> wrote:
First, I got the data of IMX8MM_EVK_64 and TX2 from https://github.com/seL4/sel4bench/actions/runs/1469475721#artifacts, the sel4bench-results-imx8mm_evk file and sel4bench-results-tx2 file, unpack the file out, I find xxxx_SMP_64.json Secondly, the test is the smp benchmark form sel4bench-manifest project, the source file is sel4bench/apps/smp/src/main.c The test scenario look like below: A pair thread of ping-pong on the same core, the ping thread will wait for "ipc_normal_delay" time then send 0 len ipc message to pong thread, then return. I think the 500 cycles mean how long ipc_normal_delay will really delay
The above scenario will test on one core, or mutil core. If we run 4 cores, every core will have a ping thread and a pong thread run like above description, then record the sum of all cores ping-pong counts.
ok, but what is the metric reported? [Apologies for not being on top of the details of our benchmarking setups.]
Looking at the sel4bench smp benchmark implementation, the metric is the total number of "operations" in a single second. An operation is a round trip intra address space seL4_Call + seL4_ReplyRecv between 2 threads on the same core with each thread delaying for the cycle count before performing the next operation. After 1 second of all cores performing these operations continuously and maintaining a core-local (on a separate cache line) count, the total number of operations is added together and reported as the final number. So you would expect that the reported metric would scale following Amdahl's law based on the proportion of an operation that is serialized inside the kernel lock which would potentially vary across platforms.
It cannot simply be throughput, else the doubled delay should be reflected in a significantly reduced throughput, but it has almost no effect.
I think this experiment is used to illustrate in multi core, our seL4 kernel big lock will not affect mutli-core performance, am I right ?
Not quite. As there’s only one big lock, only one core can execute the kernel at any time. If one core is in the IPC while another core is trying to IPC, even though both IPCs are core-local, the second will have to wait until the first gets out of the lock.
As the delay is higher than the syscall latency, you’d expect perfect scalability from one core to two (with the lock essentially synchronising the threads). For 3 or 4 cores the combined latency of two IPCs is larger than the 500cy delay and you expect lock contention, resulting in reduced scaling, while it should still scale almost perfectly with the 1000cy delay. This is exactly what you see for the i.MX8 and the TX2.
Addition: Our seL4_Call performance is same with other platform XXXX IMX8MM_EVK_64 TX2_64 seL4_Call 367(0) 378(2) 492(16) client->server, same vspace, ipc_len is 0 seL4_ReplyRecv 396(0) 402(2) 513(16) server->client, same vspace, ipc_len is 0
OK, so baseline performance is good. But are these measured on a single-core or SMP kernel (i.e. is locking included)?
My test results below: ARM platform Test item XXX IMX8MM_EVK_64 TX2 mean(Stddev) 500 cycles, 1 core 636545(46) 625605(29) 598142(365) 500 cycles, 2 cores 897900(2327) 1154209(44) 994298(94) 500 cycles, 3 cores 1301679(2036) 1726043(65) 1497740(127) 500 cycles, 4 cores 1387678(549) 2172109(12674) 1545872(109) 1000 cycles, 1 core 636529(42) 625599(22) 597627(161) 1000 cycles, 2 cores 899212(3384) 1134110(34) 994437(541) 1000 cycles, 3 cores 1297322(5028) 1695385(45) 1497547(714) 1000 cycles, 4 cores 1387149(456) 2174605(81) 1545716(614)
I notice your standard deviations for 2 and 3 cores are surprisingly high (although still small in relative terms).
Did you try running the same again? Are the numbers essentially the same or are multiple runs all over the shop?
There are some issues with our benchmarking methodology. Fixing up sel4bench is one of the projects I’d like to do if I got a student for it, or maybe someone from the community would want to help?
But just from looking at the data I’m not sure that’s the issue here.
Gernot _______________________________________________ Devel mailing list -- devel@sel4.systems To unsubscribe send an email to devel-leave@sel4.systems
On 7 Dec 2021, at 13:43, Kent Mcleod <kent.mcleod72@gmail.com> wrote:
Looking at the sel4bench smp benchmark implementation, the metric is the total number of "operations" in a single second. An operation is a round trip intra address space seL4_Call + seL4_ReplyRecv between 2 threads on the same core with each thread delaying for the cycle count before performing the next operation. After 1 second of all cores performing these operations continuously and maintaining a core-local (on a separate cache line) count, the total number of operations is added together and reported as the final number. So you would expect that the reported metric would scale following Amdahl's law based on the proportion of an operation that is serialized inside the kernel lock which would potentially vary across platforms.
Thanks for the explanation, Kent. Observations: 1) The metric is essentially independent of the delay. Looking at the single-core figures for the i/MX8, I get 1598.5 ns in both cases, the difference being 15ps. Doesn’t make sense to me. 2) Assuming this processor runs at the 1.8GHz it seems speced for, this corresponds to 2877 cycles, which is huge, even if the 1000cy delay is subtracted! 3) As I said before, intra-AS IPC is a meaningless metric we should never use (but that’s incidental to the particular thing we want to measure here). 4) Having to do these calculations to understand the numbers is a sure indication that the results are presented in an unsuitable form. I can’t see how these figures make sense. Gernot
participants (3)
-
Gernot Heiser
-
Kent Mcleod
-
yadong.li