Re: some performance problem when test 4 cores SMP答复: Devel Digest, Vol 127, Issue 5
7 Dec
2021
7 Dec
'21
9:22 p.m.
Hi: > So you would expect that the reported metric would scale following Amdahl's law based on the proportion of an operation that is serialized inside the kernel lock which would potentially vary across platforms. > It cannot simply be throughput, else the doubled delay should be reflected in a significantly reduced throughput, but it has almost no effect. 1. This is what I also feel confused, I think whether it is depend on the implementation of ipc_normal_delay ipc_normal_delay will call below code: delay = OVERHEAD_FIXUP(REXP(id) * current_delay_cycle, overhead); insert a command such as 'zigset(...)'for every core, The ziggurat method for generating random variables, when double delay, the random variables coefficient will reduce the delay influence? 2. " the proportion of an operation that is serialized inside the kernel lock ", CLH lock is serialized between cores, if 4 cores will have big different performance on different ARM platform? > For 3 or 4 cores the combined latency of two IPCs is larger than the 500cy delay and you expect lock contention, resulting in reduced scaling, while it should still scale almost perfectly with the 1000cy delay. > > > Addition: > > Our seL4_Call performance is same with other platform > > XXXX IMX8MM_EVK_64 TX2_64 > > seL4_Call 367(0) 378(2) 492(16) client->server, same vspace, ipc_len is 0 > > seL4_ReplyRecv 396(0) 402(2) 513(16) server->client, same vspace, ipc_len is 0 1. the seL4_Call and seL4_ReplyRecv are measured on SMP kernel 2. or 3 or 4 cores the combined latency of two IPCs is larger than the 500cy delay and you expect lock contention ---- that's what I thought at first, But It can't be explained very well on IMX8MM_EVK_64, base on the test seL4_Call data, when run on 3 cores and 4cores, IMX8MM_EVK_64 still have a good scalability, so I think not all seL4_Call and ReplyRecv flow which cost maybe 300 or 400 is locked(like el trap), TX2 should have a good scalability as expected when run on 3 or 4 cores, but it did not show good, from data below:1497740(3core, 500cycles),1545872(4core) > > My test results below: > > ARM platform > > Test item XXX IMX8MM_EVK_64 TX2 > > mean(Stddev) > > 500 cycles, 1 core 636545(46) 625605(29) 598142(365) > > 500 cycles, 2 cores 897900(2327) 1154209(44) 994298(94) > > 500 cycles, 3 cores 1301679(2036) 1726043(65) 1497740(127) > > 500 cycles, 4 cores 1387678(549) 2172109(12674) 1545872(109) > > 1000 cycles, 1 core 636529(42) 625599(22) 597627(161) > > 1000 cycles, 2 cores 899212(3384) 1134110(34) 994437(541) > > 1000 cycles, 3 cores 1297322(5028) 1695385(45) 1497547(714) > > 1000 cycles, 4 cores 1387149(456) 2174605(81) 1545716(614) > I notice your standard deviations for 2 and 3 cores are surprisingly high (although still small in relative terms). > > Did you try running the same again? Are the numbers essentially the same or are multiple runs all over the shop? 1. Yes,I test several times,I always get the same results, have little different 2. I think the stddev is small relative to our test results, IMX8MM_EVK_64 also have a big stddev(500 cycles, 3core) from the data. -----邮件原件----- 发件人: devel-request@sel4.systems [mailto:devel-request@sel4.systems] 发送时间: 2021年12月7日 11:45 收件人: devel@sel4.systems 主题: Devel Digest, Vol 127, Issue 5 Send Devel mailing list submissions to devel@sel4.systems To subscribe or unsubscribe via email, send a message with subject or body 'help' to devel-request@sel4.systems You can reach the person managing the list at devel-owner@sel4.systems When replying, please edit your Subject line so it is more specific than "Re: Contents of Devel digest..." Today's Topics: 1. Re: Use TimeServer by Group Components Questions (Kent Mcleod) 2. Re: Incorporating seL4 into your project, risc-v architecture (Kent Mcleod) 3. Re: some performance problem when test 4 cores SMP benchmark of seL4bench project 答复: Devel Digest, Vol 127, Issue 1 (Kent Mcleod) 4. Re: Use TimeServer by Group Components Questions (15852538526@139.com) 5. Re: Use TimeServer by Group Components Questions (Kent Mcleod) 6. Re: some performance problem when test 4 cores SMP benchmark of seL4bench project 答复: Devel Digest, Vol 127, Issue 1 (Gernot Heiser) ---------------------------------------------------------------------- Message: 1 Date: Tue, 7 Dec 2021 13:03:00 +1100 From: Kent Mcleod <kent.mcleod72@gmail.com> Subject: [seL4] Re: Use TimeServer by Group Components Questions To: 15852538526@139.com Cc: devel <devel@sel4.systems> Message-ID: <CA+-ozWeJgZm3B902EKBU6dUGxy6wU3Cp92jEUYwRKNFcVS=7DA@mail.gmail.com> Content-Type: text/plain; charset="UTF-8" On Fri, Dec 3, 2021 at 2:51 PM <15852538526@139.com> wrote: > > I got an problem when I use TimeServer with Group Components. I wish more than one component in the same Group Components, that can use TimeServer. While the cdl only have one ep for all components in the Group to communicate to TimeServer with an badge value that not equal to zero. And TimeServer must to distinguish requester by the badge, so only one component can receive TimeServer's response. How can I resolve this problem? The TimeServer component cannot be used as part of a Group component. The Group component mechanism isn't designed to transparently work for any component and the Connection types that the TimeServer uses are not compatible with the Group mechanism. > _______________________________________________ > Devel mailing list -- devel@sel4.systems To unsubscribe send an email > to devel-leave@sel4.systems ------------------------------ Message: 2 Date: Tue, 7 Dec 2021 13:14:30 +1100 From: Kent Mcleod <kent.mcleod72@gmail.com> Subject: [seL4] Re: Incorporating seL4 into your project, risc-v architecture To: so_s@gmx.de Cc: devel <devel@sel4.systems> Message-ID: <CA+-ozWeU7Bi-byXhsvo9jnQ_tcn90U7QRa8_hh39p2rVgUcKzA@mail.gmail.com> Content-Type: text/plain; charset="UTF-8" On Fri, Dec 3, 2021 at 8:14 AM <so_s@gmx.de> wrote: > > Dear seL4 devs and community. > > I try to create a simple project using the seL4 micro kernel. For setting up my environment I followed this (https://github.com/manu88/SeL4_101) and this (https://docs.sel4.systems/projects/buildsystem/incorporating.html) article. As kernel I'm using the last release of SeL4 (https://github.com/seL4/seL4/releases/tag/12.1.0). I created some kind of user application, like Manu88 did with a simple Hello World C-Program. While building with the delivered init-build.sh script I got a lot of repeating errors like this: > CMake Error at tools/cmake-tool/helpers/simulation.cmake:179 (add_custom_command): > Error evaluating generator expression: > > $<TARGET_PROPERTY:rootserver_image,KERNEL_IMAGE_NAME> > > Target "rootserver_image" not found. > Call Stack (most recent call first): > CMakeLists.txt:22 (GenerateSimulateScript) While going through > Manu88s guide I saw that he also got this errors. But after adding a CMakeList.txt file inside his user code, and declaring it as rootserver inside the CMakeList.txt file (I just copied the contents from here: https://github.com/manu88/SeL4_101/blob/master/projects/Hello/CMakeLists.txt) he got rid of the errors. I still get them and searching through the internet could not find a solution for me. Bellow I added the folder structure of my project. > Project/ > ├── kernel/ > ├── projects/ > │ ├── Hello/ > │ ├── musllibc/ > │ ├── utils_libs/ > │ ├── seL4_libs/ > ├── build/ > ├── tools/ > │ └── cmake-tool/ > │ └── elfloader-tool/ > ├── init-build.sh -> tools/cmake-tool/init-build.sh > ├── CMakeLists.txt > ├── application_settings.cmake -> > tools/cmake-tool/helpers/application_settings.cmake > ├── settings.cmake > > So I cannot tell where the issue lies. Any pointers would be appreciated. If you are using https://github.com/manu88/SeL4_101/blob/master/projects/Hello/CMakeLists.txt inside projects/Hello/ then you need to make sure that the top level CMakeLists inside Project/ is either a symlink to tools/cmake-tool/default-CMakeLists.txt or is similar to https://github.com/manu88/SeL4_101/blob/master/CMakeLists.txt. With this file layout, ../init-build.sh will end up calling the following CMake invocation: `cmake -DCMAKE_TOOLCHAIN_FILE=../kernel/gcc.cmake -G Ninja -DSEL4_CACHE_DIR=../.sel4_cache -C "../settings.cmake" "../"`. What is the contents of your settings.cmake file? > > Kind regards, > Sophia > _______________________________________________ > Devel mailing list -- devel@sel4.systems To unsubscribe send an email > to devel-leave@sel4.systems ------------------------------ Message: 3 Date: Tue, 7 Dec 2021 13:43:48 +1100 From: Kent Mcleod <kent.mcleod72@gmail.com> Subject: [seL4] Re: some performance problem when test 4 cores SMP benchmark of seL4bench project 答复: Devel Digest, Vol 127, Issue 1 To: Gernot Heiser <gernot@unsw.edu.au> Cc: "devel@sel4.systems" <devel@sel4.systems> Message-ID: <CA+-ozWcihN6my4WOyPbpCcfp3=rksYSnnEgZE2kc6c-BV4Kymw@mail.gmail.com> Content-Type: text/plain; charset="UTF-8" On Thu, Dec 2, 2021 at 9:28 PM Gernot Heiser <gernot@unsw.edu.au> wrote: > > > > > On 2 Dec 2021, at 16:17, yadong.li <yadong.li@horizon.ai> wrote: > > > > First, I got the data of IMX8MM_EVK_64 and TX2 from > > https://github.com/seL4/sel4bench/actions/runs/1469475721#artifacts, > > the sel4bench-results-imx8mm_evk file and sel4bench-results-tx2 file, unpack the file out, I find xxxx_SMP_64.json Secondly, the test is the smp benchmark form sel4bench-manifest project, the source file is sel4bench/apps/smp/src/main.c The test scenario look like below: > > A pair thread of ping-pong on the same core, the ping thread will > > wait for "ipc_normal_delay" time then send 0 len ipc message to pong > > thread, then return. I think the 500 cycles mean how long > > ipc_normal_delay will really delay > > > The above scenario will test on one core, or mutil core. If we run 4 cores, every core will have a ping thread and a pong thread run like above description, then record the sum of all cores ping-pong counts. > > ok, but what is the metric reported? [Apologies for not being on top > of the details of our benchmarking setups.] Looking at the sel4bench smp benchmark implementation, the metric is the total number of "operations" in a single second. An operation is a round trip intra address space seL4_Call + seL4_ReplyRecv between 2 threads on the same core with each thread delaying for the cycle count before performing the next operation. After 1 second of all cores performing these operations continuously and maintaining a core-local (on a separate cache line) count, the total number of operations is added together and reported as the final number. So you would expect that the reported metric would scale following Amdahl's law based on the proportion of an operation that is serialized inside the kernel lock which would potentially vary across platforms. > > It cannot simply be throughput, else the doubled delay should be reflected in a significantly reduced throughput, but it has almost no effect. > > > I think this experiment is used to illustrate in multi core, our seL4 kernel big lock will not affect mutli-core performance, am I right ? > > Not quite. As there’s only one big lock, only one core can execute the kernel at any time. If one core is in the IPC while another core is trying to IPC, even though both IPCs are core-local, the second will have to wait until the first gets out of the lock. > > As the delay is higher than the syscall latency, you’d expect perfect scalability from one core to two (with the lock essentially synchronising the threads). For 3 or 4 cores the combined latency of two IPCs is larger than the 500cy delay and you expect lock contention, resulting in reduced scaling, while it should still scale almost perfectly with the 1000cy delay. This is exactly what you see for the i.MX8 and the TX2. > > > Addition: > > Our seL4_Call performance is same with other platform > > XXXX IMX8MM_EVK_64 TX2_64 > > seL4_Call 367(0) 378(2) 492(16) client->server, same vspace, ipc_len is 0 > > seL4_ReplyRecv 396(0) 402(2) 513(16) server->client, same vspace, ipc_len is 0 > > OK, so baseline performance is good. But are these measured on a single-core or SMP kernel (i.e. is locking included)? > > > My test results below: > > ARM platform > > Test item XXX IMX8MM_EVK_64 TX2 > > mean(Stddev) > > 500 cycles, 1 core 636545(46) 625605(29) 598142(365) > > 500 cycles, 2 cores 897900(2327) 1154209(44) 994298(94) > > 500 cycles, 3 cores 1301679(2036) 1726043(65) 1497740(127) > > 500 cycles, 4 cores 1387678(549) 2172109(12674) 1545872(109) > > 1000 cycles, 1 core 636529(42) 625599(22) 597627(161) > > 1000 cycles, 2 cores 899212(3384) 1134110(34) 994437(541) > > 1000 cycles, 3 cores 1297322(5028) 1695385(45) 1497547(714) > > 1000 cycles, 4 cores 1387149(456) 2174605(81) 1545716(614) > > I notice your standard deviations for 2 and 3 cores are surprisingly high (although still small in relative terms). > > Did you try running the same again? Are the numbers essentially the same or are multiple runs all over the shop? > > There are some issues with our benchmarking methodology. Fixing up sel4bench is one of the projects I’d like to do if I got a student for it, or maybe someone from the community would want to help? > > But just from looking at the data I’m not sure that’s the issue here. > > Gernot > _______________________________________________ > Devel mailing list -- devel@sel4.systems To unsubscribe send an email > to devel-leave@sel4.systems ------------------------------ Message: 4 Date: Tue, 07 Dec 2021 03:06:38 -0000 From: 15852538526@139.com Subject: [seL4] Re: Use TimeServer by Group Components Questions To: devel@sel4.systems Message-ID: <163884639883.1070957.2394639262660252936@mattermost.seL4.systems> Content-Type: text/plain; charset="utf-8" Thank you very much for answering my question. I still have a doubt about the TimeServer. I can manually modify the cdl file and then I get the the expected source file(capdl_spec.c) from using capdl-parse. I also need to change the return value of timeserver <IF>_notification API. I have try and I can run the program as I expected. Is there any way to create adapted cdl file automatic by modifying the ADL parse project? I am looking forward to getting your answer again. Thank you very much. my ADL file: import <std_connector.camkes>; import <global-connectors.camkes>; import <TimeServer/TimeServer.camkes>; component Client { control; uses Timer timeout; } assembly { composition { group grp { component Client c1; component Client c2; } component TimeServer time_server; connection seL4TimeServer ts1( from grp.c1.timeout, to time_server.the_timer); connection seL4TimeServer ts2( from grp.c2.timeout, to time_server.the_timer); } configuration { time_server.timers_per_client = 1; } } ------------------------------ Message: 5 Date: Tue, 7 Dec 2021 14:23:54 +1100 From: Kent Mcleod <kent.mcleod72@gmail.com> Subject: [seL4] Re: Use TimeServer by Group Components Questions To: 15852538526@139.com Cc: devel <devel@sel4.systems> Message-ID: <CA+-ozWestyqAM4wy+x359CLZLf13PzdCRBNFSfPY-3nyo9Y6wA@mail.gmail.com> Content-Type: text/plain; charset="UTF-8" On Tue, Dec 7, 2021 at 2:09 PM <15852538526@139.com> wrote: > > Thank you very much for answering my question. I still have a doubt about the TimeServer. I can manually modify the cdl file and then I get the the expected source file(capdl_spec.c) from using capdl-parse. I also need to change the return value of timeserver <IF>_notification API. I have try and I can run the program as I expected. Is there any way to create adapted cdl file automatic by modifying the ADL parse project? There isn't a supported way to modify the generated cdl file as CAmkES expects to generate a full system specification. What is your motivation for trying to make the timeserver a group component instead of a separate component? As part of a group component it will still have its own kernel objects for its threads, code, data and heap memories so you still have the costs of isolation but without the benefits because it still shares the virtual address space and capability space with other components in the same group. > I am looking forward to getting your answer again. Thank you very much. > > my ADL file: > import <std_connector.camkes>; > import <global-connectors.camkes>; > import <TimeServer/TimeServer.camkes>; > > component Client { > control; > uses Timer timeout; > } > > assembly { > composition { > group grp { > component Client c1; > component Client c2; > } > component TimeServer time_server; > > connection seL4TimeServer ts1( > from grp.c1.timeout, to time_server.the_timer); > connection seL4TimeServer ts2( > from grp.c2.timeout, to time_server.the_timer); > } > > configuration { > time_server.timers_per_client = 1; > } > } > _______________________________________________ > Devel mailing list -- devel@sel4.systems To unsubscribe send an email > to devel-leave@sel4.systems ------------------------------ Message: 6 Date: Tue, 7 Dec 2021 03:45:09 +0000 From: Gernot Heiser <gernot@unsw.edu.au> Subject: [seL4] Re: some performance problem when test 4 cores SMP benchmark of seL4bench project 答复: Devel Digest, Vol 127, Issue 1 To: "devel@sel4.systems" <devel@sel4.systems> Message-ID: <17FC6BC4-3407-44AE-B7AA-4623EFF76988@unsw.edu.au> Content-Type: text/plain; charset="utf-8" On 7 Dec 2021, at 13:43, Kent Mcleod <kent.mcleod72@gmail.com> wrote: > > Looking at the sel4bench smp benchmark implementation, the metric is > the total number of "operations" in a single second. An operation is > a round trip intra address space seL4_Call + seL4_ReplyRecv between 2 > threads on the same core with each thread delaying for the cycle count > before performing the next operation. After 1 second of all cores > performing these operations continuously and maintaining a core-local > (on a separate cache line) count, the total number of operations is > added together and reported as the final number. So you would expect > that the reported metric would scale following Amdahl's law based on > the proportion of an operation that is serialized inside the kernel > lock which would potentially vary across platforms. Thanks for the explanation, Kent. Observations: 1) The metric is essentially independent of the delay. Looking at the single-core figures for the i/MX8, I get 1598.5 ns in both cases, the difference being 15ps. Doesn’t make sense to me. 2) Assuming this processor runs at the 1.8GHz it seems speced for, this corresponds to 2877 cycles, which is huge, even if the 1000cy delay is subtracted! 3) As I said before, intra-AS IPC is a meaningless metric we should never use (but that’s incidental to the particular thing we want to measure here). 4) Having to do these calculations to understand the numbers is a sure indication that the results are presented in an unsuitable form. I can’t see how these figures make sense. Gernot ------------------------------ Subject: Digest Footer _______________________________________________ Devel mailing list -- devel@sel4.systems To unsubscribe send an email to devel-leave@sel4.systems ------------------------------ End of Devel Digest, Vol 127, Issue 5 *************************************
1139
Age (days ago)
1139
Last active (days ago)
0 comments
1 participants
participants (1)
-
yadong.li