seL4 multicore boot failed on arm64
Hello, I run my seL4 application on 4 core ARMv8. Recently I have developed some more application code (I mention it because file system size is increased an I believe it may change boot time) and from time to time I observed that boot process failed (stuck after the following print) calling schedule calling activate thread kernel boot done.. CPU=1 calling schedule calling activate thread kernel boot done.. CPU=2 calling schedule calling activate thread kernel boot done.. CPU=3 Booting all finished, dropped to user space calling schedule calling activate thread kernel boot done.. CPU=0 Some time I see that only 2 CPU have been booted (like CPU1 and CPU3, or CPU2 and CPU3) Last print is the last one in init_kernel() function Thanks you, Leonid
Hello Leonid, On 2024-04-25 21:37, Leonid Meyerovich wrote:
I run my seL4 application on 4 core ARMv8.
Still the new platform port I assume?
Recently I have developed some more application code (I mention it because file system size is increased an I believe it may change boot time) and from time to time I observed that boot process failed (stuck after the following print)
Does this only happen with a bigger application? And is it always stuck for a specific binary, or does the same binary sometimes work? The application isn't started until after the kernel is done booting, so its size should not affect cores coming up or not. If it does, it may be a bug in Elfloader. If that is the case, it should either always succeed or always fail.
Some time I see that only 2 CPU have been booted (like CPU1 and CPU3, or CPU2 and CPU3) Last print is the last one in init_kernel() function
Core 0 will only finish if all other cores have done their init too. I recommend adding some debugging to try_init_kernel_secondary_core() and init_cpu() to see what where it gets stuck. Also be aware that debug printing may disappear if multiple cores print at exactly the same time, depending on the UART driver. Greetings, Indan
Hi Indian,
-BOOT_BSS static volatile int node_boot_lock;
+static volatile int node_boot_lock;
In seL4 kernel code try_init_kernel_secondary_core every core waits on
node_boot_lock to start initialization. As you can see above node_boot_lock
has been defined as BOOT_BSS is a specific usage context of the .bss
section during the boot process. Apparently, the Morello processor
introduces some unique features and behaviors that might affect how memory
regions like the BOOT_BSS section handles cache coherence in a
multi-threaded environment. My research shows that sometimes threads don't
see node_boot_lock changes and don't start initialization. I have removed
BOOT_BSS and it is booted properly now
Thanks,
Leonid
On Fri, May 3, 2024 at 8:13 AM Indan Zupancic
Hello Leonid,
On 2024-04-25 21:37, Leonid Meyerovich wrote:
I run my seL4 application on 4 core ARMv8.
Still the new platform port I assume?
Recently I have developed some more application code (I mention it because file system size is increased an I believe it may change boot time) and from time to time I observed that boot process failed (stuck after the following print)
Does this only happen with a bigger application? And is it always stuck for a specific binary, or does the same binary sometimes work?
The application isn't started until after the kernel is done booting, so its size should not affect cores coming up or not. If it does, it may be a bug in Elfloader. If that is the case, it should either always succeed or always fail.
Some time I see that only 2 CPU have been booted (like CPU1 and CPU3, or CPU2 and CPU3) Last print is the last one in init_kernel() function
Core 0 will only finish if all other cores have done their init too.
I recommend adding some debugging to try_init_kernel_secondary_core() and init_cpu() to see what where it gets stuck.
Also be aware that debug printing may disappear if multiple cores print at exactly the same time, depending on the UART driver.
Greetings,
Indan
Hello Leonid, This is very surprising. As far as I can tell, BOOT_BSS is section .boot.bss and that just goes to the normal BSS region, as src/arch/arm/common_arm.lds only has: .boot . : AT(ADDR(.boot) - KERNEL_OFFSET) { *(.boot.text) *(.boot.rodata) *(.boot.data) . = ALIGN(64K); } ki_boot_end = .; Even if it would go in a different section, the kernel itself doesn't treat it differently other than re-using the memory when done booting. I think that it works by chance if you remove the BOOT_BSS, except if I missed something somewhere. My guess is that Elfloader doesn't properly synchronise data across cores before jumping to the kernel. Perhaps it makes an assumption which is not true for Neoverse. Does adding extra synchronisation before starting the other cores solve your problem? Add it to smp_boot(), before the init_cpus() call and move the "non_boot_lock = 1;" to before the DSB or whatever sync you add: https://github.com/seL4/seL4_tools/blob/master/elfloader-tool/src/arch-arm/s... Greetings, Indan On 2024-05-09 15:29, Leonid Meyerovich wrote:
-BOOT_BSS static volatile int node_boot_lock; +static volatile int node_boot_lock;
In seL4 kernel code try_init_kernel_secondary_core every core waits on node_boot_lock to start initialization. As you can see above node_boot_lock has been defined as BOOT_BSS is a specific usage context of the .bss section during the boot process. Apparently, the Morello processor introduces some unique features and behaviors that might affect how memory regions like the BOOT_BSS section handles cache coherence in a multi-threaded environment. My research shows that sometimes threads don't see node_boot_lock changes and don't start initialization. I have removed BOOT_BSS and it is booted properly now
Thanks, Leonid
On Fri, May 3, 2024 at 8:13 AM Indan Zupancic
wrote: Hello Leonid,
On 2024-04-25 21:37, Leonid Meyerovich wrote:
I run my seL4 application on 4 core ARMv8.
Still the new platform port I assume?
Recently I have developed some more application code (I mention it
because file system size is increased an I believe it may change boot time) and from time to time I observed that boot process failed (stuck after the following print)
Does this only happen with a bigger application? And is it always stuck for a specific binary, or does the same binary sometimes work?
The application isn't started until after the kernel is done booting, so its size should not affect cores coming up or not. If it does, it may be a bug in Elfloader. If that is the case, it should either always succeed or always fail.
Some time I see that only 2 CPU have been booted (like CPU1 and CPU3, or CPU2 and CPU3) Last print is the last one in init_kernel() function
Core 0 will only finish if all other cores have done their init too.
I recommend adding some debugging to try_init_kernel_secondary_core() and init_cpu() to see what where it gets stuck.
Also be aware that debug printing may disappear if multiple cores print at exactly the same time, depending on the UART driver.
Greetings,
Indan
participants (2)
-
Indan Zupancic
-
Leonid Meyerovich