issue with elfloader ret instruction while enabling the MMU on armv8-a
Hello seL4 team, I would like to share an issue found with the port of seL4 and VMs to the STM32MP2 SoC, which features Armv8-A dual Cortex-A35 cores: When activating the secondary core, The RET instruction crashes while executing the arm_enable_mmu(). The exception stack and ESR register show no useful information. Simply replacing the RET instruction with BR X30 resolves the issue. Furthermore, before RET, a check with the AT S1E1R, X30 shows correct memory translation in the PAR_EL1 register, confirming proper MMU setting, which is also supported by the successful BR X30 workaround. Just to be sure I tried to replace the current 1Gb translation table with a finer 4Bb granularity and P:EX U:NX and RO access flags for the Non Secure code execution. Same results. I also set the ASID bits, same result. So back to the current implementation and BR X30 workaround: According to the Arm Architecture Reference Manual [1], the RET instruction behaves identically to BR but provides an additional hint that this is a return from a subroutine. AFAIU, such hints can be related to the branch predictor or the return address stack used for ROP protection, (if the feature is available). In “Requirements for branch predictor maintenance” [2], branch prediction maintenance operations must be performed after enabling or disabling the MMU. However, these operations translate as a NOP if the branch predictor hardware is not architecturally visible. Therefore, the ‘BL/RET’ behavior from different translation regime seems to to be implementation-dependent or dependent on optional architectural features Thus, replacing MMU enable sequence: enable_mmu sctlr_el1 x8 ldp x29,x30,[sp],#16 ret with a trampoline: adr x30,mmu_on ldp x29,x9,[sp],#16 ret mmu_on: enable_mmu sctlr_el1,x8 br x9 is enough to start seL4 with SMP=on NUM_NODES=2 and run a dual core Linux on the CAmkES VM. This is semantically equivalent to the BL X30 workaround, but this is more explicit regarding the MMU translation regime around BL/RET. NB: arm_enable_hyp_mmu() needs the same change as the issue happens equivalently in EL1 or EL2. Thank you very much for your thoughts on this issue. Best Regards Christian References: [1] Arm Architecture Reference Manual (DTI 0487) [2] Arm Architecture Reference Manual ARMv7-A and ARMv7-R edition
Hello Christian, On 2026-01-13 16:16, Christian Bruel via Devel wrote:
I would like to share an issue found with the port of seL4 and VMs to the STM32MP2 SoC, which features Armv8-A dual Cortex-A35 cores: When activating the secondary core, The RET instruction crashes while executing the arm_enable_mmu(). The exception stack and ESR register show no useful information.
Thanks for sharing this. We have a similar issue in 32-bit mode on the zynqmp, but there it only fails sometimes. Does it consistently fail for you or just sometimes? I suspect the latter, as Cortex-A35 is mostly in-order (with some out-of-order memory accesses). We didn't run into any of these issues with the tqma board, which also has a Cortex-A35, but is an imx8qxp SoC.
Simply replacing the RET instruction with BR X30 resolves the issue.
Interesting.
In “Requirements for branch predictor maintenance” [2], branch prediction maintenance operations must be performed after enabling or disabling the MMU. However, these operations translate as a NOP if the branch predictor hardware is not architecturally visible.
That's a link to an ARMv7 document. The 32-bit ARM Elfoader code does a BPIALL before enabling the MMU, but the 64-bit one doesn't. However, other than having unnecessarily wrong branch predictions, I don't see how clearing branch predictor state would be architecturally required. It seems more likely that the correct branch prediction causes issues that don't show up when mispredicting the branch. Do you have an aarch64 reference?
Therefore, the ‘BL/RET’ behavior from different translation regime seems to to be implementation-dependent or dependent on optional architectural features
That seems unlikely: The instructions are well-defined. I suspect we are seeing the effect of a missing synchronisation or other subtle problem, not the cause of the problem here.
is enough to start seL4 with SMP=on NUM_NODES=2 and run a dual core Linux on the CAmkES VM.
Do you see the same problem when booting with one core, with unmodified code? It would help to limit this to unicore to rule out any SMP problems. The hyp and non-hyp boot code is noticeably different, so it is a bit surprising that both fail the same way.
This is semantically equivalent to the BL X30 workaround, but this is more explicit regarding the MMU translation regime around BL/RET.
But like the BR work-around, it's a work-around and doesn't fix the source of the problem. Does adding BPIALL make it boot too? I guess it will because it has the same effect. It seems strange that any of this makes a difference though, considering that there is both a DSB as well as an ISB before the RET, at least in src/arch-arm/armv/armv8-a/64/mmu-hyp.S: enable_mmu sctlr_el2, x8 ic ialluis dsb ish isb tlbi alle2is dsb ish isb ldp x29, x30, [sp], #16 ret END_FUNC(arm_enable_hyp_mmu) So to double check, do your elfloader files match these: https://github.com/seL4/seL4_tools/blob/master/elfloader-tool/src/arch-arm/a... https://github.com/seL4/seL4_tools/blob/master/elfloader-tool/src/arch-arm/a... Or are you running some old version? Greetings, Indan
Hi Indan, Thank for your insight and sharing the experience on zynqmp.
We have a similar issue in 32-bit mode on the zynqmp, but there it only fails sometimes. Does it consistently fail for you or just sometimes? I suspect the latter, as Cortex-A35 is mostly in-order (with some out-of-order memory accesses).
If fails consistently. With the workaround it succeeds consistently. This is interesting that you had to use BPIALL before ret from enable_mmu() Unfortunately I didn't find the equivalent support for the cortex-A35 AArch64, but this is not general to all ARMv8-A processors; for instance, the Cortex-A72 can control the branch target buffer with the CPUACTLR_EL1 register, which is reserved on the Cortex-A35. I am curious to see this w/a impact on zynqmp if you can test, does it still fail sometime ?
However, other than having unnecessarily wrong branch predictions, I don't see how clearing branch predictor state would be architecturally required. It seems more likely that the correct branch prediction causes issues that don't show up when mispredicting the branch. Do you have an aarch64 reference?
The only (incomplete) information I found is from Armv8-A Instruction Set Architecture Issue 1.1, Chapter 10: Function Calls, which explicitly states that all Cortex-A processors support branch prediction affecting the RET instruction. But no details on the different micro-architecture implementations.
Do you see the same problem when booting with one core, with unmodified code? It would help to limit this to unicore to rule out any SMP problems.
I have never seen the issue when booting without SMP. When booting with SMP I see this issue sometime in core0, sometime in core1.
Does adding BPIALL make it boot too? I guess it will because it has the same effect.
BPIALL is an AArch32 instruction, not supported on AArch64.
Or are you running some old version
I am re-based on top of the master branch, so we are aligned. For the port, if I don't find another explanation to the issue, a proposal to implement the w/a without impacting other platforms is to to specialize the arm_enable_mmu() epilogue function with a platform macro. Lets see... Best Regards Christian Best Regards Christian
Greetings,
Indan
Hi Christian, On 2026-01-15 10:06, Christian Bruel wrote:
This is interesting that you had to use BPIALL before ret from enable_mmu()
It's not my code, I don't know why arm_enable_mmu() does BPIALL, while arm_enable_hyp_mmu() does not.
Unfortunately I didn't find the equivalent support for the cortex-A35 AArch64, but this is not general to all ARMv8-A processors; for instance, the Cortex-A72 can control the branch target buffer with the CPUACTLR_EL1 register, which is reserved on the Cortex-A35.
Some misunderstanding here: From what you said I understood there was an equivalent instruction in AArch64. And when I looked at the TRM, I found the BPMaint bit in the 64-bit section, but missed that it's part of ID_MMFR3_EL1, "AArch32 Memory Model Feature Register 3".
I am curious to see this w/a impact on zynqmp if you can test, does it still fail sometime ?
I can't reproduce the problem when I build it myself, it seems to depend on the compiler version or CI does something else special I can't reproduce easily. CI was unstable for different reasons lately, so I'm waiting for it to settle down a bit before testing different potential solutions. The zynqmp problem occurs both with and without HYP, so the BPIALL shouldn't make a difference. It's probably totally unrelated, but similar in symptoms. I'll try your work-around, who knows.
Do you see the same problem when booting with one core, with unmodified code? It would help to limit this to unicore to rule out any SMP problems.
I have never seen the issue when booting without SMP. When booting with SMP I see this issue sometime in core0, sometime in core1.
Okay, this changes everything. Is "Boot cpu id = " message always the same, or does it sometimes boot on core 0 and sometimes on core 1? Did Elfloader relocate? That is, do you get this: "ELF loader relocated, continuing boot" Have you tried flushing all instruction and data caches at the beginning of smp_boot()? Do you get the message: "Jumping to kernel-image entry point..." Or is it skipped? If you're booting with U-boot, try adding dcache flush; icache flush; before booting seL4, see if that makes a difference.
For the port, if I don't find another explanation to the issue, a proposal to implement the w/a without impacting other platforms is to to specialize the arm_enable_mmu() epilogue function with a platform macro. Lets see...
More likely this is an SMP boot problem that should be fixed for everyone, otherwise other people will run into the same issue in the future for other ports. Have you tried Microkit or seL4-Rust? They have their own loaders, would be interesting to know whether they run into the same issue. Greetings, Indan
participants (2)
-
Christian Bruel -
Indan Zupancic