Overlapping pointer returned by memalign
Hi, I’m currently working on a project porting drivers from U-Boot to seL4 and have run into an unexpected problem seemingly triggered by use of the memalign within the U-Boot drivers. What I am seeing is that calls to memalign from within my CAmkES component can return pointers to regions which overlap with those previously returned by malloc. Obviously this leads to the two allocated regions trampling over each other and resulting corruption of data. I’m at a complete loss to explain this behaviour and would be very grateful to receive any suggestions or pointers. Thanks for your help, Stephen This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
On Thu, Mar 17, 2022 at 8:26 AM WILLIAMS Stephen via Devel
Hi,
I’m currently working on a project porting drivers from U-Boot to seL4 and have run into an unexpected problem seemingly triggered by use of the memalign within the U-Boot drivers.
What I am seeing is that calls to memalign from within my CAmkES component can return pointers to regions which overlap with those previously returned by malloc. Obviously this leads to the two allocated regions trampling over each other and resulting corruption of data.
I’m at a complete loss to explain this behaviour and would be very grateful to receive any suggestions or pointers.
Both malloc and memalign in camkes are provided by our fork of libmuslc (https://github.com/sel4/musllibc/). Internally, memalign calls malloc and so it seems like your issue can be reduced to multiple calls to malloc are returning overlapping regions. This could be for a couple reasons: - Within the default camkes runtime, muslc functions such as malloc aren't thread safe and so must be called from critical sections guarded by a lock to avoid races. Many camkes components use a global lock when performing operations that mutate state: https://github.com/seL4/global-components/blob/master/components/TimeServer/..., or they don't use dynamic memory allocation after initialization (as initialization is single threaded). This lack of thread safety is a bit nasty and the runtime should do more to protect developers from this, but currently I don't think it does. - You have memory corruption somewhere else that's causing malloc's bookkeeping structures to be corrupted.
Thanks for your help, Stephen This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message. _______________________________________________ Devel mailing list -- devel@sel4.systems To unsubscribe send an email to devel-leave@sel4.systems
Thanks for the information. The issue is being observed within a single threaded camkes component, therefore I’m assuming that thread safety of the muslc functions it not relevant here. In an attempt to investigate memory corruption I have run the software under valgrind on my host machine, which does not highlight any issues. Obviously the interaction with the actual devices was needed to be simulated to perform this run on the host however I believe that all of the same memory allocation and access routines should have been performed. This suggests that memory corruption may not be the cause. One potentially interesting observation is that if I replace all calls to memalign with the equivalent calls to malloc (noting that the alignment appears to be for performance rather than functional correctness reasons) I no longer have overlapping memory regions allocated and the code functions as expected. Thanks, Stephen
On 16 Mar 2022, at 23:23, Kent Mcleod
wrote: ***This mail has been sent by an external source***
On Thu, Mar 17, 2022 at 8:26 AM WILLIAMS Stephen via Devel
wrote: Hi,
I’m currently working on a project porting drivers from U-Boot to seL4 and have run into an unexpected problem seemingly triggered by use of the memalign within the U-Boot drivers.
What I am seeing is that calls to memalign from within my CAmkES component can return pointers to regions which overlap with those previously returned by malloc. Obviously this leads to the two allocated regions trampling over each other and resulting corruption of data.
I’m at a complete loss to explain this behaviour and would be very grateful to receive any suggestions or pointers.
Both malloc and memalign in camkes are provided by our fork of libmuslc (https://github.com/sel4/musllibc/). Internally, memalign calls malloc and so it seems like your issue can be reduced to multiple calls to malloc are returning overlapping regions. This could be for a couple reasons: - Within the default camkes runtime, muslc functions such as malloc aren't thread safe and so must be called from critical sections guarded by a lock to avoid races. Many camkes components use a global lock when performing operations that mutate state: https://github.com/seL4/global-components/blob/master/components/TimeServer/..., or they don't use dynamic memory allocation after initialization (as initialization is single threaded). This lack of thread safety is a bit nasty and the runtime should do more to protect developers from this, but currently I don't think it does. - You have memory corruption somewhere else that's causing malloc's bookkeeping structures to be corrupted.
Thanks for your help, Stephen This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message. _______________________________________________ Devel mailing list -- devel@sel4.systems To unsubscribe send an email to devel-leave@sel4.systems
Following further investigation this does not obviously appear to be a traditional case of memory corruption. With the following code in a single threaded component the problem is being exhibited:
#include
On Tue, Mar 22, 2022 at 9:20 AM WILLIAMS Stephen
Following further investigation this does not obviously appear to be a traditional case of memory corruption. With the following code in a single threaded component the problem is being exhibited:
#include
#include #include void* debug_memalign(size_t align, size_t size) { void* addr = memalign(align, size); printf("memalign: align = 0x%x, size = 0x%x. Returned address = %p\n",align, size, addr); return addr; }
int run_main(ps_io_ops_t *io_ops) { debug_memalign(0x40, 0x1000); debug_memalign(0x40, 0x1000);
// Loop forever while (1); } CAMKES_POST_INIT_MODULE_DEFINE(run_main_, run_main);
This code, performing two calls to memalign as the first processing within the component, results in the following output:
ELF-loader started on CPU: ARM Ltd. Cortex-A53 r0p4 paddr=[408b3000..40c21117] No DTB passed in from boot loader. Looking for DTB in CPIO archive...found at 409ff090. Loaded DTB from 409ff090. paddr=[4023f000..40248fff] ELF-loading image 'kernel' to 40000000 paddr=[40000000..4023efff] vaddr=[ffffff8040000000..ffffff804023efff] virt_entry=ffffff8040000000 ELF-loading image 'capdl-loader' to 40249000 paddr=[40249000..404c1fff] vaddr=[400000..678fff] virt_entry=408e78 Enabling MMU and paging Jumping to kernel-image entry point...
Warning: gpt_cntfrq 8333333, expected 8000000 Bootstrapping kernel available phys memory regions: 1 [40000000..c0000000] reserved virt address space regions: 3 [ffffff8040000000..ffffff804023f000] [ffffff804023f000..ffffff8040249000] [ffffff8040249000..ffffff80404c2000] Booting all finished, dropped to user space memalign: align = 0x40, size = 0x1000. Returned address = 0x57d1c0 memalign: align = 0x40, size = 0x1000. Returned address = 0x57d1c0
As can be seen the same address is returned for both calls and there is no earlier processing in this case to corrupt the bookkeeping of the memalign / malloc routines. I have a few observations that may be of interest:
1. As previously noted, if the two calls to ‘memalign’ are replaced with calls to ‘malloc’ then no problems are seen; two non-overlapping memory regions are allocated.
2. The behaviour appears to depend on the other code / data linked into the executable. If the calls to ‘memalign’ are the only functionality performed by the component then two non-overlapping memory regions are allocated as expected, however if the calls to ‘memalign’ are followed by calls to the U-Boot code we are porting to seL4 then the unexpected behaviour is observed.
The memalign definition provided by musl is a weak symbol and would be overridden by any stronger definition encountered in sources such as from U-Boot. Alternatively a header from U-Boot could introduce a macro that redefines memalign (but it doesn't look like you have any U-Boot includes). Are you able to confirm that the code being for memalign in the final binary is in fact the definition provided by the musl library? Are you able to provide a minimal reproducible example? Are you also only using the malloc provided by the musl fork? If memalign is calling a different malloc and they have different internal bookkeeping, then that would cause data corruption.
3. The behaviour is being observed on a board for which we are in the process of adding support (the Avnet MaaXBoard, and i.MX8 based board). I can confirm however that seL4test passes on the board as well tests of an ethernet driver (using the Ethdriver and PicoServer global components) and tests of a MMC/SDHC driver. As such there is a degree of confidence in the board’s level of support.
Any insights or suggestions on how to proceed with determining the cause of this issue would be greatly appreciated.
Thanks, Stephen
On 22/03/2022 04:50, Kent Mcleod wrote:
The memalign definition provided by musl is a weak symbol and would be overridden by any stronger definition encountered in sources such as from U-Boot.
Good grief! Is there a linker warning that complains about that sort of overriding? - Rod
On Tue, Mar 22, 2022 at 6:17 PM Roderick Chapman
On 22/03/2022 04:50, Kent Mcleod wrote:
The memalign definition provided by musl is a weak symbol and would be overridden by any stronger definition encountered in sources such as from U-Boot.
Good grief!
Is there a linker warning that complains about that sort of overriding?
I'm not aware of one. If you add `-Wl,-M` to the CMake `target_link_libraries` command then the linker will print out a mapping file which contains which object file each symbol comes from. This may be enough to confirm whether symbols are getting replaced unexpectedly.
- Rod
_______________________________________________ Devel mailing list -- devel@sel4.systems To unsubscribe send an email to devel-leave@sel4.systems
Following a lot of further investigation I believe I have determined the root cause of this issue. Surprisingly this appears to be a bug in the memalign implementation within seL4's fork of the musl library. I note that the memory allocation system within the mainline musl was completely reworked / replaced a number of years ago so there does not appear to be any scope to raise a bug report on the musl project. Fault summary: The correctness of the memory allocation bookkeeping system relies upon the constraint that the minimum size of a memory 'chunk' is 4xsizeof(size_t). If this constraint is broken then bad things happen and the bookkeeping system becomes corrupted, specifically: 1. Arithmetic wrap-around of x occurs in the routines bin_index and bin_index_up within malloc.c. This can lead to chunks below the minimum size limit to be considered to be large unallocated chunks of memory. Subsequent allocation of these unallocated chunks (considered to be large but in reality tiny) allows previously allocated chunks to be re-used / overwritten. 2. The 'next' and 'prev' pointers held in an unallocated chunk (used to maintained a doubly linked list of unallocated chunks) that is below the minimum size limit may be overlayed with the bookkeeping of the following chunk. The malloc routine enforces this minimum chunk size limit, however the code of the __memalign routine within memalign.c can break this minimum size constraint and therefore lead to corruption of the bookkeeping. The __memalign routine works by malloc'ing sufficient memory to ensure the requested amount of memory is available, at the requested alignment, somewhere within the malloc'ed region. This means that there may be some unused memory allocated before the start of the aligned memory area. This is handled by splitting the chunk allocated by malloc into two chunks, a chunk of memory prior to the start of the aligned memory followed by a chunk that starts at the requested alignment. __memalign then calls 'free' on the first chunk which wasn't required. So far so good, however __memalign fails to enforce the minimum chunk size constraint on either of the two split chunks. In our case the first chunk (the one to be freed) was only 16 bytes whilst 4xsizeof(size_t) is 32 bytes, thereby breaking the minimum chunk size constraint and so led to the detected corruption. Stephen
On 23 Mar 2022, at 00:30, Kent Mcleod
wrote: ***This mail has been sent from an external source***
On Tue, Mar 22, 2022 at 6:17 PM Roderick Chapman
wrote: On 22/03/2022 04:50, Kent Mcleod wrote:
The memalign definition provided by musl is a weak symbol and would be overridden by any stronger definition encountered in sources such as from U-Boot.
Good grief!
Is there a linker warning that complains about that sort of overriding?
I'm not aware of one. If you add `-Wl,-M` to the CMake `target_link_libraries` command then the linker will print out a mapping file which contains which object file each symbol comes from. This may be enough to confirm whether symbols are getting replaced unexpectedly.
- Rod
_______________________________________________ Devel mailing list -- devel@sel4.systems To unsubscribe send an email to devel-leave@sel4.systems
Devel mailing list -- devel@sel4.systems To unsubscribe send an email to devel-leave@sel4.systems
This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
Following on from my previous email I have tested the following patch to memalign which I believe resolves the identified weakness:
diff --git a/src/malloc/memalign.c b/src/malloc/memalign.c
index 006bd21c..c98963f0 100644
--- a/src/malloc/memalign.c
+++ b/src/malloc/memalign.c
@@ -21,6 +21,9 @@ void *__memalign(size_t align, size_t len)
errno = ENOMEM;
return NULL;
}
+ else if (len < 4*sizeof(size_t))
+ /* Ensure the length of returned chunk meets the minimum limit. */
+ len = 4*sizeof(size_t);
if (align <= 4*sizeof(size_t)) {
if (!(mem = malloc(len)))
@@ -50,7 +53,29 @@ void *__memalign(size_t align, size_t len)
((size_t *)new)[-1] = header&7 | end-new;
((size_t *)end)[-2] = footer&7 | end-new;
- free(mem);
+ if (new-mem >= 4*sizeof(size_t))
+ free(mem);
+ else {
+ /* The size of the region before 'new' is too small to be handled as a chunk
+ * so we cannot call 'free' on it. Instead we either discard the memory or
+ * transfer ownership of it to the previous chunk */
+ if (!(((size_t *)mem)[-2] & -8)) {
+ /* mem->psize has no length, i.e. 'mem' is the first chunk in this mapped
+ * region. In this case we simply discard the memory prior to 'new' by
+ * making 'new' the first chunk of the mapped region. To do this we set
+ * the length of new->psize to 0 with the 'in use' flag set, which equates
+ * to simply setting its value to 1. */
+ ((size_t *)new)[-2] = 1;
+ }
+ else {
+ /* There is a previous chunk, assign ownership of the region before 'new'
+ * to this previous chunk by increasing it's length. */
+ unsigned char *pre = mem - (((size_t *)mem)[-2] & -8);
+ ((size_t *)pre)[-1] += new-mem;
+ ((size_t *)new)[-2] = ((size_t *)pre)[-1];
+ }
+ }
+
return new;
}
On 24 Mar 2022, at 13:59, WILLIAMS Stephen via Devel
I have raised an issue on GitHub for the identified issue (https://github.com/seL4/musllibc/issues/17#issue-1188160949). Please let me know if there is anything else I need to do to formally report the fault to can do to assist with the resolution.
Stephen
On 24 Mar 2022, at 14:17, WILLIAMS Stephen via Devel
participants (3)
-
Kent Mcleod
-
Roderick Chapman
-
WILLIAMS Stephen