Hi Alex Am 29.07.2015 um 03:19 schrieb Alexander Kroh:
Hi Robert,
Did you end up getting to the bottom of this issue?
No, unfortunately I got somewhat overwhelmed with other stuff :-(
We still have not fixed it... But the good news is that we have finally been able to reproduce it! We find the same symptoms on the Sabre Light board with a different version of u-boot
That would support my theory that Uboot somehow causes the error. Do you know if there is a way to just clear any stale pending data abort conditions during startup prior to dropping the mask? Robert
- Alex
________________________________________ From: Robert Kaiser [robert.kaiser@hs-rm.de] Sent: Monday, 23 March 2015 08:38 To: Alexander Kroh; devel@sel4.systems Subject: Re: [seL4] Wandboard Port
Hi Alex,
Am 22.03.2015 um 07:08 schrieb Alexander Kroh:
Hi Robert,
Thanks for pointing that out, we will fix those hard coded values ASAP. Attached is a patch that should do this. Hope its OK.
From the B3.13.4 in the ARMv7 manual, in the case of an async abort, the DFAR (fault address) can't be trusted either. It seems like the only way to get to the bottom of this is to use "isb" instructions to force any pending async abort to occur before execution can continue to an arbitrary point. Unless there is a way to turn off the load/store buffers?
It seems to me like the async abort signal is completely bogus in my case: It hits immediatly as soon as the mask is dropped in CPSR, no matter what the CPU is doing. The idle thread that crashed in the last test is simply an endless loop that does not load or store any data. All it does is opcode fetches. And if the async abort is kept masked, the entire testsuite completes successfully. I doubt this would be the case if loads or stores really failed (which -after all- is what the async abort is supposed to indicate, as far as I understand).
My current hypothesis is that this is due to some weird misconfiguration by U-boot.
I will be very interested to hear what the underlying cause was when you work this one out :)
I hope i will ....
Cheers
Robert
- Alex
________________________________________ From: Robert Kaiser [robert.kaiser@hs-rm.de] Sent: Sunday, 22 March 2015 00:40 To: Alexander Kroh; devel@sel4.systems Subject: Re: [seL4] Wandboard Port
Hi,
Am 21.03.2015 um 04:52 schrieb Alexander Kroh:
You have only disabled async aborts. It sounds like the kernel is following a bad pointer which leads to a translation failure. The good news is that the fault address provided can actually be trusted. Note that when you mask async aborts, writes to the invalid physical address are ignored and reads are always 0. It the kernel following a NULL pointer? Here are the final words of a test run:
--------------------- </system-out> </testcase> <testcase classname="sel4test" name="TEST_DOMAINS0004"> INFO :sel4utils_elf_load_record_regions:270: * Loading segment 00008000-->0004$ INFO :sel4utils_elf_load_record_regions:270: * Loading segment 0004d170-->0016$ Running test DOMAINS0004 (Run threads in domains())
KERNEL DATA ABORT! Faulting instruction: 0xe00107d8 FAR: 0x1f11c2e0 DFSR: 0x1c06 halting... ---------------------
The fault address (FAR) is never NULL. Worse even, its value varies from test to test. Values I have seen are: 0x1f11c2e0 (most of the time), 0x9f11c2e0 (often), 0x1f11c3e0 (rare). Interestingly, all observed values differ in a single bit only.
The address of the faulting instruction, however, is always the same, and it is the entry point of the idle thread: Apparently, this is the first time during the testsuite where the idle thread gets scheduled. It turns out that the idle thread's CPSR value is hard coded here:
https://github.com/seL4/seL4/blob/master/src/arch/arm/kernel/thread.c#L29
and its value does not disable async faults. So this is the first time code is being executed with async fault exceptions enabled and promptly, the Wandboard crashes here.
After disabling async faults for the idle thread as well, the testsuite finally completes all the way to the famous "All is well in the universe" :-)
Since the observed fault addresses are varying between tests, I guess I may have some flaky hardware here. Maybe U-Boot (or at least my version thereof) has misconfigured DDR timing in some way. I'll have to look into this and -unless someone wants them- I'll keep my patches to myself until I find a clean solution.
But for the meantime, I have a kernel that actually runs on the Wandboard. And I think I have learned quite a bit from this exercise :-).
Thanks a lot for your help.
Robert
________________________________________ From: Devel [devel-bounces@sel4.systems] on behalf of Robert Kaiser [robert.kaiser@hs-rm.de] Sent: Saturday, 21 March 2015 06:20 To: devel@sel4.systems Subject: Re: [seL4] Wandboard Port
Hi,
Hi Alex,
Am 18.03.2015 um 23:20 schrieb Alexander Kroh:
Hi Robert,
Yes, the async abort is caused by access to a physical address which is not backed by memory or registers, regardless of virtual address translation. OK, so: iff the page table contains a mapping for user space address 0x13294, but (due to a bug in the page table initialization) that page is mapped to a page frame which is not backed by RAM (or ROM), then, an attempt to execute user code at that address would cause an async abort. Is that correct?
If so, it would be great if someone could point me to the code that sets up the page table entries for the first user space thread. (I already did an unsuccessful search for this in the board specific initialization code but I can not say that I fully understand that code, so I may well have overlooked something.. )
You could try masking IRQs to further isolate the interrupt as the trigger. I tried this: result: No interrupt before start of user code, async fault still occurs in the same way as before -> I guess this shows that
Am 19.03.2015 um 09:34 schrieb Robert Kaiser: the interrupt has nothing to do with it.
Another option is to mask the async abort. You might find additional symptoms which will help to identify the issue. Now, that was interesting: After disabling the async abort in user mode (it is always disabled in kernel mode), the board starts executing the test suite! It runs a few tests successfully, but then crashes with a *kernel* data abort when running test "Run threads in domains()". There goes my theory about a memory mapping issue, I guess. But how can it have a kernel mode data abort when it is disabled?
Any ideas?
Cheers
Robert
- Alex
________________________________________ From: Robert Kaiser [robert.kaiser@hs-rm.de] Sent: Wednesday, 18 March 2015 19:27 To: Alexander Kroh Cc: devel@sel4.systems Subject: Re: [seL4] Wandboard Port
Hi Alex
On Sun, 2015-03-15 at 15:33 +0100, Robert Kaiser wrote: > Am 15.03.2015 um 11:23 schrieb Alexander Kroh: >> Hi Robert, >> >> The FSR value of 0x1c06 represents an asynchronous abort. In this case, the address reported cannot be trusted! > [...] >> The abort occurs when a physical address is accessed that has no valid backing RAM or device register. > So, could it also happen when accessing a virtual address that is mapped > to an invalid physical address (that might explain what I'm seeing)? The virtual to physical address translation has been completed successfully, else you would get an synchronous abort. The key here is that there was a problem with the underlying physical address. Thats what I meant to suggest: If the virtual address is correctly
Am 16.03.2015 um 02:52 schrieb Alexander Kroh: translated to a physical address by the MMU, but that physical address is not backed by memory or registers, could that also generate this kind of exception?
>> We have had lots of fun with this feature on the SabreLite. Common causes are: >> * Accessing device registers that do exist (some devices have voids in the middle of their address map). >> * If you (for some reason) map a device with the cacheable attribute, all addresses which would be used to fill the cache line must be valid (again, watch out for voids). >> * Some UART registers are unavailable when the appropriate enable bits are not set. >> >> My advice to you is to check that you are using the correct physical address for your device mappings (Including the kernel IRQ controller and timer). >> >> Also, the first printf at userspace may trigger the initialisation of the default UART (which will be incorrect in your case). >> https://github.com/seL4/libplatsupport/blob/master/plat_include/imx6/platsup... > Thanks for this hint! That would have been the next thing for me to > stumble over. However, quickliy fixing it had no effect on my current > problem. > >> There may also be slight differences in the availability of device registers between the 2 SoCs. > Is that really a possibility, given that U-boot reports the same chip > revision on both boards? It is unlikely, but it is still a possibility. Is it only the ARM chip revisions that match or also the i.MX6 chip revisions? Hmm, I'm sure I saw exactly the same outputs from both boards at some point, however, in the meantime I have re-flashed U-Boot on both of them. The situation now is that on the Sabre, U-Boot reports
"CPU: Freescale i.MX6 family TO1.2 at 792 MHz"
while on the wand it says:
"CPU: Freescale i.MX6Q rev1.2 at 792 MHz"
No idea wether that "1.2" refers to the core or the SoC.
> [...] > Wish I had a JTAG-debugger.... > > What I am still uncertain about is wether a fault upon entering user > code is to be expected, i.e. do those pages get mapped in by a page > fault handler or are they pre-mapped before the code is invoked? The fault is unexpected. The pages are pre-mapped by the kernel, but again, this is not a virtual memory mapping issue. However, one thing that is typical is the occurrence of an IRQ exception as soon as the mode switch to user space occurs. Indeed, that happens! I'm consistently seeing a timer interrupt at this point. Probably it has been pending for a while and fires as soon as the interrupt mask is dropped. Apart from its housekeeping work, this timer ISR does a few hardware accesses to the "private timer" and the interrupt controller (both components, as I understand, are part of the A9 core).
I tried putting isb/dmb and dsb instructions right after these hardware accesses, hoping this might change the behaviour in some way, thus indicating which of them triggered the async fault. Alas, no effect at all :-(.
One thing to try is to insert an "isb" instruction just before switching to user space. This will ensure that all memory accesses are completed before continuing and it will force the asynchronous abort to occur at this instruction rather than some future instruction, when the load/store buffer finally drains. You should also add an isb here in case you are returning from an IRQ: https://github.com/seL4/seL4/blob/master/src/arch/arm/traps.S#L49 I also tried this. And I tried sequences of dmb, dsb and isb instructions. All of this had no visible effect. The behaivour stays the same all the time: upon leaving privileged mode, the interrupt fires, gets serviced, then the async fault happens. I know the fault address can not be trusted, but it never changed during these experiments. No matter where in the ISR or else i placed those isb instructions, it always pointed to the entry point of the user code.
Any suggestions how to further systematically pinpoint this problem?
Thanks in advance for any help.
Robert
- Alex
> Again, thanks for any help > > Cheers > > Robert > > > >> - Alex >> >> >> ________________________________________ >> From: Devel [devel-bounces@sel4.systems] on behalf of Robert Kaiser [robert.kaiser@hs-rm.de] >> Sent: Sunday, 15 March 2015 19:03 >> To: devel@sel4.systems >> Subject: [seL4] Wandboard Port >> >> Hello, >> >> in an attempt to familiarize myself with the seL4 code, I am trying to >> "port" it to the Wandboard (see www.wandboard.org). This should be an >> easy task for a beginner (thought I) since the board is very similar to >> the SabeLite, and seL4 is already running well on that board. I have >> access to a SabreLite and a Wandboard Quad, both (according to U-boot) >> have the same revision of the iMX6 SoC installed. >> >> Differences between the Sabre and the Wand I have noticed so far are: >> >> - 2GB of RAM from (0x10000000 to 0x90000000) on the Wand (Sabrelite has 1GB) >> - Wand uses UART1 for debug output, Sabrelite: UART2 >> >> I compiled an sel4test project where I adapted the UART port in >> kernel/include/plat/imx6/plat/machine/devices.h and >> elfloader/src/arch-arm/plat-imx6/platform.h and the RAM size in kernel >> src/plat/imx6/machine/hardware.c. When I boot this system, I get: >> >> Jumping to kernel-image entry point... >> Bootstrapping kernel >> Caught cap fault in send phase at address 0x0 >> while trying to handle: >> vm fault on data at address 0x9f11c2e0 with status 0x1c06 >> in thread 0xffdfad00 at address 0x13294 >> >> (Needless to say, "all is well in the universe" on the SabreLite... ) >> What is not shown here are a ton of other debug messages which I have >> added to convince myself that kernel initialization completes as >> expected. The crash seems to happen upon entry into user code. The >> address 0x13294 is the virtual address of the entry point: >> >> $ nm build/arm/imx6/sel4test-driver/sel4test-driver.bin | grep 13294 >> 00013294 T _sel4_start >> >> I suspect that this fault happens on opcode fetch, because the user code >> is not properly mapped when invoked. Does "status 0x1c06" confirm this? >> >> If so, *should* the code be mapped at this point or are these mappings >> expected to be installed "on demand", i.e. through page fault handling? >> >> Thanks for any help... >> >> Robert >> >> >> -- >> Robert Kaiser >> Computer Engineering >> RheinMain University of Applied Sciences >> >> >> >> _______________________________________________ >> Devel mailing list >> Devel@sel4.systems >> https://sel4.systems/lists/listinfo/devel >> >> ________________________________ >> >> The information in this e-mail may be confidential and subject to legal professional privilege and/or copyright. National ICT Australia Limited accepts no liability for any damage caused by this email or its attachments. -- Prof. Dr. Robert Kaiser
Technische Informatik Hochschule RheinMain Wiesbaden Rüsselsheim
Computer Engineering RheinMain University of Applied Sciences
robert.kaiser@hs-rm.de http://www.cs.hs-rm.de/~kaiser
tel:(+49)611-9495-1292 fax:(+49)611-9495-1210
Postanschrift/Postal Address: Robert Kaiser, Hochschule RheinMain, FB DCSM/Informatik Unter den Eichen 5, 65195 Wiesbaden, Germany
-- Robert Kaiser
Computer Engineering RheinMain University of Applied Sciences
_______________________________________________ Devel mailing list Devel@sel4.systems https://sel4.systems/lists/listinfo/devel
-- Robert Kaiser
Computer Engineering RheinMain University of Applied Sciences
-- Robert Kaiser
Computer Engineering RheinMain University of Applied Sciences
-- Prof. Dr. Robert Kaiser Technische Informatik Hochschule RheinMain Wiesbaden Rüsselsheim Computer Engineering RheinMain University of Applied Sciences robert.kaiser@hs-rm.de http://www.cs.hs-rm.de/~kaiser tel:(+49)611-9495-1292 fax:(+49)611-9495-1210 Postanschrift/Postal Address: Robert Kaiser, Hochschule RheinMain, FB DCSM/Informatik Unter den Eichen 5, 65195 Wiesbaden, Germany