Hi Alex,

I have tracked this down to the kernel failing to unmap pages in some cases when deleting frame caps, specifically it would only unmap if deleting the last capability to a frame, which is clearly wrong. As you say, this was only able to manifest as a problem because your user space was already miss-behaving and attempting to access memory that it shouldn't, because it should've already been unmapped and just caused a user fault.

This bug would have been caught eventually by verification, they just have not gotten to that point of the C refinement for x86-64 yet. Note that this bug is *not* present on ARM.

Fix should be out on master shortly.

Thanks again for reporting this.

Adrian

On Fri 25-Aug-2017 6:49 AM, Alexander Boettcher wrote:
Hello,

we're are in progress of extending our Genode/seL4 support, namely
adding x86_64 and ARM support.

During this work I encountered on x86_64 some sporadically crashes of
the seL4 kernel. It happened during the debugging of our roottask on
seL4 in form of a page fault in the kernel ("KERNEL EXCEPTION") or a
static assertion in the kernel (see below).

From my understanding, this shouldn't be possible based on your
verification work. Nevertheless, it happens reliable.

(To be fair, the roottask was in this scenario already a bit confused
and did things wrong - which however should never be enough to force the
kernel in an exception or assertion - I presume.)

I saved and stripped down the scenario which causes the kernel exception
with high reliably and hope you can re-produce it and find the root
cause. I stripped down the scenario to just the Genode image binary and
the steps we did to create the seL4 kernel from the vanilla sources.

Steps to reproduce:

 qemu-system-x86_64 -no-kvm -serial stdio -m 512 -kernel sel4 -initrd
bomb_image.elf

The binaries of "bomb_image.elf" can be found at [1] and the kernel
"sel4" at [2].

The kernel is based on seL4 6.0 and a patch to pc99/autoconf.h (see
[0]). Beside that it is the vanilla 6.0 kernel:

 https://github.com/seL4/seL4.git

 git commit 8564ace4dfb622ec69e0f7d762ebfbc8552ec918
 "update VERSION file to 6.0.0"

 patch -p1 <~autoconfig.patch

 BOARD=x86_64 ARCH=x86 SEL4_ARCH=x86_64 PLAT=pc99 DEBUG=1 make

 objcopy -O elf32-i386 kernel.elf.strip sel4
 (Qemu complains otherwise about 64bit elf for boot)

The final "sel4" binary can be used with the "bomb_image.elf" and qemu
as pointed out above.

I tried Qemu 2.5.0 as shipped with Ubuntu 16.04 LTS and Qemu 2.8.0
(self-compiled). On real hardware the scenario typically leads after a
while (kernel booted, roottask booted, scenario starts to run) to a
reboot (without the messages or assertion as in the Qemu case, see below).

The tool-chain for Genode is based on GNU GCC 6.3.0 (see [3] and [4]).

The kernel was either compiled with Genode's toolchain (6.3.0) or with
the standard GCC as provided by Ubuntu 16.04 LTS (GCC 5.4.0).

On x86_32 I could not reproduce the issue. (On ARM I did not try.)

I hope the information are helpful - if you need more just ask.


Cheers,

Alex.



========== KERNEL EXCEPTION ==========
Vector:  0xe
ErrCode: 0x0
IP:      0xffffffff80720e60
SP:      0xffffffff80740e08
FLAGS:   0x6
CR0:     0x8000003b
CR2:     0x3007065 (page-fault address)
CR3:     0x1ffb000 (page-directory physical address)
CR4:     0x220

Stack Dump:
*0xffffffff80740e08 == 0xffffff8001a99b00
*0xffffffff80740e10 == 0xffffffff80729b1f
*0xffffffff80740e18 == 0x1
*0xffffffff80740e20 == 0xffffffff80721c59
*0xffffffff80740e28 == 0xffffffff
*0xffffffff80740e30 == 0x0
*0xffffffff80740e38 == 0xffffff801f02b740
*0xffffffff80740e40 == 0xffffff801f02b740
*0xffffffff80740e48 == 0x1
*0xffffffff80740e50 == 0xffffff8001a99000
*0xffffffff80740e58 == 0x7
*0xffffffff80740e60 == 0x59
*0xffffffff80740e68 == 0xffffff8001a99b00
*0xffffffff80740e70 == 0xffffffff80729cf3
*0xffffffff80740e78 == 0x9000000000000007
*0xffffffff80740e80 == 0xffffff8001a99059
*0xffffffff80740e88 == 0xffffff80ffffffff
*0xffffffff80740e90 == 0xffffffff80740f18
*0xffffffff80740e98 == 0xffffff800013c200
*0xffffffff80740ea0 == 0x10000

Halting...
halting...
Kernel entry via Syscall, number: 1, Call
Cap type: 10, Invocation tag: 16


or



seL4 failed assertion
'(cte_t*)mdb_node_get_mdbNext(destSlot->cteMDBNode) == NULL &&
(cte_t*)mdb_node_get_mdbPrev(destSlot->cteMDBNode) == NULL' at
src/object/cnode.c:457 in function cteInsert
halting...
Kernel entry via Syscall, number: 1, Call
Cap type: 10, Invocation tag: 18



or a fault by the roottask (and not in the kernel). So - this is no
issue for the kernel:


Caught cap fault in send phase at address 0x0
while trying to handle:
vm fault on data at address 0x18 with status 0x4
in thread 0xffffff800013c200 "child of: 'rootserver'" at address 0x20419cf



[0]
https://github.com/alex-ab/genode/blob/crash_sel4_kernel_vanilla/autoconfig.patch
[1]
https://github.com/alex-ab/genode/raw/crash_sel4_kernel_vanilla/image_bomb.elf
[2] https://github.com/alex-ab/genode/raw/crash_sel4_kernel_vanilla/sel4

[3] https://genode.org/download/tool-chain
[4] https://sourceforge.net/projects/genode/files/genode-toolchain/17.05/