Hello, we're are in progress of extending our Genode/seL4 support, namely adding x86_64 and ARM support. During this work I encountered on x86_64 some sporadically crashes of the seL4 kernel. It happened during the debugging of our roottask on seL4 in form of a page fault in the kernel ("KERNEL EXCEPTION") or a static assertion in the kernel (see below).
From my understanding, this shouldn't be possible based on your verification work. Nevertheless, it happens reliable.
(To be fair, the roottask was in this scenario already a bit confused and did things wrong - which however should never be enough to force the kernel in an exception or assertion - I presume.) I saved and stripped down the scenario which causes the kernel exception with high reliably and hope you can re-produce it and find the root cause. I stripped down the scenario to just the Genode image binary and the steps we did to create the seL4 kernel from the vanilla sources. Steps to reproduce: qemu-system-x86_64 -no-kvm -serial stdio -m 512 -kernel sel4 -initrd bomb_image.elf The binaries of "bomb_image.elf" can be found at [1] and the kernel "sel4" at [2]. The kernel is based on seL4 6.0 and a patch to pc99/autoconf.h (see [0]). Beside that it is the vanilla 6.0 kernel: https://github.com/seL4/seL4.git git commit 8564ace4dfb622ec69e0f7d762ebfbc8552ec918 "update VERSION file to 6.0.0" patch -p1 <~autoconfig.patch BOARD=x86_64 ARCH=x86 SEL4_ARCH=x86_64 PLAT=pc99 DEBUG=1 make objcopy -O elf32-i386 kernel.elf.strip sel4 (Qemu complains otherwise about 64bit elf for boot) The final "sel4" binary can be used with the "bomb_image.elf" and qemu as pointed out above. I tried Qemu 2.5.0 as shipped with Ubuntu 16.04 LTS and Qemu 2.8.0 (self-compiled). On real hardware the scenario typically leads after a while (kernel booted, roottask booted, scenario starts to run) to a reboot (without the messages or assertion as in the Qemu case, see below). The tool-chain for Genode is based on GNU GCC 6.3.0 (see [3] and [4]). The kernel was either compiled with Genode's toolchain (6.3.0) or with the standard GCC as provided by Ubuntu 16.04 LTS (GCC 5.4.0). On x86_32 I could not reproduce the issue. (On ARM I did not try.) I hope the information are helpful - if you need more just ask. Cheers, Alex. ========== KERNEL EXCEPTION ========== Vector: 0xe ErrCode: 0x0 IP: 0xffffffff80720e60 SP: 0xffffffff80740e08 FLAGS: 0x6 CR0: 0x8000003b CR2: 0x3007065 (page-fault address) CR3: 0x1ffb000 (page-directory physical address) CR4: 0x220 Stack Dump: *0xffffffff80740e08 == 0xffffff8001a99b00 *0xffffffff80740e10 == 0xffffffff80729b1f *0xffffffff80740e18 == 0x1 *0xffffffff80740e20 == 0xffffffff80721c59 *0xffffffff80740e28 == 0xffffffff *0xffffffff80740e30 == 0x0 *0xffffffff80740e38 == 0xffffff801f02b740 *0xffffffff80740e40 == 0xffffff801f02b740 *0xffffffff80740e48 == 0x1 *0xffffffff80740e50 == 0xffffff8001a99000 *0xffffffff80740e58 == 0x7 *0xffffffff80740e60 == 0x59 *0xffffffff80740e68 == 0xffffff8001a99b00 *0xffffffff80740e70 == 0xffffffff80729cf3 *0xffffffff80740e78 == 0x9000000000000007 *0xffffffff80740e80 == 0xffffff8001a99059 *0xffffffff80740e88 == 0xffffff80ffffffff *0xffffffff80740e90 == 0xffffffff80740f18 *0xffffffff80740e98 == 0xffffff800013c200 *0xffffffff80740ea0 == 0x10000 Halting... halting... Kernel entry via Syscall, number: 1, Call Cap type: 10, Invocation tag: 16 or seL4 failed assertion '(cte_t*)mdb_node_get_mdbNext(destSlot->cteMDBNode) == NULL && (cte_t*)mdb_node_get_mdbPrev(destSlot->cteMDBNode) == NULL' at src/object/cnode.c:457 in function cteInsert halting... Kernel entry via Syscall, number: 1, Call Cap type: 10, Invocation tag: 18 or a fault by the roottask (and not in the kernel). So - this is no issue for the kernel: Caught cap fault in send phase at address 0x0 while trying to handle: vm fault on data at address 0x18 with status 0x4 in thread 0xffffff800013c200 "child of: 'rootserver'" at address 0x20419cf [0] https://github.com/alex-ab/genode/blob/crash_sel4_kernel_vanilla/autoconfig.... [1] https://github.com/alex-ab/genode/raw/crash_sel4_kernel_vanilla/image_bomb.e... [2] https://github.com/alex-ab/genode/raw/crash_sel4_kernel_vanilla/sel4 [3] https://genode.org/download/tool-chain [4] https://sourceforge.net/projects/genode/files/genode-toolchain/17.05/ -- Alexander Boettcher Genode Labs http://www.genode-labs.com - http://www.genode.org Genode Labs GmbH - Amtsgericht Dresden - HRB 28424 - Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth
Hi Alex, Many thanks for the detailed report. I should point out that the x86-64 version of the kernel is presently undergoing verification and so is not yet up to the same guarantees as the verified ARM platforms. I will look into this right away and let you know how I go. Adrian On Fri 25-Aug-2017 6:49 AM, Alexander Boettcher wrote: Hello, we're are in progress of extending our Genode/seL4 support, namely adding x86_64 and ARM support. During this work I encountered on x86_64 some sporadically crashes of the seL4 kernel. It happened during the debugging of our roottask on seL4 in form of a page fault in the kernel ("KERNEL EXCEPTION") or a static assertion in the kernel (see below). From my understanding, this shouldn't be possible based on your verification work. Nevertheless, it happens reliable. (To be fair, the roottask was in this scenario already a bit confused and did things wrong - which however should never be enough to force the kernel in an exception or assertion - I presume.) I saved and stripped down the scenario which causes the kernel exception with high reliably and hope you can re-produce it and find the root cause. I stripped down the scenario to just the Genode image binary and the steps we did to create the seL4 kernel from the vanilla sources. Steps to reproduce: qemu-system-x86_64 -no-kvm -serial stdio -m 512 -kernel sel4 -initrd bomb_image.elf The binaries of "bomb_image.elf" can be found at [1] and the kernel "sel4" at [2]. The kernel is based on seL4 6.0 and a patch to pc99/autoconf.h (see [0]). Beside that it is the vanilla 6.0 kernel: https://github.com/seL4/seL4.git git commit 8564ace4dfb622ec69e0f7d762ebfbc8552ec918 "update VERSION file to 6.0.0" patch -p1 <~autoconfig.patch BOARD=x86_64 ARCH=x86 SEL4_ARCH=x86_64 PLAT=pc99 DEBUG=1 make objcopy -O elf32-i386 kernel.elf.strip sel4 (Qemu complains otherwise about 64bit elf for boot) The final "sel4" binary can be used with the "bomb_image.elf" and qemu as pointed out above. I tried Qemu 2.5.0 as shipped with Ubuntu 16.04 LTS and Qemu 2.8.0 (self-compiled). On real hardware the scenario typically leads after a while (kernel booted, roottask booted, scenario starts to run) to a reboot (without the messages or assertion as in the Qemu case, see below). The tool-chain for Genode is based on GNU GCC 6.3.0 (see [3] and [4]). The kernel was either compiled with Genode's toolchain (6.3.0) or with the standard GCC as provided by Ubuntu 16.04 LTS (GCC 5.4.0). On x86_32 I could not reproduce the issue. (On ARM I did not try.) I hope the information are helpful - if you need more just ask. Cheers, Alex. ========== KERNEL EXCEPTION ========== Vector: 0xe ErrCode: 0x0 IP: 0xffffffff80720e60 SP: 0xffffffff80740e08 FLAGS: 0x6 CR0: 0x8000003b CR2: 0x3007065 (page-fault address) CR3: 0x1ffb000 (page-directory physical address) CR4: 0x220 Stack Dump: *0xffffffff80740e08 == 0xffffff8001a99b00 *0xffffffff80740e10 == 0xffffffff80729b1f *0xffffffff80740e18 == 0x1 *0xffffffff80740e20 == 0xffffffff80721c59 *0xffffffff80740e28 == 0xffffffff *0xffffffff80740e30 == 0x0 *0xffffffff80740e38 == 0xffffff801f02b740 *0xffffffff80740e40 == 0xffffff801f02b740 *0xffffffff80740e48 == 0x1 *0xffffffff80740e50 == 0xffffff8001a99000 *0xffffffff80740e58 == 0x7 *0xffffffff80740e60 == 0x59 *0xffffffff80740e68 == 0xffffff8001a99b00 *0xffffffff80740e70 == 0xffffffff80729cf3 *0xffffffff80740e78 == 0x9000000000000007 *0xffffffff80740e80 == 0xffffff8001a99059 *0xffffffff80740e88 == 0xffffff80ffffffff *0xffffffff80740e90 == 0xffffffff80740f18 *0xffffffff80740e98 == 0xffffff800013c200 *0xffffffff80740ea0 == 0x10000 Halting... halting... Kernel entry via Syscall, number: 1, Call Cap type: 10, Invocation tag: 16 or seL4 failed assertion '(cte_t*)mdb_node_get_mdbNext(destSlot->cteMDBNode) == NULL && (cte_t*)mdb_node_get_mdbPrev(destSlot->cteMDBNode) == NULL' at src/object/cnode.c:457 in function cteInsert halting... Kernel entry via Syscall, number: 1, Call Cap type: 10, Invocation tag: 18 or a fault by the roottask (and not in the kernel). So - this is no issue for the kernel: Caught cap fault in send phase at address 0x0 while trying to handle: vm fault on data at address 0x18 with status 0x4 in thread 0xffffff800013c200 "child of: 'rootserver'" at address 0x20419cf [0] https://github.com/alex-ab/genode/blob/crash_sel4_kernel_vanilla/autoconfig.... [1] https://github.com/alex-ab/genode/raw/crash_sel4_kernel_vanilla/image_bomb.e... [2] https://github.com/alex-ab/genode/raw/crash_sel4_kernel_vanilla/sel4 [3] https://genode.org/download/tool-chain [4] https://sourceforge.net/projects/genode/files/genode-toolchain/17.05/
Hi Alex, I have tracked this down to the kernel failing to unmap pages in some cases when deleting frame caps, specifically it would only unmap if deleting the last capability to a frame, which is clearly wrong. As you say, this was only able to manifest as a problem because your user space was already miss-behaving and attempting to access memory that it shouldn't, because it should've already been unmapped and just caused a user fault. This bug would have been caught eventually by verification, they just have not gotten to that point of the C refinement for x86-64 yet. Note that this bug is *not* present on ARM. Fix should be out on master shortly. Thanks again for reporting this. Adrian On Fri 25-Aug-2017 6:49 AM, Alexander Boettcher wrote: Hello, we're are in progress of extending our Genode/seL4 support, namely adding x86_64 and ARM support. During this work I encountered on x86_64 some sporadically crashes of the seL4 kernel. It happened during the debugging of our roottask on seL4 in form of a page fault in the kernel ("KERNEL EXCEPTION") or a static assertion in the kernel (see below). From my understanding, this shouldn't be possible based on your verification work. Nevertheless, it happens reliable. (To be fair, the roottask was in this scenario already a bit confused and did things wrong - which however should never be enough to force the kernel in an exception or assertion - I presume.) I saved and stripped down the scenario which causes the kernel exception with high reliably and hope you can re-produce it and find the root cause. I stripped down the scenario to just the Genode image binary and the steps we did to create the seL4 kernel from the vanilla sources. Steps to reproduce: qemu-system-x86_64 -no-kvm -serial stdio -m 512 -kernel sel4 -initrd bomb_image.elf The binaries of "bomb_image.elf" can be found at [1] and the kernel "sel4" at [2]. The kernel is based on seL4 6.0 and a patch to pc99/autoconf.h (see [0]). Beside that it is the vanilla 6.0 kernel: https://github.com/seL4/seL4.git git commit 8564ace4dfb622ec69e0f7d762ebfbc8552ec918 "update VERSION file to 6.0.0" patch -p1 <~autoconfig.patch BOARD=x86_64 ARCH=x86 SEL4_ARCH=x86_64 PLAT=pc99 DEBUG=1 make objcopy -O elf32-i386 kernel.elf.strip sel4 (Qemu complains otherwise about 64bit elf for boot) The final "sel4" binary can be used with the "bomb_image.elf" and qemu as pointed out above. I tried Qemu 2.5.0 as shipped with Ubuntu 16.04 LTS and Qemu 2.8.0 (self-compiled). On real hardware the scenario typically leads after a while (kernel booted, roottask booted, scenario starts to run) to a reboot (without the messages or assertion as in the Qemu case, see below). The tool-chain for Genode is based on GNU GCC 6.3.0 (see [3] and [4]). The kernel was either compiled with Genode's toolchain (6.3.0) or with the standard GCC as provided by Ubuntu 16.04 LTS (GCC 5.4.0). On x86_32 I could not reproduce the issue. (On ARM I did not try.) I hope the information are helpful - if you need more just ask. Cheers, Alex. ========== KERNEL EXCEPTION ========== Vector: 0xe ErrCode: 0x0 IP: 0xffffffff80720e60 SP: 0xffffffff80740e08 FLAGS: 0x6 CR0: 0x8000003b CR2: 0x3007065 (page-fault address) CR3: 0x1ffb000 (page-directory physical address) CR4: 0x220 Stack Dump: *0xffffffff80740e08 == 0xffffff8001a99b00 *0xffffffff80740e10 == 0xffffffff80729b1f *0xffffffff80740e18 == 0x1 *0xffffffff80740e20 == 0xffffffff80721c59 *0xffffffff80740e28 == 0xffffffff *0xffffffff80740e30 == 0x0 *0xffffffff80740e38 == 0xffffff801f02b740 *0xffffffff80740e40 == 0xffffff801f02b740 *0xffffffff80740e48 == 0x1 *0xffffffff80740e50 == 0xffffff8001a99000 *0xffffffff80740e58 == 0x7 *0xffffffff80740e60 == 0x59 *0xffffffff80740e68 == 0xffffff8001a99b00 *0xffffffff80740e70 == 0xffffffff80729cf3 *0xffffffff80740e78 == 0x9000000000000007 *0xffffffff80740e80 == 0xffffff8001a99059 *0xffffffff80740e88 == 0xffffff80ffffffff *0xffffffff80740e90 == 0xffffffff80740f18 *0xffffffff80740e98 == 0xffffff800013c200 *0xffffffff80740ea0 == 0x10000 Halting... halting... Kernel entry via Syscall, number: 1, Call Cap type: 10, Invocation tag: 16 or seL4 failed assertion '(cte_t*)mdb_node_get_mdbNext(destSlot->cteMDBNode) == NULL && (cte_t*)mdb_node_get_mdbPrev(destSlot->cteMDBNode) == NULL' at src/object/cnode.c:457 in function cteInsert halting... Kernel entry via Syscall, number: 1, Call Cap type: 10, Invocation tag: 18 or a fault by the roottask (and not in the kernel). So - this is no issue for the kernel: Caught cap fault in send phase at address 0x0 while trying to handle: vm fault on data at address 0x18 with status 0x4 in thread 0xffffff800013c200 "child of: 'rootserver'" at address 0x20419cf [0] https://github.com/alex-ab/genode/blob/crash_sel4_kernel_vanilla/autoconfig.... [1] https://github.com/alex-ab/genode/raw/crash_sel4_kernel_vanilla/image_bomb.e... [2] https://github.com/alex-ab/genode/raw/crash_sel4_kernel_vanilla/sel4 [3] https://genode.org/download/tool-chain [4] https://sourceforge.net/projects/genode/files/genode-toolchain/17.05/
Hi Adrian, On 30.08.2017 09:58, Adrian.Danis@data61.csiro.au wrote:
I have tracked this down to the kernel failing to unmap pages in some cases when deleting frame caps, specifically it would only unmap if deleting the last capability to a frame, which is clearly wrong. As you say, this was only able to manifest as a problem because your user space was already miss-behaving and attempting to access memory that it shouldn't, because it should've already been unmapped and just caused a user fault.
This bug would have been caught eventually by verification, they just have not gotten to that point of the C refinement for x86-64 yet. Note that this bug is *not* present on ARM.
Fix should be out on master shortly.
Great ! I'm looking forward to the updated kernel. Thanks, Alex.
Thanks again for reporting this.
Adrian
On Fri 25-Aug-2017 6:49 AM, Alexander Boettcher wrote:
Hello,
we're are in progress of extending our Genode/seL4 support, namely adding x86_64 and ARM support.
During this work I encountered on x86_64 some sporadically crashes of the seL4 kernel. It happened during the debugging of our roottask on seL4 in form of a page fault in the kernel ("KERNEL EXCEPTION") or a static assertion in the kernel (see below).
From my understanding, this shouldn't be possible based on your verification work. Nevertheless, it happens reliable.
(To be fair, the roottask was in this scenario already a bit confused and did things wrong - which however should never be enough to force the kernel in an exception or assertion - I presume.)
I saved and stripped down the scenario which causes the kernel exception with high reliably and hope you can re-produce it and find the root cause. I stripped down the scenario to just the Genode image binary and the steps we did to create the seL4 kernel from the vanilla sources.
Steps to reproduce:
qemu-system-x86_64 -no-kvm -serial stdio -m 512 -kernel sel4 -initrd bomb_image.elf
The binaries of "bomb_image.elf" can be found at [1] and the kernel "sel4" at [2].
The kernel is based on seL4 6.0 and a patch to pc99/autoconf.h (see [0]). Beside that it is the vanilla 6.0 kernel:
https://github.com/seL4/seL4.git
git commit 8564ace4dfb622ec69e0f7d762ebfbc8552ec918 "update VERSION file to 6.0.0"
patch -p1 <~autoconfig.patch
BOARD=x86_64 ARCH=x86 SEL4_ARCH=x86_64 PLAT=pc99 DEBUG=1 make
objcopy -O elf32-i386 kernel.elf.strip sel4 (Qemu complains otherwise about 64bit elf for boot)
The final "sel4" binary can be used with the "bomb_image.elf" and qemu as pointed out above.
I tried Qemu 2.5.0 as shipped with Ubuntu 16.04 LTS and Qemu 2.8.0 (self-compiled). On real hardware the scenario typically leads after a while (kernel booted, roottask booted, scenario starts to run) to a reboot (without the messages or assertion as in the Qemu case, see below).
The tool-chain for Genode is based on GNU GCC 6.3.0 (see [3] and [4]).
The kernel was either compiled with Genode's toolchain (6.3.0) or with the standard GCC as provided by Ubuntu 16.04 LTS (GCC 5.4.0).
On x86_32 I could not reproduce the issue. (On ARM I did not try.)
I hope the information are helpful - if you need more just ask.
Cheers,
Alex.
========== KERNEL EXCEPTION ========== Vector: 0xe ErrCode: 0x0 IP: 0xffffffff80720e60 SP: 0xffffffff80740e08 FLAGS: 0x6 CR0: 0x8000003b CR2: 0x3007065 (page-fault address) CR3: 0x1ffb000 (page-directory physical address) CR4: 0x220
Stack Dump: *0xffffffff80740e08 == 0xffffff8001a99b00 *0xffffffff80740e10 == 0xffffffff80729b1f *0xffffffff80740e18 == 0x1 *0xffffffff80740e20 == 0xffffffff80721c59 *0xffffffff80740e28 == 0xffffffff *0xffffffff80740e30 == 0x0 *0xffffffff80740e38 == 0xffffff801f02b740 *0xffffffff80740e40 == 0xffffff801f02b740 *0xffffffff80740e48 == 0x1 *0xffffffff80740e50 == 0xffffff8001a99000 *0xffffffff80740e58 == 0x7 *0xffffffff80740e60 == 0x59 *0xffffffff80740e68 == 0xffffff8001a99b00 *0xffffffff80740e70 == 0xffffffff80729cf3 *0xffffffff80740e78 == 0x9000000000000007 *0xffffffff80740e80 == 0xffffff8001a99059 *0xffffffff80740e88 == 0xffffff80ffffffff *0xffffffff80740e90 == 0xffffffff80740f18 *0xffffffff80740e98 == 0xffffff800013c200 *0xffffffff80740ea0 == 0x10000
Halting... halting... Kernel entry via Syscall, number: 1, Call Cap type: 10, Invocation tag: 16
or
seL4 failed assertion '(cte_t*)mdb_node_get_mdbNext(destSlot->cteMDBNode) == NULL && (cte_t*)mdb_node_get_mdbPrev(destSlot->cteMDBNode) == NULL' at src/object/cnode.c:457 in function cteInsert halting... Kernel entry via Syscall, number: 1, Call Cap type: 10, Invocation tag: 18
or a fault by the roottask (and not in the kernel). So - this is no issue for the kernel:
Caught cap fault in send phase at address 0x0 while trying to handle: vm fault on data at address 0x18 with status 0x4 in thread 0xffffff800013c200 "child of: 'rootserver'" at address 0x20419cf
[0] https://github.com/alex-ab/genode/blob/crash_sel4_kernel_vanilla/autoconfig.... [1] https://github.com/alex-ab/genode/raw/crash_sel4_kernel_vanilla/image_bomb.e... [2] https://github.com/alex-ab/genode/raw/crash_sel4_kernel_vanilla/sel4
[3] https://genode.org/download/tool-chain [4] https://sourceforge.net/projects/genode/files/genode-toolchain/17.05/
-- Alexander Boettcher Genode Labs http://www.genode-labs.com - http://www.genode.org Genode Labs GmbH - Amtsgericht Dresden - HRB 28424 - Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth
participants (2)
-
Adrian.Danis@data61.csiro.au
-
Alexander Boettcher