Hello. I have a problem. It is a very complicated problem. I understand that It is hard to give an advice without looking into the sources. But I am stuck and need some ideas in ‘brainstorm’ format as an input for me. So, here is the problem: There are many tests from sel4test suit are working. This is how does it look like: http://pastebin.com/vvnDaUe9 Unfortunately, I found several configurations of the tests, which causes crashes with stable and different symptoms. One of them is looking following: 11 #include <sel4test/test.h> 12 #include "../test.h" 13 #include "../helpers.h" 14 15 #define MIN_EXPECTED_ALLOCATIONS 100 16 17 18 19 20 21 22 23 int test_allocator(env_t env) 24 { 25 /* Perform a bunch of allocations and frees */ 26 vka_object_t endpoint; 27 int error; 28 29 for (int i = 0; i < MIN_EXPECTED_ALLOCATIONS; i++) { 30 error = vka_alloc_endpoint(&env->vka, &endpoint); 31 test_assert(error == 0); 32 test_assert(endpoint.cptr != 0); 33 vka_free_object(&env->vka, &endpoint); 34 } 35 36 return sel4test_get_result(); 37 } 38 DEFINE_TEST(TRIVIAL0001, "Ensure the allocator works", test_allocator) 39 DEFINE_TEST(TRIVIAL0002, "Ensure the allocator works more than once", test_allocator) as you might see, this is little bit modified version of trivial.c tests. Usual, when I am testing all tests, there is no problem with this test. But when I am running this file alone, I have a problem. Also, you might see, that there are several free lines with numbers from 16 to 22. I made this not accidental, it is a source of error. If the test_allocator(env_t env) is lockated in the 23rd line, this tests has no problem. But if I add one more free line, I have an error like this: Starting test suite sel4test Starting test 0: TEST_TRIVIAL0001 TEST_TRIVIAL0001 726:trap!, address error on store! $0 : 0 fffffff8 7 10089000 $4 : 6a6ffc 3f 0 ffffffff $8 : 1 7 1 20 $12 : 70f 0 2 6ae $16 : 6a6ffc 5a6a50 ffffff30 6a6ffc $20 : 3f 5a6e58 5a6a50 6a6fdc $24 : 20 41bab0 0 0 $28 : 0 100848d8 fffffffe 41faf0 Hi : 20 Lo : 0 epc : 41f818 fi : 41f814 ra : 41faf0 Status: 20000002 Cause : 414 Config: 80008482 BadVaddr: f which means, that the test-drivers (I know the ASID) tries to store something to the address 0x0000000f by the instruction located in the 0x41f818 (EPC). Also, it is failed before the starting of trivial.c test! This is what I on the address: 0041f7b4 <unbin>: 41f7b4: 8c83000c lw v1,12(a0) 41f7b8: 27bdffd8 addiu sp,sp,-40 41f7bc: 8c820008 lw v0,8(a0) 41f7c0: afb0001c sw s0,28(sp) 41f7c4: 00808021 move s0,a0 41f7c8: afbf0024 sw ra,36(sp) 41f7cc: 1462000e bne v1,v0,41f808 <unbin+0x54> 41f7d0: afb10020 sw s1,32(sp) 41f7d4: 00a03021 move a2,a1 41f7d8: 00002021 move a0,zero 41f7dc: 0c107afa jal 41ebe8 <.pic.__ashldi3> 41f7e0: 24050001 li a1,1 41f7e4: 3c04005a lui a0,0x5a 41f7e8: 00022827 nor a1,zero,v0 41f7ec: 24846a50 addiu a0,a0,27216 41f7f0: 0c107dac jal 41f6b0 <a_and> 41f7f4: 00038827 nor s1,zero,v1 41f7f8: 3c04005a lui a0,0x5a 41f7fc: 02202821 move a1,s1 41f800: 0c107dac jal 41f6b0 <a_and> 41f804: 24846a54 addiu a0,a0,27220 41f808: 8e030008 lw v1,8(s0) 41f80c: 8e02000c lw v0,12(s0) 41f810: 8fbf0024 lw ra,36(sp) 41f814: 8fb10020 lw s1,32(sp) 41f818: ac430008 sw v1,8(v0) I am quite sure that there is no problem here with unbin function, and the problem somewhere with contexts, or alignment or something else. Also, I am mapping this area RO, so I am also sure that there is no corruption of the user-space regions. Also, I see, that there is a correlation between size of the image and faults: 1349892 ./sel4test-tests.bin_23 1349900 ./sel4test-tests.bin_24 The border line is 1349900 if I have a size of the image below the value -- there is no problem. Unfortunately, 1349900 is not a 'round' value, somehow related to TLB sizes of something else what I know. Also, I should mention, that 'key' variable wich changes when I add new line is an argument of _sel4test_failure: mips-mti-linux-gnu-objdump -d sel4test-tests.bin_23 > bin23.asm mips-mti-linux-gnu-objdump -d sel4test-tests.bin_24 > bin24.asm diff -ua bin23.asm bin24.asm Disassembly of section .init: @@ -3143,7 +3143,7 @@ 4030f0: 3c050043 lui a1,0x43 4030f4: 2484c248 addiu a0,a0,-15800 4030f8: 24a5c964 addiu a1,a1,-13980 - 4030fc: 2406001f li a2,31 + 4030fc: 24060020 li a2,32 403100: 0c107b51 jal 41ed44 <_sel4test_failure> 403104: 0000f021 move s8,zero 403108: 8fbf0074 lw ra,116(sp) @@ -3383,7 +3383,7 @@ 4034b0: 2484c950 addiu a0,a0,-14000 4034b4: 24a5c964 addiu a1,a1,-13980 4034b8: 0c107b51 jal 41ed44 <_sel4test_failure> - 4034bc: 24060020 li a2,32 + 4034bc: 24060021 li a2,33 4034c0: 03c01021 move v0,s8 4034c4: 8fbf0074 lw ra,116(sp) 4034c8: 8fbe0070 lw s8,112(sp) That is all that I have now. Any ideas? Now I do not see any other choice but to gradually reduce the amount of code with keeping this 'border' situation, until I will not have very small set of system functions. Thank you -- Vasily A. Sartakov sartakov@ksyslabs.org
On Tue 01-Nov-2016 7:37 AM, Vasily A. Sartakov wrote: as you might see, this is little bit modified version of trivial.c tests. Usual, when I am testing all tests, there is no problem with this test. But when I am running this file alone, I have a problem. Also, you might see, that there are several free lines with numbers from 16 to 22. I made this not accidental, it is a source of error. If the test_allocator(env_t env) is lockated in the 23rd line, this tests has no problem. But if I add one more free line, I have an error like this: If adding whitespace gives you a different compilation result then that is one bizarre compiler you have. I would check and make sure that this is really what is going on, because it seems fairly improbable to me. Maybe do some multiple runs/builds, 'make clean' between each build etc. Also, I see, that there is a correlation between size of the image and faults: 1349892 ./sel4test-tests.bin_23 1349900 ./sel4test-tests.bin_24 The border line is 1349900 if I have a size of the image below the value -- there is no problem. Unfortunately, 1349900 is not a 'round' value, somehow related to TLB sizes of something else what I know. If virtual address layout changes seem to coincide with faults then I would be checking things like * Context switching code / address space management * TLB/cache/ASID maintenance * Branch predictor / any other hardware state that tracks virtual addresses Note that I'm saying this as someone who knows basically nothing about MIPS, hence the broad suggestions. Adrian
I think I found the reason, or, at least, I do not have errors now and I have reasoning about the solution. Please correct me if I am wrong. Firstly, I should come back to my email about syscall.c (11 Oct. 16). This conversation was started by my message: … For example, sometimes it uses only S0-S7 with some T0-T9 registers … end ended with: ------
The point of the syscall.c tests is to check that registers are not being corrupted by our syscalls (i.e that the kernel ABI + stubs follows the calling convention of the architecture).
…and corruption of registers can happen only if syscalls modify stack, since these values are popped from it after the end of a syscall routine, right? ----- So, I have tried to say, that this test, in my case, tests nothing, because variables like this: register int a00 = 0xdead0000; \ register int a01 = 0xdead0001; \ register int a02 = 0xdead0002; \ register int a03 = 0xdead0003; \ register int a04 = 0xdead0004; \ can be located anywhere. Since I (we/you/they) do not specify exactly register name, the compiler can do anything with these variables. And this is what I saw in my tests: My compiler uses different registers and save their values on the stack before the syscall. After the syscall, the compiler load previous values, and tests pass without any problem. Now I have specified register name: #define TEST_REGISTERS(code) \ do { \ register int a00 asm("v0") = 0xdead00aa; \ __asm__ __volatile__ ("" \ : "+r"(a00)); \ code ; \ __asm__ __volatile__ ("" \ : "+r"(a00)); \ test_assert(a00 == 0xdead00aa); \ } while (0) and used only Yield syscal as the test. Btw, this is an implementation of Yield: static inline void seL4_Yield(void) { register seL4_Word scno asm("v0") = seL4_SysYield; __asm__ __volatile__ ("nop;syscall" : : "r"(scno)); } And this is what I see when I disassemble this test: 00403080 <test_seL4_Yield>: 403080: 3c04dead lui a0,0xdead <==== a0 = 0xdead0000 403084: 27bdffe0 addiu sp,sp,-32 403088: 248400aa addiu a0,a0,170 <==== a0 = 0xdead00aa 40308c: 2403000a li v1,10 403090: afbf001c sw ra,28(sp) 403094: 00802821 move a1,a0 403098: 00801021 move v0,a0 <==== v0 = a0 = 0xdead00aa 40309c: 2402fff9 li v0,-7 <==== v0 = -7 4030a0: 00000000 nop 4030a4: 0000000c syscall 4030a8: 14450008 bne v0,a1,4030cc <test_seL4_Yield+0x4c> <==== compare v0 with a0 4030ac: 2463ffff addiu v1,v1,-1 4030b0: 1460fffa bnez v1,40309c <test_seL4_Yield+0x1c> 4030b4: 00801021 move v0,a0 4030b8: 0c107f4d jal 41fd34 <sel4test_get_result> 4030bc: 00000000 nop 4030c0: 8fbf001c lw ra,28(sp) 4030c4: 03e00008 jr ra 4030c8: 27bd0020 addiu sp,sp,32 4030cc: 3c040043 lui a0,0x43 4030d0: 3c050043 lui a1,0x43 4030d4: 2484d8a0 addiu a0,a0,-10080 4030d8: 24a5d8b4 addiu a1,a1,-10060 4030dc: 0c107f25 jal 41fc94 <_sel4test_failure> 4030e0: 2406009f li a2,159 4030e4: 00001021 move v0,zero 4030e8: 8fbf001c lw ra,28(sp) 4030ec: 03e00008 jr ra 4030f0: 27bd0020 addiu sp,sp,32 So, as one can see, we load the 0xdead00aa into the a0, then we copy it to the v0, then we fill scno after the syscall we compare values. And of course, this test is failed. So, my failure was an assumption that these registers are saved across syscall. I already changed my message registers to S0-S3 (Callee saved registers), and I do not see original issue anymore. But, of course, callee saved registers add overhead and the size of my binary changes. Thus, maybe I still have an error, but I cannot trigger it with current tests.
On Tue 01-Nov-2016 7:37 AM, Vasily A. Sartakov wrote:
as you might see, this is little bit modified version of trivial.c tests. Usual, when I am testing all tests, there is no problem with this test. But when I am running this file alone, I have a problem. Also, you might see, that there are several free lines with numbers from 16 to 22. I made this not accidental, it is a source of error. If the test_allocator(env_t env) is lockated in the 23rd line, this tests has no problem. But if I add one more free line, I have an error like this: If adding whitespace gives you a different compilation result then that is one bizarre compiler you have. I would check and make sure that this is really what is going on, because it seems fairly improbable to me. Maybe do some multiple runs/builds, 'make clean' between each build etc. Also, I see, that there is a correlation between size of the image and faults:
1349892 ./sel4test-tests.bin_23 1349900 ./sel4test-tests.bin_24
The border line is 1349900 if I have a size of the image below the value -- there is no problem. Unfortunately, 1349900 is not a 'round' value, somehow related to TLB sizes of something else what I know.
If virtual address layout changes seem to coincide with faults then I would be checking things like * Context switching code / address space management * TLB/cache/ASID maintenance * Branch predictor / any other hardware state that tracks virtual addresses
Note that I'm saying this as someone who knows basically nothing about MIPS, hence the broad suggestions.
Adrian
-- Vasily A. Sartakov sartakov@ksyslabs.org
participants (2)
-
Adrian.Danis@data61.csiro.au
-
Vasily A. Sartakov