Hello, the time spend by a vCPU TCB (tcbTimeSlice) seems to be not accounted/updated by the seL4 kernel on a timer ticks, which leads to starvation of other TCBs running on the same or lower priority level. In one of our test setups, if once a vCPU runs and doesn't give up its timeslice willful (e.g. guest goes in while loop and thereby not causing a VM exit to be handled by the VMM), all other TCBs (normal threads and vCPUs) on the same timeslice are not considered to be scheduled again, since tcbTimeSlice of the vCPU stays full forever. Following patch seems to fix the issue. Can you please have a look and verify that it is correct ? Or should it be solved on another place in the kernel differently. Just out of curiosity, why is a separate thread state in the scheduler for x86 (CONFIG_VFX) required that is not needed for scheduling of a vCPU on ARM ? I didn't expect the scheduler thread states to be different between hardware architectures in seL4. Cheers, -- Alexander Boettcher Genode Labs https://www.genode-labs.com - https://www.genode.org Genode Labs GmbH - Amtsgericht Dresden - HRB 28424 - Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth
On 08.11.18 09:30, Alexander Boettcher wrote:
vCPUs) on the same timeslice are not considered to be scheduled again,
Sorry, "on the same timeslice" must be replaced by "on the same priority level" Cheers, -- Alexander Boettcher Genode Labs https://www.genode-labs.com - https://www.genode.org Genode Labs GmbH - Amtsgericht Dresden - HRB 28424 - Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth
Hi Alexander,
Following patch seems to fix the issue. Can you please have a look and verify that it is correct ? Or should it be solved on another place in the kernel differently.
Yes, the patch is correct and definitely fixes a bug. Thanks! Would you like me to apply this directly or will you raise a PR? In order to not cause unneccessary headaches for verification, I would undo the convert to switch and leave the statement as an if.
Just out of curiosity, why is a separate thread state in the scheduler for x86 (CONFIG_VFX) required that is not needed for scheduling of a vCPU on ARM ? I didn't expect the scheduler thread states to be different between hardware architectures in seL4.
seL4's x86 and ARM virtualisation architectures are indeed different. On x86, we use one kernel thread for both guest and host, and use the thread state to differentiate what is currently running. On ARM, guest and host run in different kernel threads so this isn't required. Cheers, Anna.
On Fri, Nov 9, 2018 at 6:10 AM <Anna.Lyons@data61.csiro.au> wrote:
seL4's x86 and ARM virtualisation architectures are indeed different. On x86, we use one kernel thread for both guest and host, and use the thread state to differentiate what is currently running. On ARM, guest and host run in different kernel threads so this isn't required.
Out of interest, what's the reason for the different approaches?
On 9 Nov 2018, at 08:42, Jeff Waugh <jdub@bethesignal.org<mailto:jdub@bethesignal.org>> wrote: Out of interest, what's the reason for the different approaches? Historic. One was designed without planning for the other (time pressure), not our proudest moment(s). Things are being reworked… Gernot
As the original author of the x86 model I can provide the trip down memory lane. The original x86 implementation had the model that ARM has now. Unfortunately due to having two different threads there was always a very difficult to resolve race in performing interrupt injection into the guest. There is a race since injecting an interrupt requires 1. Read guest state 2. Validate guest is interrutible 3. Modify guest state or request notification when guest is interruptible If the guest state changes between 1 and 3 then there is a problem, so there needs to be a way to synchronize the guest. No such method exists, since in the thread APIs back then (the MCS kernel has a fix) you could only suspend a thread, which had the effect of cancelling all its IPCs, including any faults. So if suspended a VCPU thread that was faulted its fault message would get thrown away, which is a problem. Running the VCPU and VMM at different priorities has its own problems and leads to a priority management nightmare that can still have races where some other event causes us to want to inject an interrupt at the same time as the VCPU now has a pending fault. Since thread states aren't queryable the VMM, whilst injecting the interrupt, does not know that the VCPU is actually sitting on a pending fault. All this lead to the very deliberate decision to use a single thread for both vmx and native execution scheduling as it allows the VMM to *know* that when it is running its native thread that the guest will not run and it knows precisely whether it is faulted or not. The other option discussed at the time was a "suspend thread if running and return its current state". This would have removed the problems of discarding a fault by blinding calling 'suspend', and also allows for knowing there is a fault to handle it prior to the interrupt. Ultimately this option was not chosen as it was deemed more invasive with unknown proof consequences and required more system calls, thus hurting performance. A third option would be to have the kernel inject interrupts, but this was never really considered as moving something to the kernel just to avoid giving the user appropriate tools is bit nasty and would have put policy in the kernel on how interrupt injection should happen. ARM does not have this problem since interrupt injection goes via the vGIC, and since the kernel cannot give the user direct access to the vGIC registers there was no choice but to move it into the kernel. The ARM virtualization model of extra EL levels naturally lends itself to just letting threads run at higher privelege levels (with the VCPU just being the 'extra register storage space') and the notion of a 'guest' and 'vmm' become a user concept. In fact for the kernel aside from extra register saving and restoring there really is no difference between a thread with and without a VCPU, they can both a cspaces, make systems calls, and generally do whatever they want. In fact an EL1 thread with a VCPU could act as the fault handler (i.e. VMM) for another thread. Is one of the models inherently more desirable? Perhaps. Does it matter if they are different? Perhaps not. I shall withhold any such opinions as I have no say in these matters now, I'm just here to provide the stories. Adrian On Fri, Nov 9, 2018 at 8:43 AM Jeff Waugh <jdub@bethesignal.org> wrote:
On Fri, Nov 9, 2018 at 6:10 AM <Anna.Lyons@data61.csiro.au> wrote:
seL4's x86 and ARM virtualisation architectures are indeed different. On x86, we use one kernel thread for both guest and host, and use the thread state to differentiate what is currently running. On ARM, guest and host run in different kernel threads so this isn't required.
Out of interest, what's the reason for the different approaches? _______________________________________________ Devel mailing list Devel@sel4.systems https://sel4.systems/lists/listinfo/devel
Thanks for the details. Alex. On 09.11.18 00:47, Adrian Danis wrote:
As the original author of the x86 model I can provide the trip down memory lane.
The original x86 implementation had the model that ARM has now. Unfortunately due to having two different threads there was always a very difficult to resolve race in performing interrupt injection into the guest. There is a race since injecting an interrupt requires 1. Read guest state 2. Validate guest is interrutible 3. Modify guest state or request notification when guest is interruptible If the guest state changes between 1 and 3 then there is a problem, so there needs to be a way to synchronize the guest. No such method exists, since in the thread APIs back then (the MCS kernel has a fix) you could only suspend a thread, which had the effect of cancelling all its IPCs, including any faults. So if suspended a VCPU thread that was faulted its fault message would get thrown away, which is a problem. Running the VCPU and VMM at different priorities has its own problems and leads to a priority management nightmare that can still have races where some other event causes us to want to inject an interrupt at the same time as the VCPU now has a pending fault. Since thread states aren't queryable the VMM, whilst injecting the interrupt, does not know that the VCPU is actually sitting on a pending fault.
All this lead to the very deliberate decision to use a single thread for both vmx and native execution scheduling as it allows the VMM to *know* that when it is running its native thread that the guest will not run and it knows precisely whether it is faulted or not. The other option discussed at the time was a "suspend thread if running and return its current state". This would have removed the problems of discarding a fault by blinding calling 'suspend', and also allows for knowing there is a fault to handle it prior to the interrupt. Ultimately this option was not chosen as it was deemed more invasive with unknown proof consequences and required more system calls, thus hurting performance. A third option would be to have the kernel inject interrupts, but this was never really considered as moving something to the kernel just to avoid giving the user appropriate tools is bit nasty and would have put policy in the kernel on how interrupt injection should happen.
ARM does not have this problem since interrupt injection goes via the vGIC, and since the kernel cannot give the user direct access to the vGIC registers there was no choice but to move it into the kernel. The ARM virtualization model of extra EL levels naturally lends itself to just letting threads run at higher privelege levels (with the VCPU just being the 'extra register storage space') and the notion of a 'guest' and 'vmm' become a user concept. In fact for the kernel aside from extra register saving and restoring there really is no difference between a thread with and without a VCPU, they can both a cspaces, make systems calls, and generally do whatever they want. In fact an EL1 thread with a VCPU could act as the fault handler (i.e. VMM) for another thread.
Is one of the models inherently more desirable? Perhaps. Does it matter if they are different? Perhaps not. I shall withhold any such opinions as I have no say in these matters now, I'm just here to provide the stories.
Adrian
On Fri, Nov 9, 2018 at 8:43 AM Jeff Waugh <jdub@bethesignal.org> wrote:
On Fri, Nov 9, 2018 at 6:10 AM <Anna.Lyons@data61.csiro.au> wrote:
seL4's x86 and ARM virtualisation architectures are indeed different. On x86, we use one kernel thread for both guest and host, and use the thread state to differentiate what is currently running. On ARM, guest and host run in different kernel threads so this isn't required.
Out of interest, what's the reason for the different approaches? _______________________________________________ Devel mailing list Devel@sel4.systems https://sel4.systems/lists/listinfo/devel
_______________________________________________ Devel mailing list Devel@sel4.systems https://sel4.systems/lists/listinfo/devel
-- Alexander Boettcher Genode Labs https://www.genode-labs.com - https://www.genode.org Genode Labs GmbH - Amtsgericht Dresden - HRB 28424 - Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth
On 08.11.18 20:08, Anna.Lyons@data61.csiro.au wrote:
Following patch seems to fix the issue. Can you please have a look and verify that it is correct ? Or should it be solved on another place in the kernel differently.
Yes, the patch is correct and definitely fixes a bug. Thanks! Would you like me to apply this directly or will you raise a PR? In order to not cause unneccessary headaches for verification, I would undo the convert to switch and leave the statement as an if.
You may apply directly. Regarding the switch change, I didn't see an obvious way to keep the original if statement and add a || and a ifdef CONFIG_VTX ... without changing the original code. Since the type of switch style and the ifdef was used already in the same file, I supposed it such be fine. But let's see how you change it ;-) Thanks, -- Alexander Boettcher Genode Labs https://www.genode-labs.com - https://www.genode.org Genode Labs GmbH - Amtsgericht Dresden - HRB 28424 - Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth
participants (5)
-
Adrian Danis
-
Alexander Boettcher
-
Anna.Lyons@data61.csiro.au
-
Gernot.Heiser@data61.csiro.au
-
Jeff Waugh