CPU design, Re: Confidentiality and realtime requirements
Date: Sun, 13 Aug 2023 16:18:08 +0000 From: Gernot Heiser <gernot@unsw.edu.au> Subject: [seL4] Re: Confidentiality and realtime requirements To: devel <devel@sel4.systems>
On 12 Aug 2023, at 02:07, Demi Marie Obenour <demiobenour@gmail.com> wrote:
If you want to relax this, you’ll need a very careful analysis of your security requirements, and deploy TP where needed and omit where it’s not needed. But you can't prevent µarch timing channels without TP. Everybody knows about D-cache channels these days and how easy they are to exploit, and similar with LLC channels. And if you want to prevent at least those, then full TP has no extra cost.
Does time protection require temporal isolation?
Nope. Time protection is the mechanism that allows providing temporal isolation.
Developing such a framework is on my agenda (it’s a core part of what I said I’m waiting for a good PhD student to work on). Basically I want to be able to make certain security guarantees where components have overt communication channels. An example is tab isolation in a browser: We want to have a fully functional web browser that allows processing of sensitive information in one tab, but guarantee that it cannot be leaked to a different tab.
Another good example is different apps on a mobile device. Android allows any two applications on the same profile to communicate with mutual consent, and yet also allows one application to keep information secret from another.
Yes. The smartphone space is really interesting. While built on top of the classical Unix security model (but implementable on a much simpler model) it demonstrates that stricter security policies (compared to desktops) can be made acceptable to users.
One critical requirement is that such a framework must allow an efficient implementation, both in terms of performance and power consumption. If it isn’t efficient enough to be deployed, then it won’t do people any good.
Energy is (to first order) proportional to computation, so just looking at “performance” is good enough.
And performance has always been our driver – see “security is no excuse for bad performance!” ;-)
As I argued earlier, the performance cost of time protection is not higher than what you need to eliminate just the most easily exploitable cache channels.
Gernot
This makes me wonder if it’d make sense to design a new CPU and/or instruction set with the goal of eliminating timing and other side channels. Like, we had the Lisp machines with hardware support for Lisp, and now most CPUs are optimized for C and similar languages, and in turn those languages are optimized for those CPUs (x86, Arm, etc.). But I can’t say I know enough about computer science at this time to even tinker with the idea. And also, on the other hand maybe it doesn’t make sense to create hardware and accompanying software that are nearly impossible to non-destructively subvert, given the concerning trend of computer manufacturers and tech companies of restricting what code can run on their products. I don’t want to make something that could be set up to prevent an end user from modifying or running their own code. Isaac
On 14 Aug 2023, at 08:25, Isaac Beckett <isaactbeckett@gmail.com> wrote:
This makes me wonder if it’d make sense to design a new CPU and/or instruction set with the goal of eliminating timing and other side channels. Like, we had the Lisp machines with hardware support for Lisp, and now most CPUs are optimized for C and similar languages, and in turn those languages are optimized for those CPUs (x86, Arm, etc.).
It’s easy to design a processor that’s free of timing channels: just don’t do any caching. It’ll be extremely slow, of course, and that has nothing to do with languages. Processor performance inherently depends on many levels of caching. The more realistic approach is to enhance the HW-SW contract to provide the OS with suitable mechanisms for preventing the effects of caching spilling across security domains. That’s exactly what time protection and the work on the appropriate HW support (i.e. fence.t) is all about. Gernot
On 8/13/23 18:25, Isaac Beckett wrote:
Date: Sun, 13 Aug 2023 16:18:08 +0000 From: Gernot Heiser <gernot@unsw.edu.au> Subject: [seL4] Re: Confidentiality and realtime requirements To: devel <devel@sel4.systems>
On 12 Aug 2023, at 02:07, Demi Marie Obenour <demiobenour@gmail.com> wrote:
If you want to relax this, you’ll need a very careful analysis of your security requirements, and deploy TP where needed and omit where it’s not needed. But you can't prevent µarch timing channels without TP. Everybody knows about D-cache channels these days and how easy they are to exploit, and similar with LLC channels. And if you want to prevent at least those, then full TP has no extra cost.
Does time protection require temporal isolation?
Nope. Time protection is the mechanism that allows providing temporal isolation.
Developing such a framework is on my agenda (it’s a core part of what I said I’m waiting for a good PhD student to work on). Basically I want to be able to make certain security guarantees where components have overt communication channels. An example is tab isolation in a browser: We want to have a fully functional web browser that allows processing of sensitive information in one tab, but guarantee that it cannot be leaked to a different tab.
Another good example is different apps on a mobile device. Android allows any two applications on the same profile to communicate with mutual consent, and yet also allows one application to keep information secret from another.
Yes. The smartphone space is really interesting. While built on top of the classical Unix security model (but implementable on a much simpler model) it demonstrates that stricter security policies (compared to desktops) can be made acceptable to users.
One critical requirement is that such a framework must allow an efficient implementation, both in terms of performance and power consumption. If it isn’t efficient enough to be deployed, then it won’t do people any good.
Energy is (to first order) proportional to computation, so just looking at “performance” is good enough.
And performance has always been our driver – see “security is no excuse for bad performance!” ;-)
As I argued earlier, the performance cost of time protection is not higher than what you need to eliminate just the most easily exploitable cache channels.
Gernot
This makes me wonder if it’d make sense to design a new CPU and/or instruction set with the goal of eliminating timing and other side channels. Like, we had the Lisp machines with hardware support for Lisp, and now most CPUs are optimized for C and similar languages, and in turn those languages are optimized for those CPUs (x86, Arm, etc.).
Fully-associative caches with random eviction help a lot. Speculative taint tracking helps close some of the remaining holes. It’s not perfect, but it is the best one can do without cache partitioning or heavyweight flushing, and I suspect both of those are often impractical from a performance perspective. There are designs for fully-associative caches that are practical, at least for levels other than L1. -- Sincerely, Demi Marie Obenour (she/her/hers)
On 15 Aug 2023, at 00:34, Demi Marie Obenour <demiobenour@gmail.com> wrote:
This makes me wonder if it’d make sense to design a new CPU and/or instruction set with the goal of eliminating timing and other side channels. Like, we had the Lisp machines with hardware support for Lisp, and now most CPUs are optimized for C and similar languages, and in turn those languages are optimized for those CPUs (x86, Arm, etc.).
Fully-associative caches with random eviction help a lot.
I’d claim they don’t help at all. They might if “random” was truly random, which in hardware is rarely the case. “Random” cache replacement is more properly described as “unspecified”. In reality it’s usually using some existing state (eg some counter) and as such is actually fully deterministic, you only have to reverse-engineer the mechanism. Besides that, fully-associative caches (unless very small) are inherently slow (long critical path) and energy-hungry many gates to traverse).
Speculative taint tracking helps close some of the remaining holes.
I’m highly sceptical too. Taint-tracking has to be highly pessimistic to be feasible, and adds a lot of complexity to the hardware (which gets back to my earlier argument: make it more complex and you’re making it more likely that something goes wrong). Gernot
On 15 Aug 2023, at 01:59, Gernot Heiser <gernot@unsw.edu.au> wrote:
Speculative taint tracking helps close some of the remaining holes.
I’m highly sceptical too. Taint-tracking has to be highly pessimistic to be feasible, and adds a lot of complexity to the hardware (which gets back to my earlier argument: make it more complex and you’re making it more likely that something goes wrong).
Forgot to add:
It’s not perfect, but it is the best one can do without cache partitioning or heavyweight flushing, and I suspect both of those are often impractical from a performance perspective.
Cache partitioning (which really is static, software-controlled vs dynamic partitioning by hardware) produces, in average, at best a mild performance degradation. There are plenty of cases where it actually improves performance (loads with a cache working set that exceeds the cache size) and it is used for performance isolation and sometimes for improving average-case performance. Similar, cache-flushing (of the L1) has little performance impact, except where context-switching rates are very high. Otherwise, the L1 is too small to hold much hot data after a full round of context switches, so the cache is effectively flushed implicitly. On OoO processors explicit flushes have a cost, as the cost of the implicit replacement can be at least partially hidden by OoO execution. But I doubt there are many use realistic cases where the impact is significant. (Yes, we did evaluate this, although it’s been studied in more detail by others.) Gernot
With e.g. HBM, there might be an opportunity to state save the entire cache on a context switch and restore it on resume. I guess some of the overhead of flushing is due to random accesses. State save could just do a single contiguous write. There’s likely to be some complexity around cache coherency and shared memory but maybe not insurmountable. In the worst case, shared memory could have a separate cache which would be flushed rather than state saved. On Mon, 14 Aug 2023 at 17:31, Gernot Heiser <gernot@unsw.edu.au> wrote:
On 15 Aug 2023, at 01:59, Gernot Heiser <gernot@unsw.edu.au> wrote:
Speculative taint tracking helps close some of the remaining holes.
I’m highly sceptical too. Taint-tracking has to be highly pessimistic to
be feasible, and adds a lot of complexity to the hardware (which gets back to my earlier argument: make it more complex and you’re making it more likely that something goes wrong).
Forgot to add:
It’s not perfect, but it is the best one can do without cache partitioning or heavyweight flushing, and I suspect both of those are often impractical from a performance perspective.
Cache partitioning (which really is static, software-controlled vs dynamic partitioning by hardware) produces, in average, at best a mild performance degradation. There are plenty of cases where it actually improves performance (loads with a cache working set that exceeds the cache size) and it is used for performance isolation and sometimes for improving average-case performance.
Similar, cache-flushing (of the L1) has little performance impact, except where context-switching rates are very high. Otherwise, the L1 is too small to hold much hot data after a full round of context switches, so the cache is effectively flushed implicitly.
On OoO processors explicit flushes have a cost, as the cost of the implicit replacement can be at least partially hidden by OoO execution. But I doubt there are many use realistic cases where the impact is significant. (Yes, we did evaluate this, although it’s been studied in more detail by others.)
Gernot _______________________________________________ Devel mailing list -- devel@sel4.systems To unsubscribe send an email to devel-leave@sel4.systems
On 8/14/23 12:25, Gernot Heiser wrote:
On 15 Aug 2023, at 01:59, Gernot Heiser <gernot@unsw.edu.au> wrote:
Speculative taint tracking helps close some of the remaining holes.
I’m highly sceptical too. Taint-tracking has to be highly pessimistic to be feasible, and adds a lot of complexity to the hardware (which gets back to my earlier argument: make it more complex and you’re making it more likely that something goes wrong).
Forgot to add:
It’s not perfect, but it is the best one can do without cache partitioning or heavyweight flushing, and I suspect both of those are often impractical from a performance perspective.
Cache partitioning (which really is static, software-controlled vs dynamic partitioning by hardware) produces, in average, at best a mild performance degradation. There are plenty of cases where it actually improves performance (loads with a cache working set that exceeds the cache size) and it is used for performance isolation and sometimes for improving average-case performance.
In a fully dynamic system, nothing is known statically. The OS needs to be able to adapt to workloads that did not even exist when the OS was created, and it needs to do so as these workloads are executing. And virtually all of these workloads will have no idea that cache partitioning even exists. Can an OS perform the needed analysis at runtime and allocate cache as needed?
Similar, cache-flushing (of the L1) has little performance impact, except where context-switching rates are very high. Otherwise, the L1 is too small to hold much hot data after a full round of context switches, so the cache is effectively flushed implicitly.
I’m not surprised by this. Nevertheless, it is yet another reason to deemphasize protected procedure calls in favor of notifications and shared memory.
On OoO processors explicit flushes have a cost, as the cost of the implicit replacement can be at least partially hidden by OoO execution. But I doubt there are many use realistic cases where the impact is significant. (Yes, we did evaluate this, although it’s been studied in more detail by others.)
Are there any links you recommend? -- Sincerely, Demi Marie Obenour (she/her/hers)
On 8/14/23 11:59, Gernot Heiser wrote:
On 15 Aug 2023, at 00:34, Demi Marie Obenour <demiobenour@gmail.com> wrote:
This makes me wonder if it’d make sense to design a new CPU and/or instruction set with the goal of eliminating timing and other side channels. Like, we had the Lisp machines with hardware support for Lisp, and now most CPUs are optimized for C and similar languages, and in turn those languages are optimized for those CPUs (x86, Arm, etc.).
Fully-associative caches with random eviction help a lot.
I’d claim they don’t help at all. They might if “random” was truly random, which in hardware is rarely the case. “Random” cache replacement is more properly described as “unspecified”. In reality it’s usually using some existing state (eg some counter) and as such is actually fully deterministic, you only have to reverse-engineer the mechanism.
Besides that, fully-associative caches (unless very small) are inherently slow (long critical path) and energy-hungry many gates to traverse).
What about Chameleon Cache (https://arxiv.org/pdf/2209.14673.pdf)?
Speculative taint tracking helps close some of the remaining holes.
I’m highly sceptical too. Taint-tracking has to be highly pessimistic to be feasible, and adds a lot of complexity to the hardware (which gets back to my earlier argument: make it more complex and you’re making it more likely that something goes wrong).
You’re not wrong. I just don’t know if a better solution for fully dynamic workloads exists. -- Sincerely, Demi Marie Obenour (she/her/hers)
On 16 Aug 2023, at 00:22, Demi Marie Obenour <demiobenour@gmail.com> wrote: I’d claim they don’t help at all. They might if “random” was truly random, which in hardware is rarely the case. “Random” cache replacement is more properly described as “unspecified”. In reality it’s usually using some existing state (eg some counter) and as such is actually fully deterministic, you only have to reverse-engineer the mechanism. Besides that, fully-associative caches (unless very small) are inherently slow (long critical path) and energy-hungry many gates to traverse). What about Chameleon Cache (https://arxiv.org/pdf/2209.14673.pdf)? I had a look, and the paper pretty much confirms what I’ve said. (Besides being only a proposal…) For one, they don’t actually do an FA cache, nor true randomness, they approximate both in some fashion so that the effect is basically the same (so they claim, I didn’t read it carefully enough to confirm that, note that the paper is published in a third-tier venue I hadn’t even heard of before – IEEE International Symposium on Secure and Private Execution Environment Design). It does all this by making the hardware more complicated (and thus more likely to be buggy/attackable). Specifically, they take the IMHO wrong approach of trying to solve the whole problem in hardware, discarding alternative approaches as “involve[ing] software to manage the partitions”. As I keep saying, security is inherently a HW-SW-codesign problem, you won’t get good solutions by trying to do it all in HW or in SW. The right approach is policy-mechanism separation: HW provides minimal (and ideally orthogonal) mechanisms, and SW uses them according to its security policy. The HW-only approach means that the HW must take a pessimistic approach and prevent information flow across anything that looks like it might be a security boundary, whether or not it actually is (which only the OS can know). This has a (generally measurable) cost. Here this cost is that the cache will likely be slower (in terms of access), likely more energy-hungry and have reduced hit rates compared to, say, a 4-way LRU cache. This will be a constant overhead, whether it’s needed or not. In contrast, flushing the L1 is a cost that is only needed when an actual security-domain switch happens. It is small (fraction of a µs), negligible unless domain-switch rates are unusually high. There is a trade-off between such per-domain-switch costs vs constant overheads resulting from a complex design, but the paper doesn’t attempt to evaluate this. In fact, their evaluation commits the benchmarking crime of improper baseline (D1 of van der Kouwe et al [2020], see https://trustworthy.systems/publications/full_text/vanderKouwe_HABG_20.pdf or also my list https://gernot-heiser.org/benchmarking-crimes.html#base). They only compare against other comples, HW-only schemes. Gernot
On 8/26/23 09:20, Gernot Heiser via Devel wrote:
On 16 Aug 2023, at 00:22, Demi Marie Obenour <demiobenour@gmail.com> wrote:
I’d claim they don’t help at all. They might if “random” was truly random, which in hardware is rarely the case. “Random” cache replacement is more properly described as “unspecified”. In reality it’s usually using some existing state (eg some counter) and as such is actually fully deterministic, you only have to reverse-engineer the mechanism.
Besides that, fully-associative caches (unless very small) are inherently slow (long critical path) and energy-hungry many gates to traverse).
What about Chameleon Cache (https://arxiv.org/pdf/2209.14673.pdf)?
I had a look, and the paper pretty much confirms what I’ve said. (Besides being only a proposal…)
For one, they don’t actually do an FA cache, nor true randomness, they approximate both in some fashion so that the effect is basically the same (so they claim, I didn’t read it carefully enough to confirm that, note that the paper is published in a third-tier venue I hadn’t even heard of before – IEEE International Symposium on Secure and Private Execution Environment Design).
It does all this by making the hardware more complicated (and thus more likely to be buggy/attackable). Specifically, they take the IMHO wrong approach of trying to solve the whole problem in hardware, discarding alternative approaches as “involve[ing] software to manage the partitions”. As I keep saying, security is inherently a HW-SW-codesign problem, you won’t get good solutions by trying to do it all in HW or in SW. The right approach is policy-mechanism separation: HW provides minimal (and ideally orthogonal) mechanisms, and SW uses them according to its security policy.
The HW-only approach means that the HW must take a pessimistic approach and prevent information flow across anything that looks like it might be a security boundary, whether or not it actually is (which only the OS can know). This has a (generally measurable) cost.
Here this cost is that the cache will likely be slower (in terms of access), likely more energy-hungry and have reduced hit rates compared to, say, a 4-way LRU cache. This will be a constant overhead, whether it’s needed or not. In contrast, flushing the L1 is a cost that is only needed when an actual security-domain switch happens. It is small (fraction of a µs), negligible unless domain-switch rates are unusually high.
For an L1 cache, flushing on each domain switch might be the best choice, but what about L2 and L3? Flushing L2 and L3 is going to be very slow, since it requires going all the way out to main memory. Do you have an alternative suggestion that can scale to hundreds of security domains? That’s not unrealistic for e.g. a web browser, where each origin is a separate security domain. The impression I got from the academic work I have seen so far is that it works well when there are a small number of security domains and crossings between them are infrequent. Real-world systems, though, have lots of security domains and tend to cross them quite frequently, most notably at every single system call. Yes, part of the problem is that existing APIs were designed around low syscall costs and that existing kernels keep too many pages mapped at once, but still. -- Sincerely, Demi Marie Obenour (she/her/hers)
participants (4)
-
Demi Marie Obenour
-
Gernot Heiser
-
Harry Butterworth
-
Isaac Beckett