On 8/26/23 09:20, Gernot Heiser via Devel wrote:
On 16 Aug 2023, at 00:22, Demi Marie Obenour
wrote: I’d claim they don’t help at all. They might if “random” was truly random, which in hardware is rarely the case. “Random” cache replacement is more properly described as “unspecified”. In reality it’s usually using some existing state (eg some counter) and as such is actually fully deterministic, you only have to reverse-engineer the mechanism.
Besides that, fully-associative caches (unless very small) are inherently slow (long critical path) and energy-hungry many gates to traverse).
What about Chameleon Cache (https://arxiv.org/pdf/2209.14673.pdf)?
I had a look, and the paper pretty much confirms what I’ve said. (Besides being only a proposal…)
For one, they don’t actually do an FA cache, nor true randomness, they approximate both in some fashion so that the effect is basically the same (so they claim, I didn’t read it carefully enough to confirm that, note that the paper is published in a third-tier venue I hadn’t even heard of before – IEEE International Symposium on Secure and Private Execution Environment Design).
It does all this by making the hardware more complicated (and thus more likely to be buggy/attackable). Specifically, they take the IMHO wrong approach of trying to solve the whole problem in hardware, discarding alternative approaches as “involve[ing] software to manage the partitions”. As I keep saying, security is inherently a HW-SW-codesign problem, you won’t get good solutions by trying to do it all in HW or in SW. The right approach is policy-mechanism separation: HW provides minimal (and ideally orthogonal) mechanisms, and SW uses them according to its security policy.
The HW-only approach means that the HW must take a pessimistic approach and prevent information flow across anything that looks like it might be a security boundary, whether or not it actually is (which only the OS can know). This has a (generally measurable) cost.
Here this cost is that the cache will likely be slower (in terms of access), likely more energy-hungry and have reduced hit rates compared to, say, a 4-way LRU cache. This will be a constant overhead, whether it’s needed or not. In contrast, flushing the L1 is a cost that is only needed when an actual security-domain switch happens. It is small (fraction of a µs), negligible unless domain-switch rates are unusually high.
For an L1 cache, flushing on each domain switch might be the best choice, but what about L2 and L3? Flushing L2 and L3 is going to be very slow, since it requires going all the way out to main memory. Do you have an alternative suggestion that can scale to hundreds of security domains? That’s not unrealistic for e.g. a web browser, where each origin is a separate security domain. The impression I got from the academic work I have seen so far is that it works well when there are a small number of security domains and crossings between them are infrequent. Real-world systems, though, have lots of security domains and tend to cross them quite frequently, most notably at every single system call. Yes, part of the problem is that existing APIs were designed around low syscall costs and that existing kernels keep too many pages mapped at once, but still. -- Sincerely, Demi Marie Obenour (she/her/hers)