tl;dr
Early in the afternoon of Tuesday, Aug 8, a hardware vulnerability affecting Intel Xeon CPUs was announced, along with proof-of-concept code. We had a fleet-wide update in place within about 2 hours of the announcement.
As with the “Zenbleed” vulnerability announced 2 weeks prior, there was a window of exposure (after the vulnerability was announced and before we patched it) where it could potentially have been possible for a malicious app scheduled on the same Intel hardware as your own app to have seen snippets of the memory of your app.
With “Zenbleed”, we told you that the likelihood of real exposure is relatively low, but that you could reasonably be concerned. You can reasonably be concerned about “Downfall”, too, but the likelihood that you were impacted is much smaller.
This was a bug deep in the operation of one of the most commonly used functions of Intel CPUs. We don’t know if it was actively exploited (we doubt it). We have no evidence that it was. This is not a notice that your application was compromised.
The Vulnerability
Like “Zenbleed”, “Downfall” leaks memory using the x64 vector extensions. In this case, the vulnerability exploits the “gather” family of instructions, which are single instructions that perform batch reads of non-contiguous memory — think of these instructions like x86 “macros” that issue a series of instructions to read a bunch of different things all at once.
“Downfall” is more situational than “Zenbleed”, which is a way of saying that attackers need more stars to align to do anything useful with it. Some variants of “Downfall” rely on target code gadgets, patterns of known instructions at particular offsets in the kernel or a target process.
Additionally, the proof-of-concept code released alongside “Downfall” is much less useful to an attacker than the "Zenbleed’ exploit was. The “Zenbleed” exploit really was an exploit; you could just run it on a vulnerable machine and start seeing secret data. “Downfall”'s code relies on kernel extensions to set up experimental conditions. That doesn’t mean the real attack needs that! But it does mean that random people with the announcement were limited in what they could do with the code.
Finally, the attack required the victim and the attacker to share a physical core. Owing to the way our system is structured, the code you were most likely to ever share a core with was our own infrasturcture.
“Zenbleed” impacted AMD Zen2 hosts. “Downfall” impacts Intel Xeon hosts. We have some Xeon hosts, but it’s a very small fraction of our total fleet.
The Fix
Intel provided a microcode update. We had it applied to all of our impacted within 2 hours of the announcement.
While coordinating the microcode update, we halted deploys on our Xeon hosts, which meant that if prior to the “Downfall” announcement you didn’t already have code running on a Xeon host, you would not have been able to install any new code on Xeon hosts. Now that those hosts are patched, deploys are again enabled on them.
Microcode updates need to be re-applied every time a machine boots, so we’re also working on verifying updated boots and setting up alerting for this particular bug.
The Risk
We think the risk here was pretty low. As a reminder, with speculative execution leak vulnerabilities, the following statements are all generally true:
- You could not have used it to simply dump all the memory from a particular VM (data is only leaked when it’s actually used; more specifically, when it’s used in a way that loads it into the XMM registers; e.g., a vectorized memcpy).
- You could not have used it to gain code execution on a VM (“Downfall” leaks memory, but doesn’t allow attackers to write it).
- You could not have used it to dump all the environment variables from a running image, unless you were actively using those variables (again, data had to be in use to wind up in the XMM registers and then be leaked by “Downfall”").
- You could not generally have used any vulnerability like this to dump all Fly Secrets. We store secrets in Vault, on separate hardware, and give our API servers only the ability to write secrets, not read them. People sometimes express irritation that we make it hard to read raw secrets using our API; this is why we do that.
Further, unlike “Zenbleed”, the available exploit code didn’t work on our platform. That doesn’t mean the vulnerability didn’t work, but it does mean attackers would have been working within a 2 hour window to build a new reliable exploit to target users here, which seems unlikely.
As it happens, most of the memory you’d have seen running this exploit would have been our own host infrastructure (Prometheus metrics, etc), presumably because our own host infrastructure is broadly scheduled across cores on machines. There’s no straightforward way to use the exploit to target a particular app.
We’ve been watching our audit logs and haven’t seen any concerning patterns of apps being deployed (for instance, somebody trying to schedule a machine on every one of our Intel hosts).
We think it’s unlikely that any of our users were impacted by this vulnerability.