tl;dr
Late in the morning on Monday, Jul 24, a hardware vulnerability affecting AMD CPUs was announced, along with exploit code. We had a fleet-wide workaround in place within about 2 hours of the announcement.
During the window of exposure, it could potentially have been possible for a malicious app scheduled on the same AMD hardware as your own app to have seen snippets of the memory of your app. We think the likelihood of real exposure is relatively low, but you could reasonably be concerned; see “What Next” at the end of this post for suggested next steps.
This was a bug deep in the operation of one of the most commonly used functions of AMD CPUs. We don’t know if it was actively exploited (there’s no way to tell). We have no evidence that it was. This is not a notice that your application was compromised.
The rest of this post is information you might not really need (feel free to skim), but that we think might be helpful to understand the way we’re thinking about risk here.
The Vulnerability
So, this was, to put it like a vulnerability reseacher, a hell of a finding. Zenbleed is best understood as a hardware use-after-free vulnerability; it had the effect of making the CPU itself not memory-safe.
Even more remarkably, Zenbleed hits a pattern of CPU instructions that is at the very core of running ordinary workloads. Modern compilers vectorize string and buffer processing code. Vector extensions use a special set of CPU registers (the XMM registers) to process multiple values simultaneously. The Zenbleed bug tricks the CPU into revealing the contents of those registers.
The net effect is that any memory used with a string or buffer copy, search, or scan is potentially leaked to other processes on the same CPU core.
This bug exclusively impacts AMD Zen2 architecture hosts. We have a bunch of Intel Xeon hardware and a bunch of Zen3 (“Milan”) hardware, but also a lot of Zen2.
The Fix
There are two, and we’ve applied both of them.
First, you can disable one of the processor features (floating-point mov-elimination) the bug depends on. That’s done by setting a bit in a feature flag MSR on the CPU (CPU designers call this a “chicken bit”). There’s a potential performance impact to chickening out, but we did that as quickly as we could and haven’t noticed anything concerning since.
The second is an AMD microcode update that was percolating just as the Zenbleed bug was announced. Ordinarily, a microcode update would demand a reboot of the host we applied it on, and thus a rolling reboot of every machine in our fleet. Thankfully, this update was late-loading, and we were able to apply it without bouncing anything.
After each of these steps, we checked on both the host and inside a VM on the host that the Zenbleed POC was rendered nonfunctional. Microcode updates and MSR changes need to be re-applied every time a machine boots, so we’re also working on verifying updates boots and setting up alerting for this particular bug.
The Risk
Zenbleed was released with exploit code that was easy to run, but tricky to target. Precisely which memory you’re going to get to see is situational: it depends on lucky timing, on the CPU core your app is scheduled on and which other apps might get scheduled there, the hardware you’re running on, and the manner in which your app uses any given bit of sensitive information.
So, some things you COULD NOT easily have done with Zenbleed:
- You could not have used it to simply dump all the memory from a particular VM (data is only leaked when it’s actually used; more specifically, when it’s used in a way that loads it into the XMM registers; e.g., a vectorized memcpy).
- You could not have used it to gain code execution on a VM (Zenbleed leaks memory, but doesn’t allow attackers to write it).
- You could not have used it to dump all the environment variables from a running image, unless you were actively using those variables (again, data had to be in use to wind up in the XMM registers and then be leaked by Zenbleed).
- You could not generally have used Zenbleed to dump all Fly Secrets. We store secrets in Vault, on separate hardware, and give our API servers only the ability to write secrets, not read them. People sometimes express irritation that we make it hard to read raw secrets using our API; this is why we do that.
There was, as these things go, a relatively short window of time attackers had to exploit Zenbleed on our Zen2 AMD hosts. The Zenbleed exploit itself is somewhat tricky. Maybe a gifted exploit developer could have fully grokked it and extended it to spot particular keys, but it’s probably more likely that your mental model of an attacker should be someone running Tavis Ormandy’s POC exploit verbatim and relying on the luck of the draw for what memory they got to see.
As it happens, most of the memory you’d have seen running this exploit would have been our own host infrastructure (Prometheus metrics, etc), presumably because our own host infrastructure is broadly scheduled across cores on machines. There’s no straightforward way to use the exploit to target a particular app.
(We are in the process of rotating our own infrastructure secrets. You shouldn’t notice, but we thought you might want to know.)
We’ve been watching our audit logs and haven’t seen any concerning patterns of apps being deployed (for instance, somebody trying to schedule a machine on every one of our AMD hosts).
So: we’d rate the likelihood that anything from your app was compromised as pretty low. Not zero, but zero was never on the table (there is with almost 100% certainty a Linux KVM bug nobody knows about yet lurking somewhere in our stack; welcome to Blue Team cortisol levels, we have stickers).
You can reasonably rate risks differently than we do, and we’d encourage you to consider your own tolerances carefully.
What Next
Out of an abundance of caution, we’d recommend you rotate your API keys and deployment tokens. You can do this from our UI, on the “Account” menu (upper right of the screen), under “Access Tokens”. API tokens are powerful. If you have to ask whether you should revoke and reissue them, we’d say just do it.
It is possible that a Fly Secret loaded in your app that your app was actually using could have been exposed using Zenbleed. Secrets are just strings, and if you were handling secret strings during the window of exposure, and a malicious app was scheduled on the same core as you, and they got lucky with the timing, your secret could have shown up in the output of their Zenbleed exploit. If you had especially sensitive secrets, you should consider rolling them as well, and the process for doing that depends on how your app works.
This was a pretty remarkable situation. AMD apparently didn’t even have an advisory ready when the bug was published; Tavis Ormandy, its author, had expected it to be embargoed for another 2 weeks. We’d like there never to be a window in which a published exploit works on Fly.io, but “literal zero-day hardware exploit” is a tough risk to mitigate. We were (and are) all-hands-on-deck during this event, and will do our best to keep you all in the loop.
The Fly Team