The Infrastructure Log tends to fly under people’s radars, which is unfortunate, since it’s a good way to learn more about the details and root causes of past incidents—as well as to hear about many upcoming changes in advance.
https://fly.io/infra-log/2025-01-18/
The narrow but intense box there on Thursday is another global(?) failure of the Machines API, this time due to a CI/CD glitch.
There were also regional failures of blue-green deployments, early on the previous day.
Definitely read the link at the top for official details.
Behind the scenes, efforts included the following, outlined here non-exhaustively…
Measures being taken to prevent future problems
- Decentralization...
- Of Corrosion, the global metadata database. *
- Of health checks, moving away from Consul to a newer, custom (and presumably regionalized) setup. *
-
Elimination of Corrosion slowness stemming from I/O contention with customers(!).
- Preemptive—as opposed to reactive—trawls throughout the infrastructure...
- Comprehensive review of recurring errors on physical host machines and their respective individual management daemons...
- "[F]ound a bunch of Fly Machines in inconsistent states".
- Swap glitches.
- Metrics for dead Machines.
- Synthetic alerts (proactive probes of functionality, rather than passive). *
- Deliberate reboots to avoid an AMD firmware bug.
- Comprehensive review of recurring errors on physical host machines and their respective individual management daemons...
Efficiency and capacity improvements
- Periodic garbage collection on physical host machines' local databases.
- Silencing metrics for dead Machines (above) also improves Prometheus capacity (IIUC).
New features
- 6PN address stability after migrations. As the Log points out, this sounds like an oxymoron, so it’ll be interesting to hear more,
...
- CPU throttling on the low-budget
shared
class, which was originally intended for things like HTTP servers—instead of background workers, media encoders, etc. *
No word yet
- Intermediate
flex
class of vCPUs for people who don't need a full 100% yet are feeling pinched now by the 6% of theshared
class.‡ - More convenient, floating, app-level egress IPs.‡
- Automated display of system-wide metrics and probe outcomes on the status page.‡
*Work ongoing from previously.
‡Fly.io gave encouraging responses to these earlier, but they aren’t necessarily in progress.
Caveat: The above are just my own interpretations and paraphrases, as a fellow user.