Like all platforms, Fly.io has its sunnier days and its rainier ones,
, and their recently revived Infra Log is the place to read the retrospective details of and underlying reasons for that latter group. Entries are posted around 7 days after the original incident, to collate everyone’s views, examine internal logs and monitoring feeds, and think extensively about who would have been affected and how…
Past experience has shown that many people overlook this corner of Fly.io, so I thought I might maintain a rolling thread here in the community forum that mirrors at least the titles and corresponding links as they arrive.
| incid. date |
pub. date |
description |
|---|---|---|
| 03/02 | 03/09 | Our certs vault went down and took some proxies with it |
| 03/02 | 03/11 | Petsem got overwhelmed |
| 03/03 | 03/10 | Cost Explorer errors from internal timeouts |
| 03/03 | 03/11 | GraphQL mutations failing |
| 03/05 | 03/12 | BGP route leak sent North America traffic to Singapore |
| 03/06 | 03/13 | App-scoped IPs deleted by mistake |
| 03/07 | 03/14 | Sydney WireGuard outages from upstream UDP filtering |
(“Incid.” = “incident”, and all dates are in MM/DD format.)
If the opportunity to read postmortems like this is beneficial to you, please do mention that over in the (re-)announcement thread; uncertainty was (very surprisingly) expressed there as to whether people were finding such information useful.
And thanks at this juncture to all the current and past mysterious authors of the Infra Log, for all the insights that they’ve passed on and hard work that they’ve done,
.
Aside: The following aren’t strictly Log entries, but similarly feature Fly.io employees detailing past problems, fixes that are currently underway, etc.
Feel free to chime in if you notice any other gems that I missed…