PSA: Postmortem(s) rollup

Like all platforms, Fly.io has its sunnier days and its rainier ones, :umbrella_with_rain_drops:, and their recently revived Infra Log is the place to read the retrospective details of and underlying reasons for that latter group. Entries are posted around 7 days after the original incident, to collate everyone’s views, examine internal logs and monitoring feeds, and think extensively about who would have been affected and how…

Past experience has shown that many people overlook this corner of Fly.io, so I thought I might maintain a rolling thread here in the community forum that mirrors at least the titles and corresponding links as they arrive.

(“Incid.” = “incident”, and all dates are in MM/DD format.)

If the opportunity to read postmortems like this is beneficial to you, please do mention that over in the (re-)announcement thread; uncertainty was (very surprisingly) expressed there as to whether people were finding such information useful.


And thanks at this juncture to all the current and past mysterious authors of the Infra Log, for all the insights that they’ve passed on and hard work that they’ve done, :black_cat:.


Aside: The following aren’t strictly Log entries, but similarly feature Fly.io employees detailing past problems, fixes that are currently underway, etc.

Feel free to chime in if you notice any other gems that I missed…

10 Likes

Here’s the first of the week’s new entries…

incid.
date
pub.
date
description
03/11 03/18 GraphQL overloaded

This was probably closely related to the following forum thread:

1 Like

Pi Day was rung in with an occurrence of a classic nerd bug…

incid.
date
pub.
date
description
03/14 03/23 Sprites API didn’t like numbers (leading digit in i.d.)
03/16 03/23 Tight capacity in ORD and SIN

The first one caused an unfortunate fairly lengthy outage for Sprites which were under organizations whose names began with a digit.


Aside: There is a new official doc about upgrading Ubuntu within your Sprite.

3 Likes

Really appreciate this

2 Likes

More word of effects of regional capacity crunches…

incid.
date
pub.
date
description
03/17 03/24 A bunch of wedged Sprites (502/503 responses)

The following forum threads seem at least partly related (although some may have had other contributing factors, as the time spans don’t match up 100%):

1 Like

Capacity squalls subsequently spread to nearby DFW, :cloud_with_rain:

incid.
date
pub.
date
description
03/18 03/25 DFW capacity errors
03/18 03/25 SJC disappeared (briefly)

(Probably things like that first item were/are contributing to the many failed Depot builders, too, but it’s hard to gauge, since people don’t always mention their region in their forum reports.)

[Note: that example is actually from a bit later than the above incident date.]

1 Like

Storage then filled up within the managed metrics database, in this atypically eventful week…

incid.
date
pub.
date
description
03/19 03/26 Metrics outage

One hour of metrics was permanently lost.

1 Like

Storm clouds returned to DFW, :cloud_with_lightning_and_rain:, as the work week drew to a close…

incid.
date
pub.
date
description
03/20 03/27 DFW capacity, again

According to the corresponding real-time status entry, this new capacity shortfall persisted for three days.

1 Like

A milder start to this new row of the calendar table…

incid.
date
pub.
date
description
03/23 03/30 Errors viewing logs in Grafana (cert. expired)

(The Grafana logs are still considered beta, last I heard.)

The following forum thread was probably related:

And, of course, expired mTLS certificates are a classic from Season One of the Infra Log. Back then, it caused a “Big Red Box day” (global, 7-hour disruption of fundamental platform functions), and not just a small blip like this season’s…


Aside: The DFW congestion from Friday was marked fully resolved on this day (March 23).

2 Likes

GraphQL then returned to the scene with a glitch in IAD…

incid.
date
pub.
date
description
03/24 03/31 GraphQL timeouts

(GraphQL is the lesser API these days, getting gradually subsumed by the Machines API.)


Aside: Changes to reduce Sprites’ initial allocation of memory were briefly mentioned in their release notes, in the March 25 entry.

I’ll note here the “GraphQL” Rails app handles some Machines API endpoints, specifically POST /apps, as well as the /apps/:name/ip_assignments and /apps/:name/certificates paths; it’s not entirely clear to me how we should make this distinction on the status page, as the rest of the Machines API had no impact.

1 Like

A bit of a fluke a couple days later, with two independent physical network links to the same destination simultaneously failing…

incid.
date
pub.
date
description
03/26 04/02 FRA region outage

People always notice when Frankfurt goes down:

This particular failure mode isn’t expected to reoccur, however.


And in a late-breaking Log entry, describing an unrelated incident, there was an overloaded database across the Atlantic:

incid.
date
pub.
date
description
03/26 04/03 ORD machine creates bogged down (408 responses)

(Bolt, in this case, a local database used by the physical host machine.)

The capacity squalls in North America moved east to the coast, :cloud_with_rain:, at this point…

incid.
date
pub.
date
description
03/27 04/03 IAD CPU crunch

The iad region is architecturally a single point of failure in the Fly.io platform, last I heard, so bad weather here is especially noteworthy. A complete failure of this particular region can actually cause a global API outage (which happened a year ago).

This may have also contributed to the reports in the forum of 12+ hour disruptions of builders, although obviously the recorded time spans don’t align entirely:

The corresponding real-time status entry did mention builders, in general, and “deploys failing to complete (even for apps outside of the IAD region)”.


In an apparently separate incident, the Sprites innovations continued to explore previously unrecognized corner cases in the underlying Machines platform:

incid.
date
pub.
date
description
03/27 04/03 Sprites API errors (per-org databases)

The databases themselves can go to sleep, it seems, and auto-wake was going awry.

Forum reports also put these disruptions in the 12+ hours range, for at least some users:

Here, too, there may have been other contributing factors.

Sprite glitches returned a couple days later, in an unusual Sunday incident…

incid.
date
pub.
date
description
03/29 04/07 Sprite creation errors in SJC and AMS

These creation errors were different from the earlier trend of Machine shortages and were rather a Sprites-specific anomaly.


Aside: It doesn’t look like that day’s real-time status entry, mentioning capacity problems in AMS and SIN (with Sprites and builders effects), was directly related.

Here’s a graphical summary of the previous month’s entries, for readier overview scanning…

March 2026
1 4 7
▪ ½▪ ▪ ½▪ ½─ │ ½▪ secrets, BGP, egress IPs, SYD
▪ │ │ GraphQL, Sprites
½│ ½│ ½│ ½│ ▪ ▪ ½│ ½│ ORD, SIN, Sprites, DFW×2, metrics
½│ ½│ ▪ ½▪ ▪ ½▪ │ ½▪ DFW, logs, FRA, ORD, IAD, Sprites, builders
▪ ½│ ½│ AMS, SJC, Sprites, SIN

The format is a little different from Season One of the official Log, in that width is now a gauge of how many users/customers or features were affected, and intensity (dark vs. light colors) indicates that it was a full outage as opposed to simply flakey, slower, etc. Height is still the duration of the incident, however. In most cases, these are my own estimates/guesses as an outside observer, so take them with a grain of salt.

(For those reading only the alt texts: “:black_small_square:” is a narrow short box, “─” is a wide short box, and “│” is a tall thin one. Fractions indicate intensity.)

Each row of the table covers a week, beginning at the upper left on March 1st, and then proceeding with seven columns for each of the seven days. (Unfortunately, there isn’t much flexibility in inking column edges, or the like.)

The worst of this month’s were all either tall but narrow or wide but short. The Fly.io platform has come a long way from the “Big Red Box days” of the olden times.


Aside: In most cases, you can click on a day’s cell in the table to see the most severe Log entry of that day. Occasionally, however, there are multiple candidates or similar ambiguity, and it instead leads to a post within the present forum thread.

Incidents can straddle midnight (00:00 UTC), so, conversely, it’s possible to get the same link out of multiple cells.

Aside2: There were also many reports in the forum over this time period of failed individual Sprites and flakey or unavailable builders. The Depot builders in particular seem a little sensitive to region congestion.

Aside3: Dates and times are in UTC.

Onward to the first incidents of April…

incid.
date
pub.
date
description
04/05 04/14 Private networking outage in Sydney (again)
04/06 04/14 Web Sidekiq backlog from stuck usage jobs

That second one affected GraphQL and the dashboard (including billing display, if I’m interpreting the Log correctly).

1 Like

Appreciate the work @mayailurus , are you at Fly now?

1 Like

No, it’s just that the mods were nice enough to move the thread into the Fresh Produce category, to make it easier to find, etc.

(Glad to hear people are finding it useful!)

More downpour in Australia, :globe_showing_asia_australia:, a couple days later…

incid.
date
pub.
date
description
04/08 04/16 SYD host I/O saturation

Judging from the real-time status page’s archives, the tail of this (or perhaps an after-effect of it) persisted into early the next morning (April 9).


Aside: When Fly says “host” as above, what they’re referring to is the underlying physical machine (i.e., hardware machine).

Aside2: The one entry that the real-time status page does list for April 8 (rather than April 9) seems probably unrelated.

Aside3: A brief ORD incident appeared on April 9 alongside SYD, mentioned here for ease of reference in the next summary grid.

Aside4: The Sydney Log entry contains the first mention in quite a while of a new wrinkle in the suspend mechanism.

The next day, also in the APAC part of the world…

incid.
date
pub.
date
description
04/10 04/17 NRT Machines API errors thrashed Managed Postgres

Kubernetes (upon which MPG is built) was unhappy with the Machines API timing out.