PSA: Postmortem(s) rollup

Like all platforms, Fly.io has its sunnier days and its rainier ones, :umbrella_with_rain_drops:, and their recently revived Infra Log is the place to read the retrospective details of and underlying reasons for that latter group. Entries are posted around 7 days after the original incident, to collate everyone’s views, examine internal logs and monitoring feeds, and think extensively about who would have been affected and how…

Past experience has shown that many people overlook this corner of Fly.io, so I thought I might maintain a rolling thread here in the community forum that mirrors at least the titles and corresponding links as they arrive.

(“Incid.” = “incident”, and all dates are in MM/DD format.)

If the opportunity to read postmortems like this is beneficial to you, please do mention that over in the (re-)announcement thread; uncertainty was (very surprisingly) expressed there as to whether people were finding such information useful.


And thanks at this juncture to all the current and past mysterious authors of the Infra Log, for all the insights that they’ve passed on and hard work that they’ve done, :black_cat:.


Aside: The following aren’t strictly Log entries, but similarly feature Fly.io employees detailing past problems, fixes that are currently underway, etc.

Feel free to chime in if you notice any other gems that I missed…

8 Likes

Here’s the first of the week’s new entries…

incid.
date
pub.
date
description
03/11 03/18 GraphQL overloaded

This was probably closely related to the following forum thread:

1 Like

Pi Day was rung in with an occurrence of a classic nerd bug…

incid.
date
pub.
date
description
03/14 03/23 Sprites API didn’t like numbers (leading digit in i.d.)
03/16 03/23 Tight capacity in ORD and SIN

The first one caused an unfortunate fairly lengthy outage for Sprites which were under organizations whose names began with a digit.


Aside: There is a new official doc about upgrading Ubuntu within your Sprite.

2 Likes

Really appreciate this

1 Like

More word of effects of regional capacity crunches…

incid.
date
pub.
date
description
03/17 03/24 A bunch of wedged Sprites (502/503 responses)

The following forum threads seem at least partly related (although some may have had other contributing factors, as the time spans don’t match up 100%):

Capacity squalls subsequently spread to nearby DFW, :cloud_with_rain:

incid.
date
pub.
date
description
03/18 03/25 DFW capacity errors
03/18 03/25 SJC disappeared (briefly)

(Probably things like that first item were/are contributing to the many failed Depot builders, too, but it’s hard to gauge, since people don’t always mention their region in their forum reports.)

[Note: that example is actually from a bit later than the above incident date.]

Storage then filled up within the managed metrics database, in this atypically eventful week…

incid.
date
pub.
date
description
03/19 03/26 Metrics outage

One hour of metrics was permanently lost.