PSA: Postmortem(s) rollup

The capacity squalls in North America moved east to the coast, :cloud_with_rain:, at this point…

incid.
date
pub.
date
description
03/27 04/03 IAD CPU crunch

The iad region is architecturally a single point of failure in the Fly.io platform, last I heard, so bad weather here is especially noteworthy. A complete failure of this particular region can actually cause a global API outage (which happened a year ago).

This may have also contributed to the reports in the forum of 12+ hour disruptions of builders, although obviously the recorded time spans don’t align entirely:

The corresponding real-time status entry did mention builders, in general, and “deploys failing to complete (even for apps outside of the IAD region)”.


In an apparently separate incident, the Sprites innovations continued to explore previously unrecognized corner cases in the underlying Machines platform:

incid.
date
pub.
date
description
03/27 04/03 Sprites API errors (per-org databases)

The databases themselves can go to sleep, it seems, and auto-wake was going awry.

Forum reports also put these disruptions in the 12+ hours range, for at least some users:

Here, too, there may have been other contributing factors.