PSA: Postmortem(s) rollup

mayailurus · April 21, 2026, 2:55am

A locking mystery on the following Sunday…

incid. date	pub. date	description
04/12	04/20	High edge CPU usage resulting in high latency in ORD

This wasn’t a reprise of the classic 0xffffffffffffffff but maybe something from the depths of SQLite instead.

mayailurus · April 23, 2026, 7:46am

A bug in the Linux kernel, which is an unusual conclusion for an Infra Log entry…

incid. date	pub. date	description
04/14	04/21	WireGuard wg0 one-way host connectivity

It was affecting egress IPs and the Fly Proxy (e.g., its load balancing), among other things.

mayailurus · April 25, 2026, 4:48am

The anticipated write-up of the widely noticed certificates incident on the subsequent Friday…

incid. date	pub. date	description
04/17	04/24	Vault outage broke TLS certificate lookups

Moving to a different storage arrangement was confirmed as being the plan, albeit in the “longer term”.

mayailurus · April 28, 2026, 6:19am

Further WireGuard wobbles, to start off the new week…

incid. date	pub. date	description
04/20	04/27	Duplicate Wireguard Mesh IPs Wreaking Havoc

This time it was a userland bug, however, not in the kernel.

Aside: ~~There was also a status-page-only~~ The status-page incident in Singapore on that day had the same underlying cause. (See @PeterCxy’s comment below for more details.)

PeterCxy · April 30, 2026, 3:52pm

Small side note: this was actually the same incident as the one in infra-log. The increased latency was caused by… duplicate wg addresses trashing one of our edges in sin rendering it mostly useless for a while

mayailurus · May 1, 2026, 4:50am

Ah… That does make sense. (And sin was specifically used as an example in the Log entry, too.)

Thanks for the correction!

mayailurus · May 1, 2026, 5:42am

500s on the dashboard and with GraphQL, as another Thursday rolled around…

incid. date	pub. date	description
04/23	04/30	Extension provider polling overloaded Postgres

It doesn’t sound like it was MPG that was overloaded, but rather an internal database of Fly.io’s own.

Addendum: The second incident on that day (April 23) was written up a bit later, as can be seen below.

mayailurus · May 6, 2026, 2:43am

On the following Monday, the Postgres storm clouds did move over MPG…

incid. date	pub. date	description
04/27	05/05	MPG provisioning failures from revoked org token
04/23	05/05	GitHub integration management callbacks returned 500s on Fly.io secondary nodes

That first one apparently caused 6-hour outages for certain operations.

mayailurus · May 9, 2026, 2:10pm

As the revived Infra Log’s second month drew to a close, several users reported odd breakage in deploys…

incid. date	pub. date	description
04/28	05/07	Machines API bug caused `fly deploy` to create duplicate Machines

Not only were extra Machines created, but existing ones weren’t updated to the new image.

This persisted slightly, half an hour, or so, into the following day (April 29).

mayailurus · May 10, 2026, 12:31pm

A small graphical overview of the previous month, now that it’s complete in the Log…

April 2026

							SYD×2, GraphQL, dashboard, metrics, ORD, NRT
							ORD, SYD, WireGuard, certs
							WireGuard, SIN, dashboard, GraphQL, IAD
							deploys, MPG

See the earlier March grid for a description of the annotations.

The first four days of April (corresponding to the top row) were clear of incidents, which was certainly a nice way to start things off…

The wide red mark on April 17 was the Vault certificates store (again); this is one of the few remaining services from the era of using Raft-based clusters for global metadata/configuration (as I understand it). In the longer term, there are plans for replacing it, and a note in the companion forum thread mentioned the decentralized PetSem as the probable substitute.

The wide red stroke on April 28, eleven days later, was a global failure of deploys, due to the Machines API erroneously returning an empty list when asked about existing Machines. This event slightly straddled midnight (00:00 UTC), which is why there are two bars, two outgoing links, etc.

Aside: Four incidents didn’t make it into the Infra Log, per se. (Possibly just because there was no further commentary that could be added.) In those spots, the cell in the table links either to the real-time status page’s archives or to a post in the present forum thread, depending on what else was in the air that day.

mayailurus · May 15, 2026, 12:04am

A new month begins in the Infra Log…

incid. date	pub. date	description
05/05	05/12	Petsem primary host lost networking (IAD)
05/06	05/14	Machines API hitting failed hosts in SIN

That first one briefly affected attempts to mutate secrets, but did not stop reads (which are distributed).

Addenda: There was also a forum-only incident with FRA networking on the bottom row’s day (May 6). The recent Fresh Produce on NATing outgoing IPv6 may be the de facto postmortem for that one.

In a similar vein, the following date’s (May 7) real-time status page reported relatively brief incidents in BOM and SJC, compiled here for ease of reference in the next summary grid.

mayailurus · May 21, 2026, 9:23pm

The next week, an intriguing interplay between heavy log traffic, hypervisor variants, and Linux kernel upgrades:

incid. date	pub. date	description
05/11	05/18	Kernel upgrade caused machine `stdout` to become wedged by Cloud Hypervisor (Redis)
05/12	05/19	Usage ingestion blocked by stuck Oban jobs

Most people don’t have Cloud Hypervisor underlying their own Machines (on Fly.io); that’s only needed for GPUs and Upstash’s backstage servers. Still, the popularity of the Upstash Redis extension resulted in considerable notice in the forum…

Aside: The real-time status page also mentioned a glitch in the Grafana logs on the top row’s day (May 11) as well as a reoccurrence of Redis on May 12.

Aside²: The Oban incident may have extended several hours into the following day (May 13).

jfent · May 21, 2026, 11:33pm

Depends what time zone you’re in