PSA: Postmortem(s) rollup

A locking mystery on the following Sunday…

incid.
date
pub.
date
description
04/12 04/20 High edge CPU usage resulting in high latency in ORD

This wasn’t a reprise of the classic 0xffffffffffffffff but maybe something from the depths of SQLite instead.

A bug in the Linux kernel, which is an unusual conclusion for an Infra Log entry…

incid.
date
pub.
date
description
04/14 04/21 WireGuard wg0 one-way host connectivity

It was affecting egress IPs and the Fly Proxy (e.g., its load balancing), among other things.

The anticipated write-up of the widely noticed certificates incident on the subsequent Friday…

incid.
date
pub.
date
description
04/17 04/24 Vault outage broke TLS certificate lookups

Moving to a different storage arrangement was confirmed as being the plan, albeit in the “longer term”.

Further WireGuard wobbles, to start off the new week…

incid.
date
pub.
date
description
04/20 04/27 Duplicate Wireguard Mesh IPs Wreaking Havoc

This time it was a userland bug, however, not in the kernel.


Aside: There was also a status-page-only The status-page incident in Singapore on that day had the same underlying cause. (See @PeterCxy’s comment below for more details.)

Small side note: this was actually the same incident as the one in infra-log. The increased latency was caused by… duplicate wg addresses trashing one of our edges in sin rendering it mostly useless for a while :grimacing:

Ah… That does make sense. (And sin was specifically used as an example in the Log entry, too.)

Thanks for the correction!

500s on the dashboard and with GraphQL, as another Thursday rolled around…

incid.
date
pub.
date
description
04/23 04/30 Extension provider polling overloaded Postgres

It doesn’t sound like it was MPG that was overloaded, but rather an internal database of Fly.io’s own.


Addendum: The second incident on that day (April 23) was written up a bit later, as can be seen below.

On the following Monday, the Postgres storm clouds did move over MPG…

That first one apparently caused 6-hour outages for certain operations.

As the revived Infra Log’s second month drew to a close, several users reported odd breakage in deploys…

incid.
date
pub.
date
description
04/28 05/07 Machines API bug caused fly deploy to create duplicate Machines

Not only were extra Machines created, but existing ones weren’t updated to the new image.

This persisted slightly, half an hour, or so, into the following day (April 29).

A small graphical overview of the previous month, now that it’s complete in the Log…

April 2026
½▪ ½▪ ▪ ½▪ ▪ ½│ ½▪ SYD×2, GraphQL, dashboard, metrics, ORD, NRT
½▪ │ ▪ ─ ORD, SYD, WireGuard, certs
▪ ½▪ ½▪ ½▪ WireGuard, SIN, dashboard, GraphQL, IAD
│ ─ ─ deploys, MPG

See the earlier March grid for a description of the annotations.

The first four days of April (corresponding to the top row) were clear of incidents, which was certainly a nice way to start things off…

The wide red mark on April 17 was the Vault certificates store (again); this is one of the few remaining services from the era of using Raft-based clusters for global metadata/configuration (as I understand it). In the longer term, there are plans for replacing it, and a note in the companion forum thread mentioned the decentralized PetSem as the probable substitute.

The wide red stroke on April 28, eleven days later, was a global failure of deploys, due to the Machines API erroneously returning an empty list when asked about existing Machines. This event slightly straddled midnight (00:00 UTC), which is why there are two bars, two outgoing links, etc.


Aside: Four incidents didn’t make it into the Infra Log, per se. (Possibly just because there was no further commentary that could be added.) In those spots, the cell in the table links either to the real-time status page’s archives or to a post in the present forum thread, depending on what else was in the air that day.

A new month begins in the Infra Log…

incid.
date
pub.
date
description
05/05 05/12 Petsem primary host lost networking (IAD)
05/06 05/14 Machines API hitting failed hosts in SIN

That first one briefly affected attempts to mutate secrets, but did not stop reads (which are distributed).


Addenda: There was also a forum-only incident with FRA networking on the bottom row’s day (May 6). The recent Fresh Produce on NATing outgoing IPv6 may be the de facto postmortem for that one.

In a similar vein, the following date’s (May 7) real-time status page reported relatively brief incidents in BOM and SJC, compiled here for ease of reference in the next summary grid.

The next week, an intriguing interplay between heavy log traffic, hypervisor variants, and Linux kernel upgrades:

Most people don’t have Cloud Hypervisor underlying their own Machines (on Fly.io); that’s only needed for GPUs and Upstash’s backstage servers. Still, the popularity of the Upstash Redis extension resulted in considerable notice in the forum…


Aside: The real-time status page also mentioned a glitch in the Grafana logs on the top row’s day (May 11) as well as a reoccurrence of Redis on May 12.

Aside2: The Oban incident may have extended several hours into the following day (May 13).

Depends what time zone you’re in :wink: