Does this affect me (besides Fly Postgres)?
If you have to ask, very unlikely. The people this would affect are doing complicated custom private network routing. If you just use the .internal
DNS for your private networking, you’re fine – the record TTL will expire and a new DNS lookup will return the new Machine IP. If you look up FLY_PRIVATE_IP
and save the result into some custom cluster configuration though, you might want to read on.
Why weren’t we notified about the migrations? / Can we get notifications about migrations in the future?
Migrations are too fundamental to our operations to have to announce them first. You don’t want us migrating infrequently with great fanfare; you want us migrating all the time: to rebalance workloads across hosts and get better performance, to shift workloads off of hosts with even a hint of hardware trouble, to make it easier to upgrade the fleet and provide you with new features, etc. etc. Migrations are something where we simply have to stick the landing 100% of the time. And we did here; this Fly Postgres incident was a problem with Fly Postgres, not with the migrations.
How exactly are you going to fix the Fly Postgres issue?
We are working on an eBPF program that will run inside each Machine with knowledge of its old and new 6PN addresses. It will rewrite all requests to the old addresses of a machine to use the new address for the same machine.
I think this affects me in my own app, not just Fly Postgres. Now I have to re-architect my app on short notice?
No, the fix we are coming up with for Fly Postgres will be deployed in all Machines and fix this problem for all Apps. However, we are not committing that it is a part of the platform that old 6PN addresses get rewritten. This is a stop-gap measure we are implementing; you should design your application with the assumption that 6PN addresses are not static. When we see that applications are no longer making this assumption, we may turn this stop-gap solution off.
My Postgres cluster is still broken!
We recommend that anyone who has a Postgres cluster which cannot elect a leader despite a quorum of nodes being online to restore their data to a new Fly Postgres app. You can do that as follows, replacing everything within <> with values relevant for your situation:
# Create a new Postgres app from one of your existing app's volumes
# (Do this once)
fly pg create --initial-cluster-size 3 --fork-from <OLD_DB_APP_NAME>:<OLDEST_VOL_ID> -n <NEW_DB_APP_NAME>
# Repeat from here for every front-end app that connects to the database
# Remove the old database config from app
fly secrets unset -a <FRONTEND_APP_NAME> --stage DATABASE_URL
# Add the config for the new database
fly pg attach -a <FRONTEND_APP_NAME> --database-user <NEW_DB_USER> --database-name <OLD_DB_NAME> <NEW_DB_APP_NAME>
fly secrets deploy -a <FRONTEND_APP_NAME>