tldr: Your Fly Machine’s private 6PN address is not guaranteed to stay the same across its lifetime and your app should be designed around this. Q&A below.
One of the core features of the Fly Platform is that we have a simple private networking setup called 6PN that is secure by default. You can go back and reread the introduction by Thomas Ptacek — he’s got a great sense of style if we say so ourselves — and you’ll notice that part of the core design of the of setup is that your Machine’s host server identifier is encoded in part of the 6PN address. Put a pin in this; we’ll come back to it.
Last year, we began a transition from using Hashi Nomad as our platform’s orchestrator to using our home-grown one called flyd. This bought us a number of features, but it also made us let go of a few. In particular, Nomad let us migrate workloads across different host servers, and we didn’t have that in flyd at the start. But workload migration is something that we’ve wanted, and we’ve been working on it for months.
We also wanted to use the new flyd migration to provide a better experience than Nomad. With Nomad, a migrated workload was in a new Nomad alloc. So allocs would be created and destroyed without your knowledge. But with flyd, we wanted a relocated Fly Machine to be “the same” Fly Machine, with the same Machine ID. We think that will make for a better product, so that’s how we designed it.
Now, unpin the bit about 6PN — the bit where the host server’s ID is encoded in the 6PN address. That’s the one aspect of the Machine environment that can’t be the same after a migration. We were okay with this, we’d just take the effort to announce that 6PN IPs would not be static over the lifetime of a Fly Machine; they will change from time to time.
Well, here’s the announcement, and it comes with an exclamation mark. As we were getting ready last week to announce the migration functionality, we spotted a few dozen Fly Postgres clusters where an error had allocated multiple cluster nodes on the same host server. Wanting to reduce this risk ASAP, we used our new migration functionality to rebalance those clusters.
It turns out that revealed a bug within Fly Postgres. The migration function worked great; it did exactly what it’s supposed to. But Fly Postgres had saved the original pairing of Machine ID and 6PN address within its config, and then expected to contact the node with the given Machine ID at the saved 6PN address. When this didn’t happen, it locked up. Maybe you saw Fly PG come up in the forums. Maybe it happened to you. This is why.
After working with users to get those clusters back online, we reviewed our designs for 6PN and Machine IDs. 6PN is solid as is; we don’t want to touch that. And we still think that stable Machine IDs will lead to a better product. So we’re not changing the design. And migrations are essential to the long-term health of the platform. So Fly Machines will start seeing their 6PN IPs change.
So here we are. 6PN addresses aren’t static. In the immediate term, we’re not going to be running any more migrations. We’re working on a fix that will forward an old 6PN address to the new one, which will fix both Fly Postgres and anyone else who has the same bug built into their apps. But once that’s up, migrations will resume. We’ve updated the docs and we’re making this announcement. Users that we detect are using old 6PN addresses will also get emailed. 6PN IPs are not static. We’ll help you be ready.