6PN IPs are not static (re: Fly Postgres last Thurs)

tldr: Your Fly Machine’s private 6PN address is not guaranteed to stay the same across its lifetime and your app should be designed around this. Q&A below.

One of the core features of the Fly Platform is that we have a simple private networking setup called 6PN that is secure by default. You can go back and reread the introduction by Thomas Ptacek — he’s got a great sense of style if we say so ourselves — and you’ll notice that part of the core design of the of setup is that your Machine’s host server identifier is encoded in part of the 6PN address. Put a pin in this; we’ll come back to it.

Last year, we began a transition from using Hashi Nomad as our platform’s orchestrator to using our home-grown one called flyd. This bought us a number of features, but it also made us let go of a few. In particular, Nomad let us migrate workloads across different host servers, and we didn’t have that in flyd at the start. But workload migration is something that we’ve wanted, and we’ve been working on it for months.

We also wanted to use the new flyd migration to provide a better experience than Nomad. With Nomad, a migrated workload was in a new Nomad alloc. So allocs would be created and destroyed without your knowledge. But with flyd, we wanted a relocated Fly Machine to be “the same” Fly Machine, with the same Machine ID. We think that will make for a better product, so that’s how we designed it.

Now, unpin the bit about 6PN — the bit where the host server’s ID is encoded in the 6PN address. That’s the one aspect of the Machine environment that can’t be the same after a migration. We were okay with this, we’d just take the effort to announce that 6PN IPs would not be static over the lifetime of a Fly Machine; they will change from time to time.

Well, here’s the announcement, and it comes with an exclamation mark. As we were getting ready last week to announce the migration functionality, we spotted a few dozen Fly Postgres clusters where an error had allocated multiple cluster nodes on the same host server. Wanting to reduce this risk ASAP, we used our new migration functionality to rebalance those clusters.

It turns out that revealed a bug within Fly Postgres. The migration function worked great; it did exactly what it’s supposed to. But Fly Postgres had saved the original pairing of Machine ID and 6PN address within its config, and then expected to contact the node with the given Machine ID at the saved 6PN address. When this didn’t happen, it locked up. Maybe you saw Fly PG come up in the forums. Maybe it happened to you. This is why.

After working with users to get those clusters back online, we reviewed our designs for 6PN and Machine IDs. 6PN is solid as is; we don’t want to touch that. And we still think that stable Machine IDs will lead to a better product. So we’re not changing the design. And migrations are essential to the long-term health of the platform. So Fly Machines will start seeing their 6PN IPs change.

So here we are. 6PN addresses aren’t static. In the immediate term, we’re not going to be running any more migrations. We’re working on a fix that will forward an old 6PN address to the new one, which will fix both Fly Postgres and anyone else who has the same bug built into their apps. But once that’s up, migrations will resume. We’ve updated the docs and we’re making this announcement. Users that we detect are using old 6PN addresses will also get emailed. 6PN IPs are not static. We’ll help you be ready.

4 Likes

Does this affect me (besides Fly Postgres)?

If you have to ask, very unlikely. The people this would affect are doing complicated custom private network routing. If you just use the .internal DNS for your private networking, you’re fine – the record TTL will expire and a new DNS lookup will return the new Machine IP. If you look up FLY_PRIVATE_IP and save the result into some custom cluster configuration though, you might want to read on.

Why weren’t we notified about the migrations? / Can we get notifications about migrations in the future?

Migrations are too fundamental to our operations to have to announce them first. You don’t want us migrating infrequently with great fanfare; you want us migrating all the time: to rebalance workloads across hosts and get better performance, to shift workloads off of hosts with even a hint of hardware trouble, to make it easier to upgrade the fleet and provide you with new features, etc. etc. Migrations are something where we simply have to stick the landing 100% of the time. And we did here; this Fly Postgres incident was a problem with Fly Postgres, not with the migrations.

How exactly are you going to fix the Fly Postgres issue?

We are working on an eBPF program that will run inside each Machine with knowledge of its old and new 6PN addresses. It will rewrite all requests to the old addresses of a machine to use the new address for the same machine.

I think this affects me in my own app, not just Fly Postgres. Now I have to re-architect my app on short notice?

No, the fix we are coming up with for Fly Postgres will be deployed in all Machines and fix this problem for all Apps. However, we are not committing that it is a part of the platform that old 6PN addresses get rewritten. This is a stop-gap measure we are implementing; you should design your application with the assumption that 6PN addresses are not static. When we see that applications are no longer making this assumption, we may turn this stop-gap solution off.

My Postgres cluster is still broken!
We recommend that anyone who has a Postgres cluster which cannot elect a leader despite a quorum of nodes being online to restore their data to a new Fly Postgres app. You can do that as follows, replacing everything within <> with values relevant for your situation:

# Create a new Postgres app from one of your existing app's volumes
# (Do this once)
fly pg create --initial-cluster-size 3 --fork-from <OLD_DB_APP_NAME>:<OLDEST_VOL_ID> -n <NEW_DB_APP_NAME>
# Repeat from here for every front-end app that connects to the database
# Remove the old database config from app
fly secrets unset -a <FRONTEND_APP_NAME> --stage DATABASE_URL
# Add the config for the new database
fly pg attach -a <FRONTEND_APP_NAME> --database-user <NEW_DB_USER> --database-name <OLD_DB_NAME> <NEW_DB_APP_NAME>
fly secrets deploy -a <FRONTEND_APP_NAME>
1 Like

Not sure I agree with this statement. I’d need more information to be happy with it.

Are the migrations live migrations (the machine keeps running during the migration and the app is none the wiser that it occurred)?

If so, what’s the impact on performance, especially for machines with volumes?

If not, then when are machines rebalanced? It would be a pretty terrible experience to have a machine that’s running a stateful service shut down simply because you wanted to satisfy some metric.

For apps with volumes, in a hardware problem scenario, how will you ensure the volume is migrated free from corruption?

For stateful services, if the machine needs to be turned off then there 100% needs to be a notification sent out. It is not acceptable to have known downtime without notice.

Has this now been announced (aside from this thread)? I suspect most replies to the thread may be related to the concept of workload migrations (is this explicitly mentioned anywhere else?), rather than the topic itself.

It may be a case for some: came for the ‘6PN IPs are not static’, stayed for the ‘migration functionality’.

2 Likes

Yes. Machine migration hasn’t been announced yet. We had something similar in the Nomad days. We didn’t have that after moving away from Nomad and now we’d like to introduce that again.

Machine migration is not “live”. We stop a machine, remote-fork its volume, and start a new machine in a different host. Stopping a machine follows our existing configuration options such as kill_signal. So your application can run its own cleanup procedure before VM is shutting down.

The hardware problem scenario is tricky. If the original volume is already corrupted due to hardware failures, we’d copy the volume with corruption. It may make sense to run ef2fsck or some other programs from the platform side, but it hasn’t been implemented yet.

For stateful services, what you’d like to do after getting a notification form the platform? Being stateful and being highly-available are two different properties. We want to provide platform primitives that could keep stafeul services highly-available.

2 Likes

For our needs, we run a CockroachDB cluster. This cluster needs to maintain quorum (more than 50% of nodes alive) in order to survive. We run a 5 node setup with a 5 replicas, meaning we can sustain up to 2 nodes down while having the cluster survive (albeit with reduced capacity).

This means that migrations should ideally never take more than 1 node down at a time and should never take more than 2 down at a time otherwise the cluster fails. During deployments, there’s an option to only update 1 node at a time in a rolling strategy so this isn’t a problem.

Currently fly does not have the ability to tell the proxy when a node is ready to serve requests separately from determining when the node is healthy (like Kubernetes liveness vs readiness). CockroachDB nodes can be alive and participating in the cluster (healthy) but not ready to serve requests. Due to this the healthcheck is configured for liveness only, otherwise deploys could fail. This creates another issue for migrations, especially if they happen frequently, in which the proxy will be trying to send requests to freshly migrated servers that aren’t ready to serve requests. Also, due to delays in node liveness to the proxy edges, sometimes requests are still sent to servers that are shutting down.

Currently fly metrics don’t support scraping metrics from CockroachDB servers due to not supporting metrics over HTTPS. This would make any possible metrics-based approach to fixing these issues a non-starter.

If there’s potential for volume corruption, then there should never be an automatic migration where a new machine is created and started using a corrupted volume. I can’t think of any scenario where someone would want an automatically created server joining their fleet with corrupted data. In regards to CockroachDB, I believe this would likely cause the self-checks to fail during startup and the machine crashing, throwing the machine into a reboot loop until the reboot limit is reached.

Many of the potential issues introduced by automatic migration is currently mitigated simply by having manual control which is not available during automatic migration.

3 Likes

Cluster app configurations are generally front-of-mind for us and that’s no less true for the design of migrations. The Machine migration functionality will take care not to migrate Machines in a way that takes any supervening cluster down.

As for volume corruption, we should first clarify that the question is already rather hypothetical. Our host servers have their disks in a RAID config and use ECC RAM, so the odds of hardware failures causing volume corruption are quite low. Now, should the stars align and a series of hardware glitches cause volume corruption, it’s true that the migration process will not catch it and fix it, so the corruption might cause a problem for the Machine on the new host. But that’s equally true for the Machine if it stayed on its old host – the (quite low) risk of volume corruption is simply present; it’s not that migration is creating this risk.