Postgres Flex - Important Stability Updates

Hey everyone,

There are two important updates that i’d like to share.

Internal Migration Stability Fixes

We’ve recently released v0.0.63, which includes an important update to ensure your setup remains stable during internal migrations. These changes allow us to safely move replicas to new hosts using volume forking, creating a more seamless experience with minimal disruption—especially for those running HA setups.

Context

Before v0.0.63, our replication configuration referenced private IPs, which change when a volume is moved to a new host. While we do have tooling to handle these changes, the process can be a bit tricky and prone to errors. This release will work to convert these private IP entries to a value that’s stable across migrations, which will turn this into a non-issue.

Important Note about the update process

The update process targets replicas first and the primary last. During the upgrade, you may see warnings about the primary not being able to communicate with the replicas. This is simply because the primary hasn’t been updated yet and doesn’t know how to interpret the new configuration. Rest assured, these warnings will clear once the primary is updated.

Testing

If you’re updating a production-level database, it’s always a good idea to test the process first using a staging db. You can quickly create a staging db by forking your production Postgres app:

fly pg create --fork-from <prod-pg-app>

Updating

To upgrade, run:

fly image update --app <pg-app-name>

Upgrade Path Updates

This is long overdue, but v0.0.40 introduced a change that would lead to collation mismatches issues. For most users this can be addressed fairly easily, but for others it can be a bit of a challenge.

To prevent further headaches, the upgrade path for users on versions older than v0.0.40 is now capped at v0.0.40. Meanwhile, users running v0.0.41 and above can upgrade to the latest version without issue.

If you’re on an older setup, you can rejoin the primary upgrade chain by provisioning a new Postgres app and using fly pg import, or by performing a manual pg_dump/pg_restore.

Questions?

If you have any questions or need assistance, don’t hesitate to reach out!

11 Likes

Awesome news! Here’s some data on how the update went for me in case that might be useful.

Context

The app running on my servers is Elixir + Phoenix, and the health check endpoint is implemented as a plug to limit performance impact.

The app knows nothing about replicas in production and connects to the DB through the flycast address.

Here’s the command I’m using for monitoring updates/failovers:

while true; do echo "$(date +"%T"): $(curl -s https://$HOSTNAME/api/health_check)"; sleep 0.5; done`

Staging

  • DB config: 1 x shared-cpu-1x@768MB
  • “absolute” downtime: 10 seconds

Logs:

19:38:04: ok
19:38:05: error
19:38:05: error
19:38:06: error
19:38:08: error
19:38:10: error
19:38:12: error
19:38:14: ok

Production

  • DB config: 3 x shared-cpu-4x@1024MB (HA cluster)
  • “intermittent” downtime: 1 minute 09 seconds
  • “absolute” downtime: 53 seconds

Note: while testing PG failover with the same command on the same cluster a month ago, the “absolute” observed downtime was only 25 seconds, and no “intermittent” downtime was observed.

Logs:

19:40:22: ok
19:40:23: error
19:40:23: error
19:40:24: ok
19:40:25: ok
19:40:25: ok
19:40:26: ok
19:40:27: ok
19:40:27: ok
19:40:28: ok
19:40:28: ok
19:40:29: ok
19:40:30: error
19:40:30: ok
19:40:31: ok
19:40:32: ok
19:40:32: ok
19:40:33: ok
19:40:34: ok
19:40:34: error
19:40:35: error
19:40:36: ok
19:40:37: error
19:41:29: error    # request on hold for 52s before failing
19:41:30: ok
19:41:30: ok
19:41:31: ok
19:41:32: error
19:41:32: ok

P.S. editing as it looks like this is my first post here, been loving Fly.io so much for more than a year, keep up the good work team! :heart:

4 Likes

Maybe this is obvious but… how do we know which version we’re using? :thinking:

Edit:

Sorry. Just saw that in the email:

fly image show --app <pg-app-name>

When do we have to get this update done by?

It’s advisable to complete them as soon as possible to minimize risk.

There are a number of reasons why we may need to perform an internal migration and unfortunately we won’t always have control over timing.

You would notify us before doing such internal migrations, right?

You should receive notification upfront for any planned volume migrations. That being said though, there are emergency situations where upfront notifications may not be possible.

1 Like

Got a cluster on v0.0.46 and the update command timed out. Only one of the 3 nodes made it to 63. Not sure what to do now.

Edit - I tried the update out with a --fork-from and it worked. The only difference is that the --fork-from is a single-node database while the production db is a 3-node cluster. Any ideas?

fly image show --app frdm-db-production  
Updates available:

Machine "784e1dea4e19e8" flyio/postgres-flex:15.3 (v0.0.46) -> flyio/postgres-flex:15.8 (v0.0.63)
Machine "91852306f36658" flyio/postgres-flex:15.3 (v0.0.46) -> flyio/postgres-flex:15.8 (v0.0.63)

Run `flyctl image update` to migrate to the latest image version.
Image Details
MACHINE ID    	REGISTRY                	REPOSITORY         	TAG 	VERSION	DIGEST                                                                 	LABELS                                                                                               
d890de5c299248	docker-hub-mirror.fly.io	flyio/postgres-flex	15.8	v0.0.63	sha256:5d3ee230d343d9682b8b0aec9f7a71c51fe04f18819f2ceb9ed1d33dc032ebb5	fly.pg-manager=repmgrfly.pg-version=15.8fly.version=v0.0.63fly.app_role=postgres_cluster            	
784e1dea4e19e8	docker-hub-mirror.fly.io	flyio/postgres-flex	15.3	v0.0.46	sha256:44b698752cf113110f2fa72443d7fe452b48228aafbb0d93045ef1e3282360a6	fly.pg-version=15.3-1.pgdg120+1fly.version=v0.0.46fly.app_role=postgres_clusterfly.pg-manager=repmgr	
91852306f36658	docker-hub-mirror.fly.io	flyio/postgres-flex	15.3	v0.0.46	sha256:44b698752cf113110f2fa72443d7fe452b48228aafbb0d93045ef1e3282360a6	fly.app_role=postgres_clusterfly.pg-manager=repmgrfly.pg-version=15.3-1.pgdg120+1fly.version=v0.0.46	

fly checks shows that machine is “gone”

-------*----------*----------------*-------------------*------------------------------------------------------------------------------
  pg   | critical | d890de5c299248 | 6m38s ago         | gone                                                                         
-------*----------*----------------*-------------------*------------------------------------------------------------------------------
  role | critical | d890de5c299248 | 6m38s ago         | gone                                                                         
-------*----------*----------------*-------------------*------------------------------------------------------------------------------
  vm   | passing  | d890de5c299248 | 6m25s ago         | [✓] checkDisk: 20.61 GB (52.8%) free space on /data/ (44.19µs)               
       |          |                |                   | [✓] checkLoad: load averages: 0.00 0.00 0.00 (121.36µs)                      
       |          |                |                   | [✓] memory: system spent 0s of the last 60s waiting on memory (35.86µs)      
       |          |                |                   | [✓] cpu: system spent 450ms of the last 60s waiting on cpu (25.67µs)         
       |          |                |                   | [✓] io: system spent 192ms of the last 60s waiting on io (21.96µs)           
-------*----------*----------------*-------------------*------------------------------------------------------------------------------

If you used --fork-from, you can scale it to a 3-node cluster by cloning the your primary with:

fly machines clone <primary-machine-id>

Got a cluster on v0.0.46 and the update command timed out. Only one of the 3 nodes made it to 63. Not sure what to do now.

That’s strange, which region are you in? Also, you can confirm if it’s the VM or the health check system by doing the following:

# SSH into machine d890de5c299248
fly ssh console -s --app <your-app-name>

# Use curl to manually check the health checks.
curl http://localhost:5500/flycheck/pg

If the the health checks look good there, then something got wedged within the health check process.

This only affects machines using the flyio/postgres-flex repository, right, not older machines using flyio/postgres?

Just looked and saw that my main organization has three apps using v0.0.41 on flyio/postgres, and two using v0.0.51 on flyio/postgres-flex.

I did this for an app that was on 0.36 or something. It felt quite scary.

I tried to follow Backup, Restores, & Snapshots · Fly Docs which mentions a fly info command which no longer exists and suggests restoring from a snapshot, which also didn’t work.

I first tried to create a new pg machine from the existing one by doing

fly volumes list --app old-db
fly volumes snapshots list vol_something
fly postgres create --snapshot-id vs_something --name new-db

but it the snapshot thing had no effect, since when I did fly pg connect --app new-db and \l I didn’t see it listed.

I then did

fly postgres import --app new-db "$DATABASE_URL" 

which was mentioned in the email, using as DATABASE_URL the secret that’s currently in use by the app.

And then as mentioned at Backup, Restores, & Snapshots · Fly Docs I tried to do the switchover with detach&attach.

fly postgres detach old-db

failed with

Error: error running user-delete: 500: ERROR: role "myapp" cannot be dropped because some objects depend on it (SQLSTATE 2BP01)

I tried to fly machine stop or fly scale count 0 the app as suggested in some forum posts, but to no avail. Some posts suggested deleting the user which seems very destructive.
Then I saw How to detach postgres without deleting db user? - #3 by Elder which suggests just reassigning ownership. Apparently that’s easily reversible. So I did

fly postgres connect --app old-db
…
REASSIGN OWNED BY myapp TO postgres;
\c myapp
REASSIGN OWNED BY myapp TO postgres;

(note it was not enough to reassign in the master db, had to switch to myapp db)

and then I was able to run

fly postgres detach --app myapp old-db
fly postgres attach --app myapp new-db

This told me Secret "DATABASE_URL" was scheduled to be removed from app myapp and Postgres cluster new-db is now attached to myapp The following secret was added to myapp: DATABASE_URL=*** .

HOWEVER: On trying to load the page, it was still trying to connect to the old db, which now was missing the user.

I tried fly machine restart but still it used the old connection string. I then manually did

fly secrets set --app myapp 'DATABASE_URL=***'

which told me Updating existing machines in 'myapp' with rolling strategy and seems to have both set and applied(?) the secret.

Maybe there’s a separate command for that which fly attach neglected to run.

Hope this helps someone (or inspires fixing some papercuts in flyctl).

3 Likes

That is correct.

@shaun The apps are on SYD region.

After the command failed yesterday I just cloned a new machine into the cluster so that it had healthy 3 nodes again and deleted the failing machine. I don’t remember which health check was failing. I’ll try the process again later today. If it fails again, I’ll create a new fork, update it and change the ENVs accordingly (I don’t use pg attach/detach, I manage database urls myself).

EDIT I tried the command again today and it worked fine on the prod cluster, all 3 machine successfully updated. Seems like yesterday’s failure was specific to that single machine which I since deleted.

1 Like