Postgres Flex - Important Stability Updates

Hey everyone,

There are two important updates that i’d like to share.

Internal Migration Stability Fixes

We’ve recently released v0.0.63, which includes an important update to ensure your setup remains stable during internal migrations. These changes allow us to safely move replicas to new hosts using volume forking, creating a more seamless experience with minimal disruption—especially for those running HA setups.

Context

Before v0.0.63, our replication configuration referenced private IPs, which change when a volume is moved to a new host. While we do have tooling to handle these changes, the process can be a bit tricky and prone to errors. This release will work to convert these private IP entries to a value that’s stable across migrations, which will turn this into a non-issue.

Important Note about the update process

The update process targets replicas first and the primary last. During the upgrade, you may see warnings about the primary not being able to communicate with the replicas. This is simply because the primary hasn’t been updated yet and doesn’t know how to interpret the new configuration. Rest assured, these warnings will clear once the primary is updated.

Testing

If you’re updating a production-level database, it’s always a good idea to test the process first using a staging db. You can quickly create a staging db by forking your production Postgres app:

fly pg create --fork-from <prod-pg-app>

Updating

To upgrade, run:

fly image update --app <pg-app-name>

Upgrade Path Updates

This is long overdue, but v0.0.40 introduced a change that would lead to collation mismatches issues. For most users this can be addressed fairly easily, but for others it can be a bit of a challenge.

To prevent further headaches, the upgrade path for users on versions older than v0.0.40 is now capped at v0.0.40. Meanwhile, users running v0.0.41 and above can upgrade to the latest version without issue.

If you’re on an older setup, you can rejoin the primary upgrade chain by provisioning a new Postgres app and using fly pg import, or by performing a manual pg_dump/pg_restore.

Questions?

If you have any questions or need assistance, don’t hesitate to reach out!

7 Likes

Awesome news! Here’s some data on how the update went for me in case that might be useful.

Context

The app running on my servers is Elixir + Phoenix, and the health check endpoint is implemented as a plug to limit performance impact.

The app knows nothing about replicas in production and connects to the DB through the flycast address.

Here’s the command I’m using for monitoring updates/failovers:

while true; do echo "$(date +"%T"): $(curl -s https://$HOSTNAME/api/health_check)"; sleep 0.5; done`

Staging

  • DB config: 1 x shared-cpu-1x@768MB
  • “absolute” downtime: 10 seconds

Logs:

19:38:04: ok
19:38:05: error
19:38:05: error
19:38:06: error
19:38:08: error
19:38:10: error
19:38:12: error
19:38:14: ok

Production

  • DB config: 3 x shared-cpu-4x@1024MB (HA cluster)
  • “intermittent” downtime: 1 minute 09 seconds
  • “absolute” downtime: 53 seconds

Note: while testing PG failover with the same command on the same cluster a month ago, the “absolute” observed downtime was only 25 seconds, and no “intermittent” downtime was observed.

Logs:

19:40:22: ok
19:40:23: error
19:40:23: error
19:40:24: ok
19:40:25: ok
19:40:25: ok
19:40:26: ok
19:40:27: ok
19:40:27: ok
19:40:28: ok
19:40:28: ok
19:40:29: ok
19:40:30: error
19:40:30: ok
19:40:31: ok
19:40:32: ok
19:40:32: ok
19:40:33: ok
19:40:34: ok
19:40:34: error
19:40:35: error
19:40:36: ok
19:40:37: error
19:41:29: error    # request on hold for 52s before failing
19:41:30: ok
19:41:30: ok
19:41:31: ok
19:41:32: error
19:41:32: ok

P.S. editing as it looks like this is my first post here, been loving Fly.io so much for more than a year, keep up the good work team! :heart:

4 Likes