What is the correct process to change the postgres leader region?

I wanted to move my postgres leader region from SYD to LAX.
I created a replica in LAX, but couldn’t find a way to trigger the failover.

I tried restarting the SYD postgres VM but it just restarted as leader.
I tried deleting the SYD pg_data volume, which remove the leader, but now
the replica seems stuck as a replica with no leader:

2022-04-22T05:55:59Z app[ba4a2446] lax [info]sentinel | 2022-04-22T05:55:59.401Z	INFO	cmd/sentinel.go:1006	trying to find a new master to replace failed master
2022-04-22T05:55:59Z app[ba4a2446] lax [info]sentinel | 2022-04-22T05:55:59.401Z	INFO	cmd/sentinel.go:741	ignoring keeper since it cannot be master (--can-be-master=false)	{"db": "4bfd05f3", "keeper": "7d180d2f82"}
2022-04-22T05:55:59Z app[ba4a2446] lax [info]sentinel | 2022-04-22T05:55:59.401Z	ERROR	cmd/sentinel.go:1009	no eligible masters
2022-04-22T05:56:01Z app[ba4a2446] lax [info]keeper   | 2022-04-22T05:56:01.738Z	INFO	cmd/keeper.go:1556	our db requested role is standby	{"followedDB": "ebc4d408"}
2022-04-22T05:56:06Z app[ba4a2446] lax [info]sentinel | 2022-04-22T05:56:06.146Z	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "ebc4d408", "keeper": "2983094dd2"}
2022-04-22T05:56:06Z app[ba4a2446] lax [info]sentinel | 2022-04-22T05:56:06.150Z	INFO	cmd/sentinel.go:995	master db is failed	{"db": "ebc4d408", "keeper": "2983094dd2"}
2022-04-22T05:56:06Z app[ba4a2446] lax [info]sentinel | 2022-04-22T05:56:06.151Z	INFO	cmd/sentinel.go:1006	trying to find a new master to replace failed master
2022-04-22T05:56:06Z app[ba4a2446] lax [info]sentinel | 2022-04-22T05:56:06.151Z	INFO	cmd/sentinel.go:741	ignoring keeper since it cannot be master (--can-be-master=false)	{"db": "4bfd05f3", "keeper": "7d180d2f82"}

So I have two questions:

  1. What is the recommended process to change the postgres leader region?
  2. Why didn’t failover work in this instance? (I expected the LAX replica would be promoted to leader)
1 Like

Hey there,

So it’s a bit of a process, but i’ll walk you through it.

Step 1: Adjust your apps primary region.

Pull down your fly.toml file if you haven’t already.

fly config save --app <app-name>

Modify the PRIMARY_REGION value inside of your fly.toml file

[env]
 PRIMARY_REGION = "lax"

Deploy your app.

WARNING: Your app will not accept writes until your issue the failover in Step 2. The HAProxy routes connections to the primary and leverages the PRIMARY REGION env var. This also requires an immediate deploy, which means it will deploy the new image to all members at the same time. This will result in a brief period of downtime. I would recommend testing this process on a staging environment if this app is critical.

# Run this inside the same directory as your fly.toml
fly deploy . --image flyio/postgres:<major-version> --strategy=immediate

If you don’t the image you’re running, you can view it by running:
fly image show

So for example, if you’re Tag indicates 14.2, the image reference should look like flyio/postgres:14.

Step 2: Orchestrating a failover

Verify your version:

fly image show

Registry   = registry-1.docker.io
Repository = flyio/postgres
Tag        = 14.2
Version    = v0.0.21
Digest     = sha256:4e4a7bfef439b5e02fa3803c4b8225b57c297fa114f995855d5d7807828d9008

If you’re running PG13/14 with Version v0.0.13+

fly ssh console --app <app-name>

pg-failover

If you’re running PG12 or an earlier Version:

fly ssh console --app <app-name>

bash
# Export Stolon specific env vars.
export $(cat /data/.env | xargs) 

# Identify the master keeper id
stolonctl status  

# Fail the master keeper to trigger the failover.
stolonctl failkeeper <master-keeper-id>

# Verify the state of the world.
stolonctl status  # Verify that the master has indeed changed.

Let me know if you have any questions on anything.

5 Likes

Exactly what I needed. Thank you.

1 Like

Just ran through this process to switch primary region from lax to iad and now I see two leaders…

❯ fly status -a web6-db
Update available 0.0.420 -> v0.0.426.
Run "fly version update" to upgrade.
App
  Name     = web6-db
  Version  = 10
  Status   = running
  Platform = nomad

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS                  HEALTH CHECKS           RESTARTS        CREATED
73023056        app     10      iad     run     running (leader)        3 total, 3 passing      0               1m43s ago
eb323e5a        app     10      lax     run     running (leader)        3 total, 3 passing      0               4m39s ago

If I connect to the DB that’s in iad, it isn’t the primary and writes to it will fail

Update: I deleted the volume in the lax region, scaled it down to 1, and looks like it’s working as expected now, but I’m seeing this being logged quite frequently:

[1;38;5;3msentinele[0m | 2022-11-01T19:57:36.241Z	e[33mWARNe[0m	cmd/sentinel.go:276	no keeper info available	{"db": "0952c761", "keeper": "7d182220c2"}

There’s a quirk in Consul’s health checks that means a passing check won’t change it’s text status for a few minutes. That iad leader just needed a few minutes to get the updated value, it was secondary the whole time.

The keeper info errors are somewhat normal. It thinks the previous node might come back. You can manually remove it with something like this:

fly ssh console -a <postgres app>
export $(cat /data/.env | xargs)
stolonctl removekeeper <keeper id>
1 Like

For v0.0.46, there’s no pg-failover binary present. Not sure how to proceed there.

@Haarolean I would recommend updating flyctl.

it’s not a flyctl issue as pg-failover is a binary on a remote machine