Hello!
I recently tried to change my postgres cluster leader ( via PRIMARY_REGION ) from sea to sjc but ran into some errors when following all the steps outlined here: how to change the postgres leader region? - #2 by kurt
After saving and editing the generated fly.toml with the desired primary region I ran fly deploy -i flyio/postgres:14.4
which unfortunately failed with the following:
2022-10-05T00:21:15Z [info]proxy | [WARNING] 277/002115 (545) : parsing [/fly/haproxy.cfg:38]: Missing LF on last line, file might have been truncated at position 96. This will become a hard error in HAProxy 2.3.
2022-10-05T00:21:15Z [info]proxy | [NOTICE] 277/002115 (545) : New worker #1 (570) forked
2022-10-05T00:21:16Z [info]checking stolon status
2022-10-05T00:21:16Z [info]exporter | ERRO[0001] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:7a94:a7b:23c5:2:3283:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:7a94:a7b:23c5:2:3283:2]:5433: connect: connection refused source="postgres_exporter.go:1658"
2022-10-05T00:21:16Z [info]proxy | [WARNING] 277/002116 (570) : bk_db/pg1 changed its IP from (none) to fdaa:0:7a94:a7b:ad1:2:32c0:2 by flydns/dns1.
2022-10-05T00:21:16Z [info]proxy | [WARNING] 277/002116 (570) : Server bk_db/pg1 ('sjc.hnhired-db.internal') is UP/READY (resolves again).
2022-10-05T00:21:16Z [info]proxy | [WARNING] 277/002116 (570) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2022-10-05T00:21:17Z [info]keeper is healthy, db is healthy, role: standby
2022-10-05T00:21:17Z [info]keeper | 2022-10-05T00:21:17.824Z ERROR cmd/keeper.go:719 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-10-05T00:21:18Z [info]proxy | [WARNING] 277/002118 (570) : Server bk_db/pg1 is DOWN, reason: Layer7 invalid response, info: "HTTP content check did not match", check duration: 286ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
2022-10-05T00:21:20Z [info]keeper | 2022-10-05T00:21:20.325Z ERROR cmd/keeper.go:719 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-10-05T00:21:22Z [info]proxy | [WARNING] 277/002122 (570) : Backup Server bk_db/pg is DOWN, reason: Layer7 timeout, check duration: 5001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2022-10-05T00:21:22Z [info]proxy | [NOTICE] 277/002122 (570) : haproxy version is 2.2.9-2+deb11u3
2022-10-05T00:21:22Z [info]proxy | [NOTICE] 277/002122 (570) : path to executable is /usr/sbin/haproxy
2022-10-05T00:21:22Z [info]proxy | [ALERT] 277/002122 (570) : backend 'bk_db' has no server available!
2022-10-05T00:21:22Z [info]keeper | 2022-10-05T00:21:22.827Z ERROR cmd/keeper.go:719 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-10-05T00:21:23Z [info]keeper | 2022-10-05 00:21:23.091 UTC [592] LOG: starting PostgreSQL 14.4 (Debian 14.4-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-10-05T00:21:23Z [info]keeper | 2022-10-05 00:21:23.091 UTC [592] LOG: listening on IPv6 address "fdaa:0:7a94:a7b:23c5:2:3283:2", port 5433
2022-10-05T00:21:23Z [info]keeper | 2022-10-05 00:21:23.092 UTC [592] LOG: listening on Unix socket "/tmp/.s.PGSQL.5433"
2022-10-05T00:21:23Z [info]keeper | 2022-10-05 00:21:23.095 UTC [593] LOG: database system was shut down in recovery at 2022-10-05 00:21:07 UTC
2022-10-05T00:21:23Z [info]keeper | 2022-10-05 00:21:23.095 UTC [593] LOG: entering standby mode
2022-10-05T00:21:23Z [info]keeper | 2022-10-05 00:21:23.098 UTC [593] LOG: redo starts at 0/C060188
2022-10-05T00:21:23Z [info]keeper | 2022-10-05 00:21:23.099 UTC [593] LOG: consistent recovery state reached at 0/C060A98
2022-10-05T00:21:23Z [info]keeper | 2022-10-05 00:21:23.099 UTC [593] LOG: invalid record length at 0/C060A98: wanted 24, got 0
2022-10-05T00:21:23Z [info]keeper | 2022-10-05 00:21:23.100 UTC [592] LOG: database system is ready to accept read-only connections
2022-10-05T00:21:24Z [info]keeper | 2022-10-05 00:21:24.094 UTC [597] LOG: started streaming WAL from primary at 0/C000000 on timeline 1
2022-10-05T00:21:46Z [info]exporter | INFO[0030] Established new database connection to "fdaa:0:7a94:a7b:23c5:2:3283:2:5433". source="postgres_exporter.go:970"
2022-10-05T00:21:46Z [info]exporter | INFO[0030] Semantic Version Changed on "fdaa:0:7a94:a7b:23c5:2:3283:2:5433": 0.0.0 -> 14.4.0 source="postgres_exporter.go:1539"
2022-10-05T00:21:46Z [info]exporter | INFO[0030] Established new database connection to "fdaa:0:7a94:a7b:23c5:2:3283:2:5433". source="postgres_exporter.go:970"
2022-10-05T00:21:46Z [info]exporter | INFO[0030] Semantic Version Changed on "fdaa:0:7a94:a7b:23c5:2:3283:2:5433": 0.0.0 -> 14.4.0 source="postgres_exporter.go:1539"
--> v6 failed - Failed due to unhealthy allocations and deploying as v7
--> Troubleshooting guide at https://fly.io/docs/getting-started/troubleshooting/
Error abort
The cluster scale count and configured regions were not changed prior to attempting the leader change in case that helps. ( scale count of four with one leader in sea and three replicas in other regions )
[Update]. I was able to partially revert this by just changing the newly edited fly.toml primary region from ‘sjc’ (new) to its original value, ‘sea’; however there now doesn’t appear to be either a leader or replica designation in the status output.
after the first attempt to change region:
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
7ada4e5d app 6 ⇡ ams run running (HTTP GET htt) 3 total, 2 passing, 1 critical 0 7m57s ago
f0397d51 app 6 ⇡ ewr run running (HTTP GET htt) 3 total, 2 passing, 1 critical 0 9m12s ago
ed8a3f18 app 6 ⇡ sea run running (HTTP GET htt) 3 total, 3 passing 0 9m12s ago
a958be3a app 5 sjc run running (replica) 3 total, 3 passing 0 6h14m ago
after reverting the change:
Deployment Status
ID = 2edd9f9f-42bc-9a10-499e-b3fdd3fd0c81
Version = v10
Status = successful
Description = Deployment completed successfully
Instances = 4 desired, 4 placed, 4 healthy, 0 unhealthy
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
8d7ebfa5 app 10 sjc run running (HTTP GET htt) 3 total, 3 passing 0 3m12s ago
83289372 app 10 ams run running (HTTP GET htt) 3 total, 3 passing 0 3m16s ago
58d835e8 app 10 ewr run running (HTTP GET htt) 3 total, 3 passing 0 4m24s ago
85d26132 app 10 sea run running (HTTP GET htt) 3 total, 3 passing 0 4m24s ago
Appreciate any help or insights! thanks