PRODUCTION FAILED: Scaling to dedicated server failing.

after upgrading from shared vm to dedicated, our postgresql cluster is failing.

below is the health check

  NAME | STATUS   | ALLOCATION | REGION | TYPE | LAST UPDATED | OUTPUT
-------*----------*------------*--------*------*--------------*---------------------------------------------------------------------------------------------------------------------------------------------
  vm   | passing  | 29b684c6   | ams    | HTTP | 1s ago       | HTTP GET http://*****5500/flycheck/vm: 200 OK Output: "[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | checkDisk: 8.19 GB (83.7%) free space on /data/ (30.94µs)\n[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | checkLoad: load averages: 0.00 0.00 0.00 (51.64µs)\n[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | memory: system spent 0s of the last 60s waiting on memory (25.73µs)\n[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | cpu: system spent 252ms of the last 60s waiting on cpu (14.89µs)\n[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | io: system spent 0s of the last 60s waiting on io (15.06µs)"[✓]
       |          |            |        |      |              |
       |          |            |        |      |              |
  role | passing  | 29b684c6   | ams    | HTTP | 8m30s ago    | replica[✓]
       |          |            |        |      |              |
       |          |            |        |      |              |
  pg   | critical | 29b684c6   | ams    | HTTP | 8m35s ago    | HTTP GET http://****:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded"[✓]
       |          |            |        |      |              |
       |          |            |        |      |              |

and we see below message in the logs

ams [info] sentinel | 2022-03-30T21:14:01.966Z ERROR cmd/sentinel.go:1009 no eligible masters


2022-03-30T21:17:43Z app[29b684c6] ams [info]sentinel | 2022-03-30T21:17:43.245Z	ERROR	cmd/sentinel.go:1009	no eligible masters
2022-03-30T21:17:47Z app[29b684c6] ams [info]keeper   | 2022-03-30T21:17:47.326Z	INFO	cmd/keeper.go:1557	our db requested role is standby	{"followedDB": "c892ddc1"}
2022-03-30T21:17:47Z app[29b684c6] ams [info]keeper   | 2022-03-30T21:17:47.326Z	INFO	cmd/keeper.go:1576	already standby
2022-03-30T21:17:47Z app[29b684c6] ams [info]keeper   | 2022-03-30T21:17:47.341Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2022-03-30T21:17:47Z app[29b684c6] ams [info]keeper   | 2022-03-30T21:17:47.341Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2022-03-30T21:17:48Z app[29b684c6] ams [info]sentinel | 2022-03-30T21:17:48.337Z	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "c78da426", "keeper": "23c20596c2"}
2022-03-30T21:17:48Z app[29b684c6] ams [info]sentinel | 2022-03-30T21:17:48.337Z	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "c892ddc1", "keeper": "23c80596d2"}
2022-03-30T21:17:48Z app[29b684c6] ams [info]sentinel | 2022-03-30T21:17:48.339Z	INFO	cmd/sentinel.go:995	master db is failed	{"db": "c892ddc1", "keeper": "23c80596d2"}
2022-03-30T21:17:48Z app[29b684c6] ams [info]sentinel | 2022-03-30T21:17:48.339Z	INFO	cmd/sentinel.go:1006	trying to find a new master to replace failed master
2022-03-30T21:17:48Z app[29b684c6] ams [info]sentinel | 2022-03-30T21:17:48.339Z	INFO	cmd/sentinel.go:451	ignoring keeper since its behind that maximum xlog position	{"db": "07ae4362", "dbXLogPos": 1493759680, "masterXLogPos": 2632389752}
1 Like

This should be fixed now.

Your Postgres DB had three available volumes, but was set to only run one instance. The new instance size booted on a different volume.

First: will you please ensure the data is correct? This shouldn’t have lost data, but I would like you to check anyway.

Second: do you know how the cluster got into this state? Do you remember adding volumes and scaling down by chance?

We basically run the following command: fly scale vm dedicated-cpu-1x -a iftrue-pg-prod, because we were using small vm. Previously we were using clusters but we got rid of them, it seems somehow scaling the vm caused the vm to use a invalid volume.