We have a database cluster set up, and after deploying our app about an hour ago our database was reset to the state it was in over 48 hours ago. Any idea how this could have happened?
Edit: just to be clear, I understand that your Postgres support is in beta.
Can you give us more details on what you’re seeing? The only way I can imagine data reverting on a Postgres setup like this is if the primary node was broken somehow, replication to the secondary wasn’t caught up, and then a failover happened making the now-stale data the primary.
You should check the status of your postgres VMs:
fly status -a <db-name> --all
Do you see any that failed? You can run fly vm status <id> -a <db-name> to possibly see why.
You can also restore snapshots to compare both versions:
fly volumes list -a <db-name>
fly volumes snapshots list <volume-id>
fly pg create --snapshot-id <snapshot-id>
It would be worth comparing snapshots from both your volumes to see if they have the same data.
$ fly status -a production-db --all
App
Name = production-db
Owner = enaia
Version = 5
Status = running
Hostname = production-db.fly.dev
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
3eda45a3 app 5 iad run running (replica) 3 total, 3 passing 0 1h14m ago
c11ee22a app 5 iad run running (leader) 3 total, 3 passing 0 1h21m ago
5e662043 app 5 iad stop failed 3 total, 1 passing, 2 critical 4 1h38m ago
df175262 app 5 iad stop failed 3 total, 1 passing, 2 critical 4 2h3m ago
7059760b app 5 iad stop failed 3 total, 1 passing, 2 critical 4 2h27m ago
7d437e7d app 5 iad stop failed 3 total, 1 passing, 2 critical 4 2h51m ago
09ea9763 app 5 iad stop failed 3 total, 1 passing, 2 critical 4 3h16m ago
a8e27476 app 5 iad stop failed 3 total, 1 passing, 2 critical 4 3h40m ago
fc3caa1e app 5 iad stop failed 3 total, 1 passing, 2 critical 4 4h5m ago
53a6f1eb app 5 iad stop failed 3 total, 1 passing, 2 critical 4 4h29m ago
fc86da43 app 5 iad stop failed 3 total, 1 passing, 2 critical 4 4h54m ago
[...]
$ fly vm status 5e662043 -a production-db
Update available 0.0.256 -> v0.0.260.
Run "fly version update" to upgrade.
Instance
ID = 5e662043
Process =
Version = 5
Region = iad
Desired = stop
Status = failed
Health Checks = 3 total, 1 passing, 2 critical
Restarts = 4
Created = 1h44m ago
Recent Events
TIMESTAMP TYPE MESSAGE
2021-12-02T19:32:12Z Received Task received by client
2021-12-02T19:32:12Z Task Setup Building Task Directory
2021-12-02T19:32:13Z Started Task started by client
2021-12-02T19:34:00Z Restart Signaled healthcheck: check "pg" unhealthy
2021-12-02T19:34:01Z Terminated Exit Code: 0
2021-12-02T19:34:02Z Restarting Task restarting in 1.24606516s
2021-12-02T19:34:08Z Started Task started by client
2021-12-02T19:35:53Z Restart Signaled healthcheck: check "role" unhealthy
2021-12-02T19:35:56Z Terminated Exit Code: 0
2021-12-02T19:35:56Z Restarting Task restarting in 1.006636038s
2021-12-02T19:36:02Z Started Task started by client
2021-12-02T19:37:50Z Restart Signaled healthcheck: check "role" unhealthy
2021-12-02T19:37:52Z Terminated Exit Code: 0
2021-12-02T19:37:52Z Restarting Task restarting in 1.066671546s
2021-12-02T19:37:57Z Started Task started by client
2021-12-02T19:39:44Z Restart Signaled healthcheck: check "pg" unhealthy
2021-12-02T19:39:45Z Terminated Exit Code: 0
2021-12-02T19:39:46Z Restarting Task restarting in 1.1424448s
2021-12-02T19:39:51Z Started Task started by client
2021-12-02T19:41:37Z Restart Signaled healthcheck: check "pg" unhealthy
2021-12-02T19:41:39Z Terminated Exit Code: 0
2021-12-02T19:41:40Z Not Restarting Exceeded allowed attempts 2 in interval 5m0s and mode is "fail"
2021-12-02T19:41:40Z Killing Sent interrupt. Waiting 5m0s before force killing
Checks
ID SERVICE STATE OUTPUT
vm app passing HTTP GET http://172.19.2.146:5500/flycheck/vm: 200 OK Output: "[✓] checkDisk: 9.15 GB (93.6%) free space on /data/ (67.78µs)\n[✓] checkLoad: load averages: 0.00 0.00 0.00 (71.46µs)\n[✓] memory: system spent 0s of the last 60s waiting on memory (42.04µs)\n[✓] cpu: system spent 72ms of the last 60s waiting on cpu (27.69µs)\n[✓] io: system spent 168ms of the last 60s waiting on io (23.3µs)"
role app critical failed to connect to local node: context deadline exceeded
pg app critical HTTP GET http://172.19.2.146:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect with local node: context deadline exceeded"
Recent Logs
That’s crashing quite frequently, apparently because the health checks can’t connect to the database. But it appears it’s been happening for much longer than a few hours.
Will you please give us more information about what you’re seeing that indicates the data is 48 hours old? And try recovering those snapshots to see if they have more recent data.
I’m also not in a position to recover from the snapshots. I’m waiting for a demo to finish, and I have a more recent manual backup that I’m planning to restore when I’m able to.
Which table is that? We’ll have a look at snapshots. Can you look at your manual backup and give us a table + ID to look for that includes data you’re missing?
The most recent snapshots on your DB are 23 hours old, so if one of the nodes was lagging it should should in those.
Ok it does look like one of the machines wasn’t replicating properly for quite a while. The most recent updated at in the good snapshot is 2021-12-01 21:23:34, the most recent updated at in the other is 2021-11-30 17:32:57. So it seems like what was primary failed sometime today, and the out of date replica took over.
We’ve changed your database to not restart when health checks fail. This might mean it stops responding, but it should make it obvious what’s wrecking the VMs.