Just few hours ago my database suddenly stoped working
2024-08-15T00:10:06.502 app[9080e966a61278] hkg [info] monitor | Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0
2024-08-15T00:10:06.502 app[9080e966a61278] hkg [info] proxy | [WARNING] (399) : Backup Server bk_db/pg is DOWN, reason: Layer7 timeout, check duration: 5051ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-08-15T00:10:06.502 app[9080e966a61278] hkg [info] repmgrd | [2024-08-14 23:45:45] [WARNING] unable to ping "host=fdaa:1:cdfa:a7b:a8:27e9:ffbf:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5"
2024-08-15T00:10:06.502 app[9080e966a61278] hkg [info] repmgrd | [2024-08-14 23:45:45] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
i checked the machine and i got this
ID NAME STATE CHECKS REGION ROLE IMAGE IP ADDRESS VOLUME CREATED LAST UPDATED PROCESS GROUP SIZE
9080e966a61278* damp-grass-4225 started hkg
* These Machines' hosts could not be reached.
volumes
❯ fly volumes list -a mediarumu-pg
ID STATE NAME SIZE REGION ZONE ENCRYPTED ATTACHED VM CREATED AT
vol_g67340k8lzmvydxw* created pg_data 2GB hkg be08 true 1 year ago
* These volumes' hosts could not be reached.
And when i tried to list all the snapshot it returns 504
❯ fly volumes snapshots list vol_g67340k8lzmvydxw
Error: failed retrieving snapshots: failed to get volume vol_g67340k8lzmvydxw snapshots: request returned non-2xx status, 504 (Request ID: 01J59XYGC7JX8EF8G67VCAY75X-sin)
Since my snapshots are not accessible I cant recreate the database.
How do i recover my snapshot ?
Where does fly.io store the snapshots ?
Is it related to this Fly.io Status - Support Platform Migration ?
UPDATE
I just got the notification hours after the first incident, this could be better. I asked via email support but still no reply yet
I got lucky to have my problematic db up and running for a couple minutes, I managed to dump the data to my local before it goes down again. Now i have created another db clusters in another region with Tigris backup enabled.
You should not rely on snapshot only, I learnt the hard way today