Postgres down, can't access machines, volumes and snapshot

Just few hours ago my database suddenly stoped working

2024-08-15T00:10:06.502 app[9080e966a61278] hkg [info] monitor | Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0

2024-08-15T00:10:06.502 app[9080e966a61278] hkg [info] proxy | [WARNING] (399) : Backup Server bk_db/pg is DOWN, reason: Layer7 timeout, check duration: 5051ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

2024-08-15T00:10:06.502 app[9080e966a61278] hkg [info] repmgrd | [2024-08-14 23:45:45] [WARNING] unable to ping "host=fdaa:1:cdfa:a7b:a8:27e9:ffbf:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5"

2024-08-15T00:10:06.502 app[9080e966a61278] hkg [info] repmgrd | [2024-08-14 23:45:45] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"

i checked the machine and i got this

ID             	NAME             	STATE  	CHECKS	REGION	ROLE	IMAGE                                	IP ADDRESS                      	VOLUME	CREATED             	LAST UPDATED        	PROCESS GROUP	SIZE
9080e966a61278*	damp-grass-4225  	started	      	hkg   	    	                                     	                                	      	                    	                    	             	                   	

* These Machines' hosts could not be reached.

volumes

❯ fly volumes list -a mediarumu-pg
ID                   	STATE  	NAME   	SIZE	REGION	ZONE	ENCRYPTED	ATTACHED VM	CREATED AT
vol_g67340k8lzmvydxw*	created	pg_data	2GB 	hkg   	be08	true     	           	1 year ago	

* These volumes' hosts could not be reached.

And when i tried to list all the snapshot it returns 504

❯ fly volumes snapshots list vol_g67340k8lzmvydxw

Error: failed retrieving snapshots: failed to get volume vol_g67340k8lzmvydxw snapshots: request returned non-2xx status, 504 (Request ID: 01J59XYGC7JX8EF8G67VCAY75X-sin)

Since my snapshots are not accessible I cant recreate the database.

How do i recover my snapshot ?
Where does fly.io store the snapshots ?
Is it related to this Fly.io Status - Support Platform Migration ?

UPDATE

I just got the notification hours after the first incident, this could be better. I asked via email support but still no reply yet :man_shrugging:
I got lucky to have my problematic db up and running for a couple minutes, I managed to dump the data to my local before it goes down again. Now i have created another db clusters in another region with Tigris backup enabled.
You should not rely on snapshot only, I learnt the hard way today :frowning:

same issue

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.