Postgres app stuck on the "un-mounting volume" stage

I’ve got a problem with one of my instances while I was trying to scale it up: the process stuck

2022-10-27T07:45:18.113 app[b73809bf] fra [info] keeper | 2022-10-27T07:45:18.111Z ERROR cmd/keeper.go:1041 error retrieving cluster data {“error”: “Unexpected response code: 500”}
2022-10-27T07:46:05.477 app[b73809bf] fra [info] sentinel | 2022-10-27T07:46:05.477Z WARN cmd/sentinel.go:276 no keeper info available {“db”: “3db81f17”, “keeper”: “23c31f35c2”}
2022-10-27T08:50:48.104 app[b73809bf] fra [info] sentinel | 2022-10-27T08:50:48.103Z WARN cmd/sentinel.go:276 no keeper info available {“db”: “3db81f17”, “keeper”: “23c31f35c2”}
2022-10-27T08:53:16.114 app[b73809bf] fra [info] sentinel | 2022-10-27T08:53:16.113Z ERROR cmd/sentinel.go:1880 cannot update sentinel info {“error”: “Unexpected response code: 500 (rpc error making call: leadership lost while committing log)”}
2022-10-27T08:57:36.084 app[b73809bf] fra [info] sentinel | 2022-10-27T08:57:36.083Z WARN cmd/sentinel.go:276 no keeper info available {“db”: “3db81f17”, “keeper”: “23c31f35c2”}
2022-10-27T10:08:46.019 app[b73809bf] fra [info] sentinel | 2022-10-27T10:08:46.018Z WARN cmd/sentinel.go:276 no keeper info available {“db”: “3db81f17”, “keeper”: “23c31f35c2”}
2022-10-27T10:35:17.319 app[b73809bf] fra [info] sentinel | 2022-10-27T10:35:17.318Z WARN cmd/sentinel.go:276 no keeper info available {“db”: “3db81f17”, “keeper”: “23c31f35c2”}
2022-10-27T10:39:44.126 runner[b73809bf] fra [info] Shutting down virtual machine
2022-10-27T10:39:44.274 app[b73809bf] fra [info] Sending signal SIGTERM to main child process w/ PID 529
2022-10-27T10:39:44.312 app[b73809bf] fra [info] Got terminated, stopping
2022-10-27T10:39:44.312 app[b73809bf] fra [info] supervisor stopping
2022-10-27T10:39:44.312 app[b73809bf] fra [info] exporter | Stopping interrupt…
2022-10-27T10:39:44.312 app[b73809bf] fra [info] keeper | Stopping interrupt…
2022-10-27T10:39:44.312 app[b73809bf] fra [info] sentinel | Stopping interrupt…
2022-10-27T10:39:44.312 app[b73809bf] fra [info] proxy | Stopping interrupt…
2022-10-27T10:39:44.312 app[b73809bf] fra [info] exporter | signal: interrupt
2022-10-27T10:39:44.312 app[b73809bf] fra [info] sentinel | Process exited 0
2022-10-27T10:39:44.312 app[b73809bf] fra [info] keeper | 2022-10-27 10:39:44.310 UTC [600] LOG: received fast shutdown request
2022-10-27T10:39:44.315 app[b73809bf] fra [info] proxy | [NOTICE] 299/103944 (548) : haproxy version is 2.2.9-2+deb11u3
2022-10-27T10:39:44.315 app[b73809bf] fra [info] proxy | [NOTICE] 299/103944 (548) : path to executable is /usr/sbin/haproxy
2022-10-27T10:39:44.315 app[b73809bf] fra [info] proxy | [ALERT] 299/103944 (548) : Current worker #1 (570) exited with code 130 (Interrupt)
2022-10-27T10:39:44.315 app[b73809bf] fra [info] proxy | [WARNING] 299/103944 (548) : All workers exited. Exiting… (130)
2022-10-27T10:39:44.318 app[b73809bf] fra [info] keeper | waiting for server to shut down…2022-10-27 10:39:44.317 UTC [600] LOG: aborting any active transactions
2022-10-27T10:39:44.324 app[b73809bf] fra [info] keeper | 2022-10-27 10:39:44.324 UTC [600] LOG: background worker “logical replication launcher” (PID 608) exited with exit code 1
2022-10-27T10:39:44.328 app[b73809bf] fra [info] keeper | 2022-10-27 10:39:44.324 UTC [602] LOG: shutting down
2022-10-27T10:39:44.330 app[b73809bf] fra [info] proxy | exit status 130
2022-10-27T10:39:44.380 app[b73809bf] fra [info] keeper | 2022-10-27 10:39:44.380 UTC [600] LOG: database system is shut down
2022-10-27T10:39:44.415 app[b73809bf] fra [info] keeper | done
2022-10-27T10:39:44.415 app[b73809bf] fra [info] keeper | server stopped
2022-10-27T10:39:44.424 app[b73809bf] fra [info] keeper | Process exited 0
2022-10-27T10:39:45.279 app[b73809bf] fra [info] Starting clean up.
2022-10-27T10:39:45.294 app[b73809bf] fra [info] Umounting /dev/vdc from /data

As for E-Mail response from support@fly.io, I feel it completely disappointing and frustrating to have no support for platform bugs. I rely on free/hobby plan to decide whether I want to spend >$3k/y or not (my current bills and the amount grows each month) and it doesn’t help at all.

This is an unmonitored support mailbox. We offer two kinds of technical support :

Hello @Bohdan, it looks like you accidentally deleted your volume, and since this was a single instance cluster it could not failover. Fortunately, you can restore the volume from snapshots but you’ll have to create another postgres app and pass one of your most recent snapshots as the source.

  1. Get a list of the most recent snapshots of your deleted volume :
    flyctl volumes snapshots list <delete_volume_id>
  2. Choose the most recent snapshot and use it to create a new postgres cluster
    fly pg create --snapshot-id snapshot_id
  3. I recommend creating a highly available cluster with at least 2 nodes for safety.

Note:

  • There will be a window of data loss cause we take automatically take volume snapshots every 24 hours
  • The latest flyctl version launches postgres using the new platform version machines and I recommend you use that, but if you want to keep things as close to how they used to be pass a --nomad flag to fly pg create

Hi Rugwiro,

Thanks a lot for an answer.

Unfortunately, the only thing that I’ve done - tried to scale the instance. I wasn’t interacting with flyctl, so there is 0 probability chance that I was doing something wrong (especially to volumes). This is why I name it a “platform bug”.

I tried to resize the instance using Web UI, nothing more.

Also, web UI and CLI shows that the volume still exists.

└──── $ fly volumes list -a ess-dev-pg
ID STATE NAME SIZE REGION ZONE ENCRYPTED ATTACHED VM CREATED AT
vol_g2yxp4my571463qd created pg_data 5GB fra d1d3 false 1 month ago

Regards,

There was an issue with your app’s host, but It seems to have been resolved as I was mid-sentence into my subsequent response. Your postgres should be back now.

Yeah, it seems that it is up now. Wonderful magic. Are there any action items that could help Fly to prevent this happening in the future? I don’t want my apps to experience unexpected outages because of this.

Thank you for helping me. I’ll definitely take a note your snapshot-related snippets when I get to setting up backups and recovery.

I believe you are running a single node Postgres. This is fine for development, but you should expect downtime (hardware fails, networks fail, etc). For maximum reliability, you need to configure additional Postgres nodes. If you’d had 2+ Postgres nodes in this cluster, you would not have experienced downtime.

I believe you are running a single node Postgres. This is fine for development, but you should expect downtime (hardware fails, networks fail, etc). For maximum reliability, you need to configure additional Postgres nodes

Yeah, that’s a totally valid point. We’ll definitely launch failsafe setup for production use.

Thank you. Have a great day :slight_smile: