Postgres database unavailable in two apps

I have two Postgres database apps that have stopped working at an unknown time.

The two apps are “sparkling-voice-5768” and “nexbibli-db”. They are attached to “billowing-bush-8831” and “nexbibli” respectively.

Both databases haven’t been touched in months. I noticed the issue after a deploy to “billowing-bush-8831”, but “nexbibli” hasn’t been touched either for 2 months and was definitely working a few weeks back.

Here are the logs from “sparkling-voice-5768” after a restart:

Restart logs

2022-05-28T14:35:10Z runner[acea18ef] yyz [info]Starting instance
2022-05-28T14:35:10Z runner[acea18ef] yyz [info]Configuring virtual machine
2022-05-28T14:35:10Z runner[acea18ef] yyz [info]Pulling container image
2022-05-28T14:35:11Z runner[acea18ef] yyz [info]Unpacking image
2022-05-28T14:35:11Z runner[acea18ef] yyz [info]Preparing kernel init
2022-05-28T14:35:11Z runner[acea18ef] yyz [info]Setting up volume ‘pg_data’
2022-05-28T14:35:14Z runner[acea18ef] yyz [info]Configuring firecracker
2022-05-28T14:35:14Z runner[acea18ef] yyz [info]Starting virtual machine
2022-05-28T14:35:14Z app[acea18ef] yyz [info]Starting init (commit: aa54f7d)…
2022-05-28T14:35:14Z app[acea18ef] yyz [info]Mounting /dev/vdc at /data w/ uid: 0, gid: 0 and chmod 0755
2022-05-28T14:35:14Z app[acea18ef] yyz [info]Preparing to run: docker-entrypoint.sh start as root
2022-05-28T14:35:14Z app[acea18ef] yyz [info]2022/05/28 14:35:14 listening on [fdaa:0:316c:a7b:aa4:0:3428:2]:22 (DNS: [fdaa::3]:53)
2022-05-28T14:35:14Z app[acea18ef] yyz [info]cluster spec filename /fly/cluster-spec.json
2022-05-28T14:35:14Z app[acea18ef] yyz [info]system memory: 256mb vcpu count: 1
2022-05-28T14:35:14Z app[acea18ef] yyz [info]{
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “initMode”: “new”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “pgParameters”: {
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “effective_cache_size”: “192MB”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “effective_io_concurrency”: “200”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “maintenance_work_mem”: “64MB”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “max_connections”: “300”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “max_parallel_workers”: “8”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “max_parallel_workers_per_gather”: “2”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “max_worker_processes”: “8”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “random_page_cost”: “1.1”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “shared_buffers”: “64MB”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “work_mem”: “4MB”
2022-05-28T14:35:14Z app[acea18ef] yyz [info] },
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “maxStandbysPerSender”: 50
2022-05-28T14:35:14Z app[acea18ef] yyz [info]}
2022-05-28T14:35:14Z app[acea18ef] yyz [info]generated new config
2022-05-28T14:35:14Z app[acea18ef] yyz [info]Listening on port 5500keeper | Running…
2022-05-28T14:35:14Z app[acea18ef] yyz [info]sentinel | Running…
2022-05-28T14:35:14Z app[acea18ef] yyz [info]exporter | Running…
2022-05-28T14:35:14Z app[acea18ef] yyz [info]proxy | Running…
2022-05-28T14:35:14Z app[acea18ef] yyz [info]exporter | INFO[0000] Starting Server: :9187 source=“postgres_exporter.go:1837”
2022-05-28T14:35:15Z app[acea18ef] yyz [info]sentinel | 2022-05-28T14:35:15.215Z INFO cmd/sentinel.go:2000 sentinel uid {“uid”: “64588e24”}
2022-05-28T14:35:15Z app[acea18ef] yyz [info]keeper | 2022-05-28T14:35:15.217Z INFO cmd/keeper.go:2091 exclusive lock on data dir taken
2022-05-28T14:35:15Z app[acea18ef] yyz [info]proxy | 2022-05-28T14:35:15.217Z INFO cmd/proxy.go:419 proxy uid {“uid”: “b568e9ae”}
2022-05-28T14:35:15Z app[acea18ef] yyz [info]checking stolon status
2022-05-28T14:35:25Z runner[61b69677] yyz [info]Starting instance
2022-05-28T14:35:25Z runner[61b69677] yyz [info]Configuring virtual machine
2022-05-28T14:35:25Z runner[61b69677] yyz [info]Pulling container image
2022-05-28T14:35:26Z runner[61b69677] yyz [info]Unpacking image
2022-05-28T14:35:26Z runner[61b69677] yyz [info]Preparing kernel init
2022-05-28T14:35:26Z runner[61b69677] yyz [info]Setting up volume ‘pg_data’
2022-05-28T14:35:26Z app[acea18ef] yyz [info]proxy | {“level”:“warn”,“ts”:“2022-05-28T14:35:26.782Z”,“logger”:“etcd-client”,“caller”:“v3@v3.5.0/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0002bfdc0/#initially=[etcd-na.fly-shared.net:443]”,“attempt”:0,“error”:“rpc error: code = Unavailable desc = etcdserver: request timed out”}
2022-05-28T14:35:26Z app[acea18ef] yyz [info]sentinel | {“level”:“warn”,“ts”:“2022-05-28T14:35:26.780Z”,“logger”:“etcd-client”,“caller”:“v3@v3.5.0/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0002bfdc0/#initially=[etcd-na.fly-shared.net:443]”,“attempt”:0,“error”:“rpc error: code = Unavailable desc = etcdserver: request timed out”}
2022-05-28T14:35:26Z app[acea18ef] yyz [info]sentinel | 2022-05-28T14:35:26.784Z FATAL cmd/sentinel.go:2021 cannot create sentinel: cannot create store: cannot create kv store: etcdserver: request timed out
2022-05-28T14:35:26Z app[acea18ef] yyz [info]proxy | 2022-05-28T14:35:26.784Z FATAL cmd/proxy.go:433 cannot create cluster checker: cannot create store: cannot create kv store: etcdserver: request timed out
2022-05-28T14:35:26Z app[acea18ef] yyz [info]panic: error checking stolon status: {“level”:“warn”,“ts”:“2022-05-28T14:35:26.781Z”,“logger”:“etcd-client”,“caller”:“v3@v3.5.0/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0004281c0/#initially=[etcd-na.fly-shared.net:443]”,“attempt”:0,“error”:“rpc error: code = Unavailable desc = etcdserver: request timed out”}
2022-05-28T14:35:26Z app[acea18ef] yyz [info]cannot create kv store: etcdserver: request timed out
2022-05-28T14:35:26Z app[acea18ef] yyz [info]: exit status 1
2022-05-28T14:35:26Z app[acea18ef] yyz [info]main.main.func1nning]:
2022-05-28T14:35:26Z app[acea18ef] yyz [info] ain.main.func1(0xc0000d0000)
2022-05-28T14:35:26Z app[acea18ef] yyz [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:60 +0x5fa
2022-05-28T14:35:26Z app[acea18ef] yyz [info]created by main.main
2022-05-28T14:35:26Z app[acea18ef] yyz [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:51 +0x297
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Main child exited normally with code: 2
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Reaped child process with pid: 536 and signal: SIGHUP, core dumped? false
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Reaped child process with pid: 541 and signal: SIGHUP, core dumped? false
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Reaped child process with pid: 544, exit code: 1
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Reaped child process with pid: 534, exit code: 1
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Reaped child process with pid: 532 and signal: SIGKILL, core dumped? false
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Reaped child process with pid: 540 and signal: SIGKILL, core dumped? false
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Starting clean up.
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Umounting /dev/vdc from /data
2022-05-28T14:35:29Z runner[61b69677] yyz [info]Configuring firecracker
2022-05-28T14:35:29Z runner[61b69677] yyz [info]Starting virtual machine
2022-05-28T14:35:29Z app[61b69677] yyz [info]Starting init (commit: aa54f7d)…
2022-05-28T14:35:29Z app[61b69677] yyz [info]Mounting /dev/vdc at /data w/ uid: 0, gid: 0 and chmod 0755
2022-05-28T14:35:29Z app[61b69677] yyz [info]Preparing to run: docker-entrypoint.sh start as root
2022-05-28T14:35:29Z app[61b69677] yyz [info]2022/05/28 14:35:29 listening on [fdaa:0:316c:a7b:aa2:0:3427:2]:22 (DNS: [fdaa::3]:53)
2022-05-28T14:35:29Z app[61b69677] yyz [info]cluster spec filename /fly/cluster-spec.json
2022-05-28T14:35:29Z app[61b69677] yyz [info]system memory: 256mb vcpu count: 1
2022-05-28T14:35:29Z app[61b69677] yyz [info]{
2022-05-28T14:35:29Z app[61b69677] yyz [info] “initMode”: “new”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “pgParameters”: {
2022-05-28T14:35:29Z app[61b69677] yyz [info] “effective_cache_size”: “192MB”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “effective_io_concurrency”: “200”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “maintenance_work_mem”: “64MB”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “max_connections”: “300”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “max_parallel_workers”: “8”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “max_parallel_workers_per_gather”: “2”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “max_worker_processes”: “8”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “random_page_cost”: “1.1”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “shared_buffers”: “64MB”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “work_mem”: “4MB”
2022-05-28T14:35:29Z app[61b69677] yyz [info] },
2022-05-28T14:35:29Z app[61b69677] yyz [info] “maxStandbysPerSender”: 50
2022-05-28T14:35:29Z app[61b69677] yyz [info]}
2022-05-28T14:35:29Z app[61b69677] yyz [info]generated new config
2022-05-28T14:35:29Z app[61b69677] yyz [info]Listening on port 5500keeper | Running…
2022-05-28T14:35:29Z app[61b69677] yyz [info]sentinel | Running…
2022-05-28T14:35:29Z app[61b69677] yyz [info]proxy | Running…
2022-05-28T14:35:29Z app[61b69677] yyz [info]exporter | Running…
2022-05-28T14:35:29Z app[61b69677] yyz [info]exporter | INFO[0000] Starting Server: :9187 source=“postgres_exporter.go:1837”
2022-05-28T14:35:30Z app[61b69677] yyz [info]sentinel | 2022-05-28T14:35:30.187Z INFO cmd/sentinel.go:2000 sentinel uid {“uid”: “97281fb8”}
2022-05-28T14:35:30Z app[61b69677] yyz [info]proxy | 2022-05-28T14:35:30.187Z INFO cmd/proxy.go:419 proxy uid {“uid”: “539b9cc3”}
2022-05-28T14:35:30Z app[61b69677] yyz [info]keeper | 2022-05-28T14:35:30.189Z INFO cmd/keeper.go:2091 exclusive lock on data dir taken
2022-05-28T14:35:30Z app[61b69677] yyz [info]checking stolon status
2022-05-28T14:35:40Z app[61b69677] yyz [info]sentinel | {“level”:“warn”,“ts”:“2022-05-28T14:35:40.792Z”,“logger”:“etcd-client”,“caller”:“v3@v3.5.0/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0002bfdc0/#initially=[etcd-na.fly-shared.net:443]”,“attempt”:0,“error”:“rpc error: code = Unavailable desc = etcdserver: request timed out”}
2022-05-28T14:35:40Z app[61b69677] yyz [info]proxy | {“level”:“warn”,“ts”:“2022-05-28T14:35:40.793Z”,“logger”:“etcd-client”,“caller”:“v3@v3.5.0/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0002bddc0/#initially=[etcd-na.fly-shared.net:443]”,“attempt”:0,“error”:“rpc error: code = Unavailable desc = etcdserver: request timed out”}
2022-05-28T14:35:41Z app[61b69677] yyz [info]Main child exited normally with code: 2
2022-05-28T14:35:41Z app[61b69677] yyz [info]Reaped child process with pid: 547 and signal: SIGKILL, core dumped? false
2022-05-28T14:35:41Z app[61b69677] yyz [info]Reaped child process with pid: 549 and signal: SIGKILL, core dumped? false
2022-05-28T14:35:41Z app[61b69677] yyz [info]Reaped child process with pid: 537 and signal: SIGKILL, core dumped? false
2022-05-28T14:35:41Z app[61b69677] yyz [info]Reaped child process with pid: 530 and signal: SIGKILL, core dumped? false
2022-05-28T14:35:41Z app[61b69677] yyz [info]Reaped child process with pid: 531 and signal: SIGHUP, core dumped? false
2022-05-28T14:35:41Z app[61b69677] yyz [info]Reaped child process with pid: 553 and signal: SIGHUP, core dumped? false
2022-05-28T14:35:41Z app[61b69677] yyz [info]Reaped child process with pid: 548, exit code: 1
2022-05-28T14:35:41Z app[61b69677] yyz [info]Reaped child process with pid: 540, exit code: 1
2022-05-28T14:35:41Z app[61b69677] yyz [info]Starting clean up.
2022-05-28T14:35:41Z app[61b69677] yyz [info]Umounting /dev/vdc from /data

I’ve tried restaring the databases and scaling them down to 0 and then back up to 2.

“fly postgres connect” results in “Error Command is not compatible with this image.”
“fly ssh connect” results in the following after about 2 minutes (with a different address for each app):

Connecting to top1.nearest.of.nexbibli-db.internal… complete
Error error connecting to SSH server: connect tcp [fdaa:0:3149:a7b:aa4:0:335c:2]:22: operation timed out

1 Like

Not sure if this has since been resolved given the time, however if not, when you try fly postgres connect are you also passing the pg app name to it? Like:

fly pg connect --app <pg-app-name>

If you are, and it’s still not happy, it may be that command is newer than your postgres app, especially since you say it hasn’t been touched in months.

Assuming you have a backup you could try upgrading it. This is a Fly app/repo for doing that:

This doesn’t explain why they stopped working. That’s one for Fly, alas. However it may get you going again.

What flyctl version are you running? You may need to update.

flyctl version

It looks like your apps initially went down due to an issue with our Etcd backend store. We had to transition a number of apps to our Consul backend store as a result of that outage.

Let me know if you still have issues connecting after you verify/update flyctl.

Well, I just checked and the apps are back up!

when you try fly postgres connect are you also passing the pg app name to it

Yeah, should’ve specified that but I omitted app names from commands because I ran them on both apps and was trying to avoid confusion between them. Guess that wasn’t the greatest idea :sweat_smile:

What flyctl version are you running?

0.0.330, though now that the apps are back up connecting both via postgres and ssh are working fine.

It looks like your apps initially went down due to an issue with our Etcd backend store. We had to transition a number of apps to our Consul backend store as a result of that outage.

I’m guessing this migration you’re talking about is the two releases that happened yesterday by Shaun Davis (I’m assuming that’s @shaun ) and Fly.io, or is that unrelated debugging?

1 Like

I’m guessing this migration you’re talking about is the two releases that happened yesterday by Shaun Davis (I’m assuming that’s @shaun ) and Fly.io, or is that unrelated debugging?

Yep! The first release was to move you to a compatible flyio/postgres image version and the second release was for the Etcd → Consul transition to get your app back up. Super sorry for the trouble!

All good, thanks yor your help!