I have two Postgres database apps that have stopped working at an unknown time.
The two apps are “sparkling-voice-5768” and “nexbibli-db”. They are attached to “billowing-bush-8831” and “nexbibli” respectively.
Both databases haven’t been touched in months. I noticed the issue after a deploy to “billowing-bush-8831”, but “nexbibli” hasn’t been touched either for 2 months and was definitely working a few weeks back.
Here are the logs from “sparkling-voice-5768” after a restart:
Restart logs
2022-05-28T14:35:10Z runner[acea18ef] yyz [info]Starting instance
2022-05-28T14:35:10Z runner[acea18ef] yyz [info]Configuring virtual machine
2022-05-28T14:35:10Z runner[acea18ef] yyz [info]Pulling container image
2022-05-28T14:35:11Z runner[acea18ef] yyz [info]Unpacking image
2022-05-28T14:35:11Z runner[acea18ef] yyz [info]Preparing kernel init
2022-05-28T14:35:11Z runner[acea18ef] yyz [info]Setting up volume ‘pg_data’
2022-05-28T14:35:14Z runner[acea18ef] yyz [info]Configuring firecracker
2022-05-28T14:35:14Z runner[acea18ef] yyz [info]Starting virtual machine
2022-05-28T14:35:14Z app[acea18ef] yyz [info]Starting init (commit: aa54f7d)…
2022-05-28T14:35:14Z app[acea18ef] yyz [info]Mounting /dev/vdc at /data w/ uid: 0, gid: 0 and chmod 0755
2022-05-28T14:35:14Z app[acea18ef] yyz [info]Preparing to run: docker-entrypoint.sh start
as root
2022-05-28T14:35:14Z app[acea18ef] yyz [info]2022/05/28 14:35:14 listening on [fdaa:0:316c:a7b:aa4:0:3428:2]:22 (DNS: [fdaa::3]:53)
2022-05-28T14:35:14Z app[acea18ef] yyz [info]cluster spec filename /fly/cluster-spec.json
2022-05-28T14:35:14Z app[acea18ef] yyz [info]system memory: 256mb vcpu count: 1
2022-05-28T14:35:14Z app[acea18ef] yyz [info]{
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “initMode”: “new”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “pgParameters”: {
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “effective_cache_size”: “192MB”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “effective_io_concurrency”: “200”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “maintenance_work_mem”: “64MB”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “max_connections”: “300”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “max_parallel_workers”: “8”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “max_parallel_workers_per_gather”: “2”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “max_worker_processes”: “8”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “random_page_cost”: “1.1”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “shared_buffers”: “64MB”,
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “work_mem”: “4MB”
2022-05-28T14:35:14Z app[acea18ef] yyz [info] },
2022-05-28T14:35:14Z app[acea18ef] yyz [info] “maxStandbysPerSender”: 50
2022-05-28T14:35:14Z app[acea18ef] yyz [info]}
2022-05-28T14:35:14Z app[acea18ef] yyz [info]generated new config
2022-05-28T14:35:14Z app[acea18ef] yyz [info]Listening on port 5500keeper | Running…
2022-05-28T14:35:14Z app[acea18ef] yyz [info]sentinel | Running…
2022-05-28T14:35:14Z app[acea18ef] yyz [info]exporter | Running…
2022-05-28T14:35:14Z app[acea18ef] yyz [info]proxy | Running…
2022-05-28T14:35:14Z app[acea18ef] yyz [info]exporter | INFO[0000] Starting Server: :9187 source=“postgres_exporter.go:1837”
2022-05-28T14:35:15Z app[acea18ef] yyz [info]sentinel | 2022-05-28T14:35:15.215Z INFO cmd/sentinel.go:2000 sentinel uid {“uid”: “64588e24”}
2022-05-28T14:35:15Z app[acea18ef] yyz [info]keeper | 2022-05-28T14:35:15.217Z INFO cmd/keeper.go:2091 exclusive lock on data dir taken
2022-05-28T14:35:15Z app[acea18ef] yyz [info]proxy | 2022-05-28T14:35:15.217Z INFO cmd/proxy.go:419 proxy uid {“uid”: “b568e9ae”}
2022-05-28T14:35:15Z app[acea18ef] yyz [info]checking stolon status
2022-05-28T14:35:25Z runner[61b69677] yyz [info]Starting instance
2022-05-28T14:35:25Z runner[61b69677] yyz [info]Configuring virtual machine
2022-05-28T14:35:25Z runner[61b69677] yyz [info]Pulling container image
2022-05-28T14:35:26Z runner[61b69677] yyz [info]Unpacking image
2022-05-28T14:35:26Z runner[61b69677] yyz [info]Preparing kernel init
2022-05-28T14:35:26Z runner[61b69677] yyz [info]Setting up volume ‘pg_data’
2022-05-28T14:35:26Z app[acea18ef] yyz [info]proxy | {“level”:“warn”,“ts”:“2022-05-28T14:35:26.782Z”,“logger”:“etcd-client”,“caller”:“v3@v3.5.0/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0002bfdc0/#initially=[etcd-na.fly-shared.net:443]”,“attempt”:0,“error”:“rpc error: code = Unavailable desc = etcdserver: request timed out”}
2022-05-28T14:35:26Z app[acea18ef] yyz [info]sentinel | {“level”:“warn”,“ts”:“2022-05-28T14:35:26.780Z”,“logger”:“etcd-client”,“caller”:“v3@v3.5.0/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0002bfdc0/#initially=[etcd-na.fly-shared.net:443]”,“attempt”:0,“error”:“rpc error: code = Unavailable desc = etcdserver: request timed out”}
2022-05-28T14:35:26Z app[acea18ef] yyz [info]sentinel | 2022-05-28T14:35:26.784Z FATAL cmd/sentinel.go:2021 cannot create sentinel: cannot create store: cannot create kv store: etcdserver: request timed out
2022-05-28T14:35:26Z app[acea18ef] yyz [info]proxy | 2022-05-28T14:35:26.784Z FATAL cmd/proxy.go:433 cannot create cluster checker: cannot create store: cannot create kv store: etcdserver: request timed out
2022-05-28T14:35:26Z app[acea18ef] yyz [info]panic: error checking stolon status: {“level”:“warn”,“ts”:“2022-05-28T14:35:26.781Z”,“logger”:“etcd-client”,“caller”:“v3@v3.5.0/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0004281c0/#initially=[etcd-na.fly-shared.net:443]”,“attempt”:0,“error”:“rpc error: code = Unavailable desc = etcdserver: request timed out”}
2022-05-28T14:35:26Z app[acea18ef] yyz [info]cannot create kv store: etcdserver: request timed out
2022-05-28T14:35:26Z app[acea18ef] yyz [info]: exit status 1
2022-05-28T14:35:26Z app[acea18ef] yyz [info]main.main.func1nning]:
2022-05-28T14:35:26Z app[acea18ef] yyz [info] ain.main.func1(0xc0000d0000)
2022-05-28T14:35:26Z app[acea18ef] yyz [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:60 +0x5fa
2022-05-28T14:35:26Z app[acea18ef] yyz [info]created by main.main
2022-05-28T14:35:26Z app[acea18ef] yyz [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:51 +0x297
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Main child exited normally with code: 2
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Reaped child process with pid: 536 and signal: SIGHUP, core dumped? false
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Reaped child process with pid: 541 and signal: SIGHUP, core dumped? false
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Reaped child process with pid: 544, exit code: 1
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Reaped child process with pid: 534, exit code: 1
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Reaped child process with pid: 532 and signal: SIGKILL, core dumped? false
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Reaped child process with pid: 540 and signal: SIGKILL, core dumped? false
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Starting clean up.
2022-05-28T14:35:27Z app[acea18ef] yyz [info]Umounting /dev/vdc from /data
2022-05-28T14:35:29Z runner[61b69677] yyz [info]Configuring firecracker
2022-05-28T14:35:29Z runner[61b69677] yyz [info]Starting virtual machine
2022-05-28T14:35:29Z app[61b69677] yyz [info]Starting init (commit: aa54f7d)…
2022-05-28T14:35:29Z app[61b69677] yyz [info]Mounting /dev/vdc at /data w/ uid: 0, gid: 0 and chmod 0755
2022-05-28T14:35:29Z app[61b69677] yyz [info]Preparing to run: docker-entrypoint.sh start
as root
2022-05-28T14:35:29Z app[61b69677] yyz [info]2022/05/28 14:35:29 listening on [fdaa:0:316c:a7b:aa2:0:3427:2]:22 (DNS: [fdaa::3]:53)
2022-05-28T14:35:29Z app[61b69677] yyz [info]cluster spec filename /fly/cluster-spec.json
2022-05-28T14:35:29Z app[61b69677] yyz [info]system memory: 256mb vcpu count: 1
2022-05-28T14:35:29Z app[61b69677] yyz [info]{
2022-05-28T14:35:29Z app[61b69677] yyz [info] “initMode”: “new”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “pgParameters”: {
2022-05-28T14:35:29Z app[61b69677] yyz [info] “effective_cache_size”: “192MB”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “effective_io_concurrency”: “200”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “maintenance_work_mem”: “64MB”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “max_connections”: “300”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “max_parallel_workers”: “8”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “max_parallel_workers_per_gather”: “2”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “max_worker_processes”: “8”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “random_page_cost”: “1.1”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “shared_buffers”: “64MB”,
2022-05-28T14:35:29Z app[61b69677] yyz [info] “work_mem”: “4MB”
2022-05-28T14:35:29Z app[61b69677] yyz [info] },
2022-05-28T14:35:29Z app[61b69677] yyz [info] “maxStandbysPerSender”: 50
2022-05-28T14:35:29Z app[61b69677] yyz [info]}
2022-05-28T14:35:29Z app[61b69677] yyz [info]generated new config
2022-05-28T14:35:29Z app[61b69677] yyz [info]Listening on port 5500keeper | Running…
2022-05-28T14:35:29Z app[61b69677] yyz [info]sentinel | Running…
2022-05-28T14:35:29Z app[61b69677] yyz [info]proxy | Running…
2022-05-28T14:35:29Z app[61b69677] yyz [info]exporter | Running…
2022-05-28T14:35:29Z app[61b69677] yyz [info]exporter | INFO[0000] Starting Server: :9187 source=“postgres_exporter.go:1837”
2022-05-28T14:35:30Z app[61b69677] yyz [info]sentinel | 2022-05-28T14:35:30.187Z INFO cmd/sentinel.go:2000 sentinel uid {“uid”: “97281fb8”}
2022-05-28T14:35:30Z app[61b69677] yyz [info]proxy | 2022-05-28T14:35:30.187Z INFO cmd/proxy.go:419 proxy uid {“uid”: “539b9cc3”}
2022-05-28T14:35:30Z app[61b69677] yyz [info]keeper | 2022-05-28T14:35:30.189Z INFO cmd/keeper.go:2091 exclusive lock on data dir taken
2022-05-28T14:35:30Z app[61b69677] yyz [info]checking stolon status
2022-05-28T14:35:40Z app[61b69677] yyz [info]sentinel | {“level”:“warn”,“ts”:“2022-05-28T14:35:40.792Z”,“logger”:“etcd-client”,“caller”:“v3@v3.5.0/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0002bfdc0/#initially=[etcd-na.fly-shared.net:443]”,“attempt”:0,“error”:“rpc error: code = Unavailable desc = etcdserver: request timed out”}
2022-05-28T14:35:40Z app[61b69677] yyz [info]proxy | {“level”:“warn”,“ts”:“2022-05-28T14:35:40.793Z”,“logger”:“etcd-client”,“caller”:“v3@v3.5.0/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0002bddc0/#initially=[etcd-na.fly-shared.net:443]”,“attempt”:0,“error”:“rpc error: code = Unavailable desc = etcdserver: request timed out”}
2022-05-28T14:35:41Z app[61b69677] yyz [info]Main child exited normally with code: 2
2022-05-28T14:35:41Z app[61b69677] yyz [info]Reaped child process with pid: 547 and signal: SIGKILL, core dumped? false
2022-05-28T14:35:41Z app[61b69677] yyz [info]Reaped child process with pid: 549 and signal: SIGKILL, core dumped? false
2022-05-28T14:35:41Z app[61b69677] yyz [info]Reaped child process with pid: 537 and signal: SIGKILL, core dumped? false
2022-05-28T14:35:41Z app[61b69677] yyz [info]Reaped child process with pid: 530 and signal: SIGKILL, core dumped? false
2022-05-28T14:35:41Z app[61b69677] yyz [info]Reaped child process with pid: 531 and signal: SIGHUP, core dumped? false
2022-05-28T14:35:41Z app[61b69677] yyz [info]Reaped child process with pid: 553 and signal: SIGHUP, core dumped? false
2022-05-28T14:35:41Z app[61b69677] yyz [info]Reaped child process with pid: 548, exit code: 1
2022-05-28T14:35:41Z app[61b69677] yyz [info]Reaped child process with pid: 540, exit code: 1
2022-05-28T14:35:41Z app[61b69677] yyz [info]Starting clean up.
2022-05-28T14:35:41Z app[61b69677] yyz [info]Umounting /dev/vdc from /data
I’ve tried restaring the databases and scaling them down to 0 and then back up to 2.
“fly postgres connect” results in “Error Command is not compatible with this image.”
“fly ssh connect” results in the following after about 2 minutes (with a different address for each app):
Connecting to top1.nearest.of.nexbibli-db.internal… complete
Error error connecting to SSH server: connect tcp [fdaa:0:3149:a7b:aa4:0:335c:2]:22: operation timed out