Hear from internal users that our web app is down last night. Checking the UI for my app in the dashboard, you wouldn’t be able to tell that anything was wrong. But running flyctl status
for the postgres app definitely showed some issues:
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
60f68b63 app 5 lax run running (replica) 3 total, 2 passing, 1 critical0 2022-10-09T20:06:36Z
b24f5fb3 app 5 lhr run running (replica) 3 total, 2 passing, 1 critical0 2022-08-17T23:44:23Z
a484c349 app 5 sin run running (readonly) 3 total, 1 passing, 2 critical0 2022-08-17T23:39:06Z
okay, so, (1) what’s the issue and (2) how do i fix it? looking at (1), i try flyctl postgres connect
with no luck:
Connecting to XXX.internal... complete
psql: error: connection to server at "XXX.internal" (fdaa:0:835e:a7b:28df:1:828c:2), port 5432 failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
okay, it claims to be readonly but it looks like it’s totally down. groping around the documentation, i discover flyctl checks list
. this is a little more helpful, but is a lot:
Health Checks for XXX
NAME | STATUS | ALLOCATION | REGION | TYPE | LAST UPDATED | OUTPUT
-------*----------*------------*--------*------*----------------------*-----------------------------------------------------------------------------------------------------------------------------------------------------------------
vm | passing | 60f68b63 | lax | HTTP | 57s ago | HTTP GET http://172.19.2.250:5500/flycheck/vm: 200 OK Output: "[✓]
| | | | | |
| | | | | | checkDisk: 9.14 GB (93.5%) free space on /data/ (53.51µs)\n[✓]
| | | | | |
| | | | | | checkLoad: load averages: 0.05 0.04 0.07 (78.45µs)\n[✓]
| | | | | |
| | | | | | memory: system spent 0s of the last 60s waiting on memory (50.38µs)\n[✓]
| | | | | |
| | | | | | cpu: system spent 906ms of the last 60s waiting on cpu (28.93µs)\n[✓]
| | | | | |
| | | | | | io: system spent 0s of the last 60s waiting on io (25.57µs)"[✓]
| | | | | |
| | | | | |
pg | critical | 60f68b63 | lax | HTTP | 5h39m ago | HTTP GET http://172.19.2.250:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded"[✓]
| | | | | |
| | | | | |
role | passing | 60f68b63 | lax | HTTP | 2022-10-09T20:08:48Z | replica[✓]
| | | | | |
| | | | | |
vm | passing | b24f5fb3 | lhr | HTTP | 5m38s ago | HTTP GET http://172.19.23.58:5500/flycheck/vm: 200 OK Output: "[✓]
| | | | | |
| | | | | | checkDisk: 9.11 GB (93.2%) free space on /data/ (44.97µs)\n[✓]
| | | | | |
| | | | | | checkLoad: load averages: 0.00 0.04 0.01 (59.52µs)\n[✓]
| | | | | |
| | | | | | memory: system spent 0s of the last 60s waiting on memory (32.47µs)\n[✓]
| | | | | |
| | | | | | cpu: system spent 786ms of the last 60s waiting on cpu (23.88µs)\n[✓]
| | | | | |
| | | | | | io: system spent 0s of the last 60s waiting on io (25.22µs)"[✓]
| | | | | |
| | | | | |
pg | critical | b24f5fb3 | lhr | HTTP | 5h39m ago | HTTP GET http://172.19.23.58:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded"[✓]
| | | | | |
| | | | | |
role | passing | b24f5fb3 | lhr | HTTP | 11h0m ago | replica[✓]
| | | | | |
| | | | | |
vm | critical | a484c349 | sin | HTTP | 2m30s ago | HTTP GET http://172.19.8.90:5500/flycheck/vm: 500 Internal Server Error Output: "[✗]checkDisk: 999.27 MB (10.0%!)(MISSING) free space on /data/ (43.66µs)\n[✓]
| | | | | |
| | | | | | checkLoad: load averages: 0.08 0.08 0.11 (50.68µs)\n[✓]
| | | | | |
| | | | | | memory: system spent 0s of the last 60s waiting on memory (33.68µs)\n[✓]
| | | | | |
| | | | | | cpu: system spent 1.4s of the last 60s waiting on cpu (16.81µs)\n[✓]
| | | | | |
| | | | | | io: system spent 0s of the last 60s waiting on io (13.5µs)"[✓]
| | | | | |
| | | | | |
role | passing | a484c349 | sin | HTTP | 5h35m ago | readonly[✓]
| | | | | |
| | | | | |
pg | critical | a484c349 | sin | HTTP | 5h39m ago | HTTP GET http://172.19.8.90:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded"[✓]
the only thing obvious is the [x]
by checkDisk
– is postgres configured to … crash? when there’s less than 10% disk space? i locate my grafana dashboard (you click a little graph icon next to one of the allocation ids on the monitoring
page), and i can tell that indeed the volume associated with the primary seems to be filling up quickly.
i realize that i might still have ssh access. i flyctl ssh console --select
and pick the primary instance in sin
:
? Select instance: sin.XXX.internal
Connecting to [fdaa:0:835e:a7b:f0f:1:8289:2]... complete
# ps
PID TTY TIME CMD
26853 pts/4 00:00:00 sh
26907 pts/4 00:00:00 ps
# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 108M 0 108M 0% /dev
/dev/vdb 7.9G 842M 6.6G 12% /
shm 111M 76K 111M 1% /dev/shm
tmpfs 111M 0 111M 0% /sys/fs/cgroup
/dev/vdc 9.8G 8.3G 1000M 90% /data
looking around in /data/postgres
, i see that the pg_wal
directory is using all of the space. shouldn’t postgres be automatically cleaning up the WAL? i think something might be wrong. in any case, after some googling, i do:
# pg_controldata --pgdata=/data/postgres
discover that my lastest checkpoint's REDO WAL file
is Y
, then then run:
root@a484c349:/data/postgres/pg_wal# du -hs
8.3G .
root@a484c349:/data/postgres/pg_wal# pg_archivecleanup /data/postgres/pg_wal/ Y
root@a484c349:/data/postgres/pg_wal# du -hs
33M
okay, now, should postgres be running? i have no idea what’s happening inside this vm. is there a process supervisor?
root@a484c349:/data/postgres/pg_wal# /etc/init.d/postgresql restart
No PostgreSQL clusters exist; see "man pg_createcluster" ... (warning).
root@a484c349:/data/postgres/pg_wal# ps
PID TTY TIME CMD
2703 pts/4 00:00:00 ps
26853 pts/4 00:00:00 sh
27118 pts/4 00:00:00 bash
umm, maybe i can try flyctl postgres restart
? i failed to save the console outpt of ths, but it began trying to do a replica failover, which i don’t think was going to work with the primary already offline. it was stuck for a little while before i ctrl-c
ed it.
final try, i grab the instnce id of the primary via flyctl status
and pass it to flyctl vm restart
. a few minutes later, everything is back on!
remaining questions:
- why did my WAL begin filling up? how do i prevent it going forward?
- why did the primary crash? is it configured to do this? where?
- is there a better way to get notified about the DB being down without my users telling me on slack?
- shouldn’t the cluster have, idk, failed over to another region? how do i make it do that if the primary dies again in the future?
- what’s the process supervisor inside the vms? is there a way to start the service without restarting the vm?
- where are the logs inside the VM? i saw nothing inside
/var/log
, and the fly logs UI is not super-helpful. - (related) what is with the
cmd/sentinel.go:276 no keeper info available
errors filling up my postgres logs? - is there documentation somewhere that shows me how i should’ve recovered from this? i feel like i was mostly just flying by the seat of my pants here.
in general, my experience here is much closer to “you’re just running some postgres instances on someone else’s computers” than “you’re using a SaaS-provided DB”. there’s no proactive alerting or recovery, it just breaks and you’re done. i feel like it could be better?