anatomy of a postgres outage

igor47 · October 25, 2022, 11:22pm

Hear from internal users that our web app is down last night. Checking the UI for my app in the dashboard, you wouldn’t be able to tell that anything was wrong. But running flyctl status for the postgres app definitely showed some issues:

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS                  HEALTH CHECKS                 RESTARTS CREATED
60f68b63        app     5       lax     run     running (replica)       3 total, 2 passing, 1 critical0        2022-10-09T20:06:36Z
b24f5fb3        app     5       lhr     run     running (replica)       3 total, 2 passing, 1 critical0        2022-08-17T23:44:23Z
a484c349        app     5       sin     run     running (readonly)      3 total, 1 passing, 2 critical0        2022-08-17T23:39:06Z

okay, so, (1) what’s the issue and (2) how do i fix it? looking at (1), i try flyctl postgres connect with no luck:

Connecting to XXX.internal... complete
psql: error: connection to server at "XXX.internal" (fdaa:0:835e:a7b:28df:1:828c:2), port 5432 failed: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

okay, it claims to be readonly but it looks like it’s totally down. groping around the documentation, i discover flyctl checks list. this is a little more helpful, but is a lot:

Health Checks for XXX
  NAME | STATUS   | ALLOCATION | REGION | TYPE | LAST UPDATED         | OUTPUT                                                                                                                                                          
-------*----------*------------*--------*------*----------------------*-----------------------------------------------------------------------------------------------------------------------------------------------------------------
  vm   | passing  | 60f68b63   | lax    | HTTP | 57s ago              | HTTP GET http://172.19.2.250:5500/flycheck/vm: 200 OK Output: "[✓]                                                                                              
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      | checkDisk: 9.14 GB (93.5%) free space on /data/ (53.51µs)\n[✓]                                                                                                  
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      | checkLoad: load averages: 0.05 0.04 0.07 (78.45µs)\n[✓]                                                                                                         
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      | memory: system spent 0s of the last 60s waiting on memory (50.38µs)\n[✓]                                                                                        
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      | cpu: system spent 906ms of the last 60s waiting on cpu (28.93µs)\n[✓]                                                                                           
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      | io: system spent 0s of the last 60s waiting on io (25.57µs)"[✓]                                                                                                 
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      |                                                                                                                                                                 
  pg   | critical | 60f68b63   | lax    | HTTP | 5h39m ago            | HTTP GET http://172.19.2.250:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded"[✓]                     
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      |                                                                                                                                                                 
  role | passing  | 60f68b63   | lax    | HTTP | 2022-10-09T20:08:48Z | replica[✓]                                                                                                                                                      
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      |                                                                                                                                                                 
  vm   | passing  | b24f5fb3   | lhr    | HTTP | 5m38s ago            | HTTP GET http://172.19.23.58:5500/flycheck/vm: 200 OK Output: "[✓]                                                                                              
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      | checkDisk: 9.11 GB (93.2%) free space on /data/ (44.97µs)\n[✓]                                                                                                  
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      | checkLoad: load averages: 0.00 0.04 0.01 (59.52µs)\n[✓]                                                                                                         
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      | memory: system spent 0s of the last 60s waiting on memory (32.47µs)\n[✓]                                                                                        
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      | cpu: system spent 786ms of the last 60s waiting on cpu (23.88µs)\n[✓]                                                                                           
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      | io: system spent 0s of the last 60s waiting on io (25.22µs)"[✓]                                                                                                 
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      |                                                                                                                                                                 
  pg   | critical | b24f5fb3   | lhr    | HTTP | 5h39m ago            | HTTP GET http://172.19.23.58:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded"[✓]                     
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      |                                                                                                                                                                 
  role | passing  | b24f5fb3   | lhr    | HTTP | 11h0m ago            | replica[✓]                                                                                                                                                      
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      |                                                                                                                                                                 
  vm   | critical | a484c349   | sin    | HTTP | 2m30s ago            | HTTP GET http://172.19.8.90:5500/flycheck/vm: 500 Internal Server Error Output: "[✗]checkDisk: 999.27 MB (10.0%!)(MISSING) free space on /data/ (43.66µs)\n[✓]  
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      | checkLoad: load averages: 0.08 0.08 0.11 (50.68µs)\n[✓]                                                                                                         
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      | memory: system spent 0s of the last 60s waiting on memory (33.68µs)\n[✓]                                                                                        
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      | cpu: system spent 1.4s of the last 60s waiting on cpu (16.81µs)\n[✓]                                                                                            
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      | io: system spent 0s of the last 60s waiting on io (13.5µs)"[✓]                                                                                                  
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      |                                                                                                                                                                 
  role | passing  | a484c349   | sin    | HTTP | 5h35m ago            | readonly[✓]                                                                                                                                                     
       |          |            |        |      |                      |                                                                                                                                                                 
       |          |            |        |      |                      |                                                                                                                                                                 
  pg   | critical | a484c349   | sin    | HTTP | 5h39m ago            | HTTP GET http://172.19.8.90:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded"[✓]

the only thing obvious is the [x] by checkDisk – is postgres configured to … crash? when there’s less than 10% disk space? i locate my grafana dashboard (you click a little graph icon next to one of the allocation ids on the monitoring page), and i can tell that indeed the volume associated with the primary seems to be filling up quickly.

i realize that i might still have ssh access. i flyctl ssh console --select and pick the primary instance in sin:

? Select instance: sin.XXX.internal
Connecting to [fdaa:0:835e:a7b:f0f:1:8289:2]... complete
# ps
  PID TTY          TIME CMD
26853 pts/4    00:00:00 sh
26907 pts/4    00:00:00 ps
# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        108M     0  108M   0% /dev
/dev/vdb        7.9G  842M  6.6G  12% /
shm             111M   76K  111M   1% /dev/shm
tmpfs           111M     0  111M   0% /sys/fs/cgroup
/dev/vdc        9.8G  8.3G 1000M  90% /data

looking around in /data/postgres, i see that the pg_wal directory is using all of the space. shouldn’t postgres be automatically cleaning up the WAL? i think something might be wrong. in any case, after some googling, i do:

# pg_controldata --pgdata=/data/postgres

discover that my lastest checkpoint's REDO WAL file is Y, then then run:

root@a484c349:/data/postgres/pg_wal# du -hs
8.3G    .
root@a484c349:/data/postgres/pg_wal# pg_archivecleanup /data/postgres/pg_wal/ Y
root@a484c349:/data/postgres/pg_wal# du -hs
33M

okay, now, should postgres be running? i have no idea what’s happening inside this vm. is there a process supervisor?

root@a484c349:/data/postgres/pg_wal# /etc/init.d/postgresql restart
No PostgreSQL clusters exist; see "man pg_createcluster" ... (warning).
root@a484c349:/data/postgres/pg_wal# ps
  PID TTY          TIME CMD
 2703 pts/4    00:00:00 ps
26853 pts/4    00:00:00 sh
27118 pts/4    00:00:00 bash

umm, maybe i can try flyctl postgres restart ? i failed to save the console outpt of ths, but it began trying to do a replica failover, which i don’t think was going to work with the primary already offline. it was stuck for a little while before i ctrl-ced it.

final try, i grab the instnce id of the primary via flyctl status and pass it to flyctl vm restart. a few minutes later, everything is back on!

remaining questions:

why did my WAL begin filling up? how do i prevent it going forward?
why did the primary crash? is it configured to do this? where?
is there a better way to get notified about the DB being down without my users telling me on slack?
shouldn’t the cluster have, idk, failed over to another region? how do i make it do that if the primary dies again in the future?
what’s the process supervisor inside the vms? is there a way to start the service without restarting the vm?
where are the logs inside the VM? i saw nothing inside /var/log, and the fly logs UI is not super-helpful.
(related) what is with the cmd/sentinel.go:276 no keeper info available errors filling up my postgres logs?
is there documentation somewhere that shows me how i should’ve recovered from this? i feel like i was mostly just flying by the seat of my pants here.

in general, my experience here is much closer to “you’re just running some postgres instances on someone else’s computers” than “you’re using a SaaS-provided DB”. there’s no proactive alerting or recovery, it just breaks and you’re done. i feel like it could be better?

kurt · October 25, 2022, 11:36pm

Our Postgres is very much just “automated Postgres on Fly”. It’s not a managed DB like you’d get from CrunchyData or anything. Which is also why a 3 node config like that costs ~$10/mo to run. We have a post-deploy doc we link from the CLI that outlines what this is doing: You've deployed Fly Postgres. Now what? · Fly Docs

It’s normal for a WAL to grow, and it’s based entirely on DB activity. If you add and remove replicas, it can continue to grow while it tries to create enough buffer for a replica to come back someday. We’ve released some update to our postgres app template that might help with replicas that have been removed. You can run fly image show and fly image update to get the latest pg app version.

We put postgres in readonly mode when it gets close to filling the disk. This prevents data corruption and gives you time to troubleshoot. Fixing a Postgres that is completely out of disk is no fun (and arguably dangerous).

It didn’t actually crash, it stayed up but couldn’t accept write transactions. Since it wasn’t accepting write transactions, the built in haproxy didn’t have anywhere to route connections on port 5432.

Postgres does not fail over between regions. It takes a lot of app work to make that possible, and you have to get clever about synchronous replication (or risk losing data). The best way to run HA postgres is to run two nodes in your primary region. If one fails, the other will take over. That would not prevent a readonly issue when you run out of disk space, however.

You should have recovered by running fly volumes extend. This enlarges the disk and the DB comes back after a quick restart. In general, you should over provision postgres disks when you care about app reliability. We’ve tried to make volumes cheap so you can do that.

We don’t have much in the way of alerting to tell you your app is failing. updown.io is pretty good, so it might be worth trying that out!

igor47 · November 8, 2022, 1:15am

thanks this is all great info. i wanted to add a replica, but the documentation (High Availability & Global Replication · Fly Docs) says the best way to add a replica is to run fly machine clone. alas, fly machine list for my cluster lists no machines.

currently, i have 3 replicas – one in each of three regions. i know that if i use flyctl scale count 4, i will get another replica, but the region is unpredictable. there’s also max-per-region, but i’m not sure if i can set it to “1 in these regions, but 2 in this other region”.

i’ll proceed for now with setting max-per-region to 2 and scaling to 6, which gives me 3 extra replicas. is there an alternate approach?

igor47 · November 8, 2022, 1:22am

after creating a new replica in my primary region, i tried failover, and got:

flyctl postgres failover
Update available 0.0.420 -> v0.0.429.
Run "flyctl version update" to upgrade.
Error failerover is only supported for machines apps

so, i’m not really sure how to initiate the failover, either. my goal is to scale the disk for the original primary, but this will cause the vm to restart. was hoping to do a failover first, to avoid downtime.

update: i tried flyctl vm stop with the primary vm, and this did not result in a failover to the replica in the same region. instead, a new VM came up, and took over as primary. it is using the same volume as the previous primary.

update: i tried flyctl image update and it restarted a bunch of replicas at once, including the primary, with no failovers or anything like that, just app downtime while the image updated.

igor47 · November 8, 2022, 1:56am

maybe my lack of failover has something to do with these error messages i get every few seconds:

2022-11-08T01:51:32Z app[XXX] sin [info]sentinel | 2022-11-08T01:51:32.587Z        WARN    cmd/sentinel.go:276     no keeper info available     {"db": "XXX", "keeper": "XXX"}

i tried to read the repo for the GitHub - fly-apps/postgres-ha: Postgres + Stolon for HA clusters as Fly apps. but honestly, it’s quite a lot. i don’t even know golang so it’s taking a while to parse the start script that is the container entrypoint.

i’m continuing to have trouble understanding who this postgres offering is targetted at. i feel like i’m a fairly typical user who just wants to spin up an app on fly and focus more on my application logic than on understanding the ins-and-outs of how fly works. should i be trying to use postgres on fly, or am i not in the target demographic? am i having an unreasonable amount of trouble here? am i just too ignorant to be using this? what should i be reading or what should my background be to be able to successfully administer this DB?

catflydotio · November 8, 2022, 2:54am

Hi @igor47, we’ve let you down here with our docs! I see there’s no indication on the doc you linked that it only applies to Postgres clusters created with newer flyctl versions. How that happened is easy to see from the inside, but the result is diabolical.

For older clusters, Fly Postgres on Apps V1 · Fly Docs is the relevant reference, and for adding a replica, here’s the section.

To add a replica, you need to provision another volume on your Postgres app. Scale up your count by one, and the new instance will go where the volume is. If you provision a new instance in the primary region and the old leader is unhealthy, I think the new instance will be elected leader as soon as it’s up.

You are having an unreasonable amount of trouble here, because we hid the docs you needed and substituted ones that almost make sense but won’t work for your cluster.

Only you can answer whether a managed database solution would suit you better, but you’ve been trying to manage your Fly Postgres in Hard Mode today, and I’m sorry about that!