Restarting the postgres instance worked yesterday, but it hasn’t been today. Today that isn’t working. Here’s a sample of the recent logs:
2022-02-03T13:41:26.807 app[b585f97d] ord [info] checking stolon status
2022-02-03T13:41:27.806 app[b585f97d] ord [info] checking stolon status
2022-02-03T13:41:27.813 app[b585f97d] ord [info] e[1;38;5;2mkeeper e[0m | 2022-02-03T13:41:27.813Z e[31mERRORe[0m cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-02-03T13:41:28.807 app[b585f97d] ord [info] checking stolon status
2022-02-03T13:41:30.739 app[b585f97d] ord [info] e[1;38;5;2mkeeper e[0m | 2022-02-03T13:41:30.736Z e[34mINFOe[0m cmd/keeper.go:1505 our db requested role is master
2022-02-03T13:41:30.760 app[b585f97d] ord [info] e[1;38;5;2mkeeper e[0m | 2022-02-03T13:41:30.760Z e[34mINFOe[0m postgresql/postgresql.go:319 starting database
2022-02-03T13:41:30.794 app[b585f97d] ord [info] e[1;38;5;2mkeeper e[0m | 2022-02-03 13:41:30.792 UTC [2719] LOG: starting PostgreSQL 13.4 (Debian 13.4-4.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-02-03T13:41:30.795 app[b585f97d] ord [info] e[1;38;5;2mkeeper e[0m | 2022-02-03 13:41:30.794 UTC [2719] LOG: listening on IPv6 address "fdaa:0:3747:a7b:2052:0:4cb7:2", port 5433
2022-02-03T13:41:30.796 app[b585f97d] ord [info] e[1;38;5;2mkeeper e[0m | 2022-02-03 13:41:30.795 UTC [2719] LOG: listening on Unix socket "/tmp/.s.PGSQL.5433"
2022-02-03T13:41:30.800 app[b585f97d] ord [info] e[1;38;5;2mkeeper e[0m | 2022-02-03 13:41:30.799 UTC [2720] LOG: database system was shut down at 2022-02-03 10:16:39 UTC
2022-02-03T13:41:30.807 app[b585f97d] ord [info] e[1;38;5;2mkeeper e[0m | 2022-02-03 13:41:30.805 UTC [2720] FATAL: could not write to file "pg_wal/xlogtemp.2720": No space left on device
This just recently started to include the final “No space left on device” message, as far as I can see. Now I’m looking at increasing the volume size somehow.
Edit: The database name is lysmore-postgres. Here is the current status after I attempted to scale to a dedicated-cpu-2x
App
Name = lysmore-postgres
Owner = personal
Version = 1
Status = running
Hostname = lysmore-postgres.fly.dev
Deployment Status
ID = 30ed794a-18a5-1ebb-2707-b40c285e4806
Version = v1
Status = failed
Description = Failed due to unhealthy allocations
Instances = 2 desired, 1 placed, 0 healthy, 1 unhealthy
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
b585f97d app 1 ⇡ ord run running (failed to co) 3 total, 3 critical 0 12m6s ago
bb55718d app 0 ord run running (replica) 3 total, 2 passing, 1 critical 3 14h26m ago
Creating a new volume and swapping them around (since I already had two production volumes) hasn’t helped. There isn’t a master now:
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
06cd9582 app 3 ord run running (replica) 3 total, 2 passing, 1 critical 1 8m56s ago
13440ead app 3 ord run running (failed to co) 3 total, 1 passing, 2 critical 1 14m33s ago
2022-02-03T14:10:32.044 app[13440ead] ord [info] e[1;38;5;5mexportere[0m | e[31mERROe[0m[0157] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:3747:a7b:20dc:0:88a5:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:3747:a7b:20dc:0:88a5:2]:5433: connect: connection refused e[31msourcee[0m="postgres_exporter.go:1658"
2022-02-03T14:10:34.062 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:34.057Z e[33mWARNe[0m cmd/sentinel.go:276 no keeper info available {"db": "12f406d9", "keeper": "205204cb72"}
2022-02-03T14:10:34.062 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:34.060Z e[34mINFOe[0m cmd/sentinel.go:995 master db is failed {"db": "12f406d9", "keeper": "205204cb72"}
2022-02-03T14:10:34.062 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:34.060Z e[34mINFOe[0m cmd/sentinel.go:1001 db not converged {"db": "12f406d9", "keeper": "205204cb72"}
2022-02-03T14:10:34.062 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:34.060Z e[34mINFOe[0m cmd/sentinel.go:1006 trying to find a new master to replace failed master
2022-02-03T14:10:34.062 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:34.060Z e[34mINFOe[0m cmd/sentinel.go:451 ignoring keeper since its behind that maximum xlog position {"db": "aced3b82", "dbXLogPos": 1612441449800, "masterXLogPos": 1620293188840}
2022-02-03T14:10:34.062 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:34.061Z e[31mERRORe[0m cmd/sentinel.go:1009 no eligible masters
2022-02-03T14:10:39.300 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:39.299Z e[33mWARNe[0m cmd/sentinel.go:276 no keeper info available {"db": "12f406d9", "keeper": "205204cb72"}
2022-02-03T14:10:39.301 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:39.301Z e[34mINFOe[0m cmd/sentinel.go:995 master db is failed {"db": "12f406d9", "keeper": "205204cb72"}
2022-02-03T14:10:39.301 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:39.301Z e[34mINFOe[0m cmd/sentinel.go:1001 db not converged {"db": "12f406d9", "keeper": "205204cb72"}
2022-02-03T14:10:39.301 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:39.301Z e[34mINFOe[0m cmd/sentinel.go:1006 trying to find a new master to replace failed master
2022-02-03T14:10:39.302 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:39.301Z e[34mINFOe[0m cmd/sentinel.go:451 ignoring keeper since its behind that maximum xlog position {"db": "aced3b82", "dbXLogPos": 1612441449800, "masterXLogPos": 1620293188840}
2022-02-03T14:10:39.302 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:39.301Z e[31mERRORe[0m cmd/sentinel.go:1009 no eligible masters
I was terried of that happening but unable to get it to change any other way. Here’s the volume id that I deleted: vol_w0enxv307ymv8okp
And this was the list of snapshots from before I deleted it:
ID SIZE CREATED AT
vs_JDOOPx3Rgyq5xfva 2472286281 8 hours ago
vs_OaDMM3z14ZGp4fzl 2076121727 1 day ago
vs_17qzQkVzPZwbNcZV 2088840192 2 days ago
vs_oOb8vjKYO4AvLc44 2070411093 3 days ago
vs_p3Q7a8o6ZjP8qU6 2102723171 4 days ago
vs_klqLZYPGK196lhNB 2118267439 5 days ago
vs_8zVvm99OxPvj9Uom 2133000291 6 days ago
We’re going to remove the not-primary volumes temporarily, free up space on the existing primary, then re-add your replicas. So don’t be surprised if you see things disappear and reappear.
You should be up and running now. The disk filled up with mostly write-ahead log files, we cleared those out to get it going again. I’m adding 40gb disks and removing the old 10gb disk to give it more headroom. That’s a lot of wal space for a relatively small amount of data!
I think scaling unstuck it for you. Apps that crash repeatedly take a little longer to start back up, but they’ll eventually come back by themselves. Scaling hurries that up some.
I think your postgres is still a very large VM size. Feel free to scale that back down, I believe you’ll be good for a while at the smaller DB size.
What seems to have happened here is a WAL overflow. The write ahead logs only went back about 12 hours, but were eating up 8GB of disk space. Which means the disk was too small for your workload.
This is technically something you could have prevented had you known to watch for it, but’s also a poor UX on our part. When we ship volumes that auto expand, it’ll fix this. But we’re thinking about other ways to help people prevent this problem in the future.
I think your DB was prone to this because it has a high ratio of writes to total data size. You should probably expect to keep your disks 10-20x bigger than your DB size if that ratio sticks.
The writes are almost entirely from the public demo, which consistently writes at the same amount (and has for the life of the app). So, the ratio isn’t changing any time soon and I’ll keep the volumes scaled up