Postgres instance crashing and unresponsive

Yesterday, and then again overnight, my postgres instance crashed and won’t restart. Here’s a chart from metrics showing the point it dropped off:

Restarting the postgres instance worked yesterday, but it hasn’t been today. Today that isn’t working. Here’s a sample of the recent logs:

 2022-02-03T13:41:26.807 app[b585f97d] ord [info] checking stolon status

2022-02-03T13:41:27.806 app[b585f97d] ord [info] checking stolon status

2022-02-03T13:41:27.813 app[b585f97d] ord [info] e[1;38;5;2mkeeper e[0m | 2022-02-03T13:41:27.813Z e[31mERRORe[0m cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}

2022-02-03T13:41:28.807 app[b585f97d] ord [info] checking stolon status

2022-02-03T13:41:30.739 app[b585f97d] ord [info] e[1;38;5;2mkeeper e[0m | 2022-02-03T13:41:30.736Z e[34mINFOe[0m cmd/keeper.go:1505 our db requested role is master

2022-02-03T13:41:30.760 app[b585f97d] ord [info] e[1;38;5;2mkeeper e[0m | 2022-02-03T13:41:30.760Z e[34mINFOe[0m postgresql/postgresql.go:319 starting database

2022-02-03T13:41:30.794 app[b585f97d] ord [info] e[1;38;5;2mkeeper e[0m | 2022-02-03 13:41:30.792 UTC [2719] LOG: starting PostgreSQL 13.4 (Debian 13.4-4.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit

2022-02-03T13:41:30.795 app[b585f97d] ord [info] e[1;38;5;2mkeeper e[0m | 2022-02-03 13:41:30.794 UTC [2719] LOG: listening on IPv6 address "fdaa:0:3747:a7b:2052:0:4cb7:2", port 5433

2022-02-03T13:41:30.796 app[b585f97d] ord [info] e[1;38;5;2mkeeper e[0m | 2022-02-03 13:41:30.795 UTC [2719] LOG: listening on Unix socket "/tmp/.s.PGSQL.5433"

2022-02-03T13:41:30.800 app[b585f97d] ord [info] e[1;38;5;2mkeeper e[0m | 2022-02-03 13:41:30.799 UTC [2720] LOG: database system was shut down at 2022-02-03 10:16:39 UTC

2022-02-03T13:41:30.807 app[b585f97d] ord [info] e[1;38;5;2mkeeper e[0m | 2022-02-03 13:41:30.805 UTC [2720] FATAL: could not write to file "pg_wal/xlogtemp.2720": No space left on device 

This just recently started to include the final “No space left on device” message, as far as I can see. Now I’m looking at increasing the volume size somehow.


Edit: The database name is lysmore-postgres. Here is the current status after I attempted to scale to a dedicated-cpu-2x

App
  Name     = lysmore-postgres
  Owner    = personal
  Version  = 1
  Status   = running
  Hostname = lysmore-postgres.fly.dev

Deployment Status
  ID          = 30ed794a-18a5-1ebb-2707-b40c285e4806
  Version     = v1
  Status      = failed
  Description = Failed due to unhealthy allocations
  Instances   = 2 desired, 1 placed, 0 healthy, 1 unhealthy

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS                  HEALTH CHECKS                   RESTARTS        CREATED
b585f97d        app     1 ⇡     ord     run     running (failed to co)  3 total, 3 critical             0               12m6s ago
bb55718d        app     0       ord     run     running (replica)       3 total, 2 passing, 1 critical  3               14h26m ago

Hey there!

Gah, that’s annoying.

You should be able to “resize” your volumes by following the procedure here (but with bigger volumes): How to convert your not-free Postgres to free Postgres

Thanks @jerome. I’m attempting to resize now.

Creating a new volume and swapping them around (since I already had two production volumes) hasn’t helped. There isn’t a master now:

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS                  HEALTH CHECKS                   RESTARTS        CREATED
06cd9582        app     3       ord     run     running (replica)       3 total, 2 passing, 1 critical  1               8m56s ago
13440ead        app     3       ord     run     running (failed to co)  3 total, 1 passing, 2 critical  1               14m33s ago
2022-02-03T14:10:32.044 app[13440ead] ord [info] e[1;38;5;5mexportere[0m | e[31mERROe[0m[0157] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:3747:a7b:20dc:0:88a5:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:3747:a7b:20dc:0:88a5:2]:5433: connect: connection refused e[31msourcee[0m="postgres_exporter.go:1658"

2022-02-03T14:10:34.062 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:34.057Z e[33mWARNe[0m cmd/sentinel.go:276 no keeper info available {"db": "12f406d9", "keeper": "205204cb72"}

2022-02-03T14:10:34.062 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:34.060Z e[34mINFOe[0m cmd/sentinel.go:995 master db is failed {"db": "12f406d9", "keeper": "205204cb72"}

2022-02-03T14:10:34.062 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:34.060Z e[34mINFOe[0m cmd/sentinel.go:1001 db not converged {"db": "12f406d9", "keeper": "205204cb72"}

2022-02-03T14:10:34.062 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:34.060Z e[34mINFOe[0m cmd/sentinel.go:1006 trying to find a new master to replace failed master

2022-02-03T14:10:34.062 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:34.060Z e[34mINFOe[0m cmd/sentinel.go:451 ignoring keeper since its behind that maximum xlog position {"db": "aced3b82", "dbXLogPos": 1612441449800, "masterXLogPos": 1620293188840}

2022-02-03T14:10:34.062 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:34.061Z e[31mERRORe[0m cmd/sentinel.go:1009 no eligible masters

2022-02-03T14:10:39.300 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:39.299Z e[33mWARNe[0m cmd/sentinel.go:276 no keeper info available {"db": "12f406d9", "keeper": "205204cb72"}

2022-02-03T14:10:39.301 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:39.301Z e[34mINFOe[0m cmd/sentinel.go:995 master db is failed {"db": "12f406d9", "keeper": "205204cb72"}

2022-02-03T14:10:39.301 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:39.301Z e[34mINFOe[0m cmd/sentinel.go:1001 db not converged {"db": "12f406d9", "keeper": "205204cb72"}

2022-02-03T14:10:39.301 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:39.301Z e[34mINFOe[0m cmd/sentinel.go:1006 trying to find a new master to replace failed master

2022-02-03T14:10:39.302 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:39.301Z e[34mINFOe[0m cmd/sentinel.go:451 ignoring keeper since its behind that maximum xlog position {"db": "aced3b82", "dbXLogPos": 1612441449800, "masterXLogPos": 1620293188840}

2022-02-03T14:10:39.302 app[13440ead] ord [info] e[1;38;5;3msentinele[0m | 2022-02-03T14:10:39.301Z e[31mERRORe[0m cmd/sentinel.go:1009 no eligible masters 

I think you might have destroyed the volume that was the master, and the previous replica is behind. Give me a few minutes to look at it.

I was terried of that happening but unable to get it to change any other way. Here’s the volume id that I deleted: vol_w0enxv307ymv8okp

And this was the list of snapshots from before I deleted it:

ID                      SIZE            CREATED AT
vs_JDOOPx3Rgyq5xfva     2472286281      8 hours ago
vs_OaDMM3z14ZGp4fzl     2076121727      1 day ago
vs_17qzQkVzPZwbNcZV     2088840192      2 days ago
vs_oOb8vjKYO4AvLc44     2070411093      3 days ago
vs_p3Q7a8o6ZjP8qU6      2102723171      4 days ago
vs_klqLZYPGK196lhNB     2118267439      5 days ago
vs_8zVvm99OxPvj9Uom     2133000291      6 days ago

It’s fine, we soft delete volumes for this reason. It’s back and we’re working on getting the cluster resynced now, give us a bit

1 Like

Ok. Thank you!

We’re going to remove the not-primary volumes temporarily, free up space on the existing primary, then re-add your replicas. So don’t be surprised if you see things disappear and reappear.

1 Like

Confirmed! The database is back online (and I grabbed a backup immediately). The app itself says pending and won’t start/scale back up.

You should be up and running now. The disk filled up with mostly write-ahead log files, we cleared those out to get it going again. I’m adding 40gb disks and removing the old 10gb disk to give it more headroom. That’s a lot of wal space for a relatively small amount of data!

1 Like

The app is still pending and won’t scale. Is there something I need to do to “unstick” it?

Ignore that last message. We’re back up and running now!

I think scaling unstuck it for you. Apps that crash repeatedly take a little longer to start back up, but they’ll eventually come back by themselves. Scaling hurries that up some.

I think your postgres is still a very large VM size. Feel free to scale that back down, I believe you’ll be good for a while at the smaller DB size.

1 Like

Thanks for pointing that out. I’ve scaled back down to a dedicated 1x.

I really appreciate the help :cold_sweat:

2 Likes

No problem!

What seems to have happened here is a WAL overflow. The write ahead logs only went back about 12 hours, but were eating up 8GB of disk space. Which means the disk was too small for your workload.

This is technically something you could have prevented had you known to watch for it, but’s also a poor UX on our part. When we ship volumes that auto expand, it’ll fix this. But we’re thinking about other ways to help people prevent this problem in the future.

I think your DB was prone to this because it has a high ratio of writes to total data size. You should probably expect to keep your disks 10-20x bigger than your DB size if that ratio sticks.

2 Likes

The writes are almost entirely from the public demo, which consistently writes at the same amount (and has for the life of the app). So, the ratio isn’t changing any time soon and I’ll keep the volumes scaled up :+1:

1 Like