Postgres database needed to be manually restarted

My Postgres instance started mysteriously failing all connections earlier today, and continued to be unresponsive until I manually restarted it 12 hours later.

I couldn’t find much, just this:

2022-07-31T23:06:43Z app[387d83f3] iad [info]exporter | INFO[1059715] Established new database connection to "fdaa:0:bff:a7b:ab8:0:65c0:2:5432".  source="postgres_exporter.go:970"
2022-07-31T23:06:44Z app[387d83f3] iad [info]exporter | ERRO[1059716] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:bff:a7b:ab8:0:65c0:2]:5432/postgres?sslmode=disable): dial tcp [fdaa:0:bff:a7b:ab8:0:65c0:2]:5432: connect: connection refused  source="postgres_exporter.go:1658"

Restarting the Postgres instance fixed the issue right away, but I’m a bit puzzled about how to avoid this happening again. Exactly the same thing happened on June 3 2022. The app’s name is mess-with-dns-pg.

Is there a way to add a healthcheck to my Postgres database so that it can automatically restart itself if it gets into a bad state?

It looks like the DB process OOMed several times and then we gave up trying to restart it. We should have cycled the VM when this happened, but I think you may be on an old Fly Postgres build that doesn’t handle this as well. Let me find out if that’s upgradeable.

I think upgrading to 1GB of RAM will prevent this.

thanks so much for looking into it! It looks like it might have automatically upgraded when I restarted it.

This happened again today – it upgraded from v8 to v9 when I restarted it. Just to make sure – what’s the Postgres build version that fixes this issue? (is it v9?)

Oh that’s actually the job version. If you run fly image show it’ll tell you the postgres image version you’re running. This is the latest:

Image Details
  Registry   = registry-1.docker.io
  Repository = flyio/postgres
  Tag        = 12.10
  Version    = v0.0.25

Did you get an email alert about this? We enabled out-of-memory crash notifications last week.

This is what I’m on, I’m using postgres-standalone instead of postgres:

Image Details
  Registry   = registry-1.docker.io                                                     
  Repository = flyio/postgres-standalone                                                
  Tag        = 14.1                                                                     
  Version    = v0.0.7                                                                   
  Digest     = sha256:ca27c53b81cae713e67d7ced87a4289961db4a81e382b09aaf42ea53032791eb  

I did get an email alert, but I’ve been ignoring them because it seems to run out of memory only about once a day, and it seems to only take about 15 seconds to restart. So that feels like an acceptable amount of downtime.

Oooh right. The standalone PG image doesn’t have the most recent fixes. Let me see if we can get that swapped to the newer image.

Ok that didn’t work at all, give us a few minutes to see what went wrong. I’m sorry about this!

We are still working on this, the upgrade attempt got your db into a bad state that we haven’t seen before.

We restored a backup of your database to mess-with-dns-pg-bak. You can update your app to use it, if you’d like, or just hold tight until we get the original running again.

Ok everything is up and running. Can you verify that your main PG is acting the way you want? If it’s good, you can delete the backup with fly apps destroy mess-with-dns-pg-bak.

everything looks good to me, thanks so much!