My Postgres instance started mysteriously failing all connections earlier today, and continued to be unresponsive until I manually restarted it 12 hours later.
I couldn’t find much, just this:
2022-07-31T23:06:43Z app[387d83f3] iad [info]exporter | INFO[1059715] Established new database connection to "fdaa:0:bff:a7b:ab8:0:65c0:2:5432". source="postgres_exporter.go:970"
2022-07-31T23:06:44Z app[387d83f3] iad [info]exporter | ERRO[1059716] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:bff:a7b:ab8:0:65c0:2]:5432/postgres?sslmode=disable): dial tcp [fdaa:0:bff:a7b:ab8:0:65c0:2]:5432: connect: connection refused source="postgres_exporter.go:1658"
Restarting the Postgres instance fixed the issue right away, but I’m a bit puzzled about how to avoid this happening again. Exactly the same thing happened on June 3 2022. The app’s name is mess-with-dns-pg.
Is there a way to add a healthcheck to my Postgres database so that it can automatically restart itself if it gets into a bad state?
It looks like the DB process OOMed several times and then we gave up trying to restart it. We should have cycled the VM when this happened, but I think you may be on an old Fly Postgres build that doesn’t handle this as well. Let me find out if that’s upgradeable.
I think upgrading to 1GB of RAM will prevent this.
This happened again today – it upgraded from v8 to v9 when I restarted it. Just to make sure – what’s the Postgres build version that fixes this issue? (is it v9?)
This is what I’m on, I’m using postgres-standalone instead of postgres:
Image Details
Registry = registry-1.docker.io
Repository = flyio/postgres-standalone
Tag = 14.1
Version = v0.0.7
Digest = sha256:ca27c53b81cae713e67d7ced87a4289961db4a81e382b09aaf42ea53032791eb
I did get an email alert, but I’ve been ignoring them because it seems to run out of memory only about once a day, and it seems to only take about 15 seconds to restart. So that feels like an acceptable amount of downtime.
We restored a backup of your database to mess-with-dns-pg-bak. You can update your app to use it, if you’d like, or just hold tight until we get the original running again.
Ok everything is up and running. Can you verify that your main PG is acting the way you want? If it’s good, you can delete the backup with fly apps destroy mess-with-dns-pg-bak.