how to automatically restart a managed postgres when it crashes?

My postgres standalone instance (mess-with-dns-pg) crashed today and started printing out these errors repeatedly. Restarting the instance manually fixed the issue, but it didn’t restart automatically.

Is there a way I can set up a healthcheck so that the postgres instance automatically restarts itself if there’s a problem?

2022-08-21T17:33:30Z app[3461a67a] iad [info]keeper   | 2022-08-21 17:33:30.956 GMT [17951] FATAL:  pre-existing shared memory block (key 131073, ID 7) is still in use
2022-08-21T17:33:30Z app[3461a67a] iad [info]keeper   | 2022-08-21 17:33:30.956 GMT [17951] HINT:  Terminate any old server processes associated with data directory "/data/postgres".
2022-08-21T17:33:31Z app[3461a67a] iad [info]keeper   | 2022-08-21T17:33:31.130Z	ERROR	cmd/keeper.go:1526	failed to start postgres	{"error": "postgres exited unexpectedly"}
2022-08-21T17:33:31Z app[3461a67a] iad [info]sentinel | 2022-08-21T17:33:31.469Z	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "0f41122e", "keeper": "ab805b922"}
2022-08-21T17:33:31Z app[3461a67a] iad [info]sentinel | 2022-08-21T17:33:31.475Z	ERROR	cmd/sentinel.go:1009	no eligible masters

So this is a thing we’re not sure how to handle. That Postgres crashed because it OOMed, which can corrupt data. We do pretty aggressive cleanup when a new VM boots against that disk to make it work, but I’m not fully comfortable doing that automatically on crash (it is better for someone to go “this is, in fact, ok”).

I think you can configure the built in healthcheck to restart for you if you run:

fly config save -a mess-with-dns-pg

# edit fly.toml under []
# restart_limit = 3 (instead of restart_limit = 0)

fly deploy -i flyio/postgres:14 -a mess-with-dns-pg

This should make our health checker do the (possibly destructive) restart for you after 3 sequential postgres healthcheck failures.

Also, you should know these Postgreses aren’t “managed” exactly, they’re automated Fly apps. Normally people think “managed” means “a human responds when something goes wrong”, but that’s not something we do. We’ve worked to make this more obvious, but I’m guessing we set the wrong expectation when you created yours.

thanks so much! I can definitely stop asking questions about my databases problems here if it’s not helpful – I can’t always tell if a problem I’m having is unique to me or not.

Oh we love the questions. I was just concerned that you were expecting something else from “managed”. Asking questions in the forum is exactly what we want. And we’re happy to look at DBs when people ask forum questions.

nope, I’m definitely only expecting “I can ask questions in the forums and hopefully get answers eventually” :slight_smile:

By “managed” I just meant that I don’t have control over the docker image so I don’t know how it works (like it’s not an image I built myself)

Ok cool. That fits! phew.

The actual source code for the Postgres app is here, btw: GitHub - fly-apps/postgres-ha: Postgres + Stolon for HA clusters as Fly apps.

You can do all kinds of fun stuff with that. fly deploy -i flyio/postgres:14 just deploys our public build of that exact app. If you fork it and fiddle around, you can deploy your changes over the DB app we created for you.

1 Like