Postgres database not responding at all

I got an email from my monitoring solution that my site is down. Cloudflare confirmed the same.

I log in and see that the following errors in the app:

 2022-02-21T11:27:54.929 app[b0a5866f] ord [info] 2022/02/21 11:27:54 ping db: failed to connect to `host=redacted.internal user=redacted database=redacted`: failed to receive message (unexpected EOF) 
 2022-02-21T11:27:55.829 app[b0a5866f] ord [info] Main child exited normally with code: 1

2022-02-21T11:27:55.830 app[b0a5866f] ord [info] Starting clean up. 

So I check the database and I see that fly is reporting it to be happy and normal.
The logs say nothing either:

 2022-02-21T11:37:45.992 app[9db2cae4] ord [info] keeper | 2022-02-21T11:37:45.992Z INFO cmd/keeper.go:1557 our db requested role is standby {"followedDB": "909c9634"}

2022-02-21T11:37:45.993 app[9db2cae4] ord [info] keeper | 2022-02-21T11:37:45.992Z INFO cmd/keeper.go:1576 already standby

2022-02-21T11:37:46.015 app[9db2cae4] ord [info] keeper | 2022-02-21T11:37:46.014Z INFO cmd/keeper.go:1676 postgres parameters not changed

2022-02-21T11:37:46.015 app[9db2cae4] ord [info] keeper | 2022-02-21T11:37:46.015Z INFO cmd/keeper.go:1703 postgres hba entries not changed

2022-02-21T11:37:49.214 app[3bd4ed4d] ord [info] keeper | 2022-02-21T11:37:49.214Z INFO cmd/keeper.go:1505 our db requested role is master

2022-02-21T11:37:49.215 app[3bd4ed4d] ord [info] keeper | 2022-02-21T11:37:49.215Z INFO cmd/keeper.go:1543 already master

2022-02-21T11:37:49.243 app[3bd4ed4d] ord [info] keeper | 2022-02-21T11:37:49.243Z INFO cmd/keeper.go:1676 postgres parameters not changed

2022-02-21T11:37:49.244 app[3bd4ed4d] ord [info] keeper | 2022-02-21T11:37:49.243Z INFO cmd/keeper.go:1703 postgres hba entries not changed 

I deployed the app just over a month ago and it was running fine until 15 hours or so.

I see others facing a similar issue. Anything else that can be done here?

We’re investigating what happened, but I think my restart of one of your postgres instance fixed it. Somehow, one of your postgres instance was unreachable, it closed the connection right away.

I just rescheduled your app so it would come back up (it did).

This could be related to the fact that your postgres cluster seems to be crashing every day due to out-of-memory kills. Once it came back up, it seems to have been in a “bad” state (even though logs showed no issues).

Still, this shouldn’t happen and we’ll keep looking into it.

You might want to try and scale up the memory limit for your postgres cluster. It’s currently at 256MB. flyctl scale memory 1024 should go a long way.

When I looked at the log or the status of the database, nothing stood out that would point me towards this. I did suspect this when I say the memory usage of the VM was close to the limit but ruled it out because I didn’t find anything complaining about it.

Sure, I’ll do that for now but how much is enough? My app is pretty light on the DB, light as in a couple of queries a minute.