Production database randomly going down and not recovering

RESOLVED fixed by redeploying. I found this thread of someone running into the same issue. status.fly.io doesn’t show anything wrong – is there any way for us to set up alerts on fly if things like these happen?

Hi,

approx. 3 hours ago, our Postgres Database suddenly became unavailable through the fly.io DNS.

Our production application is throwing:

2022-07-10T07:36:21Z app[afc1892f] fra [info] Error  Can't reach database server at `top2.nearest.of.[db-appname].internal`:`5432`

fly logs -a [db-appname] shows

2022-07-10T07:40:36Z app[0614d4ac] fra [info]keeper   | 2022-07-10T07:40:36.803Z        INFO    cmd/keeper.go:1504      our db requested role is master
2022-07-10T07:40:36Z app[0614d4ac] fra [info]keeper   | 2022-07-10T07:40:36.804Z        INFO    cmd/keeper.go:1542      already master
2022-07-10T07:40:36Z app[0614d4ac] fra [info]keeper   | 2022-07-10T07:40:36.836Z        INFO    cmd/keeper.go:1675      postgres parameters not changed
2022-07-10T07:40:36Z app[0614d4ac] fra [info]keeper   | 2022-07-10T07:40:36.836Z        INFO    cmd/keeper.go:1702      postgres hba entries not changed
2022-07-10T07:40:36Z app[0614d4ac] fra [info]keeper is healthy, db is healthy, role: master
2022-07-10T07:40:37Z app[0614d4ac] fra [info]configuring operator
2022-07-10T07:40:37Z app[0614d4ac] fra [info]error configuring operator user: can't scan into dest[3]: cannot scan null into *string
2022-07-10T07:40:37Z app[0614d4ac] fra [info]checking stolon status
2022-07-10T07:40:37Z app[0614d4ac] fra [info]keeper is healthy, db is healthy, role: master
2022-07-10T07:40:37Z app[0614d4ac] fra [info]configuring operator
2022-07-10T07:40:37Z app[0614d4ac] fra [info]error configuring operator user: can't scan into dest[3]: cannot scan null into *string
2022-07-10T07:40:38Z app[0614d4ac] fra [info]checking stolon status
2022-07-10T07:40:38Z app[0614d4ac] fra [info]keeper is healthy, db is healthy, role: master

fly ssh console -a [db-appname] shows

WARN app flag '[db-appname]' does not match app name in config file '[app-appname]'
? Continue using '[db-appname]' Yes
Error host unavailable: host was not found in DNS

fly dig aaaa [db-appname].internal gives me an empty response:

;; opcode: QUERY, status: NXDOMAIN, id: 25511
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;[db-appname].internal.       IN       AAAA

fly dig works for our other apps. The database returns an empty response however.

I’m not sure if this is something on our end, but we haven’t changed anything on our end when things broke. (it was 6am!)

This isn’t the result of an outage, it’s some kind of bug we haven’t tracked down yet. Postgres can get into a state (usually when the process crashes) where it gets removed from internal DNS and then doesn’t recover until you restart like you did.

Is this a 2+ node Postgres or single node? If it gets into this state again, try running fly status -a <name>, see if one is not passing healthchecks, then run fly vm status <id> to see if has any weird exit codes.

Not to sound harsh at all, I really appreciate you following up with me on this, but this is a scary bug. We didn’t touch postgres at all, it just went down by itself. Could we somehow make sure we get notified/restart postgres when it goes down? I’d really appreciate even the most hacky of workarounds :smile:

It was a single-node postgres. One weird thing I noticed is that it had multiple restarts (3 or 5) but was listed in a “healthy” state. It didn’t say anything about being in a failed state or needing to recover. (Maybe that’s where the issue lies? Fly.io assumes the wrong state and doesn’t restart the crashed process as the result)

Is there any way I can get to the past logs of the PG instance? I’d like to see what caused those restarts, maybe that could help you guys track down the issue? If I had to guess, it’s very likely

I’m unsure if this is the same issue, but it’s happened again:

6:12 AM UTC our production database went down.

Checking the status of the VM that failed to restart:

$ fly vm status <failed vm id> -a <db-appname>

Instance
  ID            = <failed vm id>
  Process       =
  Version       = 4
  Region        = <failed vm region>
  Desired       = stop
  Status        = complete
  Health Checks = 3 total, 2 passing, 1 critical
  Restarts      = 7
  Created       = 2022-07-10T08:02:28Z

Recent Events
TIMESTAMP            TYPE       MESSAGE
2022-07-10T08:02:25Z Received   Task received by client
2022-07-10T08:02:42Z Task Setup Building Task Directory
2022-07-10T08:02:49Z Started    Task started by client
2022-07-11T13:45:35Z Terminated Exit Code: 2
2022-07-11T13:45:35Z Restarting Task restarting in 1.049689186s
2022-07-11T13:45:46Z Started    Task started by client
2022-07-16T21:01:50Z Terminated Exit Code: 2
2022-07-16T21:01:50Z Restarting Task restarting in 1.22792229s
2022-07-16T21:02:01Z Started    Task started by client
2022-07-17T08:16:53Z Terminated Exit Code: 2
2022-07-17T08:16:54Z Restarting Task restarting in 1.143081483s
2022-07-17T08:17:05Z Started    Task started by client
2022-07-17T16:13:45Z Terminated Exit Code: 2
2022-07-17T16:13:47Z Restarting Task restarting in 1.178063081s
2022-07-17T16:13:58Z Started    Task started by client
2022-07-17T16:58:48Z Terminated Exit Code: 2
2022-07-17T16:58:48Z Restarting Task restarting in 1.199760029s
2022-07-17T16:58:59Z Started    Task started by client
2022-07-17T19:40:55Z Terminated Exit Code: 2
2022-07-17T19:40:56Z Restarting Task restarting in 1.181164063s
2022-07-17T19:41:07Z Started    Task started by client
2022-07-18T03:50:17Z Terminated Exit Code: 2
2022-07-18T03:50:18Z Restarting Task restarting in 1.23443399s
2022-07-18T03:50:29Z Started    Task started by client
2022-07-18T08:02:58Z Killing    Sent interrupt. Waiting 5m0s before force killing
2022-07-18T08:03:15Z Terminated Exit Code: 0
2022-07-18T08:03:16Z Killed     Task successfully killed
2022-07-18T08:03:16Z Killing    Sent interrupt. Waiting 5m0s before force killing

Checks
ID   SERVICE STATE    OUTPUT

vm   app     passing  HTTP GET http://172.19.10.106:5500/flycheck/vm: 200 OK Output: "[✓] checkDisk: 9.13 GB (93.4%) free space on /data/ (31.64µs)\n[✓] checkLoad: load averages: 0.22 0.21 0.17 (49.8µs)\n[✓] memory: system spent 0s of the last 60s waiting on memory (30.12µs)\n[✓] cpu: system spent 1.97s of the last 60s waiting on cpu (16µs)\n[✓] io: system spent 0s of the last 60s waiting on io (14.31µs)"


pg   app     critical HTTP GET http://172.19.10.106:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded"



role app     passing  leader


Recent Logs

There are no logs because I’ve run this command after re-deploying the DB, but they looked just like the ones in the original post!

This is not meant to sound harsh at all, but how do we solve this issue once and for all? We really can’t have our prod DB going down just like that!

Try running fly image update. The latest postgres release includes (1) less chatty logs and (2) some improvements that could make it more resilient to this issue.

I’m hoping the more efficient logs will give you an idea why the databases is crashing. Exit Code: 2 means something with Postgres failed.

Thank you! I just set up fly-log-shipper today, which worked like a charm. This goes hand-in-hand!

Much appreciated :slight_smile: