Unannounced Maintenance?

Got a report from a user that they couldn’t access my site, and I checked the database to find that the last log was “Server bk_db/pg1 is going DOWN for maintenance (unspecified DNS error). 0 active and 1 backup servers left. Running on backup. 1 sessions active, 0 requeued, 0 remaining in queue.”

Some googling seems to indicate that this was a hardware failure of the underlying machine?

It wasn’t a maintenance but one of our hosts in CDG was down for ~15 minutes. Had to be rebooted. Might be related!

How did you manage to restart it? My pg app is stuck. It’s been several hours, too.

We run bare-metal hosts, we’ve had to restart it via our provider’s console. We couldn’t reach the host any other way.

For your app, can you try restarting the affected instance?

Fly documentation is not great. Tried several times with fly pg restart and nothing, just kept saying “no active leader found”. Did some more googling, found fly machines restart and that did the trick. Thanks for your help!

1 Like

Ah yes. That pg command requires a working cluster (apparently). The fly machine commands don’t care.

We are working on a better pg solution that’s not so adversely affected by down nodes.

On the off chance you or someone reading this works for fly, the full sequence was:
fly restart -a fencing-database-db
fly restart fencing-database-db
fly apps restart fencing-database-db
fly pg restart -a fencing-database-db

I ran each of those in order. I found fly restart in the documentation somewhere, and each one in turn told me to run the next, instead of sending me directly to the last. Very very bad UX. And, of course, nowhere was fly machines restart mentioned.

1 Like

For future reference, the fly machine restart command is documented here!

1 Like