app is currently down for maintenance

eddieco · June 15, 2023, 4:58pm

we’re running an app in ewr and have this info message in our dashboard. both our development and production apps are completely unreachable. just wondering if there’s any word on this or when it may be resolved?
our logs say: "could not find a good candidate within 90 attempts at load balancing. last error: unreachable worker host. the host may be unhealthy. this is a Fly issue."

allison · June 15, 2023, 5:31pm

If I’m reading correctly, it looks like ewr was unreachable for a brief moment around half an hour ago. It should be better now. Please try again, and if there’s still issues deploying, I’ll escalate it

eddieco · June 15, 2023, 5:34pm

still unreachable

ewr-flier · June 15, 2023, 5:41pm

We’re seeing it as well. Postgres on EWR has been down for the past 3 hours.

allison · June 15, 2023, 5:43pm

I was not reading correctly, sorry!

A fix is being worked on, I’ll update when this is resolved

ewr-flier · June 15, 2023, 5:43pm

Appreciate the response. Is there an ETA?

ewr-flier · June 15, 2023, 5:44pm

Also https://status.flyio.net/ does not seem to show the outage, sounds like there’s an issue in that reporting page as well.

Davesp · June 15, 2023, 5:58pm

We are also experiencing the same on an app with a pg cluster. App has been totally out of action for over 4 hours now.

phantop · June 15, 2023, 6:42pm

Can confirm, also hosting something personal on ewr and its been down since this morning.

allison · June 15, 2023, 6:45pm

Ops team is looking into it. I sadly don’t have an ETA to share right now.
We typically report issues that affect the service as a whole on the statuspage, and issues that affect only existing apps on the app’s issues tab. Today’s issue is an unusual one because it’s tied to a single host, but it’s having an outsized effect compared to typical single-host issues. There’s an internal conversation going on right now about how we could better communicate situations like this that don’t neatly fit in our incident categories.

If at all possible, please try to move your apps to other regions temporarily. If not (if, for example, you use volumes), we’re really trying to get this back up in a timely manner. I’ll update again when there’s more to share

allison · June 15, 2023, 6:56pm

The issue should be cleared, please let us know if there are any more issues!

Davesp · June 15, 2023, 7:12pm

It has unfortunatly left one of our postgres clusters is a broken state.

ewr-flier · June 15, 2023, 7:29pm

Same here. Postgres is in a broken state.

ewr-flier · June 15, 2023, 7:34pm

Manually restarted the app with flyctl machine start and it seemed to improve.
@allison The postgres log is full of error configuring operator user: can't scan into dest[3]: cannot scan null into *string which was not the case prior to this outage.