Is EWR down?

I have two apps in EWR that both are stuck in status pending all of a sudden after working fine just an hour ago.

To be clear: one of them was up and running, no deployment or changes and now Instances lists no apps and status is pending. Trying to deploy to either app gets stuck on “Running release task (pending)…”

2 Likes

Same here. Contacted the support email but was directed back here. I have 3 unrelated apps all stuck in pending now. There needs to be something other than a forum for issues like this.

I understand that the forum helps Fly handle support at scale, but an outage/service issue is not something the community can help with.

3 Likes

Yep 2 of mine have been down since 1:46PM EST.

Mine has been down since about 1:30pm EST

1 Like

Looks like the status page just updated:

Everything should be resolved now, please let us know if you experience any other issues.

We encountered some disk-capacity issues in EWR, and the work we did to resolve them triggered a few unexpected surprises in our Nomad-based instance scheduler. Some instances were interrupted and remained in a pending state for a while (particularly volume-attached or single-region instances that couldn’t be placed elsewhere). Sorry for the interruptions! We’ll be investigating the surprises we encountered to prevent this kind of issue from occurring again in the future.

Thanks for the details, and update.

Were you all aware of the issues before customers reported it? I checked the status page around 1:30 ET, and all systems were operational. It would be great if you were able to more readily update that page, or provide an alert. I have monitoring in the app that gave me a heads up, but I spent a bunch of time trying to fix it on my end since the cause was unclear.

In addition to in-house monitoring, is there a way to report an outage to you all?

Yes, we were aware the issue soon after it began around 1:30 ET, though the ongoing impact it was having on some applications wasn’t fully clear until the first customer reports arrived. Your initial report (around 2:20 ET) helped us confirm the impact was more severe than we initially thought based on our metrics, and we updated the status page 12 minutes later.
We’re working on fixing the bug that caused this unexpected issue, as well as adding more thorough monitoring to more quickly gauge severity for this particular type of incident. Customer reports will always be helpful though, and this forum is usually the quickest way to get our attention.

Could this have caused performance issues in DFW as well?

Hi @stephenb, do you have apps deployed in DFW and are you seeing similar issues currently in DFW?

Can you please post the error messages you’re seeing?

well, I’m not totally clear on the impact experienced in EWR… But, while this issue was happening yesterday I was trying to debug unusually slow performance in a PG instance in DFW. The timing just makes me curious.

Ah I understand but capacity issues in ewr shouldn’t affect dfw so the two are not related.