Our web service is down
When will it be back to normal?..
Ours is down as well
Our IAD servers are out as well, along with the machine API.
fly.io itself, our machines in several different regions, the cli.
this isn’t good.
We had managed to make it through the current incident unscathed until a few minutes ago but it appears to be worsening. Many (but not all!) of our apps in iad are reporting no machines.
Major infrastructure issue. Our API has been down for over 2 hours.
Yeah they say “degraded API performance” but I can’t get any of my API calls to work:
| Error: server returned a non-200 status code: 504
And cannot even load fly.io anymore.
Has there been any other communication other than the status page updates?
Not that I’m aware, and I don’t think we should expect it for a hot minute—lots of signal to indicate this has ballooned (see what I did there) into a widespread, likely even global, outage.
At least the status page and Discourse sites are up
#hugops
our apps appear to be up again, cli is working, fly.io is accessible.
Things seem to be getting better, though I still have no CLI access at all which is making it really hard to restore our services
#hugops for sure, this musta gotten way bigger than they excpected
Yep, we’re in the same boat. Deploys are still 504
-ing (not Depot), and all attempts to roll existing instances are also 504
-ing. Not out of the woods yet.
Down for me as well as of 7:59 MST. 504’ing after waiting for the depot. Sucks as I had just gotten a solution ready to test apparently right as it went down. Always how it goes lol
@bobbyhiddn
Has similar incidents happened before? In my view, an outage lasting several hours is a very serious incident. If such things happen frequently, we may need to seriously consider migrating away from Fly.io. I really like the convenience that Fly.io brings, but stability is always the highest priority as we operate in the financial payment industry.
So far, no. I’ve been using it for a few months now and the convenience has been a huge value add as it let’s me black box most of the deployment stream while testing. This has been my first major incident with the platform. So far, none of my products are making money, just some development ideas, so it’s not a huge deal for me, but if they were, I would be concerned.
I am surprised they haven’t bothered to comment here, though.
We’ve been here for just about a year and a half. For sure not the first major outage. This is however one of the longest-lasting ones that I’ve personally seen. I think I’ve experienced about 4-5 other large outages with Fly, most lasting less than an hour with only one that I can remember lasting more than 2. This is by far the worst one I’ve experienced and is causing a lot of issues on our end. Definitely feel for the team, scaling server infra from scratch like this is a massive undertaking.
Take it as a sign that they’re all-in on trying to fix it. I’m sure they will reply once they aren’t all hands on deck
We are 30 minutes away from our school project presentation, and we are very flustered.