Do they not have any staff in the GMT/CET timezone anymore?
It’s been hours, we’ve sent emails, we’re still having the same issues, constantly trying spin up new machines, wait until they slowly die, spin up new ones…
Same situation here: one app that’s down because I deployed during the incident, and one app that’s working because I left it alone during the incident. I’m probably going to hold off on attempting any deploys to that working app for a while!
Going to host on a new platform in the meantime. As this is pretty unreliable not even having an ETA for a fix here.
Not too hard to keep the status page properly updated when you are serving this many customers.
We’ve just deployed to an app that was working, and it now broken (it’s running, but the health check fails). I think the problem is its connection to Upstash Redis.
I sort of get this as well, causing downtime for 15 minutes and then it comes back to “sort-of” normal.
Yeah, I’m still having issues.
Error: release command failed - aborting deployment. error running release_command machine: error updating release_command machine: failed to update VM 3d8d503b97e568: invalid_argument: unable to update machine configured for auto destroy (Request ID: 01JDM8DT8X7SRAWAV0VS515Q8M-iad) (Trace ID: 9792fa700168d7617cfe71c310005e60)
For anyone with something that looks similar, I found a machine in the dashboard that was not in a started state and I destroyed it and my deploy went through.
We’ve just seen our two instances recover after about 1 hour and 15 minutes. The two other instances I’d tried creating (1 in an existing region, 1 in a new one) some time after didn’t.
Confirmed that this worked for me! Thanks.
I’m now thinking that the problems we’ve experienced aren’t related the downtime, but that CPU quotas have been turned on:
We have an HTTP cache stored in a volume, and run a cleanup process on startup to prune old entries (cacache’s verify
). This was known to be a bit intensive, but only lasted for a minute or two.
Hidden in that thread is:
So I think that on the deploy the CPU quota balance is reset to 0, the intensive process started and was immediately throttled (which caused the HTTP server running on the instance to crawl to a halt). Once the throttled task eventually completed, the throttling was lifted allow the HTTP server to run as expected.
I’ve not been able to recreate this with a machine restart as the balance is kept (it consumes some of it, but doesn’t get near 0). I’ll have to confirm by turning off the intensive task (or at least delay it starting), and see how that deploy goes…
Edit: testing with a 10 minute delay
Edit 2: the deploy happened, the CPU quota balance dropped but the HTTP server handles requests fine. The balance is now replenishing:
Edit 3: the intensive task ran successfully with enough balance in place, though there was some ‘stealing’:
My app’s now back online. Haven’t done anything myself so it seems like either they’ve done something in the past few hours or whatever they did during the night took a while to fully take effect.
I think it is a mixture of both. I do see the quota was reset back to 75% earlier today or at least that is what the graphs show so maybe there is something to it there.
Looks like you’re right, the baseline jumped back up from ~6% to ~30% during one of my restarts.
Yes, it has been confirmed.
This is pretty unreal. This happened the day we are running a major migration of our services to fly and put us dead in the water the entire day trying to get our services up. On top of that, we needed to deploy a pretty severe fix today and it looks like as soon as it’s ready to deploy, we’re running into deployment issues again. I really thought the leap of faith would be worth it to support a smaller company doing interesting work but this is getting pretty excessive.
Yeah, it was back online for us for a while, and now deployments seem to be down again.
Incidents happen. But the PR response to this has been a joke. Marking it as ‘degraded performance’ when it’s clearly a pretty significant incident that has caused a lot of downtime is super scummy. If you’re going to cheat to be able to claim 100% uptime just don’t bother with a uptime metric.
Looks like the degraded API Performance Incident has been re-opened on the status page.
Honest question for folks on this thread (since I’m new to fly as of few weeks ago): what’s the general stability like on fly esp for production deployments? I would consider this a P0 outage that should be nearly impossible for a cloud provider. How often do these kinds of incidents happen?
I’ve been a customer for about 1 year and 8 months, I’ve never had problems that lasted almost 24 hours, this problem lasted about +8 hours. I believe that there has been a very large increase in new customers and they are scaling to support the demand, or we are seeing a large invasion and denial of service live.
I’m use regions: GRU and IAD
It’s really tough to give a quantitative response to a generalized “is Fly stable?” question.
And the folks in this thread (myself included) aren’t going to give you a representative sampling because we’re all here due to being impacted by the current instability.
My qualitative (and genuinely saddening) answer, though, is: probably not stable enough for anything that you’re on the hook for providing SLAs for. At least that’s where we’re landing in the fallout of the last 24 hours.
—
edit: for context, we run on several US-only geos with concurrency in each one. all services use blue green deployments. all upstreams inside the Fly edge communicate over 6PN and use fly proxy internal addressing for regional connectivity. we front our Fly edge with Cloudfront.
edit 2: customer since March
edit 3: current regions we deploy to: DEN, DFW, SEA, SJC