fly.io site is currently inaccessible...

antondyad · November 26, 2024, 11:25am

Do they not have any staff in the GMT/CET timezone anymore?
It’s been hours, we’ve sent emails, we’re still having the same issues, constantly trying spin up new machines, wait until they slowly die, spin up new ones…

henrycatalinismith · November 26, 2024, 11:25am

Same situation here: one app that’s down because I deployed during the incident, and one app that’s working because I left it alone during the incident. I’m probably going to hold off on attempting any deploys to that working app for a while!

hosty · November 26, 2024, 11:43am

Going to host on a new platform in the meantime. As this is pretty unreliable not even having an ETA for a fix here.

Not too hard to keep the status page properly updated when you are serving this many customers.

thewilkybarkid1 · November 26, 2024, 11:47am

We’ve just deployed to an app that was working, and it now broken (it’s running, but the health check fails). I think the problem is its connection to Upstash Redis.

licarth-switchgrid · November 26, 2024, 12:26pm

I sort of get this as well, causing downtime for 15 minutes and then it comes back to “sort-of” normal.

bacchist · November 26, 2024, 12:52pm

Yeah, I’m still having issues.

Error: release command failed - aborting deployment. error running release_command machine: error updating release_command machine: failed to update VM 3d8d503b97e568: invalid_argument: unable to update machine configured for auto destroy (Request ID: 01JDM8DT8X7SRAWAV0VS515Q8M-iad) (Trace ID: 9792fa700168d7617cfe71c310005e60)

bacchist · November 26, 2024, 1:09pm

For anyone with something that looks similar, I found a machine in the dashboard that was not in a started state and I destroyed it and my deploy went through.

thewilkybarkid1 · November 26, 2024, 1:33pm

We’ve just seen our two instances recover after about 1 hour and 15 minutes. The two other instances I’d tried creating (1 in an existing region, 1 in a new one) some time after didn’t.

nwarwick · November 26, 2024, 2:24pm

Confirmed that this worked for me! Thanks.

thewilkybarkid1 · November 26, 2024, 3:27pm

I’m now thinking that the problems we’ve experienced aren’t related the downtime, but that CPU quotas have been turned on:

We have an HTTP cache stored in a volume, and run a cleanup process on startup to prune old entries (cacache’s verify). This was known to be a bit intensive, but only lasted for a minute or two.

Hidden in that thread is:

So I think that on the deploy the CPU quota balance is reset to 0, the intensive process started and was immediately throttled (which caused the HTTP server running on the instance to crawl to a halt). Once the throttled task eventually completed, the throttling was lifted allow the HTTP server to run as expected.

I’ve not been able to recreate this with a machine restart as the balance is kept (it consumes some of it, but doesn’t get near 0). I’ll have to confirm by turning off the intensive task (or at least delay it starting), and see how that deploy goes…

Edit: testing with a 10 minute delay

Edit 2: the deploy happened, the CPU quota balance dropped but the HTTP server handles requests fine. The balance is now replenishing:

Edit 3: the intensive task ran successfully with enough balance in place, though there was some ‘stealing’:

henrycatalinismith · November 26, 2024, 3:35pm

My app’s now back online. Haven’t done anything myself so it seems like either they’ve done something in the past few hours or whatever they did during the night took a while to fully take effect.

rodolfo · November 26, 2024, 3:58pm

I think it is a mixture of both. I do see the quota was reset back to 75% earlier today or at least that is what the graphs show so maybe there is something to it there.

thewilkybarkid1 · November 26, 2024, 4:05pm

Looks like you’re right, the baseline jumped back up from ~6% to ~30% during one of my restarts.

rodolfo · November 26, 2024, 4:07pm

Yes, it has been confirmed.

jkile9 · November 26, 2024, 8:17pm

This is pretty unreal. This happened the day we are running a major migration of our services to fly and put us dead in the water the entire day trying to get our services up. On top of that, we needed to deploy a pretty severe fix today and it looks like as soon as it’s ready to deploy, we’re running into deployment issues again. I really thought the leap of faith would be worth it to support a smaller company doing interesting work but this is getting pretty excessive.

rodolfo · November 26, 2024, 8:19pm

Yeah, it was back online for us for a while, and now deployments seem to be down again.

DuncanC · November 26, 2024, 8:25pm

Incidents happen. But the PR response to this has been a joke. Marking it as ‘degraded performance’ when it’s clearly a pretty significant incident that has caused a lot of downtime is super scummy. If you’re going to cheat to be able to claim 100% uptime just don’t bother with a uptime metric.

shahzeb1 · November 26, 2024, 8:29pm

Looks like the degraded API Performance Incident has been re-opened on the status page.

Honest question for folks on this thread (since I’m new to fly as of few weeks ago): what’s the general stability like on fly esp for production deployments? I would consider this a P0 outage that should be nearly impossible for a cloud provider. How often do these kinds of incidents happen?

GuilhermeSantos001 · November 26, 2024, 8:33pm

I’ve been a customer for about 1 year and 8 months, I’ve never had problems that lasted almost 24 hours, this problem lasted about +8 hours. I believe that there has been a very large increase in new customers and they are scaling to support the demand, or we are seeing a large invasion and denial of service live.

I’m use regions: GRU and IAD

kyleatcausadix · November 26, 2024, 8:37pm

It’s really tough to give a quantitative response to a generalized “is Fly stable?” question.

And the folks in this thread (myself included) aren’t going to give you a representative sampling because we’re all here due to being impacted by the current instability.

My qualitative (and genuinely saddening) answer, though, is: probably not stable enough for anything that you’re on the hook for providing SLAs for. At least that’s where we’re landing in the fallout of the last 24 hours.

—

edit: for context, we run on several US-only geos with concurrency in each one. all services use blue green deployments. all upstreams inside the Fly edge communicate over 6PN and use fly proxy internal addressing for regional connectivity. we front our Fly edge with Cloudfront.

edit 2: customer since March

edit 3: current regions we deploy to: DEN, DFW, SEA, SJC

Topic		Replies	Views
Something went wrong? Questions / Help	42	1432	September 22, 2022
Service unavailable? Unable to deploy django app or login	18	549	September 16, 2023
Fly API down?	1	332	March 28, 2022
Fly.io apps down in production	3	324	October 17, 2022
Fly.io machine is down again - another incident? builders	15	339	November 5, 2024

fly.io site is currently inaccessible...

Related topics