flyctl timeouts

bings · November 26, 2024, 8:15pm

The issue has been marked as resolved on the status page but I’m still struggling to use flyctl proxy and fly deploy due to long waiting times and time outs.

Anyone else still having this issue?

ctcb · November 26, 2024, 8:16pm

We’re also having issued trying to set new secrets. Fails/times out in the Fly UI and when using flyctl secrets set

Error: failed to list VMs even after retries: context deadline exceeded (Request ID: [REDACTED])

bobbyhiddn · November 26, 2024, 8:18pm

Also having this from terminal and from github actions.

schmulte · November 26, 2024, 8:18pm

My apps are also failing to deploy again among other issues like flyctl status 503 etc. or e.g. database queries are being killed which are working fine locally and which worked fine before hosted.

bobbyhiddn · November 26, 2024, 8:22pm

It says 503, which normally relates to authentication, but with it appearing on more than my case, I guess it’s more. Most likely related to yesterday’s outage.

Still love the product guys, hoping we can get to stability soon tho.

GuilhermeSantos001 · November 26, 2024, 8:29pm

The ease of implementing and using Fly.io comes at a price. I use AWS, GCP and Azure, but I’ve never had a problem with my infrastructure for so long. If it happens, it’s solved, and we even have contracts with fines if the servers are unavailable. I like using Fly.io, I have smaller projects and infrastructure for developers, but it has become a learning experience that is unlikely to convince me to leave large data centers. We’re talking about about 8 hours of interruption, considering that time we have an environment in development and we can’t consider it as production.

Error: failed to list VMs even after retries: context deadline exceeded (Request ID: 01JDN27WENVWAN8KNGJ2CDSFQC-gig)
Stacktrace:
goroutine 1 [running]:
runtime/debug.Stack()
        /home/runner/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.9.linux-amd64/src/runtime/debug/stack.go:24 +0x5e
github.com/superfly/flyctl/internal/cli.printError(0xc0007ee000, 0xc0006afc1e, 0xc000867208, {0x2cfc300, 0xc000165aa0})
        /home/runner/work/flyctl/flyctl/internal/cli/cli.go:184 +0x59e
github.com/superfly/flyctl/internal/cli.Run({0x2d22ef8?, 0xc0005df5c0?}, 0xc0007ee000, {0xc00016c010, 0x7, 0x7})
        /home/runner/work/flyctl/flyctl/internal/cli/cli.go:117 +0x9d0
main.run()
        /home/runner/work/flyctl/flyctl/main.go:47 +0x156
main.main()
        /home/runner/work/flyctl/flyctl/main.go:26 +0x18

greg · November 26, 2024, 8:42pm

It looks like there is a new issue (perhaps related to the prior, now-solved one)? Unclear.

If the API is slow, services that use it will also be . I assume that’s why flyctl is right now.

rodolfo · November 26, 2024, 9:34pm

I think you hit all the points I was going to make. Everyone is going to have outages at some point, but I don’t recall a single one being over 8 hours on AWS or Azure which are the ones I have production loads on. Even on DO and Vultr the outages have not been as severe. Reading all the posts here, I don’t think a single one of us has an issue with the fly offerings. We all agree that their features make it so much easier to deploy, autoscale, HA across regions than most if not all of the big cloud providers, but their reliability is simply not there. I vouched for them when I moved a customer’s infra from DO to here 3 weeks ago, but this outage severely impacted their business and I just don’t have any justification to give them to keep their infra here.

charsleysa · November 26, 2024, 9:46pm

While you may not have been impacted, there are large AWS outages every year, the latest being in July 2024 which lasted for almost 7 hours. Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region

rodolfo · November 26, 2024, 9:55pm

You are right in pointing that out, and that is why I said that everyone is going to have outages at some point.

But let me rephrase that, I don’t recall a single global one being over 8 hours on AWS or Azure which are the ones I have production loads on. That one you listed was in one region and I was affected, but I was able to get the services running in us-east-2 and us-west-1 within minutes.

If that had been the option here, it would have been a completely different experience. We just couldn’t do anything anywhere.

Don’t get me wrong, I am not dunking on fly, I do believe they have the best offering at least for the use cases I have, and I wouldn’t hesitate moving infrastructure here if they did not have the severe global outages they have.

charsleysa · November 26, 2024, 10:40pm

I do agree on this aspect. The current underlying setup seems to have regions too tightly coupled with each other so when an issue occurs with critical infrastructure it impacts all regions.

They mentioned that one of the reasons it was taking so long to fix was their Corrosion cluster taking so long to recover. My understanding is that they have one Corrosion cluster to trying to handle all regions, so when an outage happens and it needs to recover, the large amount of updates needing to trickle to all nodes in the system takes a long time.

I wonder if it would be better to have 1 global Corrosion cluster per region so that the number of nodes in each cluster is much smaller. Each cluster would only have X hosts for the region, and Y hosts for each proxy edge globally. That way if only 1 region is impacted, only 1 smaller cluster needs to be recovered which should be faster, while still allowing other regions to continue operating.

But then again I don’t have detailed insights into how their infrastructure works so I might be heading down the wrong path.

empz · November 26, 2024, 10:59pm

Yes, if the team is able to regionalize the incidents, I might consider them for my production workloads. But all the incidents I’ve been experiencing affected all regions and I couldn’t simply fallback to another region like I can when one of the big cloud providers has an incident in a specific region.

It’s really a shame because I really enjoy working with Fly, but these long, frequent and global incidents are just a big no for any production app…

system · December 3, 2024, 11:00pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Service unavailable? Unable to deploy django app or login	18	549	September 16, 2023
Can't complete flyctl deploy process - context deadline exceeded Questions / Help	42	3665	June 10, 2022
flyctl deploy quit working with context deadline exceeded Questions / Help	2	435	August 26, 2022
Error while dialing: dial tcp i/o timeout Questions / Help flyctl	1	657	January 18, 2024
Can't build remotely (flyctl deploy)	6	803	October 13, 2022

flyctl timeouts

Related topics