flyctl timeouts

The issue has been marked as resolved on the status page but I’m still struggling to use flyctl proxy and fly deploy due to long waiting times and time outs.

Anyone else still having this issue?

Screenshot 2024-11-26 at 20.13.52

2 Likes

We’re also having issued trying to set new secrets. Fails/times out in the Fly UI and when using flyctl secrets set

Error: failed to list VMs even after retries: context deadline exceeded (Request ID: [REDACTED])
1 Like

Also having this from terminal and from github actions.

1 Like

My apps are also failing to deploy again among other issues like flyctl status 503 etc. or e.g. database queries are being killed which are working fine locally and which worked fine before hosted.

2 Likes

It says 503, which normally relates to authentication, but with it appearing on more than my case, I guess it’s more. Most likely related to yesterday’s outage.

Still love the product guys, hoping we can get to stability soon tho.

1 Like

The ease of implementing and using Fly.io comes at a price. I use AWS, GCP and Azure, but I’ve never had a problem with my infrastructure for so long. If it happens, it’s solved, and we even have contracts with fines if the servers are unavailable. I like using Fly.io, I have smaller projects and infrastructure for developers, but it has become a learning experience that is unlikely to convince me to leave large data centers. We’re talking about about 8 hours of interruption, considering that time we have an environment in development and we can’t consider it as production.

Error: failed to list VMs even after retries: context deadline exceeded (Request ID: 01JDN27WENVWAN8KNGJ2CDSFQC-gig)
Stacktrace:
goroutine 1 [running]:
runtime/debug.Stack()
        /home/runner/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.9.linux-amd64/src/runtime/debug/stack.go:24 +0x5e
github.com/superfly/flyctl/internal/cli.printError(0xc0007ee000, 0xc0006afc1e, 0xc000867208, {0x2cfc300, 0xc000165aa0})
        /home/runner/work/flyctl/flyctl/internal/cli/cli.go:184 +0x59e
github.com/superfly/flyctl/internal/cli.Run({0x2d22ef8?, 0xc0005df5c0?}, 0xc0007ee000, {0xc00016c010, 0x7, 0x7})
        /home/runner/work/flyctl/flyctl/internal/cli/cli.go:117 +0x9d0
main.run()
        /home/runner/work/flyctl/flyctl/main.go:47 +0x156
main.main()
        /home/runner/work/flyctl/flyctl/main.go:26 +0x18

It looks like there is a new issue (perhaps related to the prior, now-solved one)? Unclear.

If the API is slow, services that use it will also be :frowning:. I assume that’s why flyctl is right now.

1 Like

I think you hit all the points I was going to make. Everyone is going to have outages at some point, but I don’t recall a single one being over 8 hours on AWS or Azure which are the ones I have production loads on. Even on DO and Vultr the outages have not been as severe. Reading all the posts here, I don’t think a single one of us has an issue with the fly offerings. We all agree that their features make it so much easier to deploy, autoscale, HA across regions than most if not all of the big cloud providers, but their reliability is simply not there. I vouched for them when I moved a customer’s infra from DO to here 3 weeks ago, but this outage severely impacted their business and I just don’t have any justification to give them to keep their infra here.

While you may not have been impacted, there are large AWS outages every year, the latest being in July 2024 which lasted for almost 7 hours. Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region

2 Likes

You are right in pointing that out, and that is why I said that everyone is going to have outages at some point.

But let me rephrase that, I don’t recall a single global one being over 8 hours on AWS or Azure which are the ones I have production loads on. That one you listed was in one region and I was affected, but I was able to get the services running in us-east-2 and us-west-1 within minutes.

If that had been the option here, it would have been a completely different experience. We just couldn’t do anything anywhere.

Don’t get me wrong, I am not dunking on fly, I do believe they have the best offering at least for the use cases I have, and I wouldn’t hesitate moving infrastructure here if they did not have the severe global outages they have.

1 Like

I do agree on this aspect. The current underlying setup seems to have regions too tightly coupled with each other so when an issue occurs with critical infrastructure it impacts all regions.

They mentioned that one of the reasons it was taking so long to fix was their Corrosion cluster taking so long to recover. My understanding is that they have one Corrosion cluster to trying to handle all regions, so when an outage happens and it needs to recover, the large amount of updates needing to trickle to all nodes in the system takes a long time.

I wonder if it would be better to have 1 global Corrosion cluster per region so that the number of nodes in each cluster is much smaller. Each cluster would only have X hosts for the region, and Y hosts for each proxy edge globally. That way if only 1 region is impacted, only 1 smaller cluster needs to be recovered which should be faster, while still allowing other regions to continue operating.

But then again I don’t have detailed insights into how their infrastructure works so I might be heading down the wrong path.

4 Likes

Yes, if the team is able to regionalize the incidents, I might consider them for my production workloads. But all the incidents I’ve been experiencing affected all regions and I couldn’t simply fallback to another region like I can when one of the big cloud providers has an incident in a specific region.

It’s really a shame because I really enjoy working with Fly, but these long, frequent and global incidents are just a big no for any production app…

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.