Stability issues

mkozak · February 22, 2023, 9:08am

I migrated from Heroku to fly two months ago, and while I was super happy about how fly worked for the first few weeks, I have had enough.

If you go through forums, you’ll find many occurrences of proxy issues, connectivity issues, hang-up deployments, etc.
Two or three weeks ago, there was an issue with deployments lasting for a few hours without any indication on the status page or feedback from the fly team.
In the past week, I had a lot of issues with networking/my services being unavailable. And now, once again, to the point I decided to write this.

flyctl status -a workingreen-db
178157ea901478	       	unknown	      	                  	:                            	1970-01-01T00:00:00Z	0001-01-01T00:00:00Z	
32871e9c123085	started	leader 	ams   	3 total, 3 passing	flyio/postgres:14.4 (v0.0.32)	2022-11-18T12:29:14Z	2022-11-18T12:29:31Z

As you can see, there’s an issue with my database machine.

And don’t even get me started on upstash redis, I get dozens of connectivity issues every week.

My application is down; I cannot ssh into it and am basically unable to run my business.
I don’t know if recent issues are caused by a small team behind the platform, increased usage, or it “just happened” but I won’t recommend fly for running production applications for some time.

I do like this platform, it works really well when it’s working, but the amount of issues I have to deal with is simply too much.

fubhy · February 22, 2023, 11:01am

Agreed. When we started using fly.io, we swiftly migrated a signifant portion of our infrastructure over. The simplicity and painlessnes sold us. But that’s fading now considering all the stability issues we’ve seen over the past months.

We are going to migrate our infrastructure back to AWS or GCP (undecided yet, but have started the internal discussion already). It’s a huge bummer. I’ve been super bullish on fly.io and recommended it to friends & colleagues in the past. I can’t do that anymore.

mkozak · February 22, 2023, 11:13am

Same here, I haven’t decided where to migrate yet, but I definitely have to. Maybe I’ll come back in the future.

The biggest issue I see here is a lack of communication from fly side or no way to declare an incident from the customer’s perspective. I know covering all timezones with a small team is non-trivial, but I can’t imagine if someone is paying hundreds of dollars for infrastructure and then simply having no way to get things fixed ASAP.

Good observability and an on-call system for infra people are a must when you offer a platform to other people…

fubhy · February 22, 2023, 11:57am

True. There’s nothing on https://status.flyio.net/ even though this has been going on for 2+ hours. There’s also zero activity from fly.io staff on the forums here within the same time. Yes, covering all timezones with a small team is difficult, but then (and that’s the harsh truth) it’s also simply not production ready. We might come back in the future too. I like the concepts in general. But it’s simply not ready yet to run production workloads for real businesses.

jerome · February 22, 2023, 12:06pm

I would like to address the broader points separately, but I don’t have time right now.

A host had to be rebooted. Unfortunately it looks like some postgres clusters did not come back up gracefully and we’re looking into it. This should not affect apps that don’t need to connect to a postgres clusters with a node on that host.

fubhy · February 22, 2023, 12:21pm

Thanks @jerome. I’m looking forward to hearing your take on this & what fly.io has in store to be more reliable and responsive going forward. I’m genuinely bummed out that we have to migrate off of fly.io for the time being and I hope we can return in the future. As friends & colleagues can attest, I’ve been very vocal & spreading love for fly.io. We love it when it’s working reliably… But the recent downtimes are too much for my business and, most importantly, my team who’ve had to deal with these incidents. No matter how good the product is (and it is), if it breaks left and right, I can’t run my production infrasturcture on it.

pier · February 22, 2023, 4:04pm

I too have been affected by the PG issues lately. My production environment is running in AMS.

I have a single user casually trying out the product and unfortunately he has been affected with 500 errors multiple times when the server couldn’t connect with PG. We’re not even in alpha so luckily this hasn’t had any major repercussions for us but it’s certainly worrying.

I agree with @fubhy the status page should be showing these issues.

From the metrics it would seem PG never really died since some queries were actually being processed?

I guess the issue was with the routing layer.

As far as I can tell, my apps have been running fine on Fly (other than PG). I guess the best action I can take for now is migrating PG to Supabase or ElephantSQL.

Don’t get me wrong, I’m an early adopter and a Fly fan. Have recommended it to everyone I know IRL and very often on Twitter. Nothing would make me happier than being able to run all my infra on Fly.

jerome · February 22, 2023, 5:04pm

I will acknowledge things have been bumpy lately. The experience of each user on our platform depends a lot on what region they’re deployed to and what products they use.

As a biased example: my closest region is YUL (Montreal) and I’ve been running a few apps there + a postgres cluster and I’ve only had a few rare, user-fixable, issues in the last few months.

Some parts of Europe (FRA, in particular) are extremely busy regions, meaning our resources there are pushed to the brink. More capacity is not always available, and rebalancing hosts is not an easy task without user-coordination (it’s only a problem for apps with volumes, because they live on a specific host).

An app in FRA, AMS, CDG or LHR might have more problems. Especially if it’s a single-node app (no redundancy) or if they use a Postgres cluster which have been problematic in their current form.

I can answer the specific criticism:

Responding to these issues is an incredible amount of work. Up until Monday this week, crucial information was missing from logs that would’ve helped user diagnose issues. Essentially all proxy error logs “Could not proxy…” were missing the actual reason why. This led to a lot of confusion for our users. It’s even worse because on streaming logs were missing that information. The initial logs pulled (via fly logs) would have that information displayed correctly.

We’re trying to give users more information so they can better troubleshoot their app. Even if I’ll take the blame for a lot of proxy issues over the years, it has gotten way more stable / fast in the past few year and most error logs present in app logs are the result of:

A host struggling, but still available
The app’s instance(s) are unhealthy in a way preventing connection (timing out or outright connection refused).

We’ve pushed our Nomad cluster to the limit and it hasn’t been working the way we’ve wanted it to work for us for the past 2 years. That’s why we’re moving towards machines with a different scheduler entirely. There are still rough edges there, but we’re improving things daily (because we can, we control everything about this project as it was custom built for our own use case).

To successfully deploy an app with our Nomad infrastructure (current default, aka “Apps V1”), these things have to work:

Start a remote builder (unless you’re using a Docker locally)
Push the image to our registry (this goes through the proxy and the registry itself is just another Fly app we host ourselves, we have backups deployed elsewhere because we could get in a chicken-egg situation here)
Create a new release via our API
Schedule with Nomad (it needs to find capacity given the constraints of your app like region, volumes, etc.)
We then launch a background job to monitor the deployment
flyctl polls an API to get updates on the deployment

Things can sometimes go wrong between 4 and 6 (inclusive). Nomad can take a long time, especially if we’re close to capacity in the app’s regions.

Machines work more like docker run and they’re easier to work with if you’re familiar with Docker.

This is not to excuse the issues, but there’s a lot going on in the deployment process.

This depends a lot on which regions your app is targeting.

This weekend we had a CDG host go down and this morning an AMS host. Europe had a bad week for sure.

Today’s host failure might’ve been prevented with better internal process (we’re still investigating, but looks like there was a a mis-deploy there). We are improving this particular part of our process given the info we collected.

A quick search shows some issues that don’t immediately seem related to our infrastructure. Most of the time it’s related to libraries being used, not support IPv6, etc.

Would you mind creating a thread with your particular issues? Did they happen at the same time as our host troubles?

If you can’t fly ssh console into your app, I’m assuming it is either because:

If you had an instance on the downed host, it might still be in the DNS records and that’s what flyctl uses to determine where to connect to. It might’ve picked this bad instance.
- We are working on fixing this, so records are not returned from a down host.
You only have one instance and it’s on the downed host.

If it’s for any other reason, I’d need to see some debug logs so we can troubleshoot it.

Until Today, we had not been updating the status page for a single host going down. This might change, we’re discussing this internally.

Our stance was that our users are expected to launch highly-available applications (more than 1 instance). This includes Postgres clusters which were intended to be highly available if they were launched in a 3-node configuration.

… however, this turned out not to be the case. We’ve already made our latest solution to this the default when deploying Postgres clusters. Old clusters can be migrated via dump + restore procedure.

Given the expectation of high availability, a single host going down shouldn’t bring anything “serious” down.

That hasn’t been the case due to both our DNS server’s shortcomings and what we’re now realizing was the wrong approach to high availability Postgres clusters. Both of these problems are actively being addressed.

We do have an on-call rotation, multiple levels of escalation, etc. Most “alerts” are automatically resolved. Some alerts require manual intervention.

We have metrics for nearly everything, tracing (sampled) for many other things and oh so many logs.

That’s how we’re retracing what happened and we will be implementing new alerts / mitigations as well as fixes to prevent this from happening again.

We’re also working on preventing future host issues by tweaking resource usage. There are / were some bugs that made it possible to oversubscribe hosts under certain conditions.

pier · February 22, 2023, 5:28pm

So increasing flyctl scale count and/or adding more regions?

What is a 3 node configuration? 3 VMs and 3 volumes?

Does the new PG implementation deploy this?

jerome · February 22, 2023, 5:53pm

Yes, that should always be good. Unless you scale to regions too far from your database then you might be in trouble.

If this isn’t helping, then it is on us.

Yes, for postgres that means 3 VMs, each with a volume of their own.

It is supported yes. Any number of node is supported (I think), but the ideal number is an odd number > 1.

mkozak · February 22, 2023, 6:12pm

Thank you a lot, @jerome, for such a thorough response!
I’d like to clarify I don’t blame fly for all of this, I know how systems and platforms can be a mess, especially if there are issues in data centers, networking is down, etc.
I have migrated from heroku for the simple reason - fly is way more pleasant to work with, and this community is an excellent source of answers when something is down. It’s also developer friendly with cli-first approach.
The only reason I’ve let myself vent here today was an accumulated amount of issues in recent week, where some of them may not be your fault at all (like with the upstash redis)
I’ve never intended to diminish your ability to fix things, and now I understand European regions may be at a higher load than the rest of them. I’ve expressed my observations. Maybe I could have used a bit different language though.

TL;DR - I hope things will get better soon, as I don’t see many competitors to fly that suits me best at such pricing

ignoramous · February 22, 2023, 6:57pm

A wise man once said:

Topic		Replies	Views
Reliability: It's Not Great	53	79193	April 15, 2024
Something went wrong? Questions / Help	42	1510	September 22, 2022
fly.io site is currently inaccessible...	83	3314	December 5, 2024
Fly down?	20	1857	January 24, 2023
Very slow response time Questions / Help	24	4279	October 9, 2022

Stability issues

Related topics