Stability issues

I will acknowledge things have been bumpy lately. The experience of each user on our platform depends a lot on what region they’re deployed to and what products they use.

As a biased example: my closest region is YUL (Montreal) and I’ve been running a few apps there + a postgres cluster and I’ve only had a few rare, user-fixable, issues in the last few months.

Some parts of Europe (FRA, in particular) are extremely busy regions, meaning our resources there are pushed to the brink. More capacity is not always available, and rebalancing hosts is not an easy task without user-coordination (it’s only a problem for apps with volumes, because they live on a specific host).

An app in FRA, AMS, CDG or LHR might have more problems. Especially if it’s a single-node app (no redundancy) or if they use a Postgres cluster which have been problematic in their current form.

I can answer the specific criticism:

Responding to these issues is an incredible amount of work. Up until Monday this week, crucial information was missing from logs that would’ve helped user diagnose issues. Essentially all proxy error logs “Could not proxy…” were missing the actual reason why. This led to a lot of confusion for our users. It’s even worse because on streaming logs were missing that information. The initial logs pulled (via fly logs) would have that information displayed correctly.

We’re trying to give users more information so they can better troubleshoot their app. Even if I’ll take the blame for a lot of proxy issues over the years, it has gotten way more stable / fast in the past few year and most error logs present in app logs are the result of:

  • A host struggling, but still available
  • The app’s instance(s) are unhealthy in a way preventing connection (timing out or outright connection refused).

We’ve pushed our Nomad cluster to the limit and it hasn’t been working the way we’ve wanted it to work for us for the past 2 years. That’s why we’re moving towards machines with a different scheduler entirely. There are still rough edges there, but we’re improving things daily (because we can, we control everything about this project as it was custom built for our own use case).

To successfully deploy an app with our Nomad infrastructure (current default, aka “Apps V1”), these things have to work:

  1. Start a remote builder (unless you’re using a Docker locally)
  2. Push the image to our registry (this goes through the proxy and the registry itself is just another Fly app we host ourselves, we have backups deployed elsewhere because we could get in a chicken-egg situation here)
  3. Create a new release via our API
  4. Schedule with Nomad (it needs to find capacity given the constraints of your app like region, volumes, etc.)
  5. We then launch a background job to monitor the deployment
  6. flyctl polls an API to get updates on the deployment

Things can sometimes go wrong between 4 and 6 (inclusive). Nomad can take a long time, especially if we’re close to capacity in the app’s regions.

Machines work more like docker run and they’re easier to work with if you’re familiar with Docker.

This is not to excuse the issues, but there’s a lot going on in the deployment process.

This depends a lot on which regions your app is targeting.

This weekend we had a CDG host go down and this morning an AMS host. Europe had a bad week for sure.

Today’s host failure might’ve been prevented with better internal process (we’re still investigating, but looks like there was a a mis-deploy there). We are improving this particular part of our process given the info we collected.

A quick search shows some issues that don’t immediately seem related to our infrastructure. Most of the time it’s related to libraries being used, not support IPv6, etc.

Would you mind creating a thread with your particular issues? Did they happen at the same time as our host troubles?

If you can’t fly ssh console into your app, I’m assuming it is either because:

  • If you had an instance on the downed host, it might still be in the DNS records and that’s what flyctl uses to determine where to connect to. It might’ve picked this bad instance.
    • We are working on fixing this, so records are not returned from a down host.
  • You only have one instance and it’s on the downed host.

If it’s for any other reason, I’d need to see some debug logs so we can troubleshoot it.

Until Today, we had not been updating the status page for a single host going down. This might change, we’re discussing this internally.

Our stance was that our users are expected to launch highly-available applications (more than 1 instance). This includes Postgres clusters which were intended to be highly available if they were launched in a 3-node configuration.

… however, this turned out not to be the case. We’ve already made our latest solution to this the default when deploying Postgres clusters. Old clusters can be migrated via dump + restore procedure.

Given the expectation of high availability, a single host going down shouldn’t bring anything “serious” down.

That hasn’t been the case due to both our DNS server’s shortcomings and what we’re now realizing was the wrong approach to high availability Postgres clusters. Both of these problems are actively being addressed.

We do have an on-call rotation, multiple levels of escalation, etc. Most “alerts” are automatically resolved. Some alerts require manual intervention.

We have metrics for nearly everything, tracing (sampled) for many other things and oh so many logs.

That’s how we’re retracing what happened and we will be implementing new alerts / mitigations as well as fixes to prevent this from happening again.

We’re also working on preventing future host issues by tweaking resource usage. There are / were some bugs that made it possible to oversubscribe hosts under certain conditions.

4 Likes