Reliability: It's Not Great - Mitigation strategies

Thanks for the transparency. Now it would be real nice to figure out how to minimize any downtime on the platform given the current state of affairs.

Current app state

I’ve been lucky so far and only have had 18 minutes downtime from my monitoring in the past 6 months.

  • There are 4 regions and they have been overprovisioned to not require any autoscaling.
  • There is a postgres cluster using stolon and from the monitoring logs, I can see that:
    • both hosts have been re-deployed in the past month
    • one has had some restarts
    • there are copious error messages for leader election.
  • Deploys are every 10-20 days.
  • On the launch plan

Mitigation strategies

Edge proxy

It seems the edge proxy issues are completely unavoidable and they’ve been happening due to consul/nomad being stretched past their limits. The new service corrosion is, well, new and will likely have hiccups.

So no mitigation strategy here other than don’t deploy too frequently as the failures are usually post-deploy it seems.

Deployment

There seems to be issues with vault, docker timeouts, and service discovery.

Questions
From the various discussions, it appears that using machines api which is in beta could be potentially be more reliable? Is there any data that can be shared regarding:

  • successful deploys / total deploys for v1 vs v2
  • total app count on v1 vs v2

Are there any other mitigation strategies other than a v2 migration or don’t deploy so much?

Postgres

Questions
Is migration to a repmgr / machine managed postgres going to be better than the current stolon/nomad setup? It seems a bit non-committal.

Capacity issues

Deploying to many regions and overprovisioning to handle load without having to autoscale seems like a reasonable workaround.

Questions
Can anything else be done?

v2 reliability

Questions

  • Is the proxy routing to only healthy instances yet? There was a post indicating that it may not.
  • Are there health checks to allow failed instances to restart?

Hopefully, there is something that can be done other than sit and hope for the best. Also, I’m super-curious if most of the longer downtime was on single-node apps in which case I’m simply going to ignore the entire thread on reliability. Thanks. :slight_smile:

2 Likes

You’re right about proxy issues being unavoidable. Even still, there are two kinds of outages:

  1. Interruptions to already running applications
  2. Deployment infrastructure downtime

Most of our big outages are #2. You will feel them if you try and deploy (we disabled deploys for a total of about 20 hours this week), but an app that’s already running won’t necessarily have issues.

The gray area is when apps crash and need to be rescheduled. People who had downtime without deploying yesterday were suffering from a failure to reschedule broken VMs.

The most reliable setup on Fly is:

  • A machines app with 2+ machines
  • New, repmgr based Postgres
  • Running in 2+ regions

Your Stolon based Postgres will have issues if our shared Consuls fail. We’ve managed to mitigate this, mostly, so it’s been pretty rare recently. It’s still a brittle setup, but not as dire as it used to be.

We’re working on ways to migrate Postgres clusters from Stolon → repmgr. It’s a pain to move right now. Possible if you want to, but painful.

Most smaller issues affect single instance apps.

Those reboots on the Postgres instances are concerning. If you haven’t already asked support to look at your DB cluster, that would be worthwhile. We’re happy to look at what’s happening and see if there’s a reasonable fix.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.