Reliability: It's Not Great - Mitigation strategies

tj1 · March 16, 2023, 10:19am

Thanks for the transparency. Now it would be real nice to figure out how to minimize any downtime on the platform given the current state of affairs.

Current app state

I’ve been lucky so far and only have had 18 minutes downtime from my monitoring in the past 6 months.

There are 4 regions and they have been overprovisioned to not require any autoscaling.
There is a postgres cluster using stolon and from the monitoring logs, I can see that:
- both hosts have been re-deployed in the past month
- one has had some restarts
- there are copious error messages for leader election.
Deploys are every 10-20 days.
On the launch plan

Mitigation strategies

Edge proxy

It seems the edge proxy issues are completely unavoidable and they’ve been happening due to consul/nomad being stretched past their limits. The new service corrosion is, well, new and will likely have hiccups.

So no mitigation strategy here other than don’t deploy too frequently as the failures are usually post-deploy it seems.

Deployment

There seems to be issues with vault, docker timeouts, and service discovery.

Questions
From the various discussions, it appears that using machines api which is in beta could be potentially be more reliable? Is there any data that can be shared regarding:

successful deploys / total deploys for v1 vs v2
total app count on v1 vs v2

Are there any other mitigation strategies other than a v2 migration or don’t deploy so much?

Postgres

Questions
Is migration to a repmgr / machine managed postgres going to be better than the current stolon/nomad setup? It seems a bit non-committal.

Capacity issues

Deploying to many regions and overprovisioning to handle load without having to autoscale seems like a reasonable workaround.

Questions
Can anything else be done?

v2 reliability

Questions

Is the proxy routing to only healthy instances yet? There was a post indicating that it may not.
Are there health checks to allow failed instances to restart?

Hopefully, there is something that can be done other than sit and hope for the best. Also, I’m super-curious if most of the longer downtime was on single-node apps in which case I’m simply going to ignore the entire thread on reliability. Thanks.

kurt · March 16, 2023, 2:04pm

You’re right about proxy issues being unavoidable. Even still, there are two kinds of outages:

Interruptions to already running applications
Deployment infrastructure downtime

Most of our big outages are #2. You will feel them if you try and deploy (we disabled deploys for a total of about 20 hours this week), but an app that’s already running won’t necessarily have issues.

The gray area is when apps crash and need to be rescheduled. People who had downtime without deploying yesterday were suffering from a failure to reschedule broken VMs.

The most reliable setup on Fly is:

A machines app with 2+ machines
New, repmgr based Postgres
Running in 2+ regions

Your Stolon based Postgres will have issues if our shared Consuls fail. We’ve managed to mitigate this, mostly, so it’s been pretty rare recently. It’s still a brittle setup, but not as dire as it used to be.

We’re working on ways to migrate Postgres clusters from Stolon → repmgr. It’s a pain to move right now. Possible if you want to, but painful.

Most smaller issues affect single instance apps.

Those reboots on the Postgres instances are concerning. If you haven’t already asked support to look at your DB cluster, that would be worthwhile. We’re happy to look at what’s happening and see if there’s a reasonable fix.

system · March 23, 2023, 2:05pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Postgres reliability updates and etcd shenanigans	4	1204	July 2, 2022
Postgres database apps are crashing again	22	1188	October 25, 2022
Any plans to move away from OVH?	4	486	October 13, 2021
Frankfurt, we have a problem. Questions / Help	2	369	January 20, 2023
App is now randomly "Not Deployed" - Why?	8	560	August 3, 2021