Thanks for the transparency. Now it would be real nice to figure out how to minimize any downtime on the platform given the current state of affairs.
Current app state
I’ve been lucky so far and only have had 18 minutes downtime from my monitoring in the past 6 months.
- There are 4 regions and they have been overprovisioned to not require any autoscaling.
- There is a postgres cluster using stolon and from the monitoring logs, I can see that:
- both hosts have been re-deployed in the past month
- one has had some restarts
- there are copious error messages for leader election.
- Deploys are every 10-20 days.
- On the launch plan
Mitigation strategies
Edge proxy
It seems the edge proxy issues are completely unavoidable and they’ve been happening due to consul/nomad being stretched past their limits. The new service corrosion is, well, new and will likely have hiccups.
So no mitigation strategy here other than don’t deploy too frequently as the failures are usually post-deploy it seems.
Deployment
There seems to be issues with vault, docker timeouts, and service discovery.
Questions
From the various discussions, it appears that using machines api which is in beta could be potentially be more reliable? Is there any data that can be shared regarding:
- successful deploys / total deploys for v1 vs v2
- total app count on v1 vs v2
Are there any other mitigation strategies other than a v2 migration or don’t deploy so much?
Postgres
Questions
Is migration to a repmgr / machine managed postgres going to be better than the current stolon/nomad setup? It seems a bit non-committal.
Capacity issues
Deploying to many regions and overprovisioning to handle load without having to autoscale seems like a reasonable workaround.
Questions
Can anything else be done?
v2 reliability
Questions
- Is the proxy routing to only healthy instances yet? There was a post indicating that it may not.
- Are there health checks to allow failed instances to restart?
Hopefully, there is something that can be done other than sit and hope for the best. Also, I’m super-curious if most of the longer downtime was on single-node apps in which case I’m simply going to ignore the entire thread on reliability. Thanks.