Hey everyone,
A little while back, I talked about improving deploy recovery in flyctl
. To summarize that post, we worked on improving the ability for flyctl
to handle intermittent failures during deploy orchestration. This post is more of an update into how that’s going.
The numbers Mason!
Thanks to the amazing work from the folks at the deployments team, we have pretty good insights into how successful deploys are (and why they fail).
This is the “platform failure rate” from the past two weeks. These are cases are deploys fail because some part of the platform failed. As you can see, the error rate dropped from around 4.1% two weeks ago (before the recovery changes were merged in), to around 2.8% as of writing! The recovery changes were merged in about a week and a half ago, but only 5% of users were using recovery by default. As you can see on the graph, we brought that number up to 100% slowly, and the results speak for themselves!
Where are we going from here?
There’s still a good amount of work to be done to get this number even lower. Remote builder issues are now a large part of why deploys fail. We’re doing things like introducing support for Depot to help to mitigate these issues, but improving our current remote builder support is still a priority. There’s also platform errors that we could recover from that we currently aren’t. We’re actively looking into those now, and we should see further improvements in the future
Why am I saying all this?
Being transparent about this kind of work is important! Saying that we’re working on improving the reliability of the platform is important, but so is showing it!