Some of you have encountered capacity errors when deploying your app, such as no capacity available in fra
, or insufficient memory available to fulfill request
. They look something like this:
A few months ago we shared why these errors we happening and how to work around them in the short term. Since then we’ve made improvements to significantly reduce them.
We Got More (And Better) Hardware
We’ve purchased more hardware. Like a lot more. It’s coming online now across all our regions. More hosts == more capacity == fewer errors.
We’ve also had to tell some hardware that it’s time to go (RIP). As we’ve grown our platform, we’ve tried out several different hardware configurations, including different CPU models, memory sizes, and disk layouts. We’ve learned that some perform worse than others - for example, we learned that a few hosts in our fleet were configured with a particular RAID10 disk setup which has made volume operations slower than acceptable. We disabled these hosts and are migrating everything to better-behaved hosts. We’re proud to say the problem children are on their way out (incidentally, if anyone is looking for some RAID10 hosts, check out our eBay marketplace).
We Can Migrate Machines Now
When a machine is created, it is placed on a single host. Until recently, we aren’t able to move it anywhere else without destroying it and creating a new machine from scratch. This means that when Fly Proxy tried to autostart a machine for a new incoming request on a full host, it would fail. And when hosts became full, we did not have a lot of tools to recover capacity without doing a lot of manual surgery.
To fix this, we’ve recently developed the magical ability to migrate machines from one host to another. Now when we autostart a machine, we can automatically move it to a different host in the same region if the host it’s on is full. We’re still testing this out, but you should start seeing improvements as we roll it out fleet-wide.
We Fixed Like A Bunch of Bugs
We spent some quality time chasing down and investigating bugs that led to capacity errors. Here’s a sampling of a few we thought were interesting.
- Here’s a good one: when you create a new machine on Fly.io, we reach out to our scheduler, flyd, on every host in the desired region to query how much free CPU and memory it has. Each host is then scored using a weighted average of the utilization of each resource. Resources that are fully available score a 10, resources that are 75% free score a 7.5, resources that are fully used up score a 0 - you get the idea. We found that we were ignoring scores of 0, which meant some new machines got placed on busy hosts. Fixing this bug resulted in a large number of capacity errors disappearing:
- We made
flyctl
retry certain commands automatically instead of making you do it by hand to avoid transient errors. We’ve also updated the error messages to better explain what your options are if you do encounter an error. - Some deleted or updated machines were holding on to their CPU and memory, which made capacity errors worse. Cleaning up those invalid machines significantly freed up capacity on our hosts - up to 50% in some cases!
We haven’t eliminated every source of capacity issues - and sometimes a particular region or host is just out of capacity and behaving correctly! - but we hope you’re feeling the impact of the changes we’ve made. Let us know your thoughts below.