I just noticed a service was down so went to take a look and it is listed as “not deployed” in my dashboard, why would this happen and how do we prevent this from ever happening again?
We’ll take a look. Could you post or DM your app’s name?
better-cart-grafana
You should be good now. I’m sorry about that. Some fraud apps were crushing the CPU on the host your app was deployed to. We’re still investigating though.
I am worried about some of the more critical systems we are planning on moving over to Fly in the next few months, why would this effect our VMs? This seems like a pretty bad solution to fraudulent apps.
I was also only able to get this back up and running by deploying a new version, fly restart did absolutely nothing.
First off, fraud shouldn’t impact you. That’s our challenge, not yours. I only mentioned it because transparency is good. However, the same thing would have happened if the host had a network or hardware failure.
If your app is critical, you need at least 2 VMs running so volumes are spread across “availability zones”. This is how we achieve high availability with our Postgres apps.
Today the host didn’t actually fail, it was just under high load and needed to evict some apps to keep others running. Priority is calculated by VM size, volumes, count, etc so “larger” production apps have a higher priority than “smaller” hobby apps. Apps with lower priority are evicted only when necessary, and as far as I can tell, your app had a lower priority and was the only unlucky one to be evicted. Once CPU returned to normal the host had room to launch it again.
The system worked as designed, however annoying it might be. On our end, we’re investigating why we didn’t respond to the high CPU issue earlier and will fix as needed. And as usual, we’ll continue working on capacity planning and fraud prevention. And on your end, deploy critical apps with multiple VMs to withstand hardware failures and increase the priority.
Again, I’m sorry this happened and I hope that explains it a bit.
I can understand the concept of priority, but that should never evict without a new home created where there are resources available. That just seems odd, and if that is how the system is designed, it seems to be a pretty big flaw that apps would just evict and not boot somewhere else in the meantime while its original host is experiencing load issues.
As for high availability, we will be sure to make sure there is always VMs running in multiple regions.
Thanks,
Dan
The problem is that your app only had one volume so it had nowhere else to go. We’re not happy with this though. It’s one of the many limitations with our current scheduler, but we’re working on a replacement. Eventually volumes will be able to migrate to another host so an app can launch there instead of dying.
Multiple VMs in a single region are still balanced across hardware. You just need more than 1.