First off, fraud shouldn’t impact you. That’s our challenge, not yours. I only mentioned it because transparency is good. However, the same thing would have happened if the host had a network or hardware failure.
If your app is critical, you need at least 2 VMs running so volumes are spread across “availability zones”. This is how we achieve high availability with our Postgres apps.
Today the host didn’t actually fail, it was just under high load and needed to evict some apps to keep others running. Priority is calculated by VM size, volumes, count, etc so “larger” production apps have a higher priority than “smaller” hobby apps. Apps with lower priority are evicted only when necessary, and as far as I can tell, your app had a lower priority and was the only unlucky one to be evicted. Once CPU returned to normal the host had room to launch it again.
The system worked as designed, however annoying it might be. On our end, we’re investigating why we didn’t respond to the high CPU issue earlier and will fix as needed. And as usual, we’ll continue working on capacity planning and fraud prevention. And on your end, deploy critical apps with multiple VMs to withstand hardware failures and increase the priority.
Again, I’m sorry this happened and I hope that explains it a bit.