This is a weird one! First, we’ll always refund stuff you didn’t mean to spend, just email billing@fly.io (or use your support email if you have paid support).
Our kneejerk best guess at the moment is that something within the VM is consuming all the CPU with realtime priority and preventing anything else from working. I can force something kind of like this with a forkbomb. The setTimeout never happens because the event loop is waiting for CPU.
Your machines seem to write a huge amount of IO, like 20GB/s in aggregate. I think this could be related.
I don’t think we have any tooling that will help here. If the stuff in the Machine keeps running, we basically “trust” that it should be.
What I’d probably do is register an external watchdog. We have an example coordinator in a demo bash functions as a service project that manages stops from outside the Machines – which is actually necessary if you can’t trust the code: GitHub - superfly/bfaas: Bash functions-as-a-service
A simple way to handle this might be to put a proxy in between the user and the machine that does it’s own time based cancellation, then send a stop request through the API. If the stop doesn’t happen gracefully, we’ll kill it much more dramatically after the timeout.