Sometimes machines hang and don't shutdown themselves (and I get billed for it)

kurt · January 9, 2025, 10:21pm

This is a weird one! First, we’ll always refund stuff you didn’t mean to spend, just email billing@fly.io (or use your support email if you have paid support).

Our kneejerk best guess at the moment is that something within the VM is consuming all the CPU with realtime priority and preventing anything else from working. I can force something kind of like this with a forkbomb. The setTimeout never happens because the event loop is waiting for CPU.

Your machines seem to write a huge amount of IO, like 20GB/s in aggregate. I think this could be related.

I don’t think we have any tooling that will help here. If the stuff in the Machine keeps running, we basically “trust” that it should be.

What I’d probably do is register an external watchdog. We have an example coordinator in a demo bash functions as a service project that manages stops from outside the Machines – which is actually necessary if you can’t trust the code: GitHub - superfly/bfaas: Bash functions-as-a-service

A simple way to handle this might be to put a proxy in between the user and the machine that does it’s own time based cancellation, then send a stop request through the API. If the stop doesn’t happen gracefully, we’ll kill it much more dramatically after the timeout.

Topic		Replies	Views
Fly Machine becoming unresponsive and then stopping without explanation	18	1741	February 6, 2023
One of the worker machines gets stopped once in a while	5	418	August 16, 2023
Machine shutting down while still running a function	2	525	May 10, 2023
Predictable Processor Performance Fresh Produce machines	163	4429	January 14, 2025
Machine created via API is not suspending troubleshooting , machines , autoscaling	13	87	May 11, 2025

Sometimes machines hang and don't shutdown themselves (and I get billed for it)

Related topics