Machines take 3-8s to start when hitting them during shutdown

I’m currently doing some experiments with fly machines, volumes and SQLite.
Machines run perfect with a volume attached and SQLite on it, super smooth, performance is fantastic.

Startup time is also super. When the machine is online, latency is around 10-15ms.
When the machine is shut down completely, the latency to wake up and respond from the SQLite DB is around 400-600ms, which I think is amazing!

However, there’s one glitch. When I exit the application due to inactivity, there’s a window of a few seconds where it takes the machine to wake up 3-8s.

Here’s an example log:

2022-09-09T06:43:44.117 app[9080177b165987] fra [info] No requests for 10 seconds, shutting down...
2022-09-09T06:43:49.202 runner[9080177b165987] fra [info] machine exited with exit code 0, not restarting

The first message is from my code, after that I’m exiting with code 0.
It then takes roughly 5 seconds until the runner confirms the machine is exited.
If you’re sending a request to the machine between exiting and killed, it’s taking 3-8s to get the machine up and running again.

Is there a way you (or I) can reduce the latency to around 500ms, even when the process exited because of inactivity? Random 7s latency is not ideal.

To reproduce, create a machine with a volume attached. Make the machine exit 10s after the last HTTP request. Make a request right after the machine logs that it’s exiting.

Hi @Jens, we took a look at the logs to see what could be causing this delay because it’s not typical for the machine to take 5s to exit. Our init sees the process exit 0 and then begins a cleanup process and one of those steps for machines with a volume is to unmount any volumes in use. Based on what you’ve described and the machine config, it looks like even after the main process exits (as you noted above with the shutting down... log line) something is still writing to the /data directory which prevents our init from cleanly unmounting it in a timely manner.

1 Like

I’ve updated the code to make sure that the main process waits until the sub-process (something) is killed before exiting with 0. This seems to have decreased the shutdown time to ~3s. So, if you’re requesting the service withing the shutdown window, latency is now between 2-4s. Is this the optimum? Am I doing something wrong/could I improve my code further? Or is this simply something we have to accept when using machines (with a volume)?

It’s hard to say what optimum is tbh and right now, most (or all) of our optimizations are in the hot path for starting a machine. One thing we do today for all machines is reset the VM when it is stopped which requires us to perform a containerd snapshot. Depending on the load and size of the image, that can sometimes take 100-200ms. Eventually, we want to support pause/resume of machines wherein we don’t do most of what is done today should hopefully get stop/starts to be sub 1s but it’s not something we’re actively working on atm.

Ok cool.
Just a hint on what I’m doing:

I’d love to collab in the future to take the story of SQLite on fly even further…