Hm… Perhaps you could say a little more about how you’re defining/measuring “fork time”? My understanding is that the volume should be usable while it’s still in the hydrating state, for example.
If that’s volume capacity and not the amount used inside then the lack of difference doesn’t really seem surprising. I’d expect the hydration time to depend primarily on the number of dirty blocks, since those are what have to be transferred…
Generally, the Fly.io platform is geared toward doing things like this in advance. In other words, having a certain number of these volumes and Machines already created, before there’s demand from your users.
(Your performance question is still valid, either way, of course.)
@cam I don’t see a mention in the docs of forks being instant because of COW, in fact it says they are not instant. could you point me to exactly where it says that in the docs?
fwiw I think you’re seeing 1s fork time in the unmounted state because (afaik) we don’t actually fork volumes that have never been used - we just set up a new one.
for the 18s fork time, is that until it’s mountable (hydrating) or until it’s created? as @mayailurus mentioned, the volume is usable in hydrating state, and that’s what the “immediate” note on docs refers to.
I have a running machine with an attached volume. I want to fork the volume, so I can create another machine from the same volume point / “snapshot” (assuming it literally is a snapshot under the hood). Fork time is time from running the volume fork command to the time when the volume is hydrating / usable.
Yes, my use case is definitely abnormal. I’m running CLI-based coding agents like Claude Code, and want to be able to “branch out” trajectories or tasks to do best-of-n. For minimal latency, I was hoping to be able to let users define sandbox templates so tasks can startup super rapidly. So a user might have many volume templates and then would want to create n machines, each with a fork of a volume template.
I know there are lots of sandbox providers, but I was interested in Fly because most Sandbox providers assume Sandbox lifecycle < ~10min but Fly Machines feel more durable + have more control with the Machines API.
Sorry, I should have been more specific. I was referencing the “Forked volumes are usable immediately” quote from the Volume Forking blueprint (here: Using Fly Volume forks for faster startup times · Fly Docs ). In my tests, getting to hydrating state even for a volume with < ~100MB usage was taking between 14-32s, with an average of ~18s. This seemed far from “usable immediately”, so just wanted to check here to see if maybe I was doing something wrong or if my expectations were too high. It’d be really helpful to have more clarity on timing, maybe some simple benchmarks, as it would be really helpful when evaluating things for many agentic coding use cases.
Curious if you guys might have any guidance for using Fly in that context. I know Anthropic recommends Fly for hosting Claude Code / Claude Agent SDK which was one of the primary reasons we’re testing it out
Interesting… I think Fly.io’s new Sprites are intended to be their main answer in that niche, although they’re fairly early in development. (Forking is still on the to-do list, for example.)
The docs for them have timing expectations for the existing snapshot and rollback operations, and those at least sound more in line with what you were looking for:
Checkpoints are live and fast. Creating one takes about 300ms
Restores are fast—typically completing in under a second
(Although, I would have expected the traditional platform’s volume forking to be much quicker in reaching the hydrating state, to be honest. Maybe an early onset of the current management plane outage was slowing that down. Sometimes problems gradually build for a while before the actual meltdown…)