Unrecoverable sprites after checkpoint restore, "failed to start overlay service tree"

Sometimes after our checkpoint restores, a sprite will become unrecoverable. The error often says “failed to start overlay service tree,” although the specific of that error have changed a bit over time. The exact error counts / timelines below were gathered from our logging services with AI, but this first section (and the last section with questions) is written by me, Cliff, a human :slight_smile:

3 error messages:

(verbatim from Sprites API)

Primary (>90% of occurrences):
Failed to restore checkpoint: failed to start overlay service tree: failed to start container: process exited before initialization

Rare variant:
Failed to restore checkpoint: failed to start overlay service tree: failed to start container: cannot start process: shutdown in progress

Historical variant (March 2026 only, outside our current 30-day retention):
Failed to restore checkpoint: failed to start overlay service tree: failed to start overlay: failed to mount overlay: mount recovery failed: fresh mount failed: failed to format device: failed to format device: exit status 1, output: mke2fs 1.47.2 (1-Jan-2025)


Error occurrence counts:

(within our 30-day retention window)

Error subtype Events Distinct sprites Date range
process exited before initialization 14 12 Apr 23 – Apr 26
shutdown in progress 1 1 May 15
mke2fs / format device 0 -– Outside retention
Total 15 13

Example 1:

nori-crisp-tofu-16f9 — “process exited before initialization”

12 successful restores over ~23 hours, then fails non-deterministically on the 13th. Using @fly/sprites JS SDK, restoring from checkpoint v1 each time.

Time (UTC) Event Duration
2026-04-25T17:56:27Z Base bootstrap started -–
2026-04-25T17:57:52Z Checkpoint v1 created -–
2026-04-25T20:00:27Z Restore from v1 OK (13s)
2026-04-25T22:04:27Z Restore from v1 OK (13s)
2026-04-26T00:08:27Z Restore from v1 OK (19s)
2026-04-26T02:18:28Z Restore from v1 OK (22s)
2026-04-26T04:21:28Z Restore from v1 OK (24s)
2026-04-26T06:25:32Z Restore from v1 OK (47s)
2026-04-26T08:29:28Z Restore from v1 OK (19s)
2026-04-26T10:32:28Z Restore from v1 OK (16s)
2026-04-26T12:35:28Z Restore from v1 OK (17s)
2026-04-26T14:38:28Z Restore from v1 OK (24s)
2026-04-26T16:41:28Z Restore from v1 OK (19s)
2026-04-26T18:45:28Z Restore from v1 OK (22s) — last success
2026-04-26T20:48:29Z Restore from v1 FAILED (62s to error)

Error at 20:49:31Z:
Failed to restore checkpoint: failed to start overlay service tree: failed to start container: process exited before initialization

No further log activity for this sprite.


Example 2:

nori-savory-kamaboko-5a8d — “shutdown in progress”

Shows the cascade when a restore fails during active use. 4 successful restores, then the failure during an org script retry triggers the “shutdown in progress” variant, followed by the sprite becoming completely unreachable (404).

Time (UTC) Event
2026-05-15T17:07:42Z Base bootstrap started
2026-05-15T17:10:06Z Checkpoint v1 created
2026-05-15T17:28:15Z Restore from v1 — OK (20s)
2026-05-15T17:52:19Z Restore from v1 — OK (25s)
2026-05-15T18:18:36Z Restore from v1 — OK (25s), session claimed
2026-05-15T18:20:00Z Restore from v1 — OK (16s), org script retry path
2026-05-15T18:20:16Z Immediately, second restore initiated
2026-05-15T18:20:21Z Readiness probe: exec timeout (attempt 1/15)
2026-05-15T18:20:28Z Readiness probe: exec timeout (attempt 2/15)
2026-05-15T18:20:30Z FAILED: …cannot start process: shutdown in progress
2026-05-15T18:20:30Z Sprite returns 404: {“error”:“sprite not found”}
2026-05-15T18:20:30Z – 18:20:56Z Readiness probes 3–15: all WebSocket error
2026-05-15T18:20:56Z Readiness exhausted after 15 attempts

Questions (back to human text):

  1. Just for my own curiosity, what is the “overlay service tree”? Inside the VM, or supporting infra outside of it? Any way to get more diagnostics around what is happening when this fails?
  2. We haven’t been able to recover one of these instances before. Is there any recovery path here?
  3. The historical mke2fs variant suggests storage-layer root cause? Is this a known issue overall, and are all the variants related?

Both the examples above are from Fly.io org tilework-tech if anyone from the sprite / Fly team could investigate those sprite histories directly.

Connection: keep-alive

(I’m still interested in an answer to this, if any sprites user sees the same issue, or any fly developer sees the thread!)

A Sprite’s filesystem is managed by a series of processes which depend on one another (this is the overlay service tree - it means a “process tree”). If one of these fails to start after a checkpoint restore, you get that error.

Do you have any sprites currently in that state so I can inspect one and get more details? The one you mentioned has been destroyed it seems.

Unfortunately not right now, this only happens ~2 times a month. I’ll see about adding an edge case to not clean these guys up when they fail with this message!