Sometimes after our checkpoint restores, a sprite will become unrecoverable. The error often says “failed to start overlay service tree,” although the specific of that error have changed a bit over time. The exact error counts / timelines below were gathered from our logging services with AI, but this first section (and the last section with questions) is written by me, Cliff, a human ![]()
3 error messages:
(verbatim from Sprites API)
Primary (>90% of occurrences):
Failed to restore checkpoint: failed to start overlay service tree: failed to start container: process exited before initialization
Rare variant:
Failed to restore checkpoint: failed to start overlay service tree: failed to start container: cannot start process: shutdown in progress
Historical variant (March 2026 only, outside our current 30-day retention):
Failed to restore checkpoint: failed to start overlay service tree: failed to start overlay: failed to mount overlay: mount recovery failed: fresh mount failed: failed to format device: failed to format device: exit status 1, output: mke2fs 1.47.2 (1-Jan-2025)
Error occurrence counts:
(within our 30-day retention window)
| Error subtype | Events | Distinct sprites | Date range |
|---|---|---|---|
| process exited before initialization | 14 | 12 | Apr 23 – Apr 26 |
| shutdown in progress | 1 | 1 | May 15 |
| mke2fs / format device | 0 | -– | Outside retention |
| Total | 15 | 13 |
Example 1:
nori-crisp-tofu-16f9 — “process exited before initialization”
12 successful restores over ~23 hours, then fails non-deterministically on the 13th. Using @fly/sprites JS SDK, restoring from checkpoint v1 each time.
| Time (UTC) | Event | Duration |
|---|---|---|
| 2026-04-25T17:56:27Z | Base bootstrap started | -– |
| 2026-04-25T17:57:52Z | Checkpoint v1 created | -– |
| 2026-04-25T20:00:27Z | Restore from v1 | OK (13s) |
| 2026-04-25T22:04:27Z | Restore from v1 | OK (13s) |
| 2026-04-26T00:08:27Z | Restore from v1 | OK (19s) |
| 2026-04-26T02:18:28Z | Restore from v1 | OK (22s) |
| 2026-04-26T04:21:28Z | Restore from v1 | OK (24s) |
| 2026-04-26T06:25:32Z | Restore from v1 | OK (47s) |
| 2026-04-26T08:29:28Z | Restore from v1 | OK (19s) |
| 2026-04-26T10:32:28Z | Restore from v1 | OK (16s) |
| 2026-04-26T12:35:28Z | Restore from v1 | OK (17s) |
| 2026-04-26T14:38:28Z | Restore from v1 | OK (24s) |
| 2026-04-26T16:41:28Z | Restore from v1 | OK (19s) |
| 2026-04-26T18:45:28Z | Restore from v1 | OK (22s) — last success |
| 2026-04-26T20:48:29Z | Restore from v1 | FAILED (62s to error) |
Error at 20:49:31Z:
Failed to restore checkpoint: failed to start overlay service tree: failed to start container: process exited before initialization
No further log activity for this sprite.
Example 2:
nori-savory-kamaboko-5a8d — “shutdown in progress”
Shows the cascade when a restore fails during active use. 4 successful restores, then the failure during an org script retry triggers the “shutdown in progress” variant, followed by the sprite becoming completely unreachable (404).
| Time (UTC) | Event |
|---|---|
| 2026-05-15T17:07:42Z | Base bootstrap started |
| 2026-05-15T17:10:06Z | Checkpoint v1 created |
| 2026-05-15T17:28:15Z | Restore from v1 — OK (20s) |
| 2026-05-15T17:52:19Z | Restore from v1 — OK (25s) |
| 2026-05-15T18:18:36Z | Restore from v1 — OK (25s), session claimed |
| 2026-05-15T18:20:00Z | Restore from v1 — OK (16s), org script retry path |
| 2026-05-15T18:20:16Z | Immediately, second restore initiated |
| 2026-05-15T18:20:21Z | Readiness probe: exec timeout (attempt 1/15) |
| 2026-05-15T18:20:28Z | Readiness probe: exec timeout (attempt 2/15) |
| 2026-05-15T18:20:30Z | FAILED: …cannot start process: shutdown in progress |
| 2026-05-15T18:20:30Z | Sprite returns 404: {“error”:“sprite not found”} |
| 2026-05-15T18:20:30Z – 18:20:56Z | Readiness probes 3–15: all WebSocket error |
| 2026-05-15T18:20:56Z | Readiness exhausted after 15 attempts |
Questions (back to human text):
- Just for my own curiosity, what is the “overlay service tree”? Inside the VM, or supporting infra outside of it? Any way to get more diagnostics around what is happening when this fails?
- We haven’t been able to recover one of these instances before. Is there any recovery path here?
- The historical mke2fs variant suggests storage-layer root cause? Is this a known issue overall, and are all the variants related?
Both the examples above are from Fly.io org tilework-tech if anyone from the sprite / Fly team could investigate those sprite histories directly.