Cannot (auto-)start machine: Image not found

Hey, this is some kind of incarnation of Replay header routing 'machine not found' but the cause seems to be a different one. It looks like a bug with accessing the image storage, and I’d be surprised if I’m the only one it hits.

Summary

My app runs reliably (after the fix in the referenced issue) with multiple machines and using the auto-start feature. All machines naturally use the same image (as seen by flyctl machines list. Since 2025-06-21, after the machine start is triggered by the proxy, but it does not come up, and the logs show “failed to stat image path: no such file or directory”. The error also occurs when I run flyctl machines start machine_id. However, the other machines running the same image are not affected and can be started as before.

Debugging details

Request ID and time (with debug header)

< HTTP/2 502 
< server: Fly/b5d4f7e6 (2025-06-19)
< via: 2 fly.io
< fly-request-id: 01JYCQ8CJ6517XFA7XXDV5NGZE-fra
< flyio-debug: {"n":"edge-cf-fra2-8893","nr":"fra","ra":"2a02:908:1396:2540:9159:b18:9636:6b9e","rf":"Verbatim","sr":null,"sdc":null,"sid":null,"st":null,"nrtt":null,"bn":null,"mhn":"edge-cf-ams1-5f8b","mrtt":7}
< date: Sun, 22 Jun 2025 21:06:42 GMT

Machines

The target machine ID is xxxxxxxxxx2258 and the router machine ID is xxxxxxxxxx2358.

App logs

2025-06-22T21:06:42Z proxy[redacted] ams [info]Starting machine
2025-06-22T21:06:42Z proxy[redacted] ams [error][PR04] could not find a good candidate within 10 attempts at load balancing
2025-06-22T21:06:42Z proxy[redacted] ams [error][PM01] machines API returned an error: "failed to stat image path: no such file or directory"
1 Like

@PeterCxy do you think there could be a bug in the image storage subsystem, or the retrieval routine? There hasn’t been any redeploy since, and the lookup now only fails for one of the app machines while the other continues to work correctly.

I wanted to give fly.io staff a chance to investigate, but I need a working deployment now. So I removed the faulty machine and recreated it using flyctl scale count n. It’s now working again, and I assume and hope that this was just an effect of changes in the infrastructure. Fortunately, the affected app was still in testing mode. If such an error hits a production system, the effect could have been worse.

I’m leaving this report as a reference for the incident.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.