Machine restarts fail after successful initial start (an unhandled IO error occurred: File exists (os error 17))

Hi there! I have a Fly Machine image that was previously working great; recently an issue has arisen with Machines configured with restart: always in that a new machine builds + starts fine, restarts fine, and then subsequents restarts fail such that the Machine goes down and doesn’t come back up.

Error logs aren’t particularly insightful (to me, at least!)

Initial start is fine:

2024-06-11T10:58:56.177 runner[185e76df4246e8] lhr [info] Machine created and started in 3.707s
....

First restart succeeds:

2024-06-11T10:59:16.196 app[185e76df4246e8] lhr [info] [ 20.460491] reboot: Restarting system
2024-06-11T10:59:16.615 app[185e76df4246e8] lhr [info] 2024-06-11T10:59:16.615569983 [01J03F6RVSVNNBEX4FSGKPNDA1:main] Running Firecracker v1.7.0
2024-06-11T10:59:16.747 app[185e76df4246e8] lhr [info] [ 0.048541] PCI: Fatal: No config space access function found
2024-06-11T10:59:17.089 app[185e76df4246e8] lhr [info] INFO Starting init (commit: dec752a2)...
2024-06-11T10:59:17.130 app[185e76df4246e8] lhr [info] INFO Preparing to run: `docker-entrypoint.sh node ../../modules/cli-scripts/bin/set-env.js -c APP_ENV -- node dist/worker.js` as root
2024-06-11T10:59:17.137 app[185e76df4246e8] lhr [info] INFO [fly api proxy] listening at /.fly/api
2024-06-11T10:59:17.141 app[185e76df4246e8] lhr [info] 2024/06/11 10:59:17 INFO SSH listening listen_address=[fdaa:2:967a:a7b:15d:6ba8:fd40:2]:22 dns_server=[fdaa::3]:53
2024-06-11T10:59:17.161 runner[185e76df4246e8] lhr [info] Machine started in 625ms
...

Subsequent restarts fail, after which the machine can’t be restarted successfully:

2024-06-11T10:59:33.156 app[185e76df4246e8] lhr [info] INFO Main child exited normally with code: 0
2024-06-11T10:59:33.169 app[185e76df4246e8] lhr [info] INFO Starting clean up.
2024-06-11T10:59:33.170 app[185e76df4246e8] lhr [info] WARN could not unmount /rootfs: EINVAL: Invalid argument
2024-06-11T10:59:33.171 app[185e76df4246e8] lhr [info] [ 16.471415] reboot: Restarting system
2024-06-11T10:59:33.568 app[185e76df4246e8] lhr [info] 2024-06-11T10:59:33.568044282 [01J03F6RVSVNNBEX4FSGKPNDA1:main] Running Firecracker v1.7.0
2024-06-11T10:59:33.693 app[185e76df4246e8] lhr [info] [ 0.047461] PCI: Fatal: No config space access function found
2024-06-11T10:59:34.018 app[185e76df4246e8] lhr [info] INFO Starting init (commit: dec752a2)...
2024-06-11T10:59:34.061 app[185e76df4246e8] lhr [info] ERROR Error: an unhandled IO error occurred: File exists (os error 17)
2024-06-11T10:59:34.062 app[185e76df4246e8] lhr [info] [ 0.415046] reboot: Restarting system
2024-06-11T10:59:34.144 app[185e76df4246e8] lhr [warn] Virtual machine exited abruptly
2024-06-11T10:59:34.553 app[185e76df4246e8] lhr [info] 2024-06-11T10:59:34.553241506 [01J03F6RVSVNNBEX4FSGKPNDA1:main] Running Firecracker v1.7.0
2024-06-11T10:59:34.689 app[185e76df4246e8] lhr [info] [ 0.052915] PCI: Fatal: No config space access function found
2024-06-11T10:59:35.015 app[185e76df4246e8] lhr [info] INFO Starting init (commit: dec752a2)...
2024-06-11T10:59:35.056 app[185e76df4246e8] lhr [info] ERROR Error: an unhandled IO error occurred: File exists (os error 17)
2024-06-11T10:59:35.057 app[185e76df4246e8] lhr [info] [ 0.420334] reboot: Restarting system
2024-06-11T10:59:35.160 app[185e76df4246e8] lhr [warn] Virtual machine exited abruptly
...

…and so on, ad infinitum.

The relevant error appears to be ERROR Error: an unhandled IO error occurred: File exists (os error 17)

Any insight into what may be causing this?
Thanks!

2 Likes

I have the same issue, it was working great until today

1 Like

I am also facing this issue. I deployed a simple node js server app, but the deployed failed each time and the app is in suspended mode.

Hi @andyy @alexandergv @shekhar-227—thank you for the reports! There was a subtle bug in our init process causing it to crash on the second and subsequent restarts in some cases. I just deployed a fix. You should receive the new init version once you update your Machines/redeploy your apps (you should then see commit: f7402432 in the logs).

If this doesn’t resolve the problem, then please let me know!

3 Likes

Thanks @MatthewIngwersen it appears to be working again now!

(Aside: oh - that’s what commit: xxxxx in the logs refers to - your commit, not one of mine :grin: )

1 Like

It worked! Thanks

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.