Docker image works locally, but not on Fly.io; getting `command not found`

Summary

Hello!

I have a Docker image that works on my local machine using Podman. However, when it’s on a Fly machine, the server reports command not found for any command my server script tries to call.

Attempts to Fix

My script uploads the Docker image to registry.fly.io using Skopeo, and then deploys a Fly app from that image.

"$DOCKER_IMAGE_STREAM" | gzip --fast | skopeo --insecure-policy copy --dest-creds="x:$FLY_API_TOKEN" "docker-archive:/dev/stdin" "docker://registry.fly.io/$FLY_APP_NAME:$DOCKER_IMAGE_TAG"
flyctl deploy -c "$FLY_CONFIG" -i "registry.fly.io/$FLY_APP_NAME:$DOCKER_IMAGE_TAG"

Fly seems to successfully load the image, however the Fly machine won’t stay on.
I see some command not found errors from inside the server script in the flyctl logs.

2025-01-08T23:26:14Z app[185241dc107068] den [info]2025-01-08T23:26:14.358821070 [01JH43PM0KBA5TPPEJVYP1RY3T:main] Running Firecracker v1.7.0
2025-01-08T23:26:14Z app[185241dc107068] den [info] INFO Starting init (commit: 1df1d0a0)...
2025-01-08T23:26:15Z app[185241dc107068] den [info] INFO Preparing to run: `/nix/store/9jnifcq70bqmmsa99iw4mhpc2vla6m30-fly-vpn-server-0.0.0/bin/fly-vpn-server-0.0.0` as root
2025-01-08T23:26:15Z app[185241dc107068] den [info] INFO [fly api proxy] listening at /.fly/api
2025-01-08T23:26:15Z app[185241dc107068] den [info]/nix/store/9jnifcq70bqmmsa99iw4mhpc2vla6m30-fly-vpn-server-0.0.0/bin/fly-vpn-server-0.0.0: line 30: id: command not found
2025-01-08T23:26:15Z app[185241dc107068] den [info]/nix/store/9jnifcq70bqmmsa99iw4mhpc2vla6m30-fly-vpn-server-0.0.0/bin/fly-vpn-server-0.0.0: line 146: sudo: command not found
2025-01-08T23:26:15Z runner[185241dc107068] den [info]Machine started in 809ms
2025-01-08T23:26:15Z app[185241dc107068] den [info]2025/01/08 23:26:15 INFO SSH listening listen_address=[fdaa:2:439f:a7b:9:5554:5b3d:2]:22 dns_server=[fdaa::3]:53
2025-01-08T23:26:16Z app[185241dc107068] den [info] INFO Main child exited normally with code: 127
2025-01-08T23:26:16Z app[185241dc107068] den [info] INFO Starting clean up.
2025-01-08T23:26:16Z app[185241dc107068] den [info] WARN could not unmount /rootfs: EINVAL: Invalid argument
2025-01-08T23:26:16Z app[185241dc107068] den [info][    1.661012] reboot: Restarting system
2025-01-08T23:26:16Z runner[185241dc107068] den [info]machine has reached its max restart count of 10

So I try to use Podman to upload the image instead, thinking there could be a bug with Skopeo.

podman login -u x -p "$FLY_API_TOKEN" -v registry.fly.io
"$DOCKER_IMAGE_STREAM" | podman load
podman push --format v2s2 "localhost/$DOCKER_IMAGE_NAME:$DOCKER_IMAGE_TAG" "docker://registry.fly.io/$FLY_APP_NAME:$DOCKER_IMAGE_TAG"
flyctl deploy -c "$FLY_CONFIG" -i "registry.fly.io/$FLY_APP_NAME:$DOCKER_IMAGE_TAG"

The behavior persists.

So I try to download and run the image from registry.fly.io to see if it is copied correctly.
Podman successfully loads the image from registry.fly.io, and the image runs flawlessly.

> podman login -u x -p "$FLY_API_TOKEN" -v registry.fly.io
Used:  /run/user/1000/containers/auth.json
Login Succeeded!
> podman pull registry.fly.io/flyvpn:0.0.0     
Trying to pull registry.fly.io/flyvpn:0.0.0...
Getting image source signatures
Copying blob 507ec11f5a1f skipped: already exists  
Copying blob 3375c174bfc7 skipped: already exists  
Copying blob 10e45524ca1c skipped: already exists  
Copying blob 10e45524ca1c skipped: already exists  
Copying blob 10e45524ca1c skipped: already exists  
Copying blob 10e45524ca1c skipped: already exists  
Copying blob 10e45524ca1c skipped: already exists  
Copying blob 3375c174bfc7 skipped: already exists  
Copying blob 507ec11f5a1f skipped: already exists  
Copying blob cb52c0b7d468 skipped: already exists  
Copying blob 87197915c18d skipped: already exists  
Copying blob e07f43e582c2 skipped: already exists  
Copying blob 95b18c5e2fb6 skipped: already exists  
Copying blob 52a004ba8547 skipped: already exists  
Copying blob 2666773469b4 skipped: already exists  
Copying blob 1196b72764b9 skipped: already exists  
Copying blob b145809b4cc3 done   | 
Copying blob faa67288214d done   | 
Copying blob beb7e054375b skipped: already exists  
Copying config 954759158d done   | 
Writing manifest to image destination
954759158d565e2ffde2b38ff1989b893d86197f09086ec9c7b1e3ddec38d19c
> podman container run --tty --interactive registry.fly.io/flyvpn:0.0.0 
{ ... SERVER OUTPUT ... }

So the copy of the image is correctly loaded into registry.fly.io.

Next I thought that perhaps the issue is specifically with the commands sudo or id.
So I comment out the part of my server script that uses these commands, then re-deploy the image.

2025-01-08T23:59:29Z app[185241dc107068] den [info]2025-01-08T23:59:29.372307289 [01JH45J70AFX4KZJ597EF02DF1:main] Running Firecracker v1.7.0
2025-01-08T23:59:29Z app[185241dc107068] den [info] INFO Starting init (commit: 1df1d0a0)...
2025-01-08T23:59:30Z app[185241dc107068] den [info] INFO Preparing to run: `/nix/store/2cpwn105kaf4zxfl3vhm24m41lk6bavx-fly-vpn-server-0.0.0/bin/fly-vpn-server-0.0.0` as root
2025-01-08T23:59:30Z app[185241dc107068] den [info] INFO [fly api proxy] listening at /.fly/api
2025-01-08T23:59:30Z app[185241dc107068] den [info]/nix/store/2cpwn105kaf4zxfl3vhm24m41lk6bavx-fly-vpn-server-0.0.0/bin/fly-vpn-server-0.0.0: line 37: mkdir: command not found
2025-01-08T23:59:30Z app[185241dc107068] den [info]/nix/store/d7p2bql11nbs4jbmslzl3c52knlj8bwm-softether-4.41-9782-beta/bin/vpnserver: line 2: /var/lib/softether/vpnserver/vpnserver: No such file or directory
2025-01-08T23:59:30Z runner[185241dc107068] den [info]Machine started in 775ms
2025-01-08T23:59:30Z app[185241dc107068] den [info]2025/01/08 23:59:30 INFO SSH listening listen_address=[fdaa:2:439f:a7b:9:5554:5b3d:2]:22 dns_server=[fdaa::3]:53
2025-01-08T23:59:31Z app[185241dc107068] den [info] INFO Main child exited normally with code: 127
2025-01-08T23:59:31Z app[185241dc107068] den [info] INFO Starting clean up.
2025-01-08T23:59:31Z app[185241dc107068] den [info] WARN could not unmount /rootfs: EINVAL: Invalid argument
2025-01-08T23:59:31Z app[185241dc107068] den [info][    1.647670] reboot: Restarting system
2025-01-08T23:59:31Z runner[185241dc107068] den [info]machine has reached its max restart count of 10

With sudo and id removed, the server also can’t find mkdir or the direct path to a bin directory with my server in it.
So not only are commands provided by PATH not resolved, even direct paths to executable files in the image are not able to be resolved.

Again, the image works perfectly fine when loaded by Podman.

I’m really out of ideas for what’s causing this, so I’m out of things to test… I think I need help from Fly staff, because I don’t see any way I can identify the cause of this problem from my end.

Additional Context

My image is built using the Nix function pkgs.dockerTools.streamLayeredImage provided by nixpkgs instead of a Dockerfile. In fact, the entire project is made using Nix. Nix stacks my server on top of Busybox:

{
  pkgs,
  name,
  version,
  server ? pkgs.callPackage ./server.nix { inherit name version; }
}: let
  _name = "${name}-docker-image";
  tag = version;
  # update base image using variables from:
  #   xdg-open https://hub.docker.com/_/busybox/tags
  #   nix-shell -p nix-prefetch-docker
  #   nix-prefetch-docker --quiet --image-name busybox --image-tag stable --image-digest sha256:_
  baseImage = pkgs.dockerTools.pullImage {
    imageName = "busybox";
    imageDigest = "sha256:7c3c3cea5d4d6133d6a694d23382f6a7b32652f23855abdba3eb039ca5995447";
    sha256 = "0k9ypllg4lmwd1a370z8n3awf5fpvlwwq355hmrfjwlmvqarjmjr";
    finalImageName = "busybox";
    finalImageTag = "stable";
    os = "linux";
    arch = "amd64";
  };
in {
  name = _name;
  inherit version tag;
  stream = pkgs.dockerTools.streamLayeredImage {
    name = _name;
    inherit tag;
    fromImage = baseImage;
    contents = [
      server
    ];
    config = {
      Entrypoint = [ "${pkgs.lib.getExe server}" ];
      Cmd = [];
      ExposedPorts = {
        "5555/tcp" = {};
        "992/tcp" = {};
        "443/tcp" = {};
      };
    };
  };
}

I know this should be supported by Fly because I’ve used this tool before, and I’ve seen evidence on these forums of others doing the same.

You can recreate my issue by following the README at this public GitHub repo to create your own copy of my app: GitHub - mboyea/fly-vpn: An SSTP VPN hosted by Fly.io
Or, just inspect parts of the code to try and identify why my image isn’t working with Fly.

Comment to prevent topic closure.

Could you try adding #!/usr/bin/env bash to the top of the server.sh script?

Sure. But the Kernel will never see that directive, because the shell script produced by nixpkgs.writeShellApplication actually already injects a shebang and other bash code above the script text.

So with the above change, I’m getting the same errors.

Gotcha, thanks for trying. I know next to nothing about nix much less building docker images with it. Hopefully someone else on the forum can help.

Well here is the output file from the change you asked for. Running this file should be the same as running any other Bash script.

As you can see, the shebang at the top requires bash from a specific directory provided by Nix. That seems to be resolved correctly. It is just that other required directories and PATH don’t seem to be resolved… But ONLY when running on the fly.io server. They resolve just fine as a container on my local machine.

That’s why I’m thinking that this has something to do with Fly’s interpretation of the image, rather than the image itself.

I checked the resolved image config we pass to our init here’s what it looks like:

  "ImageConfig": {
    "ExposedPorts": {
      "443/tcp":{},
      "5555/tcp":{},
      "992/tcp":{}
    },
    "Entrypoint":[
      "/nix/store/2cpwn105kaf4zxfl3vhm24m41lk6bavx-fly-vpn-server-0.0.0/bin/fly-vpn-server-0.0.0"
    ]
  }

That looks right to me. Is there any way we can inspect the image contents as interpreted by Firecracker itself?

You should be able to use existing docker/OCI tooling to download and inspect the image contents. Our CLI has a helper command for this fly auth docker.

Well yes, I did this in my initial post! :grin:
I downloaded the image from repository.fly.io and inspecting the image on my machine proved no issues. The image sourced from repository.fly.io ran correctly on my machine. It’s just that the image is not running correctly on Fly.io machines.

Ah, sorry I missed that bit.

I’m not sure what to suggest at this point unfortunately as we don’t do anything out of the ordinary and rely on a standard containerd client to download and unpack each layer.

Okay, no problem! Thank you for the information.

First I will try to create a minimum reproducible example from a simpler Docker image.
Then I will see if I can reproduce the issue with containerd on my own machine.

I will post back here with the results of my investigation.

1 Like

So I created a minimum reproducible example at github.com/mboyea/fly-nix-test-23387.
All the server does is,

echo "Hello, World!"
sleep infinity

It is packaged into the Busybox image, where sleep should be available.


Again, the image works as expected on my local machine, but not on Fly.io. The machine reports sleep: command not found. It can also be observed however that echo works, so at least some command is resolved correctly.

Here you can see the repository provides easy instructions in the README to show how to install the project and make a deployment. This is a fully reproducible example on new machines. This will be useful for the bugfix if I can determine that this is a bug with Fly.

So in the past (with github.com/mboyea/nixflymc), for deployment I would upload my images to docker.io rather than to registry.fly.io, then tell Fly to fetch it. That was working, even with Nix-generated docker images. So now I will fork this minimum reproducible example and see if it works if I upload the image through docker.io. If that works, then it means there’s an issue with the registry.fly.io pipeline.

(edit:) I did test deploying to docker.io. The same behavior is present, which means this is definitely not an issue with registry.fly.io, and it is almost certainly an issue loading the image that’s generated by Nix.

If I can’t get this to work, my last resort is to make a containerd environment locally to try and identify the issue. Perhaps there is a bug with that software, but I don’t know anything about how it works at this time, so I’m avoiding that possibility for as long as possible.

Another thing to try if you have an image:

fly console --image X -C bash

Try it first with something like debian for X. An ephemeral machine will be created with that image and you can run commands and see results. Once you exit, the machine will be destroyed.

Now try it with your image.

1 Like

So this is very helpful @rubys, thank you! Now I can poke around the image on a fly machine. My tests gave very interesting results.

> 15:32 fly-nix-test git:(main) podman login -u x -p "$FLY_API_TOKEN" -v registry.fly.io
Used:  /run/user/1000/containers/auth.json
Login Succeeded!
 > 15:32 fly-nix-test git:(main) podman pull registry.fly.io/fly-nix-test:0.0.0
Trying to pull registry.fly.io/fly-nix-test:0.0.0...
Getting image source signatures
Copying blob e07f43e582c2 skipped: already exists  
Copying blob 87197915c18d skipped: already exists  
Copying blob 3375c174bfc7 skipped: already exists  
Copying blob 10e45524ca1c skipped: already exists  
Copying blob cb52c0b7d468 skipped: already exists  
Copying blob 1280ab61288a skipped: already exists  
Copying blob aa1508391eab skipped: already exists  
Copying blob beb7e054375b skipped: already exists  
Copying config 6c27eae71e done   | 
Writing manifest to image destination
6c27eae71e27e390aba3befa760076dd2bea7d8c68fc58a806d4062b4a257218
 > 15:32 fly-nix-test git:(main) podman container run --tty --interactive registry.fly.io/fly-nix-test:0.0.0
Hello, World!
^C%                                                                                                                                                                                                    
 > 15:33 fly-nix-test git:(main) flyctl console --image registry.fly.io/fly-nix-test:0.0.0
Searching for image 'registry.fly.io/fly-nix-test:0.0.0' remotely...
image found: img_rj5yv1jrnmx7vdwq
Image: registry.fly.io/fly-nix-test:0.0.0
Image size: 15 MB

Created an ephemeral machine 185e712b3ed968 to run the console.
WARN The running flyctl agent (v0.3.49) is older than the current flyctl (v0.3.56).
WARN The out-of-date agent will be shut down along with existing wireguard connections. The new agent will start automatically as needed.
Connecting to fdaa:2:439f:a7b:69:4e9d:4886:2... complete
/ # ls
bin    dev    etc    home   lib    lib64  nix    proc   root   run    sys    tmp    usr    var
/ # ls nix/store
2d5spnl8j5r4n1s4bj1zmra7mwx0f1n8-xgcc-13.3.0-libgcc         85m2k33jlw6s5qk654ikb46ygym5fa36-fly-nix-test-server-0.0.0  qwjjm4j652ck9izaid7bz63s4hd5bnha-libidn2-2.3.7
6pqgj71r0850b0cd95yxx0d52zax016i-libunistring-1.2           p6k7xp1lsfmbdd731mlglrdj2d66mr82-bash-5.2p37                wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36
/ # exit
Waiting for ephemeral machine 185e712b3ed968 to be destroyed ... done.
 > 15:34 fly-nix-test git:(main) flyctl console --image registry.fly.io/fly-nix-test:0.0.0 -C bash
Searching for image 'registry.fly.io/fly-nix-test:0.0.0' remotely...
image found: img_rj5yv1jrnmx7vdwq
Image: registry.fly.io/fly-nix-test:0.0.0
Image size: 15 MB

Created an ephemeral machine 080e74ea3d5228 to run the console.
Connecting to fdaa:2:439f:a7b:161:312e:4012:2... complete
exec: "bash": executable file not found in $PATH
                                                Waiting for ephemeral machine 080e74ea3d5228 to be destroyed ... done.
Error: ssh shell: wait: remote command exited without exit status or exit signal
 > 15:34 fly-nix-test git:(main) flyctl console --image registry.fly.io/fly-nix-test:0.0.0 -C /nix/store/p6k7xp1lsfmbdd731mlglrdj2d66mr82-bash-5.2p37/bin/bash
Searching for image 'registry.fly.io/fly-nix-test:0.0.0' remotely...
image found: img_rj5yv1jrnmx7vdwq
Image: registry.fly.io/fly-nix-test:0.0.0
Image size: 15 MB

Created an ephemeral machine d8dd957f7d5448 to run the console.
Connecting to fdaa:2:439f:a7b:69:d745:145:2... complete
bash-5.2# sleep 1
bash: sleep: command not found
bash-5.2# echo $PATH
/no-such-path
bash-5.2# exit
exit
Waiting for ephemeral machine d8dd957f7d5448 to be destroyed ... done.                                                                                                                                                                                           
 > 15:38 fly-nix-test git:(main) podman container run --tty --interactive --entrypoint /nix/store/p6k7xp1lsfmbdd731mlglrdj2d66mr82-bash-5.2p37/bin/bash registry.fly.io/fly-nix-test:0.0.0
bash-5.2# sleep 1
bash-5.2# exit
exit
 > 15:39 fly-nix-test git:(main) podman container run --tty --interactive --entrypoint bash registry.fly.io/fly-nix-test:0.0.0 
Error: crun: executable file `bash` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found
 > 15:42 fly-nix-test git:(main) podman container run --tty --interactive --entrypoint /nix/store/p6k7xp1lsfmbdd731mlglrdj2d66mr82-bash-5.2p37/bin/bash registry.fly.io/fly-nix-test:0.0.0
bash-5.2# echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
bash-5.2# exit
exit

So the interesting things to note here:

  • @ 15:33 & 15:34 I was able to make a Fly machine load into Bash where I could test things and poke around the image. Most notably, $PATH is defined as /no-such-path and sleep is not found, just like the server behaves when deployed by my application.
  • @ 15:38 & 15:42 I was able to use the same image with the same bash exe as the Fly machine, but in here $PATH and sleep are defined in the expected way.

Something must be wrong with the way Fly interprets my image, possibly to do with containerd? I will see what I can figure out tomorrow. Thank you both for your help today.

Try adding to your fly.toml:

[env]
  PATH = "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

This workaround does work, though it’s not an ideal solution because hard-coding PATH separately for Fly is not exactly a good solution to the Fly machine overwriting PATH with /no-such-path

For anyone who missed it, the workaround is to:

  • Run a local copy of your image to get <value_of_PATH> using podman run or docker run to execute echo $PATH natively.
  • Add to fly.toml:
    [env]
      PATH = "<value_of_PATH>"
    
  • Deploy normally.

An example of me taking these actions is provided in an above comment.

In the interest of deploying my software, I’ll accept this and move on with my life. But I think the Fly team should consider investigating this issue more closely, as PATH is clearly not resolved from the image in a way that is consistent with Docker. It is odd that $PATH resolves to /no-such-path. I’ll leave github.com/mboyea/fly-nix-test-23387 archived for your team’s reference.

Thank you both again for your assistance!

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.