"cannot get bootfile": was there a change to the Machine's /etc/hosts recently?

We had an issue today where all deploys were failing with

Runtime terminating during boot ({'cannot get bootfile', ...

After hours of tracking it down, it appeared to be an issue with how we populated RELEASE_NAME based on old instructions:

# See https://fly.io/docs/elixir/the-basics/naming-your-elixir-node/:
ip=$(grep fly-local-6pn /etc/hosts | cut -f 1)
export RELEASE_NODE=$FLY_APP_NAME@$ip

specifically, grep fly-local-6pn /etc/hosts | cut -f 1 was returning two strings separated by a space. It caused issues when starting our release, eventually resulting in beam.smp to be started with slightly incorrect arguments (--boot vs -boot, --boot-var vs -boot_var).

Anyways, we weren’t able to confirm it was indeed an issue with a format of /etc/hosts because it seems that the hosts file now works well with the command above (i.e., if there was an /etc/hosts change it was reverted). Could someone on the fly team confirm if there was a recent change to the format of the machine /etc/hosts file?

For the record, using the new instructions, we changed our release env code to:

export RELEASE_NODE="${FLY_APP_NAME}-${FLY_IMAGE_REF##*-}@${FLY_PRIVATE_IP}"

which looks to be more robust.

1 Like

Hi Ryan,

Newer versions of rel/env.sh.eex use this to generate RELEASE_NODE instead:

export RELEASE_NODE="${FLY_APP_NAME}-${FLY_IMAGE_REF##*-}@${FLY_PRIVATE_IP}"

Under some “abrupt machine stop” circumstances, /etc/hosts gets duplicated (i.e. a copy of itself is appended at the end of the files), and the presence of duplicated records are what confuses the old RELEASE_NODE determination code; when you cleanly restart the machine, the hosts file gets fixed, which explains why things worked after a reboot.

Specifically ip=$(grep fly-local-6pn /etc/hosts | cut -f 1) will produce funny stuff if more than one line mentions fly-local-6pn; you could always | head -1 but the new way of generating RELEASE_NODE is overall more reliable.

I would suggest updating your env.sh.eex as seen above and that should work reliably even when a machine stops abruptly.

Regards,

  • Daniel
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.