Flaky deployment fails

Every X deployments or so, the remote deployment (--remote-only) fails with the following error:

Error failed to fetch an image or build from source: error connecting to docker: failed building options: Validation failed: Name has already been taken

Deleting the remote worker VM and forcing Fly to create a new one solves this problem, but this behavior impairs a smooth CI/CD workflow.

We are in fra region.

Hmm, this should not be related to running VMs. It looks more likely to be an issue with wireguard peers, which are generated anew each time you deploy from a CI service. Which one are you using?

We’re using SemaphoreCI, cause it’s running on quite powerful machines.

But the deployment command is just something like:

fly deploy . --config services/nlp/fly.toml  --remote-only  --build-arg TURBO_TEAM=$TURBO_TEAM --build-arg TURBO_TOKEN=$TURBO_TOKEN 

I found out that it fails, if one or multiple previous deployments failed and you retry without deleting the remote worker manually beforehand.

Destroying the builder app before every deployment is a general solution to prevent flaky deployments due to this or these “volume-out-of-space” issues.

On CI this can be done with a script:

FLY_BUILDERS=$(fly apps list | grep -i 'fly-builder-')

if [ -z "$FLY_BUILDERS" ]; then
  echo "No Fly builders found to destroy. Exiting."
  exit 0

while IFS= read -r line; do
  BUILDER=$(echo $line | cut -d' ' -f1)
  echo "Destroying \"$BUILDER...\""
  fly apps destroy $BUILDER --yes
done <<<"$FLY_BUILDERS"

If there’s a more elegant solution, please tell me. :smiley: