Random and intermittent deployment error 500 (Fly Rails infra bug?)

Starting today I’m getting intermittent issues with deployments with the below error message. It seems to resolve itself if I wait a handful of minutes then will reappear an hour or two later if I attempt another deployment.

 Updating existing machines in 'my-app1' with rolling strategy
 > Acquiring lease for xxxxxxxxxxxx
 > Acquired lease for xxxxxxxxxxxx
 > Updating machine config for xxxxxxxxxxxx
 > Updating xxxxxxxxxxxx [app]
 > Updated machine config for xxxxxxxxxxxx
 ✔ Machine xxxxxxxxxxxx is now in a good state
 > Clearing lease for xxxxxxxxxxxx
 ✔ Cleared lease for xxxxxxxxxxxx
 ==> Verifying app config
 --> Verified app config
     background-color: #F7F7F7;
     border: 1px solid #CCC;
     border-right-color: #999;
     border-left-color: #999;
     border-bottom-color: #999;
     border-bottom-left-radius: 4px;
     border-bottom-right-radius: 4px;
     border-top-color: #DADADA;
     color: #666;
     box-shadow: 0 3px 8px rgba(50, 50, 50, 0.17);
   }
   </style>
 </head>
 <body>
   <!-- This file lives in public/500.html -->
   <div class="dialog">
     <div>
       <h1>We're sorry, but something went wrong.</h1>
     </div>
     <p>If you are the application owner check the logs for more information.</p>
   </div>
 </body>
 </html>

AI tells me that public/500.html is an indicator of a Ruby on Rails application. Are you running RoR?

Hi Halfer - no RoR in our app. It occurs intermittently when running fly deploy with no rhyme or reason for when it works versus doesn’t.

Righto. Fly does use RoR, so I wonder if something is going wrong on their side; this looks like a “should never happen”. Perhaps they can look at their logs.

I was suspecting this being something on the fly end. Any thoughts on how I can flag this to fly? I don’t currently pay for support, so don’t have a support email.

All of us do look at the community forum, we just don’t guarantee support from here.

Anyway, back to your problem, it’s very odd. Are you able to share the app name? Or something else identifying that might help us find a trace or sentry exception?

edit: I’ve found a trace for one of your requests, having a closer look now!

It looks like this is an error coming from the registry. I’ve opened an internal discussion to see if we can figure out what’s causing the registry issue, and in the meantime I’m going to put together a small change for flyctl so that the output on error isn’t just HTML-direct-to-console :slight_smile:

1 Like

Hi jfent - any further info on this or anything I can do from my end?

How do you deploy, and in what region? Can we see your fly.toml file? The forum would be on fire if deployments were intermittently working for everyone.

How big is your image?

Our running assumption here is that this is related to some work we did fairly recently to create regional registry mirrors. It seems as though the mirror has not received all of the blobs of your image when you first deploy, and so spits out a 500 when it receives a request for the first blob it hasn’t got yet. We think it’s happening to you and seemingly no one else because your image might be abnormally large.

I think that’s why you’re seeing it “self-resolve” after a bit - that’s enough time for the whole image to have been loaded into the mirror.

If that’s right, anything you’re able to do to reduce image size might help.

Hi halfer and jfent -

For my deployment, we run a mix of opensource images like postgres, etcd, mailslurper images and our own custom apps. For the images that are custom and pushed to the fly registry, none of them are particularly large:

image 1: 959.49 MB
image 2: 64.17 MB
image 3: 572.25 MB
image 4: 429.55 MB

The region we deploy to is lhr.

Are you seeing this problem with all of the images mentioned or just a subset?

My deployment runs in a single GH action and just runs through the list of all apps to be deployed.

if [[ -n "${POSTGRES_PASSWORD:-}" ]]; then
  flyctl secrets set POSTGRES_PASSWORD="$POSTGRES_PASSWORD" --app "$POSTGRES_APP_NAME" --stage
  flyctl secrets set POSTGRES_PASSWORD="$POSTGRES_PASSWORD" --app "$OPENBAO_INIT_APP_NAME" --stage
fi

flyctl secrets set FLY_API_TOKEN="$FLY_API_TOKEN" --app "$OPENBAO_INIT_APP_NAME" --stage

# Deploy postgres 
flyctl deploy --config /work/rendered/postgres-unified.fly.toml --app "$POSTGRES_APP_NAME" --ha=false --detach
wait_for_healthy "$POSTGRES_APP_NAME" 40 5

flyctl deploy --config /work/rendered/etcd.fly.toml --app "$ETCD_APP_NAME" --ha=false --detach
wait_for_healthy "$ETCD_APP_NAME" 40 5

# Check if OpenBao machine exists and is healthy 
OPENBAO_HEALTHY=false
if flyctl machines list --app "$OPENBAO_APP_NAME" --json 2>/dev/null \
  | jq -e 'map(select(.state == "started") | select(((.checks // []) | length == 0) or (((.checks // []) | map(.status == "passing") | all)))) | length > 0' >/dev/null 2>&1; then
  echo "✓ OpenBao machine already running and healthy - skipping deployment"
  OPENBAO_HEALTHY=true
else
  echo "Deploying OpenBao (machine doesn't exist or unhealthy)..."
  flyctl deploy --config /work/rendered/openbao.fly.toml --app "$OPENBAO_APP_NAME" --image "$OPENBAO_IMAGE" --ha=false --detach
  wait_for_healthy "$OPENBAO_APP_NAME" 40 5
fi

When it fails, it varies - the above is what my deployment looks like and are the first apps that are deployed. Most frequently it fails deploying postgres, but occassionally will fail at etcd or openbao.

etcd and postgres are opensource images and openbao is one we build our own image (429.55 MB). Once an an app fails and receives the above error, the GH action fails. The openbao image is rarely rebuilt, as it doesn’t change often. Our main app, which we do update the image regularly (959.49 MB), deploys much further down the list.

Hi - I think I might have shipped a fix for this to flyctl in the last ~day or so, but I hadn’t realized it might help with exactly this conversation.

Does your action here always pull the latest flyctl? Could you try it one more time with the latest flyctl?

Could you also let us know the flyctl version on your latest runs?

I might have a few ideas from my latest fix if pulling latest doesn’t help you just yet.

I’m having the same exact issue, mine just wont work. It returns 500 every time flyctl deploy is ran. I’ve had this app for over a year just fine and performs deployment at least twice a week.

I’ve got flyctl v0.3.231.

```

Error: failed to create release (status 500)

<html>
<head>
  <title>We're sorry, but something went wrong (500)</title>
  <meta name="viewport" content="width=device-width,initial-scale=1">
  <style>