Deploy Stuck

Not sure if this is related to the existing outage, but we’re seeing deployments stucks with the following:

Deployment is running.
1 desired, 0 placed, 0 healthy, 0 unhealthy

We’re on 0.0.456 of the agent. Anyone else unable to deploy?

Updated to the latest agent and killed the builder. Still unable to deploy.

This usually means the application can’t be placed for some reason, either because the config can’t be resolved, or because there are capacity issues on a host with a volume the app needs.

If you’re running an app with a single volume, try adding a second and see if you get unstuck.

Thanks, @kurt. Where might I see these errors? Is there a --verbose for deploy?

Just found LOG_LEVEL=debug – trying that now. Fingers crossed.

They don’t really manifest as errors, because Nomad is opaque in this way. When you try and schedule something that can’t yet work, Nomad just hangs out to see if it can push it through at a later time. It will happily sit there for days and then finish your deploy if it can.

Apps on Machines are much more transparent about this kind of issue, once we get existing Nomad apps migrated over: Fly Apps on machine prerelease

Very much looking forward to more transparency, which I’m hoping means a more deterministic experience. These deploy issues are on our staging instance (unpaid), and - frankly, I’m just terrified right now to touch our production instance (paid).

df -h on our volume reveals:

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        1.5G     0  1.5G   0% /dev
/dev/vdb        7.9G  1.3G  6.2G  17% /
shm             1.5G     0  1.5G   0% /dev/shm
tmpfs           1.5G     0  1.5G   0% /sys/fs/cgroup
/dev/vdc         98G   66G   28G  70%

Do I need a paid account for our staging server to further diagnose this? I’m really at a loss here at this point.

Also, this is quite frustrating.

Are you running with more than one volume? If you’re hitting capacity issues, it’s not because of the data on the volume, it’s because the VMs you’re deploying need more RAM and CPU than the host the host that owns your volume can provide during a deploy.

Our infrastructure is not designed to be reliable for single volume applications. This may not be obvious, we’ve tried to document it pretty thoroughly but it’s weird relative to other cloud providers.

1 Like

I’m thoroughly confused.

Primary app: 388 MB/3 GB @ dedicated-cpu-1x
Fly Postgres: 370 MB/1 GB
Secondary App (Log shipper): 151 MB/232 MB @ shared-cpu-1x

I’ve just upgraded the staging instance for “Launch” with hopes that paid email support means we’ll once again be able to deploy to staging.

Again, thanks for the help.

I’m having this exact same issue.

I don’t immediately see anything about this “no single volumes” restriction on Volumes · Fly Docs

When you talk about multiple volumes, what is meant? If the single app is using multiple volumes in different locations, is the data duplicated on each volume, or distributed?

Could you provide an example of an app with a single database table using multiple volumes?

Thanks.

“Our infrastructure is not designed to be reliable for single volume applications” - @kurt any chance you could elaborate on this please? Clearly, we want a reliable experience on Fly. I’d like to know more.

1 Like

I’ve finally received:

1 desired, 1 placed, 0 healthy, 1 unhealthy
--> v464 failed - Failed due to unhealthy allocations - rolling back to job version 460 and deploying as v465

Which is welcome over a stuck deploy, but still (unfortunately) not of much use.

FWIW, we’re simply attempting to push some JavaScript changes. Again, just outright puzzled.

Volumes are pinned to individual host hardware. This means any given volume is not very reliable. If a host fails, your app goes down. If a host can’t allocate more RAM, you can’t create a new VM to use the volume. fly deploy creates a new VM to try and use a volume.

In fact, single instance apps aren’t very reliable on our platform. We designed it to run everything in twos. I know this is opaque and surprising.

Volumes don’t replicate, they’re just raw (and fast) block devices. If you need maximum reliability, you should either run something that handles replication between VMs/disks –like our Postgres – or not use volumes. S3 or similar are way better places to put things than a single disk.

Your current choices are:

  1. Add a second volume to your application, this will land on a host with capacity and your deploy will continue. This only works if the data on your volume is disposable or your app knows how to replicate
  2. Remove the volume and deploy without the [mounts] section of fly.toml
  3. If your data is relatively static, restore the most recent volume backup to a new volume. You’ll get stale data, but if you haven’t been writing to the persistent disk much, that’ll be fine.

We got behind on docs and need to make a much bigger deal of this design. We’re working on that this month, the influx of Heroku users has been a bit overwhelming.

Problem:

The host containing our volume was silently failing, causing our deployments to seize up. @kurt explained this above.

Solution:

Replicate the affected volume from snapshot, which creates a new instance on a healthy host. This allows the deployment to continue.

Bonus: always specify --config if you have more than a single toml file, as we do. -a is not a reliable argument.

Fallout:

As a small dev shop, our clients rely on us to recommend everything from designers to hosting platforms. We’ve offered fly.io exclusively for the past year or so as our PaaS of choice. Until this incident, we’ve been pleased.

However, our confidence - and trust, took a hit on this one. With zero visibility into the host level deployment failures, we’re left waiting patiently for forum responses and email support. This is very painful to communicate to clients.

Furthermore, we’d been deploying to our staging/production environments using -a <app> for nearly 11 months with no issue. Through support emails, we learned that -a will pull the most recently used toml and not the file system toml. This was a significant source of confusion (and frustration) for us.

Ultimately, we remain fans of fly, but are definitely hoping the transition from Nomad to Machines provides more actionable error messages.

My deployment is stuck as well with no logs supporting why it’s failing - very frustrating!
I’m literally in development was hoping to use FLY.io as my provider for my close to ready production app but system is very unrealible

Have you deployed successfully previously? Are you using volumes?

Yes using volumes - My staging volume just hangs if I deploy to the production volume I’m getting this error posted here - Prisma + SQLite causes an out of memory error on deploy
Either way it’s failing -

And now my app is completely FUbar didn’t even rollback to working version - what a disaster

I was able to fix mine by restoring to a new volume from snapshot. Does that work?