Unable to deploy Elixir app, no feedback in output or logs

I’m seeing the following during my deploy:

==> Release command
Command: /app/prod/rel/runner/bin/runner eval Runner.Release.migrate
         Starting instance
         Configuring virtual machine
         Pulling container image
         Unpacking image
         Preparing kernel init
Error Release command failed, deployment aborted

I can’t move past this point; I’m not seeing anything in the logs after Preparing kernel init so I’m struggling to debug what the issue could be.

Locally I’m able to run the migration script without error, and the Docker container builds and starts as expected. There have been no changed to the Dockerfile since the last successful release. Using fly deploy --verbose doesn’t add any useful information to the build process.

What size instance is it? Could it be getting killed off by OOM?

I don’t think so. The container size itself is only 420MB, and the instance is set to 1024MB.

Interestingly, I’m now unable to see any metrics in Fly’s dashboard for the instance/app. Nothing for the last 2 days. But the app is up and running, and in the logs it’s doing all the background processing I expect.

Fun times!

I think we might be running the release commands in small VMs. We should run them in the same size as the app’s VM. I’ll check with the team.

Well I just looked and our certificate to communicate with our metrics cluster has expired. Whoops! Going to add a continuous test for this. Of course, we’ll renew it. Metrics should be back momentarily!

Edit: Metrics are back.

1 Like

It appears your release command isn’t logging anything, and then exiting with a non zero exit code. Your best bet is to comment out the release command, then deploy, then flyctl ssh console and run it by hand to debug.

It appears your release command isn’t logging anything, and then exiting with a non zero exit code.

That’s odd. I would expect, if there’s a non-zero exit, that at least some form of error output would occur. It’s an Elixir app which executes a migration as the release command.

Your best bet is to comment out the release command, then deploy, then flyctl ssh console and run it by hand to debug.

I’ve copied down the DB structure locally and run the migrations against it (standard process for testing migrations before deploy) and everything’s been working fine. But I’ll see if I can do as you suggest…

This is one of those situations where a “maintenance mode” toggle would be helpful.

Unfortunately this approach isn’t working either:

$ fly deploy --verbose
Deploying XXXXX
...
Monitoring Deployment

1 desired, 1 placed, 0 healthy, 1 unhealthy
v30 failed - Failed due to unhealthy allocations - rolling back to job version 29

1 desired, 1 placed, 1 healthy, 0 unhealthy
--> v31 deployed successfully
***v30 failed - Failed due to unhealthy allocations - rolling back to job version 29 and deploying as v31

Again, I’m not seeing anything in the logs so it’s pretty much impossible to debug:

2021-06-08T04:28:19.179001557Z runner[987d3f35] lhr [info] Starting instance
2021-06-08T04:28:19.216552583Z runner[987d3f35] lhr [info] Configuring virtual machine
2021-06-08T04:28:19.219245431Z runner[987d3f35] lhr [info] Pulling container image
2021-06-08T04:28:19.927760882Z runner[987d3f35] lhr [info] Unpacking image
2021-06-08T04:28:19.934354945Z runner[987d3f35] lhr [info] Preparing kernel init

It would be great if the verbose flag could redirect the output of the booting container to the flyctl output, until the user cancels.

I am also seeing this on my app today. The release command should just run and exit, as I have no pending migrations.

 ==> Release command
Command: bundle exec rake db:migrate
	 Starting instance
	 Configuring virtual machine
	 Pulling container image
	 Unpacking image
	 Preparing kernel init
Error Release command failed, deployment aborted

I commented out the release command in fly.toml, but am now seeing a general deployment error. Nothing useful showing up in logs about this one.

2 desired, 2 placed, 2 healthy, 0 unhealthy
--> v120 deployed successfully
***v119 failed - Failed due to unhealthy allocations - rolling back to job version 118 and deploying as v120

UPDATE: logs are now showing:

2021-06-08T04:57:02.212571128Z app[8392e09a] ams [info] D, [2021-06-08T04:57:02.207951 #528] DEBUG -- : Flush {05d58e2c-76bb-44e9-8425-3c0244661393 [10 lines]} failed. Timeout error occurred. execution expired. Retrying
1 Like

Aw man, now the app failed to deploy and rollback, so it’s down hard.

2021-06-08T04:37:57.248004103Z runner[3e6b8af6] lhr [info] Starting instance
2021-06-08T04:37:57.285869780Z runner[3e6b8af6] lhr [info] Configuring virtual machine
2021-06-08T04:37:57.288486070Z runner[3e6b8af6] lhr [info] Pulling container image
2021-06-08T04:37:57.882144548Z runner[3e6b8af6] lhr [info] Unpacking image
2021-06-08T04:37:57.888087993Z runner[3e6b8af6] lhr [info] Preparing kernel init
2021-06-08T04:56:05.016583838Z runner[d13a8cfd] lhr [info] Starting instance
2021-06-08T04:56:05.053480636Z runner[d13a8cfd] lhr [info] Configuring virtual machine
2021-06-08T04:56:05.056093116Z runner[d13a8cfd] lhr [info] Pulling container image
2021-06-08T04:58:02.754016022Z runner[d13a8cfd] lhr [info] Unpacking image
2021-06-08T04:58:06.493893292Z runner[d13a8cfd] lhr [info] Preparing kernel init
2021-06-08T04:59:53.257153124Z runner[934f630b] lhr [info] Starting instance
2021-06-08T04:59:53.294196659Z runner[934f630b] lhr [info] Configuring virtual machine
2021-06-08T04:59:53.296926857Z runner[934f630b] lhr [info] Pulling container image
2021-06-08T04:59:53.599600176Z runner[934f630b] lhr [info] Unpacking image
2021-06-08T04:59:53.606376415Z runner[934f630b] lhr [info] Preparing kernel init

@kurt @jerome Is there a manual way to rollback? The docs say this should be automatic, but it obviously hasn’t happened properly this time.

Is there an escalation route for situations like this?

Trying to deploy the previous working commit also doesn’t work. I don’t think this is “me”.

I suspected this could be related to the default canary deployment strategy, so I tried this which worked:

fly deploy --strategy immediate --remote-only

UPDATE: Running this appeared to work, but my app still appears to be on a previous release.

Thanks for the suggestion. Trying now…

My app is now fully down as well. :ambulance:

"No suitable (healthy) instance found to handle request"

Trying to bring stuff up in different regions, but no luck so far.

Yep, this approach also failed for me.

Oh, interesting. The fly deploy --strategy immediate --remote-only attempt created a completely new app fly-builder-restless-surf-2468 in my account. I’ve not seen this happen before. It is also “dead”, along with my downed app.

I’m really annoyed I have no recovery options here. Any one of the following options would mitigate this downtime:

  • A toggle for “maintenance mode” to display a static HTML file while the toggle is active. I could switch that on and provide a notice to our users.
  • A “failover” option when an app is “dead” or non-responsive, to either show a static HTML page or proxy to another site. Again, I could provide a notice to our users.
  • A manual rollback option. Something like flyctl deploy --rollback v28, or being able to click a Rollback to here button in the “Activity” dashboard screen next to a “green” build. Being able to revert to a previously running state without a deploy would really help when deploys aren’t working as expected.

So, the site is down hard, the option of “deploying something to fix” isn’t available, I have no other tools or channels available to me, and I’m starting to get emails about it. My only option is to repoint the DNS somewhere and hope people don’t see the error for too long.

FYI, a suspend/resume cycle doesn’t help either; the resume operation just hangs with no result.

Scaling the VM also has no effect.

Thankfully I’ve managed to cludge together a Cloudflare “down for maintenance” response to requests so there are no ugly responses. It’s not ideal.

I got a new error while trying to deploy to a different region and organization:

2021-06-08T09:22:24Z Driver Failure  rpc error: code = Unknown desc = unable to create microvm: can't populate statics: no static cache root

Note: adding new regions and scaling also doesn’t help.

Further update: I’ve been able to deploy a completely new app, under a new account, which is based on the same Dockerfile and elixir/phoenix “root” in use across all the apps I’m hosting on Fly.io

I also started a brand new app and was able to deploy there. The old one still cannot deploy.