Unable to deploy Elixir Phoenix app: "Failed due to unhealthy allocations"

Daryl_Spitzer · January 24, 2022, 1:05am

I tried deploying a Phoenix (LiveView) app and ran into errors (I think) because the migration failed because the tables were created with post-migration columns (and this the columns referred to in the migrations didn’t exist). I was able to get past this by connecting to Postgres from outside Fly and restoring the database from a backup (and deleting the migration files). But then it failed “due to unhealthy allocations”.

I thought perhaps there was still a problem with the Dockerfile so I destroyed the app without destroying the database, and then learned how to set the DATABASE_URL environment variable (using fly secrets set DATABASE_URL=...). But now I’m back to where I started. Here’s the log:

2022-01-23T23:49:47.026 runner[10a7cfb5] sjc [info]Starting instance
2022-01-23T23:49:47.057 runner[10a7cfb5] sjc [info]Configuring virtual machine
2022-01-23T23:49:47.058 runner[10a7cfb5] sjc [info]Pulling container image
2022-01-23T23:49:47.370 runner[10a7cfb5] sjc [info]Unpacking image
2022-01-23T23:49:47.376 runner[10a7cfb5] sjc [info]Preparing kernel init
2022-01-23T23:49:47.778 runner[10a7cfb5] sjc [info]Configuring firecracker
2022-01-23T23:49:47.779 runner[10a7cfb5] sjc [info]Starting virtual machine
2022-01-23T23:49:47.944 app[10a7cfb5] sjc [info]Starting init (commit: 0c50bff)...
2022-01-23T23:49:47.965 app[10a7cfb5] sjc [info]Preparing to run: `/app/bin/migrate` as nobody
2022-01-23T23:49:47.979 app[10a7cfb5] sjc [info]2022/01/23 23:49:47 listening on [fdaa:0:46ae:a7b:2295:10a7:cfb5:2]:22 (DNS: [fdaa::3]:53)
2022-01-23T23:49:49.986 app[10a7cfb5] sjc [info]23:49:49.983 [info] Migrations already up
2022-01-23T23:49:50.974 app[10a7cfb5] sjc [info]Main child exited normally with code: 0
2022-01-23T23:49:50.974 app[10a7cfb5] sjc [info]Reaped child process with pid: 561 and signal: SIGUSR1, core dumped? false
2022-01-23T23:49:50.975 app[10a7cfb5] sjc [info]Starting clean up.
2022-01-23T23:49:58.157 runner[2a2d797b] sjc [info]Starting instance
2022-01-23T23:49:58.187 runner[2a2d797b] sjc [info]Configuring virtual machine
2022-01-23T23:49:58.188 runner[2a2d797b] sjc [info]Pulling container image
2022-01-23T23:49:58.489 runner[2a2d797b] sjc [info]Unpacking image
2022-01-23T23:49:58.494 runner[2a2d797b] sjc [info]Preparing kernel init
2022-01-23T23:49:58.916 runner[2a2d797b] sjc [info]Configuring firecracker
2022-01-23T23:49:59.031 runner[2a2d797b] sjc [info]Starting virtual machine
2022-01-23T23:49:59.167 app[2a2d797b] sjc [info]Starting init (commit: 0c50bff)...
2022-01-23T23:49:59.180 app[2a2d797b] sjc [info]Preparing to run: `/app/bin/server` as nobody
2022-01-23T23:49:59.195 app[2a2d797b] sjc [info]2022/01/23 23:49:59 listening on [fdaa:0:46ae:a7b:2295:2a2d:797b:2]:22 (DNS: [fdaa::3]:53)
2022-01-23T23:50:00.186 app[2a2d797b] sjc [info]Reaped child process with pid: 546, exit code: 0
2022-01-23T23:50:02.189 app[2a2d797b] sjc [info]Reaped child process with pid: 567 and signal: SIGUSR1, core dumped? false
2022-01-23T23:50:20.853 proxy[2a2d797b] sjc [error]Health check status changed 'passing' => 'critical'

And here’s what I see in the terminal where I ran fly launch:

...
Monitoring Deployment

1 desired, 1 placed, 0 healthy, 1 unhealthy [health checks: 1 total, 1 critical]
v4 failed - Failed due to unhealthy allocations - no stable job version to auto revert to
Failed Instances

==> Failure #1

Instance
  ID            = 2a2d797b
  Process       =
  Version       = 4
  Region        = sjc
  Desired       = run
  Status        = running
  Health Checks = 1 total, 1 critical
  Restarts      = 0
  Created       = 4m53s ago

Recent Events
TIMESTAMP            TYPE       MESSAGE
2022-01-23T23:49:56Z Received   Task received by client
2022-01-23T23:49:56Z Task Setup Building Task Directory
2022-01-23T23:49:59Z Started    Task started by client

Recent Logs
2022-01-23T23:49:59.000 [info] Starting init (commit: 0c50bff)...
2022-01-23T23:49:59.000 [info] Preparing to run: `/app/bin/server` as nobody
2022-01-23T23:49:59.000 [info] 2022/01/23 23:49:59 listening on [fdaa:0:46ae:a7b:2295:2a2d:797b:2]:22 (DNS: [fdaa::3]:53)
2022-01-23T23:50:00.000 [info] Reaped child process with pid: 546, exit code: 0
2022-01-23T23:50:02.000 [info] Reaped child process with pid: 567 and signal: SIGUSR1, core dumped? false
2022-01-23T23:50:20.000 [error] Health check status changed 'passing' => 'critical'
***v4 failed - Failed due to unhealthy allocations - no stable job version to auto revert to and deploying as v5

Troubleshooting guide at https://fly.io/docs/getting-started/troubleshooting/
Error abort

I found a topic describing a similar problem but it doesn’t look like either of @rushsteve12’s “two-fold” issues apply to me.

I did try @OldhamMade’s suggestion to run fly scale memory 1024 and @kurt’s to change grace_period in my fly.toml to “10s”:

...
[[services.tcp_checks]]
    grace_period = "10s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

But I still got “Failed due to unhealthy allocations”.

How do I troubleshoot this? Is there a way to get more detail on what those allocations are?

ExGecko · February 1, 2022, 9:11am

Have the same issue, did you find a solution or what was wrong?

I was able to reproduce this on a new repo/phx-app aswell. But no solution yet - more log output would be amazing. How can we get more insight?

FrequentFlyer · February 1, 2022, 9:50am

@ExGecko and @Daryl_Spitzer, can you try any of these 3 things:

The thread you linked to says use http instead of https for the checks. Could you try that?
Bump up the grace period even further to 60s just to rule it out
If you setup & ship metrics from your app (configurable in fly.toml), you can see if it’s a resource constraint of some sort. I believe dashboards are available in the Sign In · Fly page

Oh, and flyctl vm status <vm-id> should show VM events. There’s a chance of finding something there as well. flyctl vm status

kurt · February 1, 2022, 2:16pm

When a deployment fails, the first step is to look at a failed VM and see what you can figure out. RAM increases are only useful if the VM had an out of memory error (which you might see in the logs). The health check grace period is only helpful if health checks took too long to pass.

To see the specific VM status, run fly status --all to get a list of VMs. Find one with status failed, then run fly vm status <id>. This will give you a lot more information. Make sure you check the exit code, if it’s 0 it means health check failures, if it’s not zero it’s some issue crashing the process.

Daryl_Spitzer · February 2, 2022, 4:31am

@FrequentFlyer:

I don’t have any checks specified in my fly.toml file. I’m using the fly unmodified after it was generated when I first ran fly launch.

[[services]]
  http_checks = []

Done.
I’ll look into how metrics work.

@kurt & @FrequentFlyer:

$ fly status --all
App
  Name     = ssauction
  Owner    = personal
  Version  = 9
  Status   = running
  Hostname = ssauction.fly.dev

Deployment Status
  ID          = a43a4cae-e858-e340-906a-f99a03bfb6ca
  Version     = v9
  Status      = failed
  Description = Failed due to unhealthy allocations - no stable job version to auto revert to
  Instances   = 1 desired, 1 placed, 0 healthy, 1 unhealthy

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS  	HEALTH CHECKS      	RESTARTS	CREATED
3576c1df	app    	9 ⇡    	sea(B)	stop   	complete	1 total, 1 critical	0       	6m1s ago
63e9f7df	app    	8      	sjc   	run    	running 	1 total, 1 critical	0       	2022-01-24T00:14:03Z
$ fly vm status 3576c1df
Instance
  ID            = 3576c1df
  Process       =
  Version       = 9
  Region        = sea
  Desired       = stop
  Status        = complete
  Health Checks = 1 total, 1 critical
  Restarts      = 0
  Created       = 7m12s ago

Recent Events
TIMESTAMP            TYPE            MESSAGE
2022-02-02T04:18:57Z Received        Task received by client
2022-02-02T04:18:57Z Task Setup      Building Task Directory
2022-02-02T04:19:04Z Started         Task started by client
2022-02-02T04:23:57Z Alloc Unhealthy Task not running for min_healthy_time of 10s by deadline
2022-02-02T04:23:58Z Killing         Sent interrupt. Waiting 5s before force killing
2022-02-02T04:24:20Z Terminated      Exit Code: 0
2022-02-02T04:24:20Z Killed          Task successfully killed

Checks
ID                               SERVICE  STATE    OUTPUT
d6dd6a7392c47a522d5161aff2bffadd tcp-8080 critical dial tcp 172.19.6.2:8080: connect: connection refused

Recent Logs

Daryl_Spitzer · February 2, 2022, 4:48am

I realized the one config change I made to my app (in config/dev.exs) is to enable Tailwind CSS. IIRC I followed these instructions to do so: Adding Tailwind CSS to Phoenix 1.6. I read them again and found section 8: “Building CSS in Production”. So I made the changes directed there and got this error while building the Docker image:

 => ERROR [builder 12/17] RUN mix assets.deploy                                                                                         0.6s
------
 > [builder 12/17] RUN mix assets.deploy:
#19 0.571 sh: 1: npm: not found
#19 0.579 ** (exit) 127
#19 0.579     (mix 1.12.2) lib/mix/tasks/cmd.ex:64: Mix.Tasks.Cmd.run/1
#19 0.579     (mix 1.12.2) lib/mix/task.ex:394: anonymous fn/3 in Mix.Task.run_task/3
#19 0.579     (mix 1.12.2) lib/mix/task.ex:452: Mix.Task.run_alias/5
#19 0.579     (mix 1.12.2) lib/mix/cli.ex:84: Mix.CLI.run_task/2
------
Error error building: executor failed running [/bin/sh -c mix assets.deploy]: exit code: 1

I guess those instructions don’t apply to Fly.io. Perhaps I need to find the equivalent using esbuild instead of npm. I’ll see what I can find.

Daryl_Spitzer · February 2, 2022, 5:13am

I found Tailwind Standalone for Phoenix · Fly and (after backing out the “8. Building CSS in Production” changes I describe above) I made the suggested changes and confirmed it works in dev. But fly launch continues to fail with:

...
Recent Logs
2022-02-02T05:03:37.000 [info] Unpacking image
2022-02-02T05:03:37.000 [info] Preparing kernel init
2022-02-02T05:03:38.000 [info] Configuring firecracker
2022-02-02T05:03:38.000 [info] Starting virtual machine
2022-02-02T05:03:38.000 [info] Starting init (commit: 0c50bff)...
2022-02-02T05:03:38.000 [info] Preparing to run: `/app/bin/server` as nobody
2022-02-02T05:03:38.000 [info] 2022/02/02 05:03:38 listening on [fdaa:0:46ae:a7b:ac2:82c2:4078:2]:22 (DNS: [fdaa::3]:53)
2022-02-02T05:03:39.000 [info] Reaped child process with pid: 546, exit code: 0
2022-02-02T05:03:41.000 [info] Reaped child process with pid: 567 and signal: SIGUSR1, core dumped? false
***v10 failed - Failed due to unhealthy allocations - no stable job version to auto revert to and deploying as v11

Daryl_Spitzer · February 3, 2022, 3:18am

I just read Deploying on Fly.io — Phoenix v1.7.10 and it reads:

Make our project ready for Fly

For this guide, we’ll use a Dockerfile and build a release for our Fly deployment. Internally, Fly’s networking uses IPv6, so there is a little config we can do to our application to make it a smooth experience.

…and then has subsections titled:

“Use releases” - configure the app to deploy using Releases including the section on containers
“Runtime configuration” - update config/runtime.exs to configure it for Fly
“Generate release config files” - use the mix release.init command

I didn’t do any of that. I followed the instructions in Getting Started · Fly Docs and ran fly launch. Does fly launch do all of the above? Or should I try deleting my app and start over following the above instructions?

FrequentFlyer · February 3, 2022, 5:59am

Hi Daryl,

About the health check, Fly does a basic TCP connection check when there’s a [[services]] block present in your fly.toml. So that’s what’s failing here:
tcp-8080 critical dial tcp 172.19.6.2:8080: connect: connection refused

Considering the above and other setup work you’ve attempted so far, IMO, I’d suggest starting afresh with just the Fly docs & guides. Of course, Fly folks may be able to “check in the back” and suggest things based on internal knowledge to sort you out.

I’ve seen some posts here saying they’ve had to remove things suggested from the hexdocs.pm guides. At the very least, you could use it for cross-reference.

And finally, sorry I don’t know how many of the guides & links you’ve been through already; just posting here for thoroughness or whatever.
Fly sample apps are available at both github.com/fly-apps and github.com/superfly.
There’s also fly.io/phoenix-files if you want to keep up with Phoenix and LiveView stuff at Fly.
Note that phoenix-files is separate from fly.io/blog…

EDIT: Missed a link to the latest Phoenix LiveView example app - LiveBeats: Building a social music app with Phoenix LiveView · Fly

Daryl_Spitzer · February 4, 2022, 2:50am

Thanks @FrequentFlyer. You wrote:

About the health check, Fly does a basic TCP connection check when there’s a [[services]] block present in your fly.toml . So that’s what’s failing here…

But since inside my [[services]] block I have http_checks = [], I’m already using http instead of https for the checks, right?

And all the TCP connection check failure is telling me is that the deployed Docker container is not responding to HTTP requests, right? Can I run the Docker image locally to give me more troubleshooting info?

Of course, Fly folks may be able to “check in the back” and suggest things based on internal knowledge to sort you out.

I sure wish they would.

Considering the above and other setup work you’ve attempted so far, IMO, I’d suggest starting afresh with just the Fly docs & guides.

I feel like I’ve already used only the Fly docs. But since I don’t know what else to try I’ll start over yet again unless I get more advice.

FrequentFlyer · February 4, 2022, 3:47am

This is of course based only on my understanding & inference; could be totally wrong.

But since inside my [[services]] block I have http_checks = [] , I’m already using http instead of https for the checks, right?

Though there’s http_checks = [], since it’s an empty block, it must be taking it as no checks; therefore defaulting to TCP conn check (as seen in the HC failure).

And all the TCP connection check failure is telling me is that the deployed Docker container is not responding to HTTP requests, right?

Afraid I don’t know the exact check function that’s used, except for what I’ve seen on the forum (basic TCP connection check).

Can I run the Docker image locally to give me more troubleshooting info?

fly launch may have generated a Dockerfile you can use for local testing.

Yet another link, this one looks comprehensive, I hope it helps

Sorry I couldn’t be of more help…
I’ve only tried to fish out info from others who have had success with this from the forum.
There’s a lot of good bits here and there that can surely be put into gold standard guides, covering all common use cases seen so far.

Daryl_Spitzer · February 5, 2022, 9:04pm

I decided to start from scratch with a new app generated using the instructions in Deploy an Elixir Phoenix Application.

The first time I ran mix phx.new ssauction_live_fly and then fly launch I saw:

...
We recommend upgrading to Phoenix 1.6.3 which includes a release configuration for Docker-based deployment.
...

I did that and fly launch failed. I neglected to record why.

So I rm -rfed the whole directory and started again, but upgraded to Phoenix 1.6.6 after running mix phx.new ssauction_live_fly but before running fly launch. This time I got:

	 20:17:39.853 [error] Postgrex.Protocol (#PID<0.136.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (ssauction-db.internal:5432): non-existing domain - :nxdomain

I found Failed to connect to database cluster (non-existing domain) - #2 by kurt and made the changes recommended and ran fly launch again. I got:

--> Building image done
==> Pushing image to fly
The push refers to repository [registry.fly.io/ssauction]
29f06f2baaee: Pushed
4c686833369d: Layer already exists
f75686d47dae: Layer already exists
d3cce7faa027: Layer already exists
6129aa9d37ee: Layer already exists
ba5a5fe43301: Layer already exists
deployment-1644093446: digest: sha256:abc7146f666cbb07e18d4e9824579a75740f17e9de140f974eea89b227a84fd0 size: 1575
--> Pushing image done
Image: registry.fly.io/ssauction:deployment-1644093446
Image size: 117 MB
==> Creating release
Release v2 created
Release command detected: this new release will not be available until the command succeeds.

You can detach the terminal anytime without stopping the deployment
==> Release command
Command: /app/bin/migrate
	 Starting instance
	 Configuring virtual machine
	 Pulling container image
	 Unpacking image
	 Preparing kernel init
	 Starting virtual machine
	 Starting init (commit: 0c50bff)...
	 2022/02/05 20:38:02 listening on [fdaa:0:46ae:a7b:2295:baae:94ff:2]:22 (DNS: [fdaa::3]:53)
	 20:38:04.497 [info] Migrations already up
	 Main child exited normally with code: 0
	 Reaped child process with pid: 559 and signal: SIGUSR1, core dumped? false
	 Reaped child process with pid: 561 and signal: SIGUSR1, core dumped? false
	 Starting clean up.
Monitoring Deployment

1 desired, 1 placed, 0 healthy, 1 unhealthy [health checks: 1 total, 1 critical]
v0 failed - Failed due to unhealthy allocations - no stable job version to auto revert to
Failed Instances

==> Failure #1

Instance
  ID            = 1ba57533
  Process       =
  Version       = 0
  Region        = sjc
  Desired       = run
  Status        = running
  Health Checks = 1 total, 1 critical
  Restarts      = 0
  Created       = 4m57s ago

Recent Events
TIMESTAMP            TYPE       MESSAGE
2022-02-05T20:38:15Z Received   Task received by client
2022-02-05T20:38:15Z Task Setup Building Task Directory
2022-02-05T20:38:18Z Started    Task started by client

Recent Logs
2022-02-05T20:38:19.000 [info] Reaped child process with pid: 546, exit code: 0
2022-02-05T20:38:21.000 [info] Reaped child process with pid: 567 and signal: SIGUSR1, core dumped? false
2022-02-05T20:38:49.000 [error] Health check status changed 'passing' => 'critical'
***v0 failed - Failed due to unhealthy allocations - no stable job version to auto revert to and deploying as v1

Troubleshooting guide at https://fly.io/docs/getting-started/troubleshooting/
Error abort

So even on a newly generated Phoenix app I’m still getting “Failed due to unhealthy allocations”. I don’t know what to do at this point (except throw up my hands and give up).

kurt · February 5, 2022, 10:09pm

The Phoenix upgrade process prints a whole bunch of instructions for adjusting your config files. I missed these when I upgraded a Phoenix app.

That error probably means there’s some config missing, particularly the endpoint config.

If you generate a brand new project with Phoenix 1.6.6, the runtime.exs file has everything you need. It should include an endpoint block like this:

  secret_key_base =
    System.get_env("SECRET_KEY_BASE") ||
      raise """
      environment variable SECRET_KEY_BASE is missing.
      You can generate one by calling: mix phx.gen.secret
      """

  host = System.get_env("PHX_HOST") || "example.com"
  port = String.to_integer(System.get_env("PORT") || "4000")

  config :fizz, FizzWeb.Endpoint,
    url: [host: host, port: 443],
    check_origin: :conn,
    http: [
      # Enable IPv6 and bind on all interfaces.
      # Set it to  {0, 0, 0, 0, 0, 0, 0, 1} for local network only access.
      # See the documentation on https://hexdocs.pm/plug_cowboy/Plug.Cowboy.html
      # for details about using IPv6 vs IPv4 and loopback vs public addresses.
      ip: {0, 0, 0, 0, 0, 0, 0, 0},
      port: port
    ],
    secret_key_base: secret_key_base

Does yours have that?

That health check error means it can’t connect to your app server, which is most likely because it’s not listening on the right IP / port combo.

Daryl_Spitzer · February 5, 2022, 10:44pm

Success! I upgraded Phoenix using mix archive.install hex phx_new 1.6.6 before running mix phx.new and fly launch worked!

Vizzy · September 25, 2022, 7:53pm

Hi, I’m having a similar error. My LiveView app runs locally (port 4000), I can get the through the postgres stuff (so those inet changes I needed to make work), but I think I’m lost between the port mods. in the Prod.exs I have

config :demo, DemoWeb.Endpoint,
url: [host: “autumn-sea-2660.fly.dev”, port: 4000],
cache_static_manifest: “priv/static/cache_manifest.json”

my toml

[env]
PHX_HOST = “autumn-sea-2660.fly.dev”

PORT = “4000”

[experimental]
allowed_public_ports =
auto_rollback = true

[[services]]
http_checks =
internal_port = 4000
processes = [“app”]
protocol = “tcp”
script_checks =
[services.concurrency]
hard_limit = 25
soft_limit = 20
type = “connections”

[[services.ports]]
force_https = true
handlers = [“http”]
port = 4000

[[services.ports]]
handlers = [“tls”, “http”]
port = 443

[[services.tcp_checks]]
grace_period = “1s”
interval = “15s”
restart_limit = 0
timeout = “2s”

and I added

EXPOSE 4000

to the Dockerfile

Vizzy · September 25, 2022, 7:53pm

Recent Logs
***v18 failed - Failed due to unhealthy allocations - no stable job version to auto revert to and deploying as v19

Troubleshooting guide at Troubleshooting your Deployment · Fly Docs
Error abort

PinOcean · September 29, 2022, 2:32am

@Vizzy I’m having the same issue as you. Details here: "Failed due to unhealthy allocations" on Phoenix Deployment