Failed due to unhealthy allocations - no stable job version to auto revert to

Have referenced the related issues, but the solutions didn’t appear to apply. I get the message:
“Failed due to unhealthy allocations - no stable job version to auto revert to”

Despite this, the app appears to be running… but want to figure this out.

Here is my fly.toml:

app = "hellotrava"
kill_signal = "SIGINT"
kill_timeout = 5
processes = [ ]

[env]
PORT = "8080"

[deploy]
release_command = "npx prisma migrate deploy"

[experimental]
allowed_public_ports = [ ]
auto_rollback = true

[[services]]
internal_port = 8080
processes = [ "app" ]
protocol = "tcp"
script_checks = [ ]

  [services.concurrency]
  hard_limit = 25
  soft_limit = 20
  type = "connections"

  [[services.ports]]
  handlers = [ "http" ]
  port = "80"
  force_https = true

  [[services.ports]]
  handlers = [ "tls", "http" ]
  port = "443"

  [[services.tcp_checks]]
  grace_period = "1s"
  interval = "15s"
  restart_limit = 0
  timeout = "2s"

  [[services.http_checks]]
  interval = 10_000
  grace_period = "5s"
  method = "get"
  path = "/healthcheck"
  protocol = "http"
  timeout = 2_000
  tls_skip_verify = false
  headers = { }

And here are the relevant logs:

2022-04-07T14:38:41.7053633Z 	 Starting instance
2022-04-07T14:38:41.7054698Z 	 Configuring virtual machine
2022-04-07T14:38:41.7055890Z 	 Pulling container image
2022-04-07T14:38:41.7056460Z 	 Unpacking image
2022-04-07T14:38:41.7056950Z 	 Preparing kernel init
2022-04-07T14:38:41.7057437Z 	 Configuring firecracker
2022-04-07T14:38:41.7057714Z 	 Starting virtual machine
2022-04-07T14:38:41.7057988Z 	 Starting init (commit: 6f9865f)...
2022-04-07T14:38:41.7058780Z 	 Preparing to run: `docker-entrypoint.sh npx prisma migrate deploy` as root
2022-04-07T14:38:41.7078978Z 	 2022/04/07 14:38:34 listening on [fdaa:0:5938:a7b:a9e:a07c:b56c:2]:22 (DNS: [fdaa::3]:53)
2022-04-07T14:38:41.7079455Z 	 Prisma schema loaded from prisma/schema.prisma
2022-04-07T14:38:41.7080313Z 	 Datasource "db": PostgreSQL database "hellotrava", schema "public" at "top2.nearest.of.hellotrava-db.internal:5432"
2022-04-07T14:38:41.7080789Z 	 2 migrations found in prisma/migrations
2022-04-07T14:38:41.7081113Z 	 No pending migrations to apply.
2022-04-07T14:38:41.7081386Z 	 npm notice
2022-04-07T14:38:41.7081824Z 	 npm notice New minor version of npm available! 8.5.0 -> 8.6.0
2022-04-07T14:38:41.7082248Z 	 npm notice Changelog: <https://github.com/npm/cli/releases/tag/v8.6.0>
2022-04-07T14:38:41.7082752Z 	 npm notice Run `npm install -g npm@8.6.0` to update!
2022-04-07T14:38:41.7083050Z 	 npm notice
2022-04-07T14:38:41.7083327Z 	 Main child exited normally with code: 0
2022-04-07T14:38:41.7083956Z 	 Starting clean up.
2022-04-07T14:38:41.8373047Z e[32m==> Monitoring deploymente[0m
2022-04-07T14:38:42.0006103Z 
2022-04-07T14:38:42.0006583Z v3 is being deployed
2022-04-07T14:38:50.1227130Z 376f2ec9: vin running healthy
2022-04-07T14:38:51.0162950Z 376f2ec9: vin running unhealthy [health checks: 2 total, 1 critical]
2022-04-07T14:39:01.2375537Z 376f2ec9: vin running unhealthy [health checks: 2 total, 1 passing, 1 critical]
2022-04-07T14:43:42.6283822Z Failed Instances
2022-04-07T14:43:42.8652108Z 
2022-04-07T14:43:42.8652901Z e[1mInstancee[0m
2022-04-07T14:43:42.8654695Z Failure #1
2022-04-07T14:43:42.8655347Z ID      	PROCESS	VERSION	REGION	DESIRED	STATUS 	HEALTH CHECKS                 	RESTARTS	CREATED   
2022-04-07T14:43:42.8660315Z 
2022-04-07T14:43:42.8660797Z 376f2ec9	       	3      	vin   	run    	running	2 total, 1 passing, 1 critical	0       	4m53s ago	
2022-04-07T14:43:42.8690375Z e[38;5;252m--> v3 failed - Failed due to unhealthy allocations - no stable job version to auto revert to and deploying as v4 

When you say “the app appears to be running” … do you mean you are able to access /healthcheck in a browser, and get a successful response (200)?

Only when I’ve had a deploy say [1 passing, 1 critical] generally it’s the tcp healthcheck that passes but the http healthcheck that fails.

You can check on that if you run fly logs. Do you see the Fly system attempt to call /healthcheck? What response code is shown? If you see a non-200 code (like a 500) the healthcheck is failing and hence the deploy is not completing. Often you will see a message to say why from your app, like an exception (assuming you have some kind of logging).

The other thing to double-check would be whether using e.g 10_000 in the fly.toml is valid. I assume it is (given the lack of error to say it’s not valid) but the example has e.g 10000:

No, I am unable to access /healthcheck in the browser. Also, when I run fly logs I see a bunch of 404s on the /healthcheck GET request.

By “appears to be running”, I meant that I could access the home page of the app in the browser.

In a previous attempt, I removed the underscores, and still received the same errors.

Appreciate your help.

1 Like

Ah, well that would explain the error on-deploy then. The Fly system would also get a 404 when it tries to access /healthcheck. Being a non-200, it would fail.

So … you could either edit the path in the fly.toml so the healthcheck is done on /. As you say you can access that, so that means it could too. And so pass, and the deploy would complete.

Or you could leave the fly.toml as-is and add a route in your app for /healthcheck. So, again, it (and you) would get a successful response from a request to that. And the deploy would complete.

1 Like

I am at my wits end trying to deploy my app. I have read the troubleshooting guide and it is of not help.

my fly.toml:

[experimental]
  allowed_public_ports = []
  auto_rollback = true
  cmd = ["gunicorn", "app:app"]
  entrypoint = []
  exec = []
  private_network = true

[processes]
  app = "gunicorn app:app"

[build]
  dockerfile = "Dockerfile.publish"

[[services]]
  internal_port = 8080
  processes = ["app"]
  protocol = "tcp"
  script_checks = []
  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

  [[services.http_checks]]
    interval = 10000
    grace_period = "5s"
    method = "get"
    path = "/healthcheck"
    protocol = "https"
    restart_limit = 0
    timeout = 2000
    tls_skip_verify = false

I keep getting the unhealthy checks error on my GithubActions build.

the vm status:

Instance
  ID            = 4caa034a
  Process       = app
  Version       = 92
  Region        = ord
  Desired       = stop
  Status        = complete
  Health Checks = 2 total, 2 critical
  Restarts      = 0
  Created       = 5m50s ago

Events
TIMESTAMP               TYPE            MESSAGE
2022-10-12T13:24:56Z    Received        Task received by client
2022-10-12T13:24:56Z    Task Setup      Building Task Directory
2022-10-12T13:24:59Z    Started         Task started by client
2022-10-12T13:29:56Z    Alloc Unhealthy Task not running for min_healthy_time of 10s by deadline
2022-10-12T13:29:57Z    Killing         Sent interrupt. Waiting 5s before force killing
2022-10-12T13:30:15Z    Terminated      Exit Code: 0
2022-10-12T13:30:15Z    Killed          Task successfully killed

Checks
ID                                      SERVICE         STATE           OUTPUT
3df2415693844068640885b45074b954        tcp-8080        critical        dial tcp <ip>:8080: connect: connection refused
03833b6def760b24d9962af66e7ec077        tcp-8080        critical        Get "<ip>:8080/healthcheck": dial tcp 172.19.1.50:8080: connect: connection refused

Recent Logs
  2022-10-12T13:30:12Z   [info]Shutting down virtual machine
  2022-10-12T13:30:12Z   [info][2022-10-12 13:30:12 +0000] [520] [INFO] Handling signal: int
  2022-10-12T13:30:12Z   [info]Sending signal SIGINT to main child process w/ PID 520
  2022-10-12T13:30:12Z   [info][2022-10-12 13:30:12 +0000] [525] [INFO] Worker exiting (pid: 525)
  2022-10-12T13:30:12Z   [info][2022-10-12 13:30:12 +0000] [520] [INFO] Shutting down: Master
  2022-10-12T13:30:13Z   [info]Starting clean up.

and the Dockerfile

FROM python:latest

WORKDIR /app

EXPOSE 8080

COPY ./requirements.txt /app

RUN pip install -r requirements.txt

COPY . /app

CMD ["gunicorn","-b","127.0.0.1:8080","app:app"]

Did you write your fly.toml by hand, or is this output from fly config?

  1. I don’t think you need a top-level [processes] section at all. It’s experimental and you don’t appear to be trying to run a multi-process app
  2. I don’t think you need a cmd entry in [experimental]
  3. You might be missing a top-level app entry with your fly application’s name.

In your Dockerfile, try removing the EXPOSE key, and change the bind address in CMD to "0.0.0.0:8080"

So the previous suggestion resolved my issue, but then I encountered a https issue with my Flask app. I used the werkzeug proxy workaround. But now my app is not deploying. Giving me the same unhealthy allocation error.

v143 is being deployed
232b495f: ord pending
232b495f: ord pending
232b495f: ord running unhealthy [health checks: 2 total, 1 passing]
232b495f: ord running unhealthy [health checks: 2 total, 1 passing, 1 critical]
Failed Instances

Instance
Failure #1
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS 	HEALTH CHECKS                 	RESTARTS	CREATED   

232b495f	app    	143    	ord   	run    	running	2 total, 1 passing, 1 critical	0       	4m56s ago	

Recent Events
TIMESTAMP           	TYPE      	MESSAGE                 
--> v143 failed - Failed due to unhealthy allocations - not rolling back to stable job version 143 as current job has same specification and deploying as v144 

I tried editing the healthcheck section of my fly.toml.

 [[services.http_checks]]
    interval = 10000
    grace_period = "5s"
    method = "get"
    path = "/healthcheck"
    protocol = "https"
    restart_limit = 0
    timeout = 2000
    tls_skip_verify = false

Same deployment errors.