Strange behavior preventing app startup

Here we are again, only this time I do have restart_limit = 0 in my fly.toml.

Everything was working fine yesterday; I did multiple deploys to my beta and prod sites (which are now being served by Fly) and it all went smoothly even with restart_limit = 2 (coupled with large health_check timeouts). But today I went to test an update to my app on my internal sandbox server (sm-sandbox) and not only is it not deploying but I can’t get the server to deploy anything, which makes me nervous about updating the live sites…

Here are the logs from a single deploy attempt…

Expand logs
2021-01-29T19:22:29.490Z 868c742e sjc [info] Starting Meteor server on 868c742e
2021-01-29T19:22:30.734Z 868c742e sjc [info] Monti APM: completed instrumenting the app
2021-01-29T19:22:31.527Z 868c742e sjc [info] Strum Machine v333.0 now serving sandbox.strummachine.com on 868c742e
2021-01-29T19:22:32.125Z 868c742e sjc [info] Shutting down virtual machine
2021-01-29T19:22:32.203Z 868c742e sjc [info] Sending signal SIGTERM to main child process w/ PID 503
2021-01-29T19:22:33.122Z 75c6e89d sjc [info] Starting instance
2021-01-29T19:22:33.142Z 75c6e89d sjc [info] Configuring virtual machine
2021-01-29T19:22:33.143Z 75c6e89d sjc [info] Pulling container image
2021-01-29T19:22:33.205Z 868c742e sjc [info] Reaped child process with pid: 503 and signal: SIGTERM
2021-01-29T19:22:33.525Z 75c6e89d sjc [info] Unpacking image
2021-01-29T19:22:33.528Z 75c6e89d sjc [info] Preparing kernel init
2021-01-29T19:22:33.892Z 75c6e89d sjc [info] Configuring firecracker
2021-01-29T19:22:33.970Z 75c6e89d sjc [info] Starting virtual machine
2021-01-29T19:22:34.080Z 75c6e89d sjc [info] Starting init (commit: a91af2e)...
2021-01-29T19:22:34.093Z 75c6e89d sjc [info] Running: `docker-entrypoint.sh node main.js` as root
2021-01-29T19:22:36.837Z 75c6e89d sjc [info] Monti APM: Successfully connected
2021-01-29T19:22:43.388Z 75c6e89d sjc [info] Starting Meteor server on 75c6e89d
2021-01-29T19:22:44.595Z 75c6e89d sjc [info] Monti APM: completed instrumenting the app
2021-01-29T19:22:45.054Z 75c6e89d sjc [info] Health check status changed to 'warning'
2021-01-29T19:22:45.386Z 75c6e89d sjc [info] Strum Machine v333.0 now serving sandbox.strummachine.com on 75c6e89d
2021-01-29T19:23:29.553Z 75c6e89d sjc [info] Health check status changed to 'critical'
2021-01-29T19:24:35.762Z 75c6e89d sjc [info] Shutting down virtual machine
2021-01-29T19:24:35.840Z 75c6e89d sjc [info] Sending signal SIGTERM to main child process w/ PID 503
2021-01-29T19:24:36.841Z 75c6e89d sjc [info] Reaped child process with pid: 503 and signal: SIGTERM
2021-01-29T19:24:39.210Z 75c6e89d sjc [info] Starting instance
2021-01-29T19:24:39.229Z 75c6e89d sjc [info] Configuring virtual machine
2021-01-29T19:24:39.230Z 75c6e89d sjc [info] Pulling container image
2021-01-29T19:24:39.570Z 75c6e89d sjc [info] Health check status changed to 'passing'
2021-01-29T19:24:39.719Z 75c6e89d sjc [info] Unpacking image
2021-01-29T19:24:39.724Z 75c6e89d sjc [info] Preparing kernel init
2021-01-29T19:24:40.073Z 75c6e89d sjc [info] Configuring firecracker
2021-01-29T19:24:40.149Z 75c6e89d sjc [info] Starting virtual machine
2021-01-29T19:24:40.270Z 75c6e89d sjc [info] Starting init (commit: a91af2e)...
2021-01-29T19:24:40.283Z 75c6e89d sjc [info] Running: `docker-entrypoint.sh node main.js` as root
2021-01-29T19:24:43.002Z 75c6e89d sjc [info] Monti APM: Successfully connected
2021-01-29T19:24:46.155Z 75c6e89d sjc [info] Health check status changed to 'warning'
2021-01-29T19:24:49.849Z 75c6e89d sjc [info] Starting Meteor server on 75c6e89d
2021-01-29T19:24:51.080Z 75c6e89d sjc [info] Monti APM: completed instrumenting the app
2021-01-29T19:24:51.884Z 75c6e89d sjc [info] Strum Machine v333.0 now serving sandbox.strummachine.com on 75c6e89d
2021-01-29T19:24:59.345Z 75c6e89d sjc [info] Health check status changed to 'critical'
2021-01-29T19:26:38.047Z 75c6e89d sjc [info] Shutting down virtual machine
2021-01-29T19:26:38.126Z 75c6e89d sjc [info] Sending signal SIGTERM to main child process w/ PID 503
2021-01-29T19:26:39.085Z fada5313 sjc [info] Starting instance
2021-01-29T19:26:39.106Z fada5313 sjc [info] Configuring virtual machine
2021-01-29T19:26:39.107Z fada5313 sjc [info] Pulling container image
2021-01-29T19:26:39.127Z 75c6e89d sjc [info] Reaped child process with pid: 503 and signal: SIGTERM
2021-01-29T19:26:39.616Z fada5313 sjc [info] Unpacking image
2021-01-29T19:26:39.619Z fada5313 sjc [info] Preparing kernel init
2021-01-29T19:26:39.959Z fada5313 sjc [info] Configuring firecracker
2021-01-29T19:26:40.035Z fada5313 sjc [info] Starting virtual machine
2021-01-29T19:26:40.156Z fada5313 sjc [info] Starting init (commit: a91af2e)...
2021-01-29T19:26:40.169Z fada5313 sjc [info] Running: `docker-entrypoint.sh node main.js` as root
2021-01-29T19:26:42.354Z 75c6e89d sjc [info] Health check status changed to 'passing'
2021-01-29T19:26:42.967Z fada5313 sjc [info] Monti APM: Successfully connected
2021-01-29T19:26:49.909Z fada5313 sjc [info] Starting Meteor server on fada5313
2021-01-29T19:26:51.128Z fada5313 sjc [info] Monti APM: completed instrumenting the app
2021-01-29T19:26:51.930Z fada5313 sjc [info] Strum Machine v333.0 now serving sandbox.strummachine.com on fada5313
2021-01-29T19:27:25.551Z fada5313 sjc [info] Health check status changed to 'critical'
2021-01-29T19:28:41.483Z fada5313 sjc [info] Shutting down virtual machine
2021-01-29T19:28:41.563Z fada5313 sjc [info] Sending signal SIGTERM to main child process w/ PID 503
2021-01-29T19:28:42.565Z fada5313 sjc [info] Reaped child process with pid: 503 and signal: SIGTERM
2021-01-29T19:28:45.392Z fada5313 sjc [info] Starting instance
2021-01-29T19:28:45.410Z fada5313 sjc [info] Pulling container image
2021-01-29T19:28:45.410Z fada5313 sjc [info] Configuring virtual machine
2021-01-29T19:28:45.462Z fada5313 sjc [info] Health check status changed to 'passing'
2021-01-29T19:28:45.959Z fada5313 sjc [info] Unpacking image
2021-01-29T19:28:45.964Z fada5313 sjc [info] Preparing kernel init
2021-01-29T19:28:46.281Z fada5313 sjc [info] Configuring firecracker
2021-01-29T19:28:46.392Z fada5313 sjc [info] Starting virtual machine
2021-01-29T19:28:46.493Z fada5313 sjc [info] Starting init (commit: a91af2e)...
2021-01-29T19:28:46.505Z fada5313 sjc [info] Running: `docker-entrypoint.sh node main.js` as root
2021-01-29T19:28:49.179Z fada5313 sjc [info] Monti APM: Successfully connected
2021-01-29T19:28:51.869Z fada5313 sjc [info] Health check status changed to 'warning'
2021-01-29T19:28:56.583Z fada5313 sjc [info] Starting Meteor server on fada5313
2021-01-29T19:28:57.795Z fada5313 sjc [info] Monti APM: completed instrumenting the app
2021-01-29T19:28:58.690Z fada5313 sjc [info] Strum Machine v333.0 now serving sandbox.strummachine.com on fada5313
2021-01-29T19:29:47.852Z fada5313 sjc [info] Health check status changed to 'critical'
2021-01-29T19:30:47.493Z fada5313 sjc [info] Shutting down virtual machine
2021-01-29T19:30:47.576Z fada5313 sjc [info] Sending signal SIGTERM to main child process w/ PID 503
2021-01-29T19:30:48.577Z fada5313 sjc [info] Reaped child process with pid: 503 and signal: SIGTERM
2021-01-29T19:30:51.630Z fada5313 sjc [info] Starting instance
2021-01-29T19:30:51.650Z fada5313 sjc [info] Configuring virtual machine
2021-01-29T19:30:51.651Z fada5313 sjc [info] Pulling container image
2021-01-29T19:30:52.209Z fada5313 sjc [info] Unpacking image
2021-01-29T19:30:52.212Z fada5313 sjc [info] Preparing kernel init
2021-01-29T19:30:52.460Z fada5313 sjc [info] Health check status changed to 'passing'
2021-01-29T19:30:52.512Z fada5313 sjc [info] Configuring firecracker
2021-01-29T19:30:52.591Z fada5313 sjc [info] Starting virtual machine
2021-01-29T19:30:52.687Z fada5313 sjc [info] Starting init (commit: a91af2e)...
2021-01-29T19:30:52.695Z fada5313 sjc [info] Running: `docker-entrypoint.sh node main.js` as root
2021-01-29T19:30:54.931Z fada5313 sjc [info] Monti APM: Successfully connected
2021-01-29T19:30:59.043Z fada5313 sjc [info] Health check status changed to 'warning'
2021-01-29T19:31:03.065Z fada5313 sjc [info] Starting Meteor server on fada5313
2021-01-29T19:31:04.279Z fada5313 sjc [info] Monti APM: completed instrumenting the app
2021-01-29T19:31:05.089Z fada5313 sjc [info] Strum Machine v333.0 now serving sandbox.strummachine.com on fada5313
2021-01-29T19:31:38.371Z fada5313 sjc [info] Shutting down virtual machine
2021-01-29T19:31:38.449Z fada5313 sjc [info] Sending signal SIGTERM to main child process w/ PID 503
2021-01-29T19:31:39.450Z fada5313 sjc [info] Reaped child process with pid: 503 and signal: SIGTERM
2021-01-29T19:31:41.262Z fada5313 sjc [info] Health check status changed to 'critical'

Since the last line was Health check status changed to 'critical', you’d think flyctl status would show that the deployment failed, but no, it says all is peachy:

Expand output of flyctl status
App
  Name     = sm-sandbox
  Owner    = strum-machine
  Version  = 23
  Status   = running
  Hostname = sm-sandbox.fly.dev

Deployment Status
ID = ee97ee0b-01e3-2e73-42ee-db4775576569
Version = v23
Status = running
Description = Deployment is running
Instances = 1 desired, 1 placed, 0 healthy, 0 unhealthy

Instances
ID VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
c7cacca6 23 sjc run running 1 total, 1 passing 0 2021-01-28T19:39:23Z

Also, when I retry flyctl deploy, even half an hour after my last attempt, it says that the previous deployment is being cancelled due to a newer version of the job; in other words, the deployment isn’t being completed…

Any ideas?

@luke could you try deploying again when you have a moment?

Yup, deploying in 3… 2… 1…

This is the part that makes no sense to me (horizontally condensed for clarity):

20:18:58 99b35024 sjc Health check status changed to 'critical'
20:20:17 99b35024 sjc Shutting down virtual machine
20:20:17 99b35024 sjc Sending signal SIGTERM to main child process w/ PID 503
20:20:18 99b35024 sjc Reaped child process with pid: 503 and signal: SIGTERM
20:20:21 99b35024 sjc Health check status changed to 'passing'
20:20:21 99b35024 sjc Starting instance

It goes critical (after the app server successfully boots up, mind you) and then gets shut down, and then the health check status is “passing”…

Also I saw these logs outputted from flyctl deploy:

Recent Events
TIMESTAMP            TYPE             MESSAGE
2021-01-29T20:18:13Z Received         Task received by client
2021-01-29T20:18:13Z Task Setup       Building Task Directory
2021-01-29T20:18:16Z Started          Task started by client
2021-01-29T20:20:17Z Restart Signaled healthcheck: check "0b4733c6715f665999544c99ce4a5d9c" unhealthy
2021-01-29T20:20:20Z Terminated       Exit Code: -1
2021-01-29T20:20:20Z Restarting       Task restarting in 1.116502702s
2021-01-29T20:20:22Z Started          Task started by client
2021-01-29T20:22:23Z Restart Signaled healthcheck: check "0b4733c6715f665999544c99ce4a5d9c" unhealthy
2021-01-29T20:22:26Z Terminated       Exit Code: -1
2021-01-29T20:22:26Z Restarting       Task restarting in 1.114479924s
2021-01-29T20:22:28Z Started          Task started by client
2021-01-29T20:23:13Z Alloc Unhealthy  Task not running for min_healthy_time of 10s by deadline
2021-01-29T20:23:14Z Killing          Sent interrupt. Waiting 3s before force killing

I wonder what min_health_time means…

Yeah something is funky with these checks. They report a critical state which causes the VM to restart, and after a few restarts the deployment is hanging, which is an unrelated issue we’re still investigating.

I think we need to tweak your checks. Kurt added a grace_period option that you can use to wait for your app to boot before running checks, but I need to do one thing before you can use it. Give me a few minutes for that… I also noticed that your app is using ~100% of the 256mb ram. I don’t know what kind of app that is but low memory might be causing an issue too.

That means a VM wasn’t healthy for at least 10s before the deployment deadline was reached, which I think is a rolling 3 minutes.

It was working with 256mb before – it uses close to 200mb normally – but I bumped it up to 1GB to see what that did. Same result.

Removed the health check and it booted up just fine in under 10 seconds. This made me think that maybe my health check timeout (30s) was too long? Anyway, it does seem like this was a health-check related thing. That grace_period option will be great to have, thanks!

@Luke grace_period is available now. Try something like this in your checks to delay them long enough for your app to start and see how a deploy does.

[[services.tcp_checks]]
interval = 10000
timeout = 2000
grace_period = "1m"

Wonderful! I appreciate how the health checks are started right away, but failures are ignored during the grace period. Even better than what I had to do before, which was wait to start health checks until I could be sure the app was ready, which needlessly delayed rollouts.

:+1::+1:

Well, that’s what it seemed like was happening. But maybe it just wasn’t respecting values of "30s" or 30000, because when I changed it to "1m" it seems to now wait a full minute before shutting down the server:

2021-01-29T21:43:44.778Z b8622719 sjc [info] Starting instance
2021-01-29T21:43:44.798Z b8622719 sjc [info] Configuring virtual machine
2021-01-29T21:43:44.799Z b8622719 sjc [info] Pulling container image
2021-01-29T21:43:45.104Z b8622719 sjc [info] Unpacking image
2021-01-29T21:43:45.107Z b8622719 sjc [info] Preparing kernel init
2021-01-29T21:43:45.424Z b8622719 sjc [info] Configuring firecracker
2021-01-29T21:43:45.501Z b8622719 sjc [info] Starting virtual machine
2021-01-29T21:43:45.617Z b8622719 sjc [info] Starting init (commit: 8bf1b31)...
2021-01-29T21:43:45.626Z b8622719 sjc [info] Running: `docker-entrypoint.sh node main.js` as root
2021-01-29T21:43:47.789Z b8622719 sjc [info] Monti APM: Successfully connected
2021-01-29T21:43:52.287Z b8622719 sjc [info] Starting Meteor server on b8622719
2021-01-29T21:43:53.508Z b8622719 sjc [info] Monti APM: completed instrumenting the app
2021-01-29T21:43:54.286Z b8622719 sjc [info] Strum Machine v332.3 now serving sandbox.strummachine.com on b8622719
(Note that one minute passes...)
2021-01-29T21:44:51.456Z b8622719 sjc [info] Shutting down virtual machine
2021-01-29T21:44:51.539Z b8622719 sjc [info] Sending signal SIGTERM to main child process w/ PID 504
2021-01-29T21:44:52.541Z b8622719 sjc [info] Reaped child process with pid: 504 and signal: SIGTERM
2021-01-29T21:44:54.828Z b8622719 sjc [info] Starting instance
(And the cycle repeats...)

So, a) grace_period would work better if it allowed for early successes (though that’s opinionated and I could be wrong on that), and b) it still is failing health checks.

Here’s my health check config (I’m only using HTTP, not TCP):

  [[services.http_checks]]
    interval = 5000
    method = "get"
    path = "/healthz"
    protocol = "http"
    timeout = 2000
    tls_skip_verify = false
    restart_limit = 2
    grace_period = "1m"

And yes, /healthz works – here it is on the live server. And setting tls_skip_verify to true had no effect.

Here’s the output we got from that health check on vm b8622719:

"Get \"http://172.19.4.18:80/healthz\": dial tcp 172.19.4.18:80: connect: connection refused"

Seems like we can’t actually connect to the app. What does the rest of the services section in your config look like?

Huh, interesting! So, first of all, I tweaked my health-check settings to this:

  [[services.http_checks]]
    interval = 20000
    method = "get"
    path = "/healthz"
    protocol = "http"
    timeout = 15000
    tls_skip_verify = true
    restart_limit = 2
    grace_period = "15s"

And then did several successful deployments in a row. So I’m not sure why that worked when your manual check didn’t?

Here's my full fly.toml
app = "sm-sandbox"

kill_signal = "SIGTERM"
kill_timeout = 2

[[services]]
  internal_port = 80
  protocol = "tcp"

  [services.concurrency]
    hard_limit = 500
    soft_limit = 100

  [[services.ports]]
    handlers = ["http"]
    port = "80"

  [[services.ports]]
    handlers = ["tls", "http"]
    port = "443"

  [[services.http_checks]]
    interval = 20000
    method = "get"
    path = "/healthz"
    protocol = "http"
    timeout = 15000
    tls_skip_verify = true
    restart_limit = 2
    grace_period = "15s"

It may have been that the VM closed before you got a chance to get a response. There’s currently an active vm 27bc369f, and I imagine it would succeed with that one.

Looks like I have a config that’s working now. Thanks for all your help and thanks again to @kurt for implementing grace_period!

You can see the check status yourself with flyctl checks list or a bunch more VM info with flyctl status instance <id>

I’m very happy you got it working. This was a doozie :smiley: