Dead app: Data Transfer In has stopped

We had an app go down that was running in only ORD over the last hour with no internal app errors being thrown. We also noticed that both health checks were passing. When trying to access this app from the outside world it was not accessible.

I wish I had more information to report, but everything seemed to be in good standing, just stopped being able to connect, only thing I could notice that changed at the time of the downtime was the metrics for the app stopped reporting the data transfer into the app.

This is a very simple app that does not connect to and DBs just a simple endpoint for our i18n translations that 99.9% of the time is responding with a cached payload as you could imagine.

After a simple fly restart we were back up and running - we have since scaled the app up to multiple regions as it doesn’t need a connection to our DB.

App
  Name     = better-cart-theater-of-magic-prod
  Owner    = better-cart
  Version  = 24
  Status   = running
  Hostname = better-cart-theater-of-magic-prod.fly.dev

Services
PROTOCOL PORTS
TCP      80 => 3000 [HTTP]
         443 => 3000 [TLS, HTTP]

IP Adresses
TYPE ADDRESS          REGION CREATED AT
v4   213.188.208.37          2021-10-30T22:10:36Z
v6   2a09:8280:1::ef0        2021-10-30T22:10:36Z

Thanks!

What does fly status --all show?

App
  Name     = better-cart-theater-of-magic-prod
  Owner    = better-cart
  Version  = 24
  Status   = running
  Hostname = better-cart-theater-of-magic-prod.fly.dev

Deployment Status
  ID          = 9bc06492-4014-b3aa-ae7a-82d76c1112b4
  Version     = v24
  Status      = successful
  Description = Deployment completed successfully
  Instances   = 3 desired, 3 placed, 3 healthy, 0 unhealthy

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS          HEALTH CHECKS           RESTARTS        CREATED
7084e779        app     24 ⇡    ord     run     running         2 total, 2 passing      0               16m44s ago
e7c57e90        app     24 ⇡    iad     run     running         2 total, 2 passing      0               17m48s ago
fe421d9d        app     24 ⇡    lax     run     running         2 total, 2 passing      0               18m23s ago
1cc2fba8        app     23      ord     stop    complete        2 total, 2 passing      0               18m37s ago
83dc15aa        app     23      ord     stop    complete        2 total, 2 passing      0               19m32s ago
e24c4ccc        app     22      ord     stop    complete        2 total, 2 passing      1               2022-01-05T18:15:22Z

Have you deployed since the restart? It looks like e24c4ccc might have crashed at least once. You can run this to see more information about what happened:

fly vm status e24c4ccc

Yup, we scaled up the service to multiple regions to avoid any issues in the future.

As for e24c4ccc this was the original single VM running that I manually restarted to get the service back up.

Instance
  ID            = e24c4ccc
  Process       =
  Version       = 22
  Region        = ord
  Desired       = stop
  Status        = complete
  Health Checks = 2 total, 2 passing
  Restarts      = 1
  Created       = 2022-01-05T18:15:22Z

Recent Events
TIMESTAMP            TYPE             MESSAGE
2022-01-05T18:15:17Z Received         Task received by client
2022-01-05T18:15:29Z Task Setup       Building Task Directory
2022-01-05T18:15:51Z Started          Task started by client
2022-03-11T19:58:09Z Restart Signaled User requested restart
2022-03-11T19:58:59Z Terminated       Exit Code: 130
2022-03-11T19:58:59Z Restarting       Task restarting in 0s
2022-03-11T19:59:32Z Started          Task started by client
2022-03-11T20:04:05Z Killing          Sent interrupt. Waiting 5s before force killing
2022-03-11T20:04:22Z Terminated       Exit Code: 130
2022-03-11T20:04:22Z Killed           Task successfully killed

Checks
ID                               SERVICE  STATE   OUTPUT
a2098605836ef8e97413353200e3880b tcp-3000 passing HTTP GET http://172.19.2.50:3000/: 200 OK Output: OK
2c049ace173aa212ee0332a9b0a966d5 tcp-3000 passing TCP connect 172.19.2.50:3000: Success

Oh so the VM was still there and running, but not responsive? Are you using HTTP health checks or TCP health checks?

If you enable an HTTP health check it might catch this next time. Then you can add something like this to your check to make it restart automatically when the check fails:

restart_limit = 6

Correct, like I said, it really did seem like everything was okay and in good working order, but clients were not able to access the app from the outside world.

Can you explain the restart_limit configuration further, I can’t seem to find it in the docs here: App Configuration (fly.toml)

Thanks!

[[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

  [[services.http_checks]]
    interval = 10000
    grace_period = "5s"
    method = "get"
    path = "/"
    protocol = "http"
    timeout = 2000
    tls_skip_verify = false

That’ll restart the process after X number of failures.

Gotcha! So just add this under the [[services.http_checks]] then?