Dead app: Data Transfer In has stopped

danwetherald · March 11, 2022, 8:13pm

We had an app go down that was running in only ORD over the last hour with no internal app errors being thrown. We also noticed that both health checks were passing. When trying to access this app from the outside world it was not accessible.

I wish I had more information to report, but everything seemed to be in good standing, just stopped being able to connect, only thing I could notice that changed at the time of the downtime was the metrics for the app stopped reporting the data transfer into the app.

This is a very simple app that does not connect to and DBs just a simple endpoint for our i18n translations that 99.9% of the time is responding with a cached payload as you could imagine.

After a simple fly restart we were back up and running - we have since scaled the app up to multiple regions as it doesn’t need a connection to our DB.

App
  Name     = better-cart-theater-of-magic-prod
  Owner    = better-cart
  Version  = 24
  Status   = running
  Hostname = better-cart-theater-of-magic-prod.fly.dev

Services
PROTOCOL PORTS
TCP      80 => 3000 [HTTP]
         443 => 3000 [TLS, HTTP]

IP Adresses
TYPE ADDRESS          REGION CREATED AT
v4   213.188.208.37          2021-10-30T22:10:36Z
v6   2a09:8280:1::ef0        2021-10-30T22:10:36Z

Thanks!

kurt · March 11, 2022, 8:18pm

What does fly status --all show?

danwetherald · March 11, 2022, 8:23pm

App
  Name     = better-cart-theater-of-magic-prod
  Owner    = better-cart
  Version  = 24
  Status   = running
  Hostname = better-cart-theater-of-magic-prod.fly.dev

Deployment Status
  ID          = 9bc06492-4014-b3aa-ae7a-82d76c1112b4
  Version     = v24
  Status      = successful
  Description = Deployment completed successfully
  Instances   = 3 desired, 3 placed, 3 healthy, 0 unhealthy

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS          HEALTH CHECKS           RESTARTS        CREATED
7084e779        app     24 ⇡    ord     run     running         2 total, 2 passing      0               16m44s ago
e7c57e90        app     24 ⇡    iad     run     running         2 total, 2 passing      0               17m48s ago
fe421d9d        app     24 ⇡    lax     run     running         2 total, 2 passing      0               18m23s ago
1cc2fba8        app     23      ord     stop    complete        2 total, 2 passing      0               18m37s ago
83dc15aa        app     23      ord     stop    complete        2 total, 2 passing      0               19m32s ago
e24c4ccc        app     22      ord     stop    complete        2 total, 2 passing      1               2022-01-05T18:15:22Z

kurt · March 11, 2022, 8:26pm

Have you deployed since the restart? It looks like e24c4ccc might have crashed at least once. You can run this to see more information about what happened:

fly vm status e24c4ccc

danwetherald · March 11, 2022, 8:28pm

Yup, we scaled up the service to multiple regions to avoid any issues in the future.

As for e24c4ccc this was the original single VM running that I manually restarted to get the service back up.

Instance
  ID            = e24c4ccc
  Process       =
  Version       = 22
  Region        = ord
  Desired       = stop
  Status        = complete
  Health Checks = 2 total, 2 passing
  Restarts      = 1
  Created       = 2022-01-05T18:15:22Z

Recent Events
TIMESTAMP            TYPE             MESSAGE
2022-01-05T18:15:17Z Received         Task received by client
2022-01-05T18:15:29Z Task Setup       Building Task Directory
2022-01-05T18:15:51Z Started          Task started by client
2022-03-11T19:58:09Z Restart Signaled User requested restart
2022-03-11T19:58:59Z Terminated       Exit Code: 130
2022-03-11T19:58:59Z Restarting       Task restarting in 0s
2022-03-11T19:59:32Z Started          Task started by client
2022-03-11T20:04:05Z Killing          Sent interrupt. Waiting 5s before force killing
2022-03-11T20:04:22Z Terminated       Exit Code: 130
2022-03-11T20:04:22Z Killed           Task successfully killed

Checks
ID                               SERVICE  STATE   OUTPUT
a2098605836ef8e97413353200e3880b tcp-3000 passing HTTP GET http://172.19.2.50:3000/: 200 OK Output: OK
2c049ace173aa212ee0332a9b0a966d5 tcp-3000 passing TCP connect 172.19.2.50:3000: Success

kurt · March 11, 2022, 8:30pm

Oh so the VM was still there and running, but not responsive? Are you using HTTP health checks or TCP health checks?

If you enable an HTTP health check it might catch this next time. Then you can add something like this to your check to make it restart automatically when the check fails:

restart_limit = 6

danwetherald · March 11, 2022, 8:33pm

Correct, like I said, it really did seem like everything was okay and in good working order, but clients were not able to access the app from the outside world.

Can you explain the restart_limit configuration further, I can’t seem to find it in the docs here: App Configuration (fly.toml)

Thanks!

[[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

  [[services.http_checks]]
    interval = 10000
    grace_period = "5s"
    method = "get"
    path = "/"
    protocol = "http"
    timeout = 2000
    tls_skip_verify = false

kurt · March 11, 2022, 8:35pm

That’ll restart the process after X number of failures.

danwetherald · March 11, 2022, 8:37pm

Gotcha! So just add this under the [[services.http_checks]] then?

Topic		Replies	Views
FLy status shows up but app is down for seven hours	9	765	March 21, 2023
Fly deploys failing (stalled in "pending"), "runing" app unresponsive	7	510	May 10, 2022
Something went wrong? Questions / Help	42	1199	September 22, 2022
Proxy Error - Internal problem - MAA	2	314	February 16, 2022
Bug / inconsistent state: Errors from dead app	7	223	January 23, 2023

Dead app: Data Transfer In has stopped

Related Topics