We had an app go down that was running in only ORD over the last hour with no internal app errors being thrown. We also noticed that both health checks were passing. When trying to access this app from the outside world it was not accessible.
I wish I had more information to report, but everything seemed to be in good standing, just stopped being able to connect, only thing I could notice that changed at the time of the downtime was the metrics for the app stopped reporting the data transfer into the app.
This is a very simple app that does not connect to and DBs just a simple endpoint for our i18n translations that 99.9% of the time is responding with a cached payload as you could imagine.
After a simple fly restart we were back up and running - we have since scaled the app up to multiple regions as it doesn’t need a connection to our DB.
App
Name = better-cart-theater-of-magic-prod
Owner = better-cart
Version = 24
Status = running
Hostname = better-cart-theater-of-magic-prod.fly.dev
Services
PROTOCOL PORTS
TCP 80 => 3000 [HTTP]
443 => 3000 [TLS, HTTP]
IP Adresses
TYPE ADDRESS REGION CREATED AT
v4 213.188.208.37 2021-10-30T22:10:36Z
v6 2a09:8280:1::ef0 2021-10-30T22:10:36Z
App
Name = better-cart-theater-of-magic-prod
Owner = better-cart
Version = 24
Status = running
Hostname = better-cart-theater-of-magic-prod.fly.dev
Deployment Status
ID = 9bc06492-4014-b3aa-ae7a-82d76c1112b4
Version = v24
Status = successful
Description = Deployment completed successfully
Instances = 3 desired, 3 placed, 3 healthy, 0 unhealthy
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
7084e779 app 24 ⇡ ord run running 2 total, 2 passing 0 16m44s ago
e7c57e90 app 24 ⇡ iad run running 2 total, 2 passing 0 17m48s ago
fe421d9d app 24 ⇡ lax run running 2 total, 2 passing 0 18m23s ago
1cc2fba8 app 23 ord stop complete 2 total, 2 passing 0 18m37s ago
83dc15aa app 23 ord stop complete 2 total, 2 passing 0 19m32s ago
e24c4ccc app 22 ord stop complete 2 total, 2 passing 1 2022-01-05T18:15:22Z
Have you deployed since the restart? It looks like e24c4ccc might have crashed at least once. You can run this to see more information about what happened:
Yup, we scaled up the service to multiple regions to avoid any issues in the future.
As for e24c4ccc this was the original single VM running that I manually restarted to get the service back up.
Instance
ID = e24c4ccc
Process =
Version = 22
Region = ord
Desired = stop
Status = complete
Health Checks = 2 total, 2 passing
Restarts = 1
Created = 2022-01-05T18:15:22Z
Recent Events
TIMESTAMP TYPE MESSAGE
2022-01-05T18:15:17Z Received Task received by client
2022-01-05T18:15:29Z Task Setup Building Task Directory
2022-01-05T18:15:51Z Started Task started by client
2022-03-11T19:58:09Z Restart Signaled User requested restart
2022-03-11T19:58:59Z Terminated Exit Code: 130
2022-03-11T19:58:59Z Restarting Task restarting in 0s
2022-03-11T19:59:32Z Started Task started by client
2022-03-11T20:04:05Z Killing Sent interrupt. Waiting 5s before force killing
2022-03-11T20:04:22Z Terminated Exit Code: 130
2022-03-11T20:04:22Z Killed Task successfully killed
Checks
ID SERVICE STATE OUTPUT
a2098605836ef8e97413353200e3880b tcp-3000 passing HTTP GET http://172.19.2.50:3000/: 200 OK Output: OK
2c049ace173aa212ee0332a9b0a966d5 tcp-3000 passing TCP connect 172.19.2.50:3000: Success
Oh so the VM was still there and running, but not responsive? Are you using HTTP health checks or TCP health checks?
If you enable an HTTP health check it might catch this next time. Then you can add something like this to your check to make it restart automatically when the check fails:
Correct, like I said, it really did seem like everything was okay and in good working order, but clients were not able to access the app from the outside world.
Can you explain the restart_limit configuration further, I can’t seem to find it in the docs here: App Configuration (fly.toml)