I have a few apps. The ones only in LHR don’t seem to be responding.
I have one in LHR and IAD (2 instance). All requests seem to be handled by IAD.
I have one in LHR and CDG (2 instance). All requests seem to be handled by CDG.
e.g looking at the logs, this stands out:
2022-01-05T22:07:44.838 proxy[13e0beb0] lhr [error] Error 2: Internal problem
… as since then, the app (that’s a one instance app, in LHR) seems dead. Logs stopped about 90 minutes ago. 525 error (because Cloudflare is in front and it can’t contact it, so returns a 525 error - the SSL bit of that error is misleading).
No volumes attached to any app, or any database. They are simple NodeJS apps.
I looked at metrics and they all look fine, minimal load, no spike.
I looked at the activity tab, and every app seems to return a 500 error. But that happens even with the apps that are working, so I assume is not connected:
Yeah, LHR appears to be the common issue for the errors. I haven’t deployed for days, no code changes, so nothing on my end to break them.
fly --app NAME status
The two-instance app:
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
2e15c859 app 218 lhr run pending 1 total, 1 critical 0 2022-01-01T21:20:23Z
5328b10e app 218 cdg(B) run running 1 total, 1 passing 0 2022-01-01T21:20:23Z
Its sibling one-instance app:
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
13e0beb0 app 395 lhr run running 1 total, 1 critical 0 2022-01-01T21:06:38Z
Not clear why the health check is failing. It’s just doing a simple http response. And I’ve not changed anything. Maybe the host it’s running on? Weird.
e.g I saw this line, and not changed anything for days at my end:
2022-01-05T22:09:26.515 proxy[13e0beb0] lhr [error]Health check status changed 'passing' => 'critical'
So I restarted it manually to see what would happen and sure enough, a new log line appeared, and it resumed and is now working again.
...
2022-01-05T22:10:00.989 app[13e0beb0] lhr [info]POST /TEST 200 96 - 23.033 ms
2022-01-05T23:41:58.328 runner[13e0beb0] lhr [info]Shutting down virtual machine
...
So it was down for 90 minutes or so, based on that.
Would it have been moved to another region if I did have a backup region set based on the failure? Or what would trigger that? I removed the backup region.
We’re doing emergency maintenance on a couple of hosts in London. Are you using volumes on these apps? If you’re not using volumes, your VMs should get rescheduled. The health check failures probably aren’t caused by this, though.
If you run fly vm status 13e0beb0 on the one instance app, what does it say about health checks?
The output for that command is interesting. It does point to an issue: Failed Restoring Task failed to restore task at that time.
I manually did a restart 90 minutes later and that worked fine without changing any code. The healthchecks passed and are fine now. Which suggests it was the host.
Instance
ID = 13e0beb0
Process =
Version = 395
Region = lhr
Desired = run
Status = running
Health Checks = 1 total, 1 passing
Restarts = 1
Created = 2022-01-01T21:06:38Z
Recent Events
TIMESTAMP TYPE MESSAGE
2022-01-01T21:06:31Z Received Task received by client
2022-01-01T21:06:31Z Task Setup Building Task Directory
2022-01-01T21:06:35Z Started Task started by client
2022-01-05T22:07:38Z Failed Restoring Task failed to restore task; will not run until server is contacted
2022-01-05T22:08:20Z Started Task started by client
2022-01-05T23:41:58Z Restart Signaled User requested restart
2022-01-05T23:41:59Z Terminated Exit Code: 0
2022-01-05T23:41:59Z Restarting Task restarting in 0s
2022-01-05T23:42:07Z Started Task started by client
Checks
ID SERVICE STATE OUTPUT
50f1daa51d97c995477d21c1927c1a24 tcp-8080 passing TCP connect 172.19.0.34:8080: Success
No problem! Thank you for letting us know, I’m not sure we would have noticed this otherwise. It’s a pretty quick fix on our end – we just need to ensure that VMs are rescheduled before we take host hardware down.