This may be related to the 525 errors (Cloudflare 525 error randomly occurs - #3 by greg) if kurt or someone on the team is playing with the proxy but a few moments ago I got a notification of another app failing, different to the one there, this time with 502 errors.
And when I look at the app status, all is well. Has one instance, running in ams. I haven’t touched it for ages, no deploys or changes.
But when I look at the logs, with flyctl logs, I see pages of errors where it looks like the request arriving in lhr, from me or a healthcheck, is failing:
2021-06-30T00:22:34.487955794Z proxy lhr [error] error.code=1002 error.message="No suitable (healthy) instance found to handle request" request.method="HEAD" request.url="/admin/signin" request.id="01F9D4NTQM2PJX6XF2SW08A786" response.status=502
That’s from a healthcheck test URL which should respond and work. I can’t connect to it either, requests timeout for me in browser too.
Hmm … not good. I can try making two instances as a temporary fix? Or give it a restart? Shouldn’t need to but wondered if you were changing anything currently?
I didn’t touch anything Not sure if you did anything at your end? Still showing the same one instance in the status. in ams(B). Doesn’t appear to have been replaced.
I had a load of errors in the log e.g
2021-06-30T00:22:04.825421114Z proxy lhr [error] error.code=1002 error.message="No suitable (healthy) instance found to handle request" request.method="GET" request.url="/favicon.ico" request.id="01F9D4MXT37EHASSBNSMTM1QWC" response.status=502
And indeed it did not work.
… but now I’m seeing 200s again. And can connect to the app in browser too.
Don’t know if those request IDs shed any light at your end, whether this is connected to the proxy/525/restart, or just a coincidence?
But it’s working again now. Down for maybe 7 minutes.
And yep, got a 502 Cloudflare error. So that 525 does seem SSL related, and so not related to the app itself.
Ah, if it helps, it does appear to be proxy-related. Only when scrolling through the logs, I see e.g
2021-06-30T00:22:01.955175043Z app[8b2d659a] ams [info] GET /healthcheck 200 16 - 0.632 ms
2021-06-30T00:22:04.825421114Z proxy lhr [error] error.code=1002 error.message="No suitable (healthy) instance found to handle request" request.method="GET" request.url="/favicon.ico" request.id="01F9D4MXT37EHASSBNSMTM1QWC" response.status=502
2021-06-30T00:22:06.961117721Z app[8b2d659a] ams [info] GET /healthcheck 200 16 - 2.582 ms
… and that /healthcheck is what I have in the fly toml. So the app itself was working the whole time, and reporting 200 to that internal check. Which would explain why your system did not auto-replace it (I assume, as I guess that would happen on a healthcheck failure?).
But the outside world could not connect to it as the proxy was reporting no instance was found to serve the request. Which is a problem.
Like I say, I didn’t intervene. Was wondering whether to, but didn’t seem like anything I could do given the app instance was saying it was healthy. And I haven’t changed it.
It seems like the wobble was at your proxy which connects it to the outside world? I noticed errors for lhr and sea but may have been others. As that was reporting no instance to connect, even though there was.
Yeah this was on our end. Still trying to figure out what happened, but right now it seems a network issue caused consul to lose a leader for a few minutes which then caused bad state in several other services, including our proxy.
I wasn’t sure whether to report the issue but didn’t know how long it would last, naturally.
I guess in this case even having more VMs in different regions wouldn’t have helped, as that was my other thought. As the proxy controls access to them. So it’s all dependent on that.