I have been getting a lot of in-app request timeout errors (when my app makes an outgoing http request to another URL) and have been trying to debug whether it is caused by Cloudflare/Fly/DNS etc. I added some experiments, tried to deploy them in an updated app … and have hit another error. Now the deploy itself is not working. Hmm.
So I tried destroying the remote builder to see if it was a fault with the prior one, but no, I get the same error with the new one too. It appears to be an issue with how the remote builder authenticates with Docker, which would be out of my control:
==> Creating build context
--> Creating build context done
==> Building image with Docker
--> docker host: 20.10.12 linux x86_64
Sending build context to Docker daemon 546kB
[+] Building 10.6s (3/3) FINISHED
=> [internal] load remote build context 0.0s
=> copy /context / 0.1s
=> ERROR [internal] load metadata for docker.io/library/node:16.13-slim 10.4s
> [internal] load metadata for docker.io/library/node:16.13-slim:
Error failed to fetch an image or build from source: error building: failed to solve with frontend dockerfile.v0: failed to create LLB definition: failed to auth
orize: failed to fetch anonymous token: Get "https://auth.docker.io/token?scope=repository%3Alibrary%2Fnode%3Apull&service=registry.docker.io": net/http: TLS han
It’s interesting it says a timeout … since I’ve also been getting timeouts from my Fly-hosted apps too I wonder if the issues are related. Maybe some networking/DNS issue resolving any external domain from a Fly app?
Ah. Interestingly I just did another deploy (without changing anything) and that one worked. So the timeout was not an issue then. That deploy completed.
That fits with what I’ve been finding with my own app.
I don’t touch anything and external requests (same domain, headers etc - all the same) randomly stop responding within an Xs timeout (even when set high, like 20s). So the incoming request to the app works fine (like a healthcheck). It’s when the app does an outgoing request, which it/Fly has to resolve. It’s that which times-out. But … not consistently.
I don’t touch anything and the same requests then don’t timeout X minutes later and work fine. Weird.
Do you know what region your builder was in? And what region the uptime checks are coming from?
Not sure since the builder I have now has reverted to ‘pending’ and so has no instances listed.
TYPE ADDRESS REGION CREATED AT
The uptime checks are coming from the US, I assume (UptimeRobot). But their checks are correct: the requests were indeed not responding. I tried too (from UK).
Incoming requests were working (like if I did a request to e.g
/healthcheck, which just hits the vm, and returns back, to check the vm itself works). All good. So World->Cloudflare->Fly works fine.
It was sub-requests that were outgoing from the app that were failing. So, like, if it had to fetch data from
https://example.com in order to handle a request. That outgoing call would timeout after whatever time I set for the fetch. I tried 20s. Didn’t return.
Given the builder was also returning time-outs, and that’s also a Fly app, it did not seem a coincidence. However now it’s working again: sub-requests are responding within ms again, so I can’t recreate it. Ah well.
Do your sub requests go to AWS by chance? AWS has had weird networking issues in Virginia for the past week. We’ve been seeing packet loss deep within their network for RDS connections, connections to Heroku DBs, possibly S3, etc.
This would affect our registry too. We actually disabled the registry in Virginia for now, so registry requests get routed to other regions before they try to talk to AWS.
Interesting. These particular ones (that I happened to investigate) were going to Fly, not AWS.
Now I was using the outside-world hostname (the one proxied by Cloudflare, in front of Fly). Which usually works fine. Works fine right now. Just like the builder’s requests (in its case, to that registry). They work fine too. They just weren’t (reliably) all day yesterday.
My debugging plan was (and I may do this anyway) to swap out the Cloudflare-proxied hostname for its
app-name.internal one. So app->app requests would stay within the private network. That would remove the Cloudflare/DNS variable from sub-requests to see if it was at fault. However now requests are back to working normally, even proxied by Cloudflare, there is no way to know if that would have helped or not. But - given the builder app requests were timing out too - it seems unlikely Cloudflare was an issue yesterday.
Oh the sub requests didn’t hit any outside URLs?
It probably wasn’t cloudflare yesterday. Our registry timeout was almost definitely AWS.
My sub-requests did hit outside URLs. Because of going via Cloudflare.
e.g To fetch some data, the app does
… and so that request is handled by …
my-app.com (proxied by Cloudfare via orange-cloud DNS) → Fly
And so that sub-request to
my-app.com does venture into the outside world, and so
my-app.com would need to be resolved using DNS, for example. Was that the cause of the 20s+ delay? Not sure. That was my thought. And so I wondered whether to swap out
my-app.internal in the fetch. And so that request would then not venture outside. So there would still be DNS, but it would be in-house. The only reason I didn’t do that was because of self-signed certificate errors (it wanted an actual domain) and so didn’t get around to sorting that. Still might though. But using
https://my-app.com is working again now. Weird.
Huh that is strange. Once we have a DNS entry cache, the
.internal lookup and the external lookup are almost identical.
If this comes up again, will you ssh into your Fly vm and run some tests against the slow URL?
mtr is a nice tool for this:
mtr -nt <hostname> could identify some issues.
Ok, I’ll try and remember if it happens again. So far it’s been back to normal all day today. UptimeRobot is a happy bot.
Im getting this today, new user first deploy failing
Get "https://registry.fly.io/v2/": net/http: request canceled (Client.Timeout exceeded when I hit it in a web browser it takes over a minute to respond
Hi @squareborg we’re looking into a registry incident right now.