HTTP 525 errors via Cloudflare Worker

Hello,

We’re using a Cloudflare Worker to send requests to a fly.dev url. During periods of higher QPS (50-500), I am getting random intermittent 525s (as reported by response.status in the cf worker) up to ~1% of the time.

Removing force_https from fly.toml and hitting http://x.fly.dev seems to resolve the issue (all requests return 200s), but is not ideal. Perhaps we could handle TLS termination in-app, although I’d prefer not to do that either…

This is related to Cloudflare 525 error randomly occurs and 525 errors are back, but I’d highlight a couple things:

  • This is not a health check or an “orange cloud” DNS proxy. It is a fetch request via Cloudflare Worker.
  • I am using the fly.dev URL directly.
  • It happens intermittently, not all the time, and only under relatively high QPS

Any ideas for how I can further troubleshoot this issue? Or any potential fixes/workarounds?

Hmm. Certainly appears to be TLS related, given removing https fixes it, and the 525 code directly relates to that too. Unlike e.g a 500 or 504 etc.

My first thought, given those numbers, would be it could be related to this:

Since I know Workers re-use a single instance, unlike e.g Lambda where is only one instance handling each request. And so it would follow the requests are coming from the same IP. Not quite sure how to avoid that other than handle TLS in-app to avoid any Fly-proxy restriction.

That would mean every fetch call from a worker does a separate TLS handshake? There’s no connection re-use / H2 multiplexing going on?

I’ll try to think of something we can do (that isn’t just raising the global limit).

I would have thought Cloudflare would re-use connections too. According to this they do:

Because Cloudflare reuses connections, the TLS handshake time is mostly mitigated, so there’s very little performance advantage to using HTTP rather than HTTPS to origin.

https://community.cloudflare.com/t/2019-9-19-workers-runtime-release-notes-concurrent-subrequest-limit/115546/29

So perhaps their 525 is unrelated. It just seemed a good place to start looking.

Would you happen to have some sample worker code I can set up myself to reproduce the issue? It might help speed up finding the root of the issue.

mkdir ./fly-workers
cd ./fly-workers
# install npm (via nvm, if you prefer) if required
# install the wrangler-cli
npm i wrangler
npx wrangler init
# open ./src/index.js, then
async function origin(r) {
  console.debug("serving", r.url, "by", r.headers.get("CF-Connecting-IP"));
  return fetch(`https://fly.io/docs`);
}

export default {
  async fetch(request, env, ctx) {
    return origin(request);
  }
}
# generate auth token
npx wrangler login
# ensure `workers_dev = true` in wrangler.toml exists or add it; then:
npx wrangler publish

docs, examples.

@ian1 - Sorry for resurrecting this thread but this sounds exactly like the situation we are seeing as well. We too are making calls to the fly.dev instances from a Cloudflare Worker and seeing 525s randomly.

Did you ever find a solution for this?

Unfortunately not. We were forced to switched to http instead of https, which is not ideal.