Cloudflare 525 error randomly occurs

Hello,

I have a Cloudflare proxy (an “orange cloud” DNS record) enabled for my fly app’s DNS. So instead of a simple DNS A record to 1.2.3.4, the Cloudflare proxy sits in front of that.

That generally works well.

However I randomly get 525 errors. Apparently it’s caused by an error with the SSL handshake between Cloudflare’s server and fly’s. As far as I can see there is nothing I can do since both are out of my control.

When you Google Cloudflare 525 loads of other people get it and there is no definitive solution.

Today, for two apps, I got a 525 error. Then they started working again, no 525. Interestingly the 525 happened on a round number, at 10pm UTC, so I wondered whether it could be to do with a certificate being replaced by fly? Is there a way to see when exactly a certificate was replaced? I looked using certs in the CLI (show/list), but it just shows when it was issued.

Or that may be a coincidence.

Else could someone look at numbers 5 and 6 on this list of possible issues suggested by Cloudflare, and offer any thoughts on them? Those seem like things you’ll know. Like about the cipher:

I’m assuming if it happens for me it’s happening for other people too. It makes it look like something is broken, when it isn’t (app wise). I know to reload the page, retry an API call etc, but it looks bad for other people as it looks like downtime, error etc.

Thanks.

This can happen for a number of reasons. What app did you see it on? We can check some metrics. And do you know what region cloudflare served the 525 from?

We monitor TLS handshakes pretty aggressively from a bunch of spots all over the world. It’s probably not an issue with our TLS stack. It could be anything from a network routing issue to CloudFlare handling stale connections in a special way to your app processes just being slow to respond. It might be hard to troubleshoot, running proxies on proxies like this is vastly more complex.

1 Like

Ok, great.

I’ll send you the Cloudflare error page as text form as a message (that’s how I got it, as an error report, like HTML->text). In between all the \n etc it shows the domain, exact time, ID, region etc.

Interestingly the app is in LHR but the Cloudflare error lists Amsterdam. Weird.

That’s all I have as far as I know since I did a flyctl logs but the tail didn’t go back far, and Cloudflare don’t show logs. Hmm.

Yep, it would be easier to debug if it never worked! I know random network errors happen. The app should respond pretty instantly. Like one I got a 525 response on earlier today just redirects. No database, logic, anything. Lightweight VM with Nodejs.

Hi,

I’m guessing nothing stood out in the logs from why the TCP connection was reset during the handshake (which is Cloudflare’s guess as to why the 525s occur intermittently)?

Really the main reason for using them is to get the client’s country code, which they handily add for free in a cf-ipcountry header.

Any plans to add a similar header? I assume you must have some idea where a client is for your own geo system to work, so if that extends to knowing their country code could you add it as a fly-X header?

That would bypass this 525 issue.

It’s true! We didn’t have anything logged for a connection failure, so it’s hard to say what actually happened. We log TLS handshake failures and I wasn’t able to find any of those for your app, for example.

We probably won’t add geo headers until we’re a much larger company. Companies who sell geo IP databases don’t want us to relay that information to customers without a (very expensive) reseller plan. At some point we’ll be big enough to generate our own.

Depending on what you need, you might be able to use Maxmind’s free geo IP database to look people up: https://dev.maxmind.com/geoip/geolite2-free-geolocation-data?lang=en

2 Likes

Hmm just got another one of these 525 errors. Which seemed to resolve itself and app now works again.

Did you have any thoughts about 5 and 6 on their list of quick fix ideas?

I checked the logs and didn’t see any errors but I did see some lines about the proxy warning … which is interesting. Could that be linked to the 525 if Cloudflare can’t connect to the proxy? Or just a coincidence? Hmm. Not sure what prompted the proxy to move from warning/passing, as I haven’t touched anything.

This app only uses 443 as a port so without SSL working, it’s not going to work.

These were the lines from the log around the same time it happened, LHR region:

2021-08-12T00:41:46.287149835Z app[03570dbb] lhr [info] GET /healthcheck 200 16 - 0.412 ms
2021-08-12T00:41:48.267714211Z proxy[03570dbb] lhr [warn] Health check status changed 'passing' => 'warning'
2021-08-12T00:41:51.381597622Z app[03570dbb] lhr [info] GET /healthcheck 200 16 - 0.256 ms
2021-08-12T00:41:53.765488678Z proxy[03570dbb] lhr [info] Health check status changed 'warning' => 'passing'
2021-08-12T00:41:56.426375203Z app[03570dbb] lhr [info] GET /healthcheck 200 16 - 0.205 ms

Those proxy warnings indicate that the health check might’ve “flapped”. We don’t have much else, really. We’ve debated calling CloudFlare in front of Fly unsupported, for what it’s worth, just because layered CDN/Anycast infrastructure does weird stuff.

When I started with Fly had some 525 issues too. Issues are related to SSL/TLS, more info also: Troubleshooting SSL errors – Cloudflare Help Center

To resolve the issues I am:

  1. Running CloudFlare SSL/TLS in full mode (*)
  2. Setup Fly to not terminate TLS at the edge

(*) Generated an origin certificate in CloudFlare for the host. It is used for all connections that come from CloudFlare.

I did notice that sometimes during a deploy CloudFlare shows a 525. The 525 happens when CloudFlare is trying to make requests to the app during the time it’s being re-started, the 525 doesn’t go away immediately when the instance comes online think CloudFlare uses a delay and requires multiple successful health checks on their end before this happens.

When running 2 instances of the app using the default canary deploy I have not experienced this issue.

Hope this helps.

@kurt Declaring CloudFlare persona none grata would be bummer. We make heavy use of CloudFlare workers for about anything and everything.

1 Like

Thanks @johan.

Yes, it’s possible that using the same approach, where I control the certificate (like you say, getting the one from Cloudflare, and having the app manage TLS) would solve the 525. Since I don’t see any error in the logs the issue appears to be happening between Cloudflare and the proxy, as that handles the TLS for me. Which is impossible for me to debug.

But ideally I’d prefer not to have to do that as it’s another thing to manage :frowning: To compare, I ran Cloudflare in front of an AWS ELB (which also handles the certificate for you) for 5 years and didn’t get a single 525. So there is something about the proxy that randomly causes it. Don’t know how to recreate it though.

It makes sense a 525 could happen during a deploy because of the swap. But perhaps I should have said this app has at least two instances … and I haven’t touched them. So that’s not the cause here.

And yep, I want to keep some kind of Cloudflare support too just because I currently need their geo header.

True, you would need to manage this, guess it depends a bit how you are running your app. It’s pretty easy though, you can generate certificates that are valid for max 15 years.

Another approach is to use SSL/TLS in flexible mode , then you let CloudFlare terminate the TLS and you forward HTTP to fly, that doesn’t require any extra work from your end.

I tested this and it works fine too. I personally prefer full mode as it lets me use the origin certs to setup self-signed HTTPS locally too, gives me better parity between local and production.

1 Like

The big difference between the elb and Fly is anycast. CDNs control the routing to the origins, so if Cloud Flare makes a change that alters routes, connections could break and packets might end up at a different Fly proxy. It’s impossible for us to debug, too, because it happens between Cloud Flare and our proxy instance.

Doing TLS in your app actually mitigates that because it’s only one instance terminating TLS.

Thanks @kurt @johan that helps.

Yep I’d thought about having Cloudflare terminate the TLS and forward http, avoiding this issue entirely. But keeping everything encrypted is better.

So it sounds like the solution is to have the TLS done in my app. Since errors never happen at a convenient time of the day.

Side-note: that’s interesting about anycast. That suggests I would also get a 525 with a Google Cloud load balancer as that provides a static IP using anycast. I haven’t tried using one with Cloudflare before. Did wonder about it. Ok, thanks.

I was getting 525 errors too and seem to have resolved them but…

If I am not using Cloudflare workers, is there any benefit of putting Cloudflare in front of my Fly instance? I am using Cloudflare for 2 reasons: certificate management and DDOS protection. For certificate management, on Cloudflare “Full mode” I still need to generate self-signed certs on the server and Fly can do that…so not much benefit from Cloudflare there. For DDOS, I have read that Fly has protections in place at the network level and if anything wild happens, you all won’t charge me for the attack traffic. And of course I will have auto-scaling limits set at something (fairly) reasonable.

Also, if this question doesn’t make much sense, pardon my ignorance. I am a self-taught one-man show. Fly.io seems awesome by the way and I hope you all are growing. I look forward to using Fly for years to come.

Makes perfect sense.

There are arguments for and against putting Cloudflare in front (aka “orange cloud”, using its proxy).

For example if you want to know which country your user is in, Cloudflare provides that in a header. Fly does not. So that would be an example where using Cloudflare’s proxy does make sense.

Using Cloudflare’s proxy naturally will always add another network “hop” as a request has to go via its server, and then on to Fly. So you might think that is an argument to not use it, as it increases latency. Which may be the case. But Cloudflare has more edge locations and so a user may (depending on where they are in the world, another variable) get to one of their servers faster. And so even with the extra hop they add, the overall request may end up faster.

Also, Cloudflare supports the very latest networking tricks like HTTP/3. Not sure where Fly is at with that. So again, may be faster … if those make a difference to what your app does.

Do you need a WAF? Cloudflare does do more than DDoS protection and filters out other kinds of attacks. So, again, depends if you need that.

Arguments against? Added complexity, random 525s (like you found), more networks means more to go wrong … and harder to debug when it does.

There isn’t a clear answer. Sadly “it depends” :slight_smile:

2 Likes

I deployed an app to Fly and am getting these 525 errors when trying to make requests from Cloudflare Workers to the Fly app. :confused:

So the only solution is managing TLS in my app?

I didn’t get the 525 errors when using Google Cloud Run. I imagine there is some sort of balancer since you can have multiple instances currently processing requests.

Hmm. Yes, I had the same.

The only way I fixed it was by doing the TLS in-app. Using one of these https://developers.cloudflare.com/ssl/origin-configuration/origin-ca

Is that the only solution? Not sure, but it’s the only thing that worked for me.

2 Likes

Thanks for the link @greg .

So I download the certs from Cloudflare, and then use them in my application.

Then @johan mentioned:

Setup Fly to not terminate TLS at the edge

How is this configured?

1 Like

You can set up an empty handlers = [] section on your [[services.ports]], which basically tells Fly that you want raw connections with no transformations or terminations.

That lets you handle everything in-app.

1 Like

I got it running, thanks everyone for your help.

These are the exact steps I took:

1 Add your domain to Cloudflare

CF will not give an origin cert for a domain that they do not control. You won’t be able to do this for your Fly dev domain (eg: your-app.fly.dev). Add the domain to CF, change the NS in your registrar’s dashboard, and wait until the NS have propagated.

2 Configure the DNS to point to your Fly app

You will need to add A records so that your domain or subdomain point to your Fly app.

To get the IP of your Fly app do:

fly info

In the DNS management section of CF’s dashboard add the A record. Typically when doing this, you’d want to avoid CF being a proxy, but in this case you must enable the option (the orange cloud) because the CF origin cert is only valid between your app and CF. Browsers will not accept the origin cert as valid.

3 Get the origin certs from CloudFlare

Go to the SSL/origin server section fo the dashboard. Create and download the cert and the key. I used PEM format, but YMMV.

4 Add the certs to your application

Obviously this will change for every app. I’m using Node.js and Fastify and injecting the certs via an env var that comes from two Fly secrets.

To add a cert file to a Fly secret use this:

fly secrets set SSL_KEY=- < key.pem

5 Configure fly to let you handle SSL in your app

In the fly.toml file remove the http and tls handlers for the 443 port.

  [[services.ports]]
    handlers = []
    port = 443

Honestly, I don’t know if this needs to be done for the 80 port too. I think CF acting as a proxy won’t let you access the Fly app via HTTP anyway.

And that’s it. Deploy your app and no more 525 errors!

Took me a couple of hours at first, but now that I understand the whole process and its pitfalls it should be super easy to replicate.

1 Like

Nice.

Yes it’s a pain to set up but it seems to solve the problem. And their certificate does not expire (or at least not for ages as I recall) so you can just forget about it once done.