SSL issue

Hey, just had a 15min downtime as a result of SSL issue and trying to understand what was the root cause if you might be able to help (as I’ve not changed anything SSL related during that time)

Had this in the app logs:

2021-08-19T15:57:02.298950303Z app[50663299] lhr [info] 15:57:02.295 [warn] Description: 'Authenticity is not established by certificate path validation'
2021-08-19T15:57:02.299743078Z app[50663299] lhr [info]      Reason: 'Option {verify, verify_peer} and cacertfile/cacerts is missing'

The certificate on fly.io is showing the following - I assume the red indicator means it’s not set up correctly (although issue went away already)

And here’s what I have in Cloudflare:

Any ideas? Not super urgent given the problem solved itself but trying to demo the app now to folks so it’s a bit stressful!

Huh, those logs look like they were generated by your app, and not any of our certificate infrastructure. Is there an external service your app connects to with SSL?

Yes - I probably caused a small issue causing outgoing connections to fail - but didn’t think that might cause incoming connections to fail…

It’s probably my fault somehow but thought I’d check if you see anything obvious. Should I worry about the red indicator on the certificate page?

Oh I missed that cloudflare screenshot. You’ll want to turn off proxying for the _acme-challenge record at the very least. Click that little cloud to control it.

Those certs are actually good, but we may not be able to renew them when they expire if the record is proxied like that.

What kind of errors did you see? CloudFlare to Fly seems pretty brittle and hard to debug, lately.

Ah, Cloudflare and its SSL @Tomasz . Yep, I’m dealing with this too.

As @kurt says you can solve your current problem by turning off the proxy for the acme-challenge record (aka “grey cloud”).

And that will let the SSL certificate be renewed.

However that’s the same setup I have now too. Cloudflare → Fly. And while it works, be aware I get totally random 525 errors. Kurt and I have discussed it in a prior thread and alas it’s not Fly’s fault so there isn’t anything they can debug/fix.

One solution I’m exploring is using Cloudflare’s own CA certificate. This: https://developers.cloudflare.com/ssl/origin-configuration/origin-ca So that may be helpful for you if you ever get a 525 too.

Ha, well I went with Cloudflare based on it being the top recommended choice here! GitHub - superfly/dns-help: Instructions for pointing a domain at your Fly edge application

I’ll turn off that proxy thing, thanks for the tip both!

And it just started again - it’s 525s indeed. Is the only real workaround to set up the certificate via CF? I’m a bit unsure on how to upload it to fly…

Actually looking at it, the app has crashed but unsure why…

Oh well that’s a legit 525 then.

Try running fly status --all and then fly vm status <id> of anything in a failed state, it might give you an idea of what’s up.

Hm okay

Instances
ID       VERSION REGION DESIRED STATUS   HEALTH CHECKS       RESTARTS CREATED
3010b94e 74 ⇡    lhr    run     failed   1 total, 1 critical 2        1m17s ago
a4693146 74 ⇡    lhr    stop    failed   1 total, 1 critical 2        10m19s ago
550e5b28 74 ⇡    lhr    stop    failed   1 total, 1 critical 2        15m20s ago
26662f6b 74 ⇡    lhr    stop    failed   1 total, 1 critical 2        17m47s ago
6be617b9 74 ⇡    lhr    stop    failed   1 total             2        19m20s ago
766e8575 74 ⇡    lhr    stop    failed   1 total             2        20m20s ago

All of the failed ones have this output

Recent Events
TIMESTAMP            TYPE           MESSAGE
2021-08-19T20:06:15Z Received       Task received by client
2021-08-19T20:06:15Z Task Setup     Building Task Directory
2021-08-19T20:06:18Z Started        Task started by client
2021-08-19T20:06:36Z Terminated     Exit Code: 1
2021-08-19T20:06:36Z Restarting     Task restarting in 1.150028707s
2021-08-19T20:06:38Z Started        Task started by client
2021-08-19T20:06:56Z Terminated     Exit Code: 1
2021-08-19T20:06:56Z Restarting     Task restarting in 1.169078711s
2021-08-19T20:06:58Z Started        Task started by client
2021-08-19T20:07:16Z Terminated     Exit Code: 1
2021-08-19T20:07:16Z Not Restarting Exceeded allowed attempts 2 in interval 5m0s and mode is "fail"

Checks
ID                               SERVICE  STATE    OUTPUT
0a33a4f86fa24b92a038e7a2786f7e82 tcp-4000 critical dial tcp 172.19.1.186:4000: connect: connection refused

App logs look like there’s an issue connecting to Postgres - I switched to the direct IP as per the other thread yesterday

2021-08-19T20:07:09.827772794Z app[a4693146] lhr [info] 20:07:09.822 [error] Postgrex.Protocol (#PID<0.2541.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (fdaa:0:3049:a7b:a98:0:31c2:2:5433): host is unreachable - :ehostunreach
2021-08-19T20:07:09.832608146Z app[a4693146] lhr [info] 20:07:09.823 [error] Postgrex.Protocol (#PID<0.2539.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (fdaa:0:3049:a7b:a98:0:31c2:2:5433): host is unreachable - :ehostunreach
2021-08-19T20:07:09.835098390Z app[a4693146] lhr [info] 20:07:09.828 [error] Postgrex.Protocol (#PID<0.2538.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (fdaa:0:3049:a7b:a98:0:31c2:2:5433): host is unreachable - :ehostunreach

It seems like the Elixir lib isn’t handling multiple IPs very well, it should be trying both IPs but it appears to just be hitting the one? I haven’t dug very deep on the Elixir Postgres library but you may need to rebuild your connection string.

I think the issue is actually DB, not the app or the SSL red herring (sorry)

App
  Name     = premade-db
  Owner    = premade
  Version  = 0
  Status   = running
  Hostname = premade-db.fly.dev

Instances
ID       VERSION REGION DESIRED STATUS                 HEALTH CHECKS       RESTARTS CREATED
50714c3c 0       lhr    run     running (context dead) 3 total, 3 critical 2        2021-08-18T19:53:55Z
df9c3004 0       lhr    run     running (context dead) 3 total, 3 critical 2        2021-08-18T19:51:32Z

Health Checks for premade-db
NAME STATUS   ALLOCATION REGION TYPE   LAST UPDATED OUTPUT
pg   critical 50714c3c   lhr    SCRIPT 21m34s ago   context deadline exceeded
vm   critical 50714c3c   lhr    SCRIPT 28m13s ago   context deadline exceeded
role critical 50714c3c   lhr    SCRIPT 28m28s ago   context deadline exceeded
vm   critical df9c3004   lhr    SCRIPT 21m3s ago    context deadline exceeded
pg   critical df9c3004   lhr    SCRIPT 22m2s ago    context deadline exceeded
role critical df9c3004   lhr    SCRIPT 22m30s ago   context deadline exceeded

I’ll stop the VMs again I guess to see if that helps

Took a bit longer this time, ~6m each but they’re back up and seems fine - kept the IP connection string for now, as I guess it’s not related?

Do you know what the context deadline error refers to here?

Apologies for jumpstarting the thread with the SSL issue, noticed the symptom and not the cause.

We’ve been fighting Postgres related issues all day, more here: Postgres fails constantly every 1-2 days for 2-3 min

@kurt thank you and the rest of the gang for fighting this - hope my frequent posts don’t add to the stress! I think I read in one of the threads that some of those issues might be isolated to the LHR region, is that correct?

If I was to move over to another, am I naive in thinking that I could spin up 2 more instances in another region and then stop the LHR ones?

This past few days have been causing problems in most regions, London isn’t specifically an issue.

Moving a postgres to a new region means:

  1. Adding new volumes
  2. fly scale count <num>
  3. Update the PRIMARY_REGION environment variable for new region
  4. Remove old volumes
  5. fly scale count 2

The simplest way to do #3 is to run fly secrets set PRIMARY_REGION=<val>