Hey, just had a 15min downtime as a result of SSL issue and trying to understand what was the root cause if you might be able to help (as I’ve not changed anything SSL related during that time)
Had this in the app logs:
2021-08-19T15:57:02.298950303Z app[50663299] lhr [info] 15:57:02.295 [warn] Description: 'Authenticity is not established by certificate path validation'
2021-08-19T15:57:02.299743078Z app[50663299] lhr [info] Reason: 'Option {verify, verify_peer} and cacertfile/cacerts is missing'
The certificate on fly.io is showing the following - I assume the red indicator means it’s not set up correctly (although issue went away already)
Huh, those logs look like they were generated by your app, and not any of our certificate infrastructure. Is there an external service your app connects to with SSL?
Oh I missed that cloudflare screenshot. You’ll want to turn off proxying for the _acme-challenge record at the very least. Click that little cloud to control it.
Those certs are actually good, but we may not be able to renew them when they expire if the record is proxied like that.
What kind of errors did you see? CloudFlare to Fly seems pretty brittle and hard to debug, lately.
Ah, Cloudflare and its SSL @Tomasz . Yep, I’m dealing with this too.
As @kurt says you can solve your current problem by turning off the proxy for the acme-challenge record (aka “grey cloud”).
And that will let the SSL certificate be renewed.
However that’s the same setup I have now too. Cloudflare → Fly. And while it works, be aware I get totally random 525 errors. Kurt and I have discussed it in a prior thread and alas it’s not Fly’s fault so there isn’t anything they can debug/fix.
And it just started again - it’s 525s indeed. Is the only real workaround to set up the certificate via CF? I’m a bit unsure on how to upload it to fly…
Instances
ID VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
3010b94e 74 ⇡ lhr run failed 1 total, 1 critical 2 1m17s ago
a4693146 74 ⇡ lhr stop failed 1 total, 1 critical 2 10m19s ago
550e5b28 74 ⇡ lhr stop failed 1 total, 1 critical 2 15m20s ago
26662f6b 74 ⇡ lhr stop failed 1 total, 1 critical 2 17m47s ago
6be617b9 74 ⇡ lhr stop failed 1 total 2 19m20s ago
766e8575 74 ⇡ lhr stop failed 1 total 2 20m20s ago
All of the failed ones have this output
Recent Events
TIMESTAMP TYPE MESSAGE
2021-08-19T20:06:15Z Received Task received by client
2021-08-19T20:06:15Z Task Setup Building Task Directory
2021-08-19T20:06:18Z Started Task started by client
2021-08-19T20:06:36Z Terminated Exit Code: 1
2021-08-19T20:06:36Z Restarting Task restarting in 1.150028707s
2021-08-19T20:06:38Z Started Task started by client
2021-08-19T20:06:56Z Terminated Exit Code: 1
2021-08-19T20:06:56Z Restarting Task restarting in 1.169078711s
2021-08-19T20:06:58Z Started Task started by client
2021-08-19T20:07:16Z Terminated Exit Code: 1
2021-08-19T20:07:16Z Not Restarting Exceeded allowed attempts 2 in interval 5m0s and mode is "fail"
Checks
ID SERVICE STATE OUTPUT
0a33a4f86fa24b92a038e7a2786f7e82 tcp-4000 critical dial tcp 172.19.1.186:4000: connect: connection refused
App logs look like there’s an issue connecting to Postgres - I switched to the direct IP as per the other thread yesterday
It seems like the Elixir lib isn’t handling multiple IPs very well, it should be trying both IPs but it appears to just be hitting the one? I haven’t dug very deep on the Elixir Postgres library but you may need to rebuild your connection string.
I think the issue is actually DB, not the app or the SSL red herring (sorry)
App
Name = premade-db
Owner = premade
Version = 0
Status = running
Hostname = premade-db.fly.dev
Instances
ID VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
50714c3c 0 lhr run running (context dead) 3 total, 3 critical 2 2021-08-18T19:53:55Z
df9c3004 0 lhr run running (context dead) 3 total, 3 critical 2 2021-08-18T19:51:32Z
Health Checks for premade-db
NAME STATUS ALLOCATION REGION TYPE LAST UPDATED OUTPUT
pg critical 50714c3c lhr SCRIPT 21m34s ago context deadline exceeded
vm critical 50714c3c lhr SCRIPT 28m13s ago context deadline exceeded
role critical 50714c3c lhr SCRIPT 28m28s ago context deadline exceeded
vm critical df9c3004 lhr SCRIPT 21m3s ago context deadline exceeded
pg critical df9c3004 lhr SCRIPT 22m2s ago context deadline exceeded
role critical df9c3004 lhr SCRIPT 22m30s ago context deadline exceeded
I’ll stop the VMs again I guess to see if that helps
@kurt thank you and the rest of the gang for fighting this - hope my frequent posts don’t add to the stress! I think I read in one of the threads that some of those issues might be isolated to the LHR region, is that correct?
If I was to move over to another, am I naive in thinking that I could spin up 2 more instances in another region and then stop the LHR ones?