hey folks, one of my core services hosted on fly started failing my external health-checks and appears to be down, so I decided to re-deploy it. Unfortunately deployments appear to be not working (or at least erroring), so I’m stuck.
Initially went down about 11:37am.
v240 never seemed to finish so I had to ctl-c
v241 “deployed successfully” but then showed an abort
please disregard / hold off sinking time here. This may be an issue on my end with my database that’s preventing my service from working. Will circle back with an update
Thanks Jerome. Latest theory is it appears there’s a DNS issue when connecting to that database. I’m able to resolve / reach it from outside of fly.io as of right now and can query it, etc. Any recent changes to fly DNS that may explain this? Thanks!
There was a DNS change about 2 hours ago. It was seamless though.
I just checked across all our hosts and they can all resolve the DNS entry for your database as far as I can tell (and all in the same way). This should mean everything is working fine and your app should also be working. I see it’s still a problem from your logs.
We had to switch because we started getting rate limited by Google DNS. This should help DNS resolution in general.
I’ll dig some more, but I can’t yet explain this issue.
Update: I’m re-deploying all my fly.io services to see if that does anything.
Kurt: Yes, I’m able to resolve from here in SF at home and (it appears) my other production servers hosted on other hosts don’t have a problem. it’s possible those haven’t retried recently, but I think they likely would have.
I was just using my desktop mysql client (Sequel Ace). But I can try something else if I can come up with something I could trigger on my fly instances. I don’t have SSH set up on my instances, so I’m not clear on how to do that.
hey folks, maybe there’s a different resolver at play? I couldn’t get dig to run, but on the machine I tried to get curl to resolve the host and it immediately failed. Whereas on other linux machines I tried it on (that aren’t at fly), it times out because the port isn’t open.
We’re as curious as you are. The change we just made (moments ago, not the one from earlier today Jerome was talking about) minimizes DNS responses so they include only the answer records and not all the authority records (“go to these nameservers for further answers about this name”) — in our new DNS configuration, we were (briefly) including those additional records, and now we’re not, which is also what 8.8.8.8 does. Neither option should break anything!
But if assiduously replicating the behavior of 8.8.8.8 keeps apps happy, that’s what we’ll do.
I see thanks for letting me know. Btw, it’s not clear to me the issue is totally resolved or propagated. I’m seeing some services bounce back up briefly and then go down. Maybe that’s just a caching issue that we now need to push through?
We’re still pushing through the change we targeted the servers where the app your mentioned was running, but it looks like other apps of yours are maybe affected?