Issue with service

davidhodge · January 11, 2022, 9:33pm

hey folks, one of my core services hosted on fly started failing my external health-checks and appears to be down, so I decided to re-deploy it. Unfortunately deployments appear to be not working (or at least erroring), so I’m stuck.

Initially went down about 11:37am.
v240 never seemed to finish so I had to ctl-c
v241 “deployed successfully” but then showed an abort

any ideas?

Thank you

davidhodge · January 11, 2022, 9:41pm

please disregard / hold off sinking time here. This may be an issue on my end with my database that’s preventing my service from working. Will circle back with an update

jerome · January 11, 2022, 9:46pm

Well I looked into it nonetheless and it does appear there’s an issue with your database, or at least reaching your database.

I see the latest version deployed is v242.

davidhodge · January 11, 2022, 9:50pm

Thanks Jerome. Latest theory is it appears there’s a DNS issue when connecting to that database. I’m able to resolve / reach it from outside of fly.io as of right now and can query it, etc. Any recent changes to fly DNS that may explain this? Thanks!

jerome · January 11, 2022, 9:56pm

There was a DNS change about 2 hours ago. It was seamless though.

I just checked across all our hosts and they can all resolve the DNS entry for your database as far as I can tell (and all in the same way). This should mean everything is working fine and your app should also be working. I see it’s still a problem from your logs.

We had to switch because we started getting rate limited by Google DNS. This should help DNS resolution in general.

I’ll dig some more, but I can’t yet explain this issue.

jerome · January 11, 2022, 10:04pm

I ran dig from within one of your VMs directly and was able to get the correct response. I’m not sure why your app can’t resolve it.

I’ve also confirmed in a python REPL within your VM that it can resolve that hostname. It can also connect to it just fine.

Very odd.

davidhodge · January 11, 2022, 10:05pm

Update: I’m seeing this issue resolving the DB across all my fly.io services, not just this one, but not seeing it on my other non-fly hosts.

Maybe there’s some caching going on somewhere at the application layer? Am trying a new deploy now.

kurt · January 11, 2022, 10:09pm

This may not be an issue resolving, despite what the error says. Are you able to connect to mysql with this same hostname from other places?

davidhodge · January 11, 2022, 10:09pm

Update: I’m re-deploying all my fly.io services to see if that does anything.

Kurt: Yes, I’m able to resolve from here in SF at home and (it appears) my other production servers hosted on other hosts don’t have a problem. it’s possible those haven’t retried recently, but I think they likely would have.

kurt · January 11, 2022, 10:11pm

How are you testing resolving? Can you ssh to your Fly app instance and test the same way?

davidhodge · January 11, 2022, 10:13pm

I was just using my desktop mysql client (Sequel Ace). But I can try something else if I can come up with something I could trigger on my fly instances. I don’t have SSH set up on my instances, so I’m not clear on how to do that.

kurt · January 11, 2022, 10:15pm

fly ssh console --app <name> will get you in. This is the python we’re testing with:

import socket
addr = socket.gethostbyname('<rds-hostname>')
print(addr)

You can also run:

dig <rds-hostname>

I do not think this is a DNS problem, despite the message.

davidhodge · January 11, 2022, 10:17pm

okay will try now. Just ran the this docker container locally and also another one and was able to connect.

davidhodge · January 11, 2022, 10:22pm

trying dig now. Where is dig on these machines? I’ve loaded up bash, but I guess it’s not pulling in all the paths, etc.

davidhodge · January 11, 2022, 10:28pm

hey folks, maybe there’s a different resolver at play? I couldn’t get dig to run, but on the machine I tried to get curl to resolve the host and it immediately failed. Whereas on other linux machines I tried it on (that aren’t at fly), it times out because the port isn’t open.

root@52886617:/# curl us-west-1-teslamonitor-prod-cluster.cluster-ro-cqoe3fww0gbl.us-west-1.rds.amazonaws.com

curl: (6) Could not resolve host: us-west-1-teslamonitor-prod-cluster.cluster-ro-cqoe3fww0gbl.us-west-1.rds.amazonaws.com

jerome · January 11, 2022, 10:28pm

We found the issue and are rolling out a fix. DNS responses with extra fields seem to break your app in particular.

davidhodge · January 11, 2022, 10:29pm

Thank you so much! If you think there’s anything I should do differently to not be vulnerable to this, please let me know.

thomas · January 11, 2022, 10:31pm

We’re as curious as you are. The change we just made (moments ago, not the one from earlier today Jerome was talking about) minimizes DNS responses so they include only the answer records and not all the authority records (“go to these nameservers for further answers about this name”) — in our new DNS configuration, we were (briefly) including those additional records, and now we’re not, which is also what 8.8.8.8 does. Neither option should break anything!

But if assiduously replicating the behavior of 8.8.8.8 keeps apps happy, that’s what we’ll do.

davidhodge · January 11, 2022, 10:37pm

I see thanks for letting me know. Btw, it’s not clear to me the issue is totally resolved or propagated. I’m seeing some services bounce back up briefly and then go down. Maybe that’s just a caching issue that we now need to push through?

jerome · January 11, 2022, 10:40pm

We’re still pushing through the change we targeted the servers where the app your mentioned was running, but it looks like other apps of yours are maybe affected?

Rolling out might take a little bit.

Topic		Replies	Views
Something went wrong? Questions / Help	42	1426	September 22, 2022
DNS Resolution failing on fresh app Questions / Help	4	513	December 30, 2021
Anyone still having issues after their 'fix' of tonight's problems?	8	210	December 3, 2024
Error logs saying "Internal problem" result in 502s	10	455	August 16, 2021
Trying to deploy a Phoenix app, nothing works, close to giving up	6	729	December 11, 2021

Issue with service

Related topics