Issue with service

hey folks, one of my core services hosted on fly started failing my external health-checks and appears to be down, so I decided to re-deploy it. Unfortunately deployments appear to be not working (or at least erroring), so I’m stuck.

Initially went down about 11:37am.
v240 never seemed to finish so I had to ctl-c
v241 “deployed successfully” but then showed an abort

any ideas?

Thank you

please disregard / hold off sinking time here. This may be an issue on my end with my database that’s preventing my service from working. Will circle back with an update

Well I looked into it nonetheless and it does appear there’s an issue with your database, or at least reaching your database.

I see the latest version deployed is v242.

Thanks Jerome. Latest theory is it appears there’s a DNS issue when connecting to that database. I’m able to resolve / reach it from outside of fly.io as of right now and can query it, etc. Any recent changes to fly DNS that may explain this? Thanks!

There was a DNS change about 2 hours ago. It was seamless though.

I just checked across all our hosts and they can all resolve the DNS entry for your database as far as I can tell (and all in the same way). This should mean everything is working fine and your app should also be working. I see it’s still a problem from your logs.

We had to switch because we started getting rate limited by Google DNS. This should help DNS resolution in general.

I’ll dig some more, but I can’t yet explain this issue.

1 Like

I ran dig from within one of your VMs directly and was able to get the correct response. I’m not sure why your app can’t resolve it.

I’ve also confirmed in a python REPL within your VM that it can resolve that hostname. It can also connect to it just fine.

Very odd.

1 Like

Update: I’m seeing this issue resolving the DB across all my fly.io services, not just this one, but not seeing it on my other non-fly hosts.

Maybe there’s some caching going on somewhere at the application layer? Am trying a new deploy now.

This may not be an issue resolving, despite what the error says. Are you able to connect to mysql with this same hostname from other places?

Update: I’m re-deploying all my fly.io services to see if that does anything.

Kurt: Yes, I’m able to resolve from here in SF at home and (it appears) my other production servers hosted on other hosts don’t have a problem. it’s possible those haven’t retried recently, but I think they likely would have.

How are you testing resolving? Can you ssh to your Fly app instance and test the same way?

I was just using my desktop mysql client (Sequel Ace). But I can try something else if I can come up with something I could trigger on my fly instances. I don’t have SSH set up on my instances, so I’m not clear on how to do that.

fly ssh console --app <name> will get you in. This is the python we’re testing with:

import socket
addr = socket.gethostbyname('<rds-hostname>')
print(addr)

You can also run:

dig <rds-hostname>

I do not think this is a DNS problem, despite the message.

okay will try now. Just ran the this docker container locally and also another one and was able to connect.

trying dig now. Where is dig on these machines? I’ve loaded up bash, but I guess it’s not pulling in all the paths, etc.

hey folks, maybe there’s a different resolver at play? I couldn’t get dig to run, but on the machine I tried to get curl to resolve the host and it immediately failed. Whereas on other linux machines I tried it on (that aren’t at fly), it times out because the port isn’t open.

root@52886617:/# curl us-west-1-teslamonitor-prod-cluster.cluster-ro-cqoe3fww0gbl.us-west-1.rds.amazonaws.com

curl: (6) Could not resolve host: us-west-1-teslamonitor-prod-cluster.cluster-ro-cqoe3fww0gbl.us-west-1.rds.amazonaws.com

We found the issue and are rolling out a fix. DNS responses with extra fields seem to break your app in particular.

1 Like

Thank you so much! If you think there’s anything I should do differently to not be vulnerable to this, please let me know.

We’re as curious as you are. The change we just made (moments ago, not the one from earlier today Jerome was talking about) minimizes DNS responses so they include only the answer records and not all the authority records (“go to these nameservers for further answers about this name”) — in our new DNS configuration, we were (briefly) including those additional records, and now we’re not, which is also what 8.8.8.8 does. Neither option should break anything!

But if assiduously replicating the behavior of 8.8.8.8 keeps apps happy, that’s what we’ll do. :slight_smile:

1 Like

I see thanks for letting me know. Btw, it’s not clear to me the issue is totally resolved or propagated. I’m seeing some services bounce back up briefly and then go down. Maybe that’s just a caching issue that we now need to push through?

We’re still pushing through the change :slight_smile: we targeted the servers where the app your mentioned was running, but it looks like other apps of yours are maybe affected?

Rolling out might take a little bit.