Help with "could not translate host name" error

danieledelgiudice · September 20, 2024, 4:29pm

Hello Fly.io community,

We’re encountering an issue with a new app setup, and we’re hoping someone can help us figure out what’s going wrong.

We have two services running: an application and a PostgreSQL database.
We’ve used the fly postgres attach command to attach the database to the app service.

Everything was fine for the first minutes, but after some deploys we’ve begin receiving the following error:

could not translate host name "<snip>.flycast" to address: Name or service not known

Some more details:

The omitted name is the name of the DB service
It works intermittently so the configuration should be mostly ok
I am NOT trying to connect from my machine, I get the error from the app service
From the dashboard everything looks ok (both services are green)
We’re able to connect manually using the Fly CLI even when it’s failing on the app service

Also, don’t know if it’s related but we’re having issues connecting via SSH using fly ssh console.
We can connect if we use the -s flag to manually choose a specific machine, but the other one just won’t work. We don’t have any kind of VPN setup. This is the error we get:

Error: error connecting to SSH server: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

Any ideas on what might be wrong or what we should check?
Or is it just a DNS outage?

Thanks in advance for any guidance!

mayailurus · September 20, 2024, 4:47pm

Hi… Try fly logs and then repeat the SSH attempt (in a separate terminal). This may be a return of the mysterious _orgcert.internal glitch…

(That would be fly logs -a <snip> (database name) if it’s the Postgres machine that you’re trying to SSH into.)

mayailurus · September 20, 2024, 4:48pm

From App not working to Questions / Help

danieledelgiudice · September 20, 2024, 5:08pm

Yeah I think that’s the issue.

Our application service has two machines attached (let’s call them A and B), and one of them gives issues while connecting via SSH (let’s say B).

I tried forcing each specific machine to serve the response by stopping the other: when only A was active the application could connect to the database, and when only B was active it could not.

This is the error from the logs when I try to connect:

2024-09-20T16:57:43Z app[<snip>] cdg [info]2024/09/20 16:57:43 ERROR unexpected error fetching cert error="transient SSH server error: can't resolve _orgcert.internal"
2024-09-20T16:57:43Z app[<snip>] cdg [info]2024/09/20 16:57:43 ERROR unexpected error error="[ssh: no auth passed yet, transient SSH server error: can't resolve _orgcert.internal]"

I tried re-creating the machine but the new one has the same issue. I could disable auto-scaling and keep the one machine working (traffic won’t be high) but I fear it could break anytime and go offline.

Any tips on how to solve this aside from re-creating the organization like the other poster did?

mayailurus · September 20, 2024, 8:06pm

That does look pretty conclusive…

Right… It’s generally important to avoid single-machine deployments on Fly, .

Unfortunately it’s unclear what fixes these—or even what the underlying cause is.

(Older posts suggest that it’s a metadata synchronization lag within the infrastructure, , but those internals may have changed a lot in the interim.)

The Fly.io platform as a whole seems under increased strain this week, so perhaps simply waiting a little and then retrying machine re-creation, during off-peak hours, might shake things loose. (I would keep B listed but stopped—and then fly m clone machine A. This minimizes the odds of landing in the exact same (glitch-prone?) spot as before.) It might be that API calls are silently faulting during the machine’s setup phase, or that its particular physical host is having load-related network problems.

Failing that, it might suffice to create a new application, rather than an entire new organization, .

system · September 22, 2024, 8:06pm

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
could not translate host name postgres	3	902	February 8, 2024
Could not translate host name "top1.nearest.of.***.internal" to address: Name or service not known Questions / Help	2	1144	January 27, 2023
could not translate host name "pochitaaapidb.internal" to address: Name or service not known postgres	3	147	April 22, 2024
psql: error: could not translate host name "db" to address: nodename nor servname provided, or not known Questions / Help postgres , django	7	3485	January 10, 2024
Absolutely Nothing About Fly Postgres Seems To Be Working Questions / Help	1	450	December 30, 2022

Help with "could not translate host name" error

Related topics