Help with "could not translate host name" error

Hello Fly.io community,

We’re encountering an issue with a new app setup, and we’re hoping someone can help us figure out what’s going wrong.

We have two services running: an application and a PostgreSQL database.
We’ve used the fly postgres attach command to attach the database to the app service.

Everything was fine for the first minutes, but after some deploys we’ve begin receiving the following error:

could not translate host name "<snip>.flycast" to address: Name or service not known

Some more details:

  • The omitted name is the name of the DB service
  • It works intermittently so the configuration should be mostly ok
  • I am NOT trying to connect from my machine, I get the error from the app service
  • From the dashboard everything looks ok (both services are green)
  • We’re able to connect manually using the Fly CLI even when it’s failing on the app service

Also, don’t know if it’s related but we’re having issues connecting via SSH using fly ssh console.
We can connect if we use the -s flag to manually choose a specific machine, but the other one just won’t work. We don’t have any kind of VPN setup. This is the error we get:

Error: error connecting to SSH server: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

Any ideas on what might be wrong or what we should check?
Or is it just a DNS outage?

Thanks in advance for any guidance!

Hi… Try fly logs and then repeat the SSH attempt (in a separate terminal). This may be a return of the mysterious _orgcert.internal glitch…

(That would be fly logs -a <snip> (database name) if it’s the Postgres machine that you’re trying to SSH into.)

From App not working to Questions / Help

Yeah I think that’s the issue.

Our application service has two machines attached (let’s call them A and B), and one of them gives issues while connecting via SSH (let’s say B).

I tried forcing each specific machine to serve the response by stopping the other: when only A was active the application could connect to the database, and when only B was active it could not.

This is the error from the logs when I try to connect:

2024-09-20T16:57:43Z app[<snip>] cdg [info]2024/09/20 16:57:43 ERROR unexpected error fetching cert error="transient SSH server error: can't resolve _orgcert.internal"
2024-09-20T16:57:43Z app[<snip>] cdg [info]2024/09/20 16:57:43 ERROR unexpected error error="[ssh: no auth passed yet, transient SSH server error: can't resolve _orgcert.internal]"

I tried re-creating the machine but the new one has the same issue. I could disable auto-scaling and keep the one machine working (traffic won’t be high) but I fear it could break anytime and go offline.

Any tips on how to solve this aside from re-creating the organization like the other poster did?

1 Like

That does look pretty conclusive…

Right… It’s generally important to avoid single-machine deployments on Fly, :dragon:.

Unfortunately it’s unclear what fixes these—or even what the underlying cause is.

(Older posts suggest that it’s a metadata synchronization lag within the infrastructure, :snowflake:, but those internals may have changed a lot in the interim.)

The Fly.io platform as a whole seems under increased strain this week, so perhaps simply waiting a little and then retrying machine re-creation, during off-peak hours, might shake things loose. (I would keep B listed but stopped—and then fly m clone machine A. This minimizes the odds of landing in the exact same (glitch-prone?) spot as before.) It might be that API calls are silently faulting during the machine’s setup phase, or that its particular physical host is having load-related network problems.

Failing that, it might suffice to create a new application, rather than an entire new organization, :thought_balloon:.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.