We’re encountering an issue with a new app setup, and we’re hoping someone can help us figure out what’s going wrong.
We have two services running: an application and a PostgreSQL database.
We’ve used the fly postgres attach command to attach the database to the app service.
Everything was fine for the first minutes, but after some deploys we’ve begin receiving the following error:
could not translate host name "<snip>.flycast" to address: Name or service not known
Some more details:
The omitted name is the name of the DB service
It works intermittently so the configuration should be mostly ok
I am NOT trying to connect from my machine, I get the error from the app service
From the dashboard everything looks ok (both services are green)
We’re able to connect manually using the Fly CLI even when it’s failing on the app service
Also, don’t know if it’s related but we’re having issues connecting via SSH using fly ssh console.
We can connect if we use the -s flag to manually choose a specific machine, but the other one just won’t work. We don’t have any kind of VPN setup. This is the error we get:
Error: error connecting to SSH server: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
Any ideas on what might be wrong or what we should check?
Or is it just a DNS outage?
Our application service has two machines attached (let’s call them A and B), and one of them gives issues while connecting via SSH (let’s say B).
I tried forcing each specific machine to serve the response by stopping the other: when only A was active the application could connect to the database, and when only B was active it could not.
This is the error from the logs when I try to connect:
I tried re-creating the machine but the new one has the same issue. I could disable auto-scaling and keep the one machine working (traffic won’t be high) but I fear it could break anytime and go offline.
Any tips on how to solve this aside from re-creating the organization like the other poster did?
Right… It’s generally important to avoid single-machine deployments on Fly, .
Unfortunately it’s unclear what fixes these—or even what the underlying cause is.
(Older posts suggest that it’s a metadata synchronization lag within the infrastructure, , but those internals may have changed a lot in the interim.)
The Fly.io platform as a whole seems under increased strain this week, so perhaps simply waiting a little and then retrying machine re-creation, during off-peak hours, might shake things loose. (I would keep B listed but stopped—and then fly m clone machine A. This minimizes the odds of landing in the exact same (glitch-prone?) spot as before.) It might be that API calls are silently faulting during the machine’s setup phase, or that its particular physical host is having load-related network problems.
Failing that, it might suffice to create a new application, rather than an entire new organization, .