Fly-replay doesn't work for new machines

itayelgazar · May 26, 2025, 8:57pm

Hey - I have a gateway that redirects requests based on a subdomain.

In the past 2-3 hours, for new machines that are being created, the proxy can’t redirect the user’ request to the machines.

I get 54 [PR03] could not find a good candidate within 1 attempts at load balancing. last error: [PA04] app 'superdev-api-gateway' used 'fly-replay' response header to target app 'mysuperdev', which we cannot find 2025-05-26 22:40:50.154 [PR03] could not find a good candidate within 1 attempts at load balancing. last error: [PA04] app 'superdev-api-gateway' used 'fly-replay' response header to target app 'mysuperdev', which we cannot find

Will appreciate some help as it’s a major outage for us at the moment

lillian · May 26, 2025, 9:32pm

I can’t find an app named mysuperdev on the platform? are you sure the app name is correct?

since you mentioned this is a major outage: please keep in mind this is a community forum and Fly.io staff don’t consistently read it; for production apps I would recommend purchasing Premium Support to talk to our support engineering staff about the issues you are having. https://fly.io/docs/about/support/

itayelgazar · May 26, 2025, 9:46pm

Doing it right now - I wasn’t aware.

It looks like an outage with the machines in Sweden, as when I added different machines it’s back to normal again.

Regarding the app name - that’s what’s odd, the name of the app of which we pass to fly-replay is derived from the subdomain, the domain itself is Mysuperdev.app, e.g - someappname.mysuperdev.app should find “someappname”. Anyways, scaling machines in a different location solved the issue. Is there anything happening in Sweden’s region?

lillian · May 26, 2025, 9:55pm

I’m not aware of any issues in arn (I actually run quite a few of my workloads there and haven’t had any trouble).
either way, if it’s working now, your issue isn’t related to the logs you provided; I tried accessing mysuperdev.mysuperdev.app and saw the same messages in your app logs again, which are expected. you should check the user’s project actually exists before replaying to it!

itayelgazar · May 27, 2025, 6:22am

Issue is back:

Regarding the “mysuperdev.mysuperdev.app”

That’s the thing, I’ve never tried to access such app - I tried accessing an actual app that’s up and running (e.g https://hq63z22dhqj9ns40rgoei.mysuperdev.app)

When I try to access the machine directly through fly.dev - it works.

That’s what I get in the proxy logs:

2025-05-27T06:18:14.945 proxy[2876d16f02d008] ewr [info] machine became reachable in 2.179091511s

2025-05-27T06:18:14.951 proxy[2876d16f02d008] ewr [error] [PR03] could not find a good candidate within 1 attempts at load balancing. last error: [PM11] machine was recently stopped and is unavailable to service request

2025-05-27T06:18:15.427 proxy[286e175fe46ed8] ewr [error] [PM11] machine was recently stopped and is unavailable to service request

2025-05-27T06:18:15.451 proxy[2876d16f02d008] ewr [error] [PR03] could not find a good candidate within 1 attempts at load balancing. last error: [PM11] machine was recently stopped and is unavailable to service request

[PR04] could not find a good candidate within 20 attempts at load balancing

itayelgazar · May 27, 2025, 6:49am

I live in AMS, I scaled another machine in AMS for the proxy app - now it’s back to normal again

Opened a support ticket

Jerrick · May 27, 2025, 7:22am

I’ve been trying a variety of things for the last 2 hours.. In my head, I was like I swear this was working for the past month with no changes that should affect our fly.io set up. After seeing this post I think were facing the same issues. My requests are not hitting the target VM but hitting the “replay/router/proxy” app. Double check and tried changing target ports, scaled router app no solutions for me.

Im getting the same issue you mentioned:
2025-05-27T07:18:17Z app[e784049ef36783] iad [info]Routing subdomain “7815e6ec5995e8” to app
2025-05-27T07:18:17Z proxy[e784049ef36783] atl [error][PR04] could not find a good candidate within 20 attempts at load balancing

You created a new proxy machine in the same region as the previous one and that worked for you?

itayelgazar · May 27, 2025, 7:34am

At least I’m not alone

I did not - I created a proxy machine in a different region and that worked. I opened a support request, will keep you posted.

Keep me posted too!

PeterCxy · May 27, 2025, 5:28pm

Hi @Jerrick

Your case is a bit different in that the replay failures seem to be related to the fact that machines in your replay target app having different sets of public port definitions from each other. Some of them overlap with each other – for example, there are two different machines defining port 443 in a slightly different way – and in this case the proxy may not be able to decide which definition it should pick, causing intermittent replay errors depending on exactly which one becomes picked.

The proxy expects that all machines in an app servicing a certain port have more or less the exact same service definitions. So, for example, an external port should map to the same internal port and vice-versa, with the same set of handlers. A mismatch is allowed temporarily during a deployment, but should not be something permanent. If your machines are expected to expose different sets of ports, and are expected to belong to different customers, we recommend using an app-per-customer architecture (Per-User Dev Environments with Fly Machines · Fly Docs) both to improve isolation (machines in one app can still access each other!), and to make sure proxy behaves predictably when routing requests / replays.

Jerrick · May 27, 2025, 5:55pm

Ah! Thank you very much. You’re right; the my replay app was on 8080 and the target was listening on 80; Changing it to 8080 and making the node js server run on 8080 fixed it! Thank you for your response as it pointed me in the right directions!

itayelgazar · May 27, 2025, 6:08pm

Hey @PeterCxy, any insights about my case?

system · June 3, 2025, 6:09pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Replay header routing 'machine not found' Questions / Help machines , autoscaling , proxy	10	103	June 12, 2025
Replay with machine auto-start stopped working Questions / Help machines , autoscaling , proxy	3	36	March 15, 2025
bug: Machines fly.dev domain no longer getting proxied after `fly deploy`	4	281	March 16, 2023
fly-replay header does not work with instance ID docs , proxy	3	49	February 19, 2025
fly-replay and dynamically created Machines	5	427	April 30, 2023

Fly-replay doesn't work for new machines

Related topics