fly-replay to specific instance failing despite Machine being "Started" and responsive

Hi everyone,

I am facing an issue for quite a while now. I’m building a FastAPI App on Fly.io and I’m having trouble getting fly-replay to target a specific machine reliably.

The Setup: When machine_1 receives a request and has too many (long running) background tasks open, it checks if the request can be sent to an alternative machine_B. For doing so, I am following this exact sequence via the Machines API:

Discovery: I fetch the list of machines in the app.

Start: I call /v1/apps/{app_name}/machines/{machine_id}/start for the target machine.

Wait: I call /v1/apps/{app_name}/machines/{machine_id}/wait?state=started&instance={instance_id} to ensure it’s up.

Health Check: I even perform an internal request to the worker’s /health endpoint to confirm the FastAPI server is actually listening.

Replay: Once confirmed, I return a 200 OK with the header: fly-replay: instance=<machine_id>.

The Problem: The request never arrives at the worker machine_B.

If I use fly-replay: elsewhere=true, it works perfectly (the request is moved to another machine).

If I use instance=<machine_id>, it fails silently or the client hangs/gets an error.

I have confirmed that machine_id is the short alphanumeric ID (e.g., d896525a615e48). I’ve also tried adding a 2-second sleep before the replay to account for proxy propagation, but no luck.

Questions:

Does anyone see what I am missing? Do I need to provide the instance_id rather than machine_id, and if so is the instance_id ephemeral so that it changes between machine restarts?

Is there a way to view the Fly Proxy logs specifically? I can see my app logs in Grafana, but I can’t find the logs that show why the proxy is discarding or failing a replay hand-off.

Any help or insight would be so much appreciated. Thanks and have a great day

Hi… No, the Machine ID is what you want, the first column of fly m list.

I just tried over here on a small test app of my own, and fly-replay: instance=<machine-id> does work. (It also conveniently auto-started the other Machine, since I had auto_start_machines = true in fly.toml.)

Are you perhaps crossing an app boundary in your case? In that situation, you need an explicit app= knob, last I checked:

https://community.fly.io/t/fly-replay-header-does-not-work-with-instance-id/23852

Are you by any chance sending a somewhat large body for your requests (i.e. POSTs, PUTs)? If so, fly-replay may not work that well since these we don’t buffer bodies indefinitely and by the time your target machine starts, the buffer might have already expired (hence we can’t reuse it anymore to replay elsewhere).

Could you try maybe setting auto_start_machine to true and returning the fly-replay as early as possible without invoking the machines API yourself?

1 Like

Hey, you guys are amazing, thanks for the quick answers.

@PeterCxy Thank you for mentioning the potential buffering issue. In my opinion the body is not very large (around 10-20 key-value pairs of single words each). Also it feels to be ok because the “elsewhere=true” replay option works as expected. Regarding the buffering strategy this should not make a difference, right?

@mayailurus, thanks for bringing the App boundary idea up. (Un-)fortunately I tried to specify app={app_name} in addition to instance={machine_id} without any results. Also thank you for the hint setting `auto_start_machine=true`; indeed I do have that setting already.

So accessing the logs of the load-balancer is not that easy so that I could investigate what it receives and what it does with the info?

And is it true that the body of the response is not really relevant as long as the header contains the “fly-replay” key?

If the machine start / confirmation over machines api took longer then it could make a difference, but if the only change you made was elsewhere vs instance= then it’s probably not that.

Could you share the app name so that I could look into this from our end?

Hei @PeterCxy yes indeed I did only change {”fly-reply”: f“instance={machine_id}”} into {”fly-reply”: f“instance={machine_id};elsewhere=true”}.

BUT as you mentioned buffering, I made some (not practical) tries to test that out. When entering the /run endpoint at our machine, it does several things mainly testing for rate limit and checking on supabase if the user is authenticated. These two steps I removed and immediately return “fly-replay”. And… it does work :star_struck:. However, I can not really deploy the backend without authentication check and without rate limiter so I will try to make the overhead as fast as possible. And also it is still strange that the `elsewhere` at the same place after the same amount of work did send it to the other machine.

The app name is `test-confidential-fraudcrawler-backend` from the organization `veanu`

Could these be done on the replay target machine instead, if the point is to reduce load on the source machine?

Another option here is if you also control the client, you may send a special rejection response from the machine telling the client to send another request with fly-force-instance-id. That way you will not be limited by the body lifecycle during a fly-replay.

That is actually a very good idea, this can indeed easily be done AFTER deciding whether to replay elsewhere or not.:grinning_face:

Also the second option of forcing the client to take care of the reply is nice, although I prefer the backend to fully take care of this if possible, because it keeps a clear separation of the backend and the frontend.

Thank you once again, I will try that out and come back as soon as there are some news :smiley:

It actually does work, thank you so much @PeterCxy. I still do not understand exactly why elsewhere=true did work but instance={machine_id} did not.

But I am very happy to have a solution.

Just one minor followup question, is it possible to tell the proxy to resend the same request after a given delay → after what you explained with the buffering I guess the backend should just send an 503 error and let the client handling this, right?

Once again, thank you so much for your help, very much appreciated.:star_struck:

1 Like

This is on our roadmap though I don’t have an exact ETA – we’ve been meaning to add something like “send the request back to origin / another machine / app if it failed” for a while, but we’re still hashing out the exact semantics of such a feature.

Cool I will keep myself updated and in the meantime ask the client to handle this. Have a great day

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.