Hi everyone,
I am facing an issue for quite a while now. I’m building a FastAPI App on Fly.io and I’m having trouble getting fly-replay to target a specific machine reliably.
The Setup: When machine_1 receives a request and has too many (long running) background tasks open, it checks if the request can be sent to an alternative machine_B. For doing so, I am following this exact sequence via the Machines API:
Discovery: I fetch the list of machines in the app.
Start: I call /v1/apps/{app_name}/machines/{machine_id}/start for the target machine.
Wait: I call /v1/apps/{app_name}/machines/{machine_id}/wait?state=started&instance={instance_id} to ensure it’s up.
Health Check: I even perform an internal request to the worker’s /health endpoint to confirm the FastAPI server is actually listening.
Replay: Once confirmed, I return a 200 OK with the header: fly-replay: instance=<machine_id>.
The Problem: The request never arrives at the worker machine_B.
If I use fly-replay: elsewhere=true, it works perfectly (the request is moved to another machine).
If I use instance=<machine_id>, it fails silently or the client hangs/gets an error.
I have confirmed that machine_id is the short alphanumeric ID (e.g., d896525a615e48). I’ve also tried adding a 2-second sleep before the replay to account for proxy propagation, but no luck.
Questions:
Does anyone see what I am missing? Do I need to provide the instance_id rather than machine_id, and if so is the instance_id ephemeral so that it changes between machine restarts?
Is there a way to view the Fly Proxy logs specifically? I can see my app logs in Grafana, but I can’t find the logs that show why the proxy is discarding or failing a replay hand-off.
Any help or insight would be so much appreciated. Thanks and have a great day