Return to sender: Fallbacks for fly-replay

bglw · March 20, 2026, 4:34am

Our fly-replay feature has two new fields you can define, to help with the cases where a replay cannot succeed. This might happen if, say, the app you’re replaying to is unhealthy or unavailable.

By default, like most requests, a replay will try for a while before throwing a generic 5xx error back to the client. This isn’t always the best course of action, so we now have a couple of new replay fields you can use to tweak this behavior: replay timeout and fallback.

Replay Timeout

The new timeout field sets how long the proxy should try to reach the replay target before giving up. It accepts duration strings like 10s or 800ms:

fly-replay: app=my-worker;timeout=2s

On its own, setting a timeout just makes the replay fail faster.

Replay Fallback

The new fallback field tells our proxy to route the request back to the original Machine that issued the replay, instead of erroring. There are two modes:

force_self — route back to the exact Machine that issued the replay, otherwise error.
prefer_self — try the original Machine first, but fall back to any Machine in the original app.

fly-replay: app=my-worker;timeout=5s;fallback=force_self

When a fallback occurs, your app gets the request again with a fly-replay-failed header containing metadata about what went wrong. You can use this to better track the error in your app, and you can also respond with a first-class error to your client, rather than the curt 5xx from our proxy.

Extra notes

Once a replayed request has hit its fallback, it can’t be replayed again.
“Failed” here means that the Fly Proxy didn’t manage to make any connection with one of your machines. As soon as a connection to your app is opened, the replay is a success, and your app returning an error will not trigger a fallback.
The defined timeout is a minimum/best-effort, as we will not cancel requests mid-flight.
These fields can be defined in the replay header, or in the JSON replay format.

rubys · March 20, 2026, 2:02pm

Nice! I can get rid of my hacky workaround.

rubys · March 20, 2026, 10:30pm

As luck would have it there has been some issues at DFW today so I had an opportunity to observe the behavior. What I got was a series of redirects. @bglw any chance you can look into 01KM6M38DQ1V0AGNA3FTZJ4FXF-dfw ? It doesn’t look like my code detected a fly-replay-failedheader.

No rush, meanwhile I’ve partially restored my previous changes.

bglw · March 23, 2026, 10:18pm

I had a look at the time, but without the flyio-debug: doit header there’s nothing useful for this request.

bglw · March 23, 2026, 10:29pm

Hard to say exactly beyond that, these fallbacks have been tested across a range of topologies and I haven’t seen any issues with them.

The one thing I note in your original linked PR is that your replays are to "region": target + ",any" , or "prefer_instance": machineID. So the preferences on these replays are fighting the fallback mechanism somewhat.
Both of these cases will drop their preference before resorting to a fallback — the former will resolve to the ,any region and succeed, and the latter will drop the preferred instance and succeed (assuming in both cases the app has other available machines). The fallback only triggers if the request would otherwise error, so "region": target or "instance": machineID would hit the fallback path if the destinations were unavailable. The prefer_instance and ,any replays would only ever reasonably hit a fallback if no machines were available for that app anywhere globally.

rubys · March 24, 2026, 12:17pm

Ah, makes perfect sense!

changed to hard targets

It does mean that if a machine is temporarily down (for example, during deployment), the user will see a installing updates page rather than waiting, but that page is set to automatically refresh.