Replay with file stream

Hi, I’m experimenting with an application which receives data via file upload like

curl -T - ${URL} < some_file

I’m using Fly.io’s replay header feature to route the authenticated request to the correct machine. I observed, that often, but not always, the data stream gets scrambled. There is a workaround which delays sending the data as a stream:

{ sleep 1 && cat some_file; } | curl -T - ${URL}

However, that involves adjusting the uploading method. I don’t know the internals of how the routing proxy handles and forwards the request, but I’m looking for a way to achieve this without having to introduce a delay. Would it be possible to somehow control the buffering of the request for replay (actually, I only need headers for authentication)? Some ideas anybody?

Can you explain more about what ‘scrambled’ looks like?

One thing to note is that replays may not work for request bodies over 10MB in size. 10MB is the replay/retry buffer limit. When you hit the limit, you should see a PA01 error in your logs.

Hi @joshua-fly, by scrambled I mean that when I diff the data, usually a large block of bytes is missing at the beginning, sometimes there are also missing bytes in the middle. In fact, I have seen the PA01 error in some (more infrequent) cases, but this corruption also happens without notice in the logs, I believe. It is easy to reproduce, but not really deterministic.

I found out, that for a running gateway instance, a delay of 1 second is enough for the app to re-route the request before the data is swallowed by the proxy. If the gateway needs to spin up, I need to use something like 6 seconds.

So from a user black box perspective, I understand the issue and can somehow handle it using the workaround. But I’d like to have a better understanding of the internals and how the buffering could be controlled (like for instance let the proxy only read a certain amount of data from the client or tell it what part of the data to buffer).

Also, I even saw this with data blocks of size 512 kB, so this should definitely fit into the buffer.

I ran a test to produce an example with an app which relays a byte stream. The error could be in the relay app or in the proxy/routing, but I cannot reproduce this error neither with a direct app connection without replay nor running the app locally, and it seems not to occur with a delay of 1 second or more in the workaround above.

The blocks on the left side visualize the parts of the data which are missing.

The forum doesn’t allow me to post or attach the full text data.

Hey @fungs

I was able to reproduce this. I’m trying to find what might be the problem. Will post here once I understand what’s going on.

1 Like

You may want to replace the - with . in the curl command for async input.

Just a small update. I found a bug and deployed the fix to a few edges in ams region. If all goes well, I’m gonna be rolling it out everywhere soon.

Great! I could also do a some stress testing with my application in AMS region for you, to see, if the issue is fixed. Please keep us updated.

Now the real question: even now, the delay workaround seems to be necessary for data streams of more than 10 MiB, right? The problem is, that the delay is not really predictable because it depends on latency and load. Is there any way to prevent the proxy from consuming data, which it then will not buffer and discard? Otherwise, client applications might need custom tweaking for fly.io replay routing.

Lastly, wouldn’t a public bug tracker be a good thing to have? It feels really unsatisfying to report critical bugs as support requests to a community forum, for which you say it is not guaranteed to be read at all by a Fly.io employee. I see a general trend to route bug customer reports exclusively though customer support (often not possible for basic plans) with IT companies, e.g. Databricks. I can say, that this led to those bugs not being reported by me in many cases, and that I in the end used alternative providers and solutions. Having this said, the community support here worked great in this case!

1 Like

Sorry for the delay. The fix has been rolled out everywhere.

Yeah. Buffering is the default behavior (as the proxy needs the data in case it needs to retry the request, with fly-replay being a special case of this). But it will never buffer more than 10Mb per request. It’s currently not possible to tell the proxy to wait for fly-replay response and only after that consume the body. If you need to upload large files I think it’s better not to rely on fly-replay in this case. Some possible way to achieve this:

  • route the request through your app that currently does fly-replay and stream the upload from it to a specific instance (via .internal address)
  • replay with the target instance ID from your app and do upload with fly-force-instance-id: <instance id> request header. Note that in this case if the target instance ID is unavailable (for example, host is down), the request is gonna fail eventually after all retry attempts are exhausted.

Thanks for the suggestions!

For the first one, streaming data through the primary app instance (I would call gateway app) unfortunately defeats the whole purpose of using the fly-replay mechanism to load balance large data streams directly to different instances. That would be like running a secondary reverse proxy.

The second to me looks like a client-side implementation approach of the replay mechanism, thus requiring client side custom code for fly.io routing. Doesn’t really work for load balancing dynamic requests without an additional round of negotiation using a custom protocol.

@pavel, do I understand correctly that a public connection request can circumvent the gateway application by adding the undocumented fly-force-instance-id to the request? That would be a real issue with simple first-level authentication proxies, very important to know. I was assuming that the first request would always go to the gateway app!

No. This only works if the app already has public IP address. Are you doing fly-replay response to an app without public IP? In this case this won’t work.

Another small update on the issue - I had to rollback the change for now. Looks like it caused issues with our registry which I don’t fully understand yet.

1 Like

@fungs

The now correct fix for this has been rolled out. So all should be good.

Now, about replaying large file streams. I’ve checked again what we do with large bodies and here is what happens:

  • a proxy on edge accepts a request and starts buffering the body (up to 10 Mb)
  • immediately, it looks for an instance to handle the request and sends it to it
  • once 10Mb are buffered, the proxy on edge stops reading the body from the client until it either gets a response from the instance (not fly-replay, but proper response) or 3 seconds pass (right now hardcoded). Once either of the two happens, the proxy continues reading the body from the client and the request can no longer be retried or replayed.

With fly-replay it means that your app has 3 seconds to redirect the request somewhere else. There could definitely be situations when this is not gonna be enough, like networking problems.

There is an alternative solution I think that still lets you benefit from proxy load balancing. Have you considered allocating a flycast address for the app (the one you are replaying to) and sending a request to it. This way your proxy app can auth the request, buffer it in memory or on disk if you want to be sure that you can retry it, and send it to the final app via flycast address.

@pavel it’s great to hear that you were finally able to roll out the fix! I also like your dissection of the routing process. It is really helping me (and possibly others) to understand what happens under the hood.

As you say, a threshold of 3 seconds leaves quite some room for nondeterministic failures when using standard type requests with large bodies and fly.io replay headers, in particular, if you add the spin up time for stopped machines.

My current solution approach to handle this is to wait for a response from the final routing destination before sending large data. That, however, means I cannot use any out-of-the-box HTTP client for the application remote side I host! Of course, it would be great if those 3 seconds could be more flexible, or ideally that the proxy could be compliant with requests processing to be fully transparent to those clients.

IMO the real issue for the backend application is that the data it gets may or may not be corrupted (and there might be no way to tell whether it is the case, e. g. with encrypted data). This just violates the principle of TCP/IP, which does all the work to ensure that the data is correct. So instead of replaying incomplete data, the proxy should error to the client that the request failed, if it is unable to guarantee the correctness of relayed requests.

I’m not sure if I understand your alternative solution. I assume you mean a private flycast address to let the gateway app make a request to the backend app. Wouldn’t that mean that all the incoming traffic would go through the gateway app? The aim would be for large data streams to take the shortest route possible from the edge router to the backend app. The gateway app should be lightweight and only used for path routing and authentication for different backends.

That’s what it’s doing. The corruption (partial request) was a bug and is now fixed.
If the proxy can’t retry/replay the request anymore because the body is (partially) consumed it will simply fail the request, as there is nothing else it can do. This shouldn’t happen for requests less than 10Mb but may happen for requests > 10Mb if it takes more than 3 seconds to process the request.

Yes. This gives you flexibility to buffer the body if needed, but the downside is that everything will have to go via your proxy app.

1 Like

Hi @pavel, I still get this error (AMS region) with a 1 MiB request and automatic instance startup. Is this due to the timeout?

[error][PA01] cannot retry/replay request because request body has been read past what we're willing to buffer

I’m observing this error reproducibly in the specific situation, when the gateway app needs to spin up and I use a stream:

curl --fail -T - ${URL} < 1_mib_file

which returns a 502 status error, as opposed to a file with pre-determined size:

curl --fail -T 1_mib_file ${URL}

which works as expected.

Does this always work when the gateway app needs to spin up as well?

Yes, unless the size is larger than 10 MiB

Interesting. They behave exactly the same for me. The only difference from proxy point of view is that the buffer capacity in the second case will be min(10mb, content-length) and always 10mb in the first case since there is no content length header.

I actually might have misled you earlier with this statement:

This shouldn’t happen for requests less than 10Mb but may happen for requests > 10Mb if it takes more than 3 seconds to process the request.

I checked the code again and we stop buffering after 3 seconds completely (even if there is still some space in the buffer). So if it takes more than 3 seconds to fly-replay the request may fail.

Yes, when I tested, this was reproducible and I re-checked just now. The combination of a 1 MiB file stream and auto spin-up always results in the 502 status error and cannot retry/replay request because request body has been read past what we're willing to buffer. It took about 7 s overall to return to the client with that error (request.id="01HYDHVTA00J89XQE29R9QFVHY-ams").

Does it just mean that the proxy stops consuming data, or does it directly return a 502 status error without waiting for the gateway app to respond. I assume the latter from what I observe.

I’m also trying to find an explanation. I hypothesize that the gateway app API framework handles file stream upload requests differently, which could mean that it could fail to respond to the request within 3 seconds with file streams. I checked with the time stamps in the logs and indeed, it takes about 5 s in the error case with a stream and only 3 s with a fixed file upload. I will check, if I can handle that somehow differently.

I’m wondering if the 3 seconds restriction for replay is usable for production services, if I cannot control many of the timing parameters. The app itself takes roughly 1.5 s to spin up and may depend on external services for authentication, which also adds to the overall time.

Maybe you could start with adjusting the error message in such cases to something like ‘timeout waiting for response’ for better debugging and understanding.