The default behavior in the Ruby Fly adapter is to replay all POST requests as early as possible in the middleware stack. But even there, the Rails default web server (Puma) will buffer large requests to disk before passing them to Rails.
I think we can get replay to account for this (it’s the point of the buffer, after all). It’s more complex, though, and going to take some time to work out all the wrinkles.
I’ve improved things by adding an express middleware before parsing the body to redirect everything that’s not a GET request to the primary region. And I also have read-only and write-only clients for my postgres db so that way even if I do write in a GET request, it still works without issue (useful for session refreshes for example).
I think this is a pretty solid solution to this problem. Thanks for all your help on this @kurt and @jsierles
@jerome actually found that weird body parsing bug. Definitely let us know if there’s another 502 with an Undocumented error, though. This should be pretty stable.
The “Undocumented” error is very unfortunate. I’ll be working on improving that a bit more.
The request seems to have hit a gru edge server, then went to scl for your app. Your app then replied with the fly-replay header. Our edge then forwarded this to dfw.
Something is odd in the logs I’m seeing. I’m going to set a much higher log level for your app specifically (in our proxy logs) to see if anything interesting happens.
I’ll check for errors myself, but if you see one and I haven’t replied yet, please let me know.
We can improve the GET errors you saw. These were “connection reset by peer” from your app. Meaning it should fall in the “app connection” errors category.
I didn’t get the logs I wish I did for the POST requests. I’m pushing another update momentarily that might help.
If they do try again, it’s easier to sort through the logs if they only try once.
Is there any way I can trigger these myself? It would much easier to debug.
I disabled the replay again because it was impacting my users. Unfortunately I’m not sure how to reproduce. Seems like it happens to some users hitting non-primary region servers when they are submitting a recording at https://kent.dev/calls/record/new and I have replay enabled.
Do you know how big the POST bodies are? These errors match what we see in tests when our a Rails app parses a whole body and the request is quite large. I am wondering if the way those recordings are encoded balloon the request body somehow.
You’ll notice that I don’t even use the body parser middleware at all. All I do before the replay code is the redirect code and disabling the x-powered-by header.
We’re digging through Express, it’s quite possible it reads the whole body even before middleware. We’re going to launch a second instance of your app (assuming you’re cool with it) to see if we can replicate these things.
@kurt Did anything come of this? I feel like this is tied to my problem where my app hangs after trying to read the body of the request (also an express app, also Remix).
We haven’t been able to replicate with our own setup, sadly. We’ve fixed several body reading edge cases in the last few weeks though. Do you have a way to replicate the body read hangs?