Interestingly, I’ve been unable to reproduce this. It appears to only happen when using the fly-replay header functionality, but not to all users (I use fly-prefer-region to simulate a fly-replay situation and the POST replays and completes successfully for me). One of the users who experienced this also used fly-prefer-region to set their region to the primary region and the POST request finished successfully as well.
So here’s what I know:
It happens on certain POST requests, but not all of them
It happens on requests that need to be replayed
It happens after the request has been replayed (my server successfully sends the response with the fly-replay header, and then the error happens)
It only happens to certain users
@jsierles told me this looks like a router error. In some ways this looks similar to 502 Errors for POST requests except there’s a different error message.
For an abundance of information, here’s a more complete log of the user experience that triggers this issue:
Ahhh I’m glad you posted this! We have much better error logs for these kinds of errors, so we can see exactly what caused it.
This error was the result of a failure to read the body from the request. We’re going to try to reproduce the problem. It could be a bug on our end, it might be something specific about how a browser is interacting with our proxy. It’s probably not your app.
Do you think it’s possible the user that hit this has a browser extension running that could modify requests, or is potentially running through a proxy server of some kind?
The “true” error we see internally is an error reading the request body between our servers, sometimes. We were missing some instrumentation (I’ve just added it) and the next time this happens I should be able to tell what’s wrong and probably fix it with a few tweaks.
It does appear to be related to latency. MAA in particular is pretty far away from your primary (I assume).
How big was the request body? We buffer up to 1 MiB. If the body is less or equal to that, it should be safe to replay under any circumstances (or else it could be a bug).
If the body is bigger than 1 MiB, then we’ll wait up to 3 seconds before starting to read the body OR until we get response headers from either our own server or your instance. This means some “uploads” may take some time to start, but it ensures it can be replayed.
Now that I think about it, we’re not resetting that timer on replays. I’ll work on that this morning (in addition to monitoring body errors for your app).
My friend did it again and it took longer to fail he said, but it still failed. Hopefully you see what you need to see now. Here’s what I have in my logs:
2021-09-15T14:11:32.856187689Z proxy ams [error] error.code=1002 error.message="No suitable (healthy) instance found to handle request" request.method="POST" request.url="https://kent.dev/calls/record/new?_data=routes%2Fcalls.record%2Fnew" request.id="01FFMWQ263QWPBVHQ6ET2Y2Y3N" response.status=502
Based on the error message I wondered whether there was an instance health issue, but my fly status gives me this:
~/code/remix-kentcdodds (main) 👨🔬
$ fly status
App
Name = withered-frost-3196
Owner = personal
Version = 554
Status = running
Hostname = withered-frost-3196.fly.dev
Instances
ID TASK VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
ec2b50ab app 554 maa run running 1 total, 1 passing 0 6h12m ago
ac202b5f app 554 ams run running 1 total, 1 passing 0 21h23m ago
ef8e02d4 app 554 maa run running 1 total, 1 passing 0 21h23m ago
fc169d58 app 554 syd run running 1 total, 1 passing 0 21h23m ago
0bd9d4c6 app 554 hkg run running 1 total, 1 passing 0 21h23m ago
43ff6e3a app 554 scl run running 1 total, 1 passing 0 21h23m ago
~/code/remix-kentcdodds (main) 👨🔬
$ fly status -a kcd-postgres
WARN app flag 'kcd-postgres' does not match app name in config file 'withered-frost-3196'
? Continue using 'kcd-postgres' Yes
App
Name = kcd-postgres
Owner = personal
Version = 22
Status = running
Hostname = kcd-postgres.fly.dev
Instances
ID TASK VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
bf35b1ba app 22 maa run running (replica) 3 total, 3 passing 0 53m30s ago
7b00ecb0 app 22 dfw run running (leader) 3 total, 3 passing 0 6h7m ago
ee9797cf app 22 dfw run running (replica) 3 total, 3 passing 0 6h8m ago
3fe4347a app 22 scl run running (replica) 3 total, 3 passing 2 2021-09-12T06:02:46Z
9968616e app 22 hkg run running (replica) 3 total, 3 passing 0 2021-09-10T22:37:47Z
9ec72ecf app 22 ams run running (replica) 3 total, 3 passing 1 2021-09-02T18:33:42Z
8445923f app 22 syd run running (replica) 3 total, 3 passing 1 2021-08-21T17:23:40Z
~/code/remix-kentcdodds (main) 👨🔬
$ fly status -a kcd-redis
WARN app flag 'kcd-redis' does not match app name in config file 'withered-frost-3196'
? Continue using 'kcd-redis' Yes
App
Name = kcd-redis
Owner = personal
Version = 10
Status = running
Hostname = kcd-redis.fly.dev
Instances
ID TASK VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
1a9f347b app 10 dfw run running 1 total, 1 passing 0 6h8m ago
854108fa app 10 scl run running 1 total, 1 passing 0 2021-09-12T06:02:45Z
0393a9c9 app 10 hkg run running 1 total, 1 passing 0 2021-09-10T21:13:21Z
c6c55a6b app 10 maa run running 1 total, 1 passing 0 2021-09-08T12:06:54Z
2956469e app 10 ams run running 1 total, 1 passing 0 2021-08-24T17:09:21Z
c4dbe6d1 app 10 syd run running 1 total, 1 passing 0 2021-08-06T23:43:25Z
I see the error. Your app asked for a replay in DFW, but you have no instance there. So it kept retrying (that’s why it was long) and eventually gave up because it couldn’t find a “suitable” instance to send the request to.
Edit: I’m aware this is a different issue than before. Pretty sure you had an instance in DFW before this.
Not sure how that happened, but @kurt suggested that I run: fly scale count 6 --max-per-region=1 and now my status is:
~/code/remix-kentcdodds (main) 👨🔬
$ fly status
App
Name = withered-frost-3196
Owner = personal
Version = 555
Status = running
Hostname = withered-frost-3196.fly.dev
Instances
ID TASK VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
b7274be3 app 555 maa run running 1 total, 1 passing 0 9m37s ago
8c01af67 app 555 dfw run running 1 total, 1 passing 0 9m37s ago
ac202b5f app 555 ams run running 1 total, 1 passing 0 21h46m ago
fc169d58 app 555 syd run running 1 total, 1 passing 0 21h46m ago
43ff6e3a app 555 scl run running 1 total, 1 passing 0 21h46m ago
0bd9d4c6 app 555 hkg run running 1 total, 1 passing 0 21h46m ago
We should get better logs for the next error, assuming there is one.
Without that --max-per-region set, the scheduler does its best to spread VMs out but will occasionally put two per region. This is fine if you’re not trying to replay, really bad if you are.
We have a better setup you can use for primary/read regions but the --max-per-region should be fine for now.
@kentcdodds I’ve noticed more errors for your app. I just deployed a branch that should help (in MAA, AMS and FRA), I found a few issues with the way we read the bodies (sometimes prematurely).
If your friend is still around, can you have them try it from amsterdam? If it’s possible to reproduce with a curl call, I’d like that.
Just had someone in France have the same issue. Got this in the logs:
2021-09-15T18:37:38.626047107Z proxy[995decfc] fra [error] error.code=1 error.message="Undocumented" request.method="POST" request.url="https://kent.dev/calls/record/new?_data=routes%2Fcalls.record%2Fnew" request.id="01FFNC01ATJTVC37G60P84E3J9" response.status=502
Don’t know if it’s helpful, but I don’t have an instance in the fra region. @kurt said that’s just where fly received the request, not where my server that handled it was running
Ok we’ve tracked down the cause of this error. We can likely build a workaround, but it happens in some instances when:
Your app receives an HTTP request in one region
a. Usually with a file upload or otherwise large (>64KB probably) body
It reads the whole request body
Then responds with a replay
We think this is exacerbated by connection speed, it works fine for some browsers / connections, errors on others.
This is problematic because many HTTP frameworks eagerly read a request body (which they really shouldn’t but that’s not your fault).
To prevent these errors, for now, you can disable request body reading. In an Express app, you might be able to do that by returning a replay with middleware that runs before this:
app.use(express.json())
We should be able to engineer around this for bodies up to a certain size, but in general things will behave better if the app doesn’t read the body. If you can prevent the app from reading the body, it’ll respond faster, and then replay will work fine with even very large bodies.