Router errors 502 of POST requests "Undocumented"

kentcdodds · September 14, 2021, 8:14pm

In some cases when a user POSTs, I get the following error in my fly logs:

2021-09-14T19:49:41.050178391Z proxy[f29bb9a4] ams [error] error.code=1 error.message="Undocumented" request.method="POST" request.url="https://kent.dev/calls/record/new?_data=routes%2Fcalls.record%2Fnew" request.id="01FFJXQ4T4JQMZXC9QQGH19RZA" response.status=502

Interestingly, I’ve been unable to reproduce this. It appears to only happen when using the fly-replay header functionality, but not to all users (I use fly-prefer-region to simulate a fly-replay situation and the POST replays and completes successfully for me). One of the users who experienced this also used fly-prefer-region to set their region to the primary region and the POST request finished successfully as well.

So here’s what I know:

It happens on certain POST requests, but not all of them
It happens on requests that need to be replayed
It happens after the request has been replayed (my server successfully sends the response with the fly-replay header, and then the error happens)
It only happens to certain users

@jsierles told me this looks like a router error. In some ways this looks similar to 502 Errors for POST requests except there’s a different error message.

For an abundance of information, here’s a more complete log of the user experience that triggers this issue:

2021-09-14T19:38:20.807110464Z app[ac202b5f] ams [info] GET /chats?_data=routes%2Fchats%2Findex 204 - - 270.176 ms
2021-09-14T19:38:21.232474711Z app[ac202b5f] ams [info] GET /calls?_data=routes%2Fcalls 200 - - 580.799 ms
2021-09-14T19:38:21.249278923Z app[ac202b5f] ams [info] GET /chats?_data=routes%2Fchats 200 - - 703.539 ms
2021-09-14T19:38:54.210915294Z app[ac202b5f] ams [info] GET /calls/record?_data=routes%2Fcalls.record 200 - - 229.447 ms
2021-09-14T19:38:55.612048460Z app[ac202b5f] ams [info] GET /calls/record?_data=routes%2Fcalls.record%2Findex 200 - - 127.107 ms
2021-09-14T19:38:58.163404953Z app[ac202b5f] ams [info] GET /calls/record/new?_data=routes%2Fcalls.record%2Fnew 200 - - 210.336 ms
2021-09-14T19:39:27.365114140Z app[ac202b5f] ams [info] Replaying: {
2021-09-14T19:39:27.365640414Z app[ac202b5f] ams [info]   pathname: '/calls/record/new',
2021-09-14T19:39:27.366043035Z app[ac202b5f] ams [info]   method: 'POST',
2021-09-14T19:39:27.366598188Z app[ac202b5f] ams [info]   PRIMARY_REGION: 'dfw',
2021-09-14T19:39:27.367003019Z app[ac202b5f] ams [info]   FLY_REGION: 'ams'
2021-09-14T19:39:27.367137831Z app[ac202b5f] ams [info] }
2021-09-14T19:39:27.580189902Z app[ac202b5f] ams [info] POST /calls/record/new?_data=routes%2Fcalls.record%2Fnew 204 - - 214.995 ms
2021-09-14T19:39:30.695127097Z proxy[f29bb9a4] ams [error] error.code=1 error.message="Undocumented" request.method="POST" request.url="https://kent.dev/calls/record/new?_data=routes%2Fcalls.record%2Fnew" request.id="01FFJX4GQSJFB7ZAVE0AKGQEEF" response.status=502 
2021-09-14T19:39:30.759682548Z app[ac202b5f] ams [info] {
2021-09-14T19:39:30.759861235Z app[ac202b5f] ams [info]   timings: {
2021-09-14T19:39:30.760239692Z app[ac202b5f] ams [info]     postgres_read: [
2021-09-14T19:39:30.760422089Z app[ac202b5f] ams [info]       {
2021-09-14T19:39:30.761214617Z app[ac202b5f] ams [info]         name: 'getUser in root loader',
2021-09-14T19:39:30.761866907Z app[ac202b5f] ams [info]         type: 'postgres_read',
2021-09-14T19:39:30.762446593Z app[ac202b5f] ams [info]         time: 5.234527999535203
2021-09-14T19:39:30.762632229Z app[ac202b5f] ams [info]       }
2021-09-14T19:39:30.762848327Z app[ac202b5f] ams [info]     ],
2021-09-14T19:39:30.763388213Z app[ac202b5f] ams [info]     cache_read: [
2021-09-14T19:39:30.763575679Z app[ac202b5f] ams [info]       {
2021-09-14T19:39:30.764462301Z app[ac202b5f] ams [info]         name: 'cache.get(convertkit:448263086)',
2021-09-14T19:39:30.765028937Z app[ac202b5f] ams [info]         type: 'cache_read',
2021-09-14T19:39:30.765627376Z app[ac202b5f] ams [info]         time: 1.0343770012259483
2021-09-14T19:39:30.765813248Z app[ac202b5f] ams [info]       }
2021-09-14T19:39:30.765998060Z app[ac202b5f] ams [info]     ],
2021-09-14T19:39:30.766543400Z app[ac202b5f] ams [info]     convertkit_and_discord_read: [
2021-09-14T19:39:30.766725155Z app[ac202b5f] ams [info]       {
2021-09-14T19:39:30.767400985Z app[ac202b5f] ams [info]         name: 'getUserInfo in root loader',
2021-09-14T19:39:30.768080859Z app[ac202b5f] ams [info]         type: 'convertkit_and_discord_read',
2021-09-14T19:39:30.768612511Z app[ac202b5f] ams [info]         time: 2.0900500006973743
2021-09-14T19:39:30.768837270Z app[ac202b5f] ams [info]       }
2021-09-14T19:39:30.768963405Z app[ac202b5f] ams [info]     ]
2021-09-14T19:39:30.769094213Z app[ac202b5f] ams [info]   }
2021-09-14T19:39:30.769295465Z app[ac202b5f] ams [info] }
2021-09-14T19:39:30.979118367Z app[ac202b5f] ams [info] GET /calls/record/new?_data=root 200 - - 226.777 ms
2021-09-14T19:39:30.988703576Z app[ac202b5f] ams [info] GET /calls/record/new?_data=routes%2Fcalls.record%2Fnew 200 - - 206.982 ms
2021-09-14T19:39:31.003902408Z app[ac202b5f] ams [info] GET /calls/record/new?_data=routes%2Fcalls.record 200 - - 216.706 ms
2021-09-14T19:39:32.760331365Z app[f29bb9a4] dfw [info] POST /calls/record/new?_data=routes%2Fcalls.record%2Fnew - - - - ms

What more information do you need from me to figure out what’s going on? Thanks!

gwil · September 14, 2021, 8:56pm

I saw my thread mentioned - it’s worth mentioning that I believe both @kentcdodds and my own apps are using Remix!

kurt · September 14, 2021, 9:35pm

Ahhh I’m glad you posted this! We have much better error logs for these kinds of errors, so we can see exactly what caused it.

This error was the result of a failure to read the body from the request. We’re going to try to reproduce the problem. It could be a bug on our end, it might be something specific about how a browser is interacting with our proxy. It’s probably not your app.

Do you think it’s possible the user that hit this has a browser extension running that could modify requests, or is potentially running through a proxy server of some kind?

kentcdodds · September 14, 2021, 9:41pm

I think it’s possible, but unlikely. I could ask them to try in incognito mode though.

kurt · September 14, 2021, 9:43pm

If they can try in icognito with extensions disabled, that’ll help. We haven’t been able to make this happen yet.

kentcdodds · September 14, 2021, 9:52pm

He’s in Amsterdam and told me he’ll try tomorrow.

kentcdodds · September 15, 2021, 1:01pm

He tried with all extensions disabled in incognito mode and it didn’t work still.

jerome · September 15, 2021, 1:34pm

I’m digging into this.

The “true” error we see internally is an error reading the request body between our servers, sometimes. We were missing some instrumentation (I’ve just added it) and the next time this happens I should be able to tell what’s wrong and probably fix it with a few tweaks.

It does appear to be related to latency. MAA in particular is pretty far away from your primary (I assume).

How big was the request body? We buffer up to 1 MiB. If the body is less or equal to that, it should be safe to replay under any circumstances (or else it could be a bug).

If the body is bigger than 1 MiB, then we’ll wait up to 3 seconds before starting to read the body OR until we get response headers from either our own server or your instance. This means some “uploads” may take some time to start, but it ensures it can be replayed.

Now that I think about it, we’re not resetting that timer on replays. I’ll work on that this morning (in addition to monitoring body errors for your app).

kentcdodds · September 15, 2021, 1:35pm

Yeah, the body can be several MBs for this particular request.

kentcdodds · September 15, 2021, 2:14pm

My friend did it again and it took longer to fail he said, but it still failed. Hopefully you see what you need to see now. Here’s what I have in my logs:

2021-09-15T14:11:32.856187689Z proxy ams [error] error.code=1002 error.message="No suitable (healthy) instance found to handle request" request.method="POST" request.url="https://kent.dev/calls/record/new?_data=routes%2Fcalls.record%2Fnew" request.id="01FFMWQ263QWPBVHQ6ET2Y2Y3N" response.status=502

Based on the error message I wondered whether there was an instance health issue, but my fly status gives me this:

~/code/remix-kentcdodds (main) 👨‍🔬
$ fly status
App
  Name     = withered-frost-3196          
  Owner    = personal                     
  Version  = 554                          
  Status   = running                      
  Hostname = withered-frost-3196.fly.dev  

Instances
ID       TASK VERSION REGION DESIRED STATUS  HEALTH CHECKS      RESTARTS CREATED    
ec2b50ab app  554     maa    run     running 1 total, 1 passing 0        6h12m ago  
ac202b5f app  554     ams    run     running 1 total, 1 passing 0        21h23m ago 
ef8e02d4 app  554     maa    run     running 1 total, 1 passing 0        21h23m ago 
fc169d58 app  554     syd    run     running 1 total, 1 passing 0        21h23m ago 
0bd9d4c6 app  554     hkg    run     running 1 total, 1 passing 0        21h23m ago 
43ff6e3a app  554     scl    run     running 1 total, 1 passing 0        21h23m ago 

~/code/remix-kentcdodds (main) 👨‍🔬
$ fly status -a kcd-postgres
WARN app flag 'kcd-postgres' does not match app name in config file 'withered-frost-3196'
? Continue using 'kcd-postgres' Yes
App
  Name     = kcd-postgres          
  Owner    = personal              
  Version  = 22                    
  Status   = running               
  Hostname = kcd-postgres.fly.dev  

Instances
ID       TASK VERSION REGION DESIRED STATUS            HEALTH CHECKS      RESTARTS CREATED              
bf35b1ba app  22      maa    run     running (replica) 3 total, 3 passing 0        53m30s ago           
7b00ecb0 app  22      dfw    run     running (leader)  3 total, 3 passing 0        6h7m ago             
ee9797cf app  22      dfw    run     running (replica) 3 total, 3 passing 0        6h8m ago             
3fe4347a app  22      scl    run     running (replica) 3 total, 3 passing 2        2021-09-12T06:02:46Z 
9968616e app  22      hkg    run     running (replica) 3 total, 3 passing 0        2021-09-10T22:37:47Z 
9ec72ecf app  22      ams    run     running (replica) 3 total, 3 passing 1        2021-09-02T18:33:42Z 
8445923f app  22      syd    run     running (replica) 3 total, 3 passing 1        2021-08-21T17:23:40Z 

~/code/remix-kentcdodds (main) 👨‍🔬
$ fly status -a kcd-redis   
WARN app flag 'kcd-redis' does not match app name in config file 'withered-frost-3196'
? Continue using 'kcd-redis' Yes
App
  Name     = kcd-redis          
  Owner    = personal           
  Version  = 10                 
  Status   = running            
  Hostname = kcd-redis.fly.dev  

Instances
ID       TASK VERSION REGION DESIRED STATUS  HEALTH CHECKS      RESTARTS CREATED              
1a9f347b app  10      dfw    run     running 1 total, 1 passing 0        6h8m ago             
854108fa app  10      scl    run     running 1 total, 1 passing 0        2021-09-12T06:02:45Z 
0393a9c9 app  10      hkg    run     running 1 total, 1 passing 0        2021-09-10T21:13:21Z 
c6c55a6b app  10      maa    run     running 1 total, 1 passing 0        2021-09-08T12:06:54Z 
2956469e app  10      ams    run     running 1 total, 1 passing 0        2021-08-24T17:09:21Z 
c4dbe6d1 app  10      syd    run     running 1 total, 1 passing 0        2021-08-06T23:43:25Z

jerome · September 15, 2021, 2:18pm

I see the error. Your app asked for a replay in DFW, but you have no instance there. So it kept retrying (that’s why it was long) and eventually gave up because it couldn’t find a “suitable” instance to send the request to.

Edit: I’m aware this is a different issue than before. Pretty sure you had an instance in DFW before this.

kurt · September 15, 2021, 2:34pm

@kentcdodds see if it works now? I looked and we didn’t log any 502s since before the 10th.

kentcdodds · September 15, 2021, 2:39pm

Not sure how that happened, but @kurt suggested that I run: fly scale count 6 --max-per-region=1 and now my status is:

~/code/remix-kentcdodds (main) 👨‍🔬
$ fly status
App
  Name     = withered-frost-3196          
  Owner    = personal                     
  Version  = 555                          
  Status   = running                      
  Hostname = withered-frost-3196.fly.dev  

Instances
ID       TASK VERSION REGION DESIRED STATUS  HEALTH CHECKS      RESTARTS CREATED    
b7274be3 app  555     maa    run     running 1 total, 1 passing 0        9m37s ago  
8c01af67 app  555     dfw    run     running 1 total, 1 passing 0        9m37s ago  
ac202b5f app  555     ams    run     running 1 total, 1 passing 0        21h46m ago 
fc169d58 app  555     syd    run     running 1 total, 1 passing 0        21h46m ago 
43ff6e3a app  555     scl    run     running 1 total, 1 passing 0        21h46m ago 
0bd9d4c6 app  555     hkg    run     running 1 total, 1 passing 0        21h46m ago

I’ll ask my friend to try again.

kurt · September 15, 2021, 2:43pm

We should get better logs for the next error, assuming there is one.

Without that --max-per-region set, the scheduler does its best to spread VMs out but will occasionally put two per region. This is fine if you’re not trying to replay, really bad if you are.

We have a better setup you can use for primary/read regions but the --max-per-region should be fine for now.

jerome · September 15, 2021, 4:44pm

@kentcdodds I’ve noticed more errors for your app. I just deployed a branch that should help (in MAA, AMS and FRA), I found a few issues with the way we read the bodies (sometimes prematurely).

If your friend is still around, can you have them try it from amsterdam? If it’s possible to reproduce with a curl call, I’d like that.

kentcdodds · September 15, 2021, 6:14pm

My Amsterdam friend said it worked for him

kentcdodds · September 15, 2021, 6:43pm

Just had someone in France have the same issue. Got this in the logs:

2021-09-15T18:37:38.626047107Z proxy[995decfc] fra [error] error.code=1 error.message="Undocumented" request.method="POST" request.url="https://kent.dev/calls/record/new?_data=routes%2Fcalls.record%2Fnew" request.id="01FFNC01ATJTVC37G60P84E3J9" response.status=502

Don’t know if it’s helpful, but I don’t have an instance in the fra region. @kurt said that’s just where fly received the request, not where my server that handled it was running

kentcdodds · September 18, 2021, 6:40pm

I got another one of these Undocumented issues:

2021-09-18T16:47:06.708519083Z proxy[2ce4f113] cdg [error] error.code=1 error.message="Undocumented" request.method="POST" request.url="https://kent.dev/calls/record/new?_data=routes%2Fcalls.record%2Fnew" request.id="01FFWWVK3ASSNNEP8MM2ZNWN17" response.status=502

kurt · September 19, 2021, 8:44pm

Ok we’ve tracked down the cause of this error. We can likely build a workaround, but it happens in some instances when:

Your app receives an HTTP request in one region
a. Usually with a file upload or otherwise large (>64KB probably) body
It reads the whole request body
Then responds with a replay

We think this is exacerbated by connection speed, it works fine for some browsers / connections, errors on others.

This is problematic because many HTTP frameworks eagerly read a request body (which they really shouldn’t but that’s not your fault).

To prevent these errors, for now, you can disable request body reading. In an Express app, you might be able to do that by returning a replay with middleware that runs before this:

app.use(express.json())

We should be able to engineer around this for bodies up to a certain size, but in general things will behave better if the app doesn’t read the body. If you can prevent the app from reading the body, it’ll respond faster, and then replay will work fine with even very large bodies.

kentcdodds · September 19, 2021, 9:51pm

I may be able to just replay all non-GET requests before handling them at all

Topic		Replies	Views
502 Errors for POST requests Questions / Help	13	2259	October 18, 2021
Weird log messages	5	322	April 5, 2022
Fly replay resulting in an error	3	567	January 31, 2023
proxy error.message=["Undocumented"] Phoenix	10	426	September 9, 2021
Error logs saying "Internal problem" result in 502s	10	455	August 16, 2021

Router errors 502 of POST requests "Undocumented"

Related topics