Router errors 502 of POST requests "Undocumented"

In some cases when a user POSTs, I get the following error in my fly logs:

2021-09-14T19:49:41.050178391Z proxy[f29bb9a4] ams [error] error.code=1 error.message="Undocumented" request.method="POST" request.url="https://kent.dev/calls/record/new?_data=routes%2Fcalls.record%2Fnew" request.id="01FFJXQ4T4JQMZXC9QQGH19RZA" response.status=502 

Interestingly, I’ve been unable to reproduce this. It appears to only happen when using the fly-replay header functionality, but not to all users (I use fly-prefer-region to simulate a fly-replay situation and the POST replays and completes successfully for me). One of the users who experienced this also used fly-prefer-region to set their region to the primary region and the POST request finished successfully as well.

So here’s what I know:

  • It happens on certain POST requests, but not all of them
  • It happens on requests that need to be replayed
  • It happens after the request has been replayed (my server successfully sends the response with the fly-replay header, and then the error happens)
  • It only happens to certain users

@jsierles told me this looks like a router error. In some ways this looks similar to 502 Errors for POST requests except there’s a different error message.

For an abundance of information, here’s a more complete log of the user experience that triggers this issue:

2021-09-14T19:38:20.807110464Z app[ac202b5f] ams [info] GET /chats?_data=routes%2Fchats%2Findex 204 - - 270.176 ms
2021-09-14T19:38:21.232474711Z app[ac202b5f] ams [info] GET /calls?_data=routes%2Fcalls 200 - - 580.799 ms
2021-09-14T19:38:21.249278923Z app[ac202b5f] ams [info] GET /chats?_data=routes%2Fchats 200 - - 703.539 ms
2021-09-14T19:38:54.210915294Z app[ac202b5f] ams [info] GET /calls/record?_data=routes%2Fcalls.record 200 - - 229.447 ms
2021-09-14T19:38:55.612048460Z app[ac202b5f] ams [info] GET /calls/record?_data=routes%2Fcalls.record%2Findex 200 - - 127.107 ms
2021-09-14T19:38:58.163404953Z app[ac202b5f] ams [info] GET /calls/record/new?_data=routes%2Fcalls.record%2Fnew 200 - - 210.336 ms
2021-09-14T19:39:27.365114140Z app[ac202b5f] ams [info] Replaying: {
2021-09-14T19:39:27.365640414Z app[ac202b5f] ams [info]   pathname: '/calls/record/new',
2021-09-14T19:39:27.366043035Z app[ac202b5f] ams [info]   method: 'POST',
2021-09-14T19:39:27.366598188Z app[ac202b5f] ams [info]   PRIMARY_REGION: 'dfw',
2021-09-14T19:39:27.367003019Z app[ac202b5f] ams [info]   FLY_REGION: 'ams'
2021-09-14T19:39:27.367137831Z app[ac202b5f] ams [info] }
2021-09-14T19:39:27.580189902Z app[ac202b5f] ams [info] POST /calls/record/new?_data=routes%2Fcalls.record%2Fnew 204 - - 214.995 ms
2021-09-14T19:39:30.695127097Z proxy[f29bb9a4] ams [error] error.code=1 error.message="Undocumented" request.method="POST" request.url="https://kent.dev/calls/record/new?_data=routes%2Fcalls.record%2Fnew" request.id="01FFJX4GQSJFB7ZAVE0AKGQEEF" response.status=502 
2021-09-14T19:39:30.759682548Z app[ac202b5f] ams [info] {
2021-09-14T19:39:30.759861235Z app[ac202b5f] ams [info]   timings: {
2021-09-14T19:39:30.760239692Z app[ac202b5f] ams [info]     postgres_read: [
2021-09-14T19:39:30.760422089Z app[ac202b5f] ams [info]       {
2021-09-14T19:39:30.761214617Z app[ac202b5f] ams [info]         name: 'getUser in root loader',
2021-09-14T19:39:30.761866907Z app[ac202b5f] ams [info]         type: 'postgres_read',
2021-09-14T19:39:30.762446593Z app[ac202b5f] ams [info]         time: 5.234527999535203
2021-09-14T19:39:30.762632229Z app[ac202b5f] ams [info]       }
2021-09-14T19:39:30.762848327Z app[ac202b5f] ams [info]     ],
2021-09-14T19:39:30.763388213Z app[ac202b5f] ams [info]     cache_read: [
2021-09-14T19:39:30.763575679Z app[ac202b5f] ams [info]       {
2021-09-14T19:39:30.764462301Z app[ac202b5f] ams [info]         name: 'cache.get(convertkit:448263086)',
2021-09-14T19:39:30.765028937Z app[ac202b5f] ams [info]         type: 'cache_read',
2021-09-14T19:39:30.765627376Z app[ac202b5f] ams [info]         time: 1.0343770012259483
2021-09-14T19:39:30.765813248Z app[ac202b5f] ams [info]       }
2021-09-14T19:39:30.765998060Z app[ac202b5f] ams [info]     ],
2021-09-14T19:39:30.766543400Z app[ac202b5f] ams [info]     convertkit_and_discord_read: [
2021-09-14T19:39:30.766725155Z app[ac202b5f] ams [info]       {
2021-09-14T19:39:30.767400985Z app[ac202b5f] ams [info]         name: 'getUserInfo in root loader',
2021-09-14T19:39:30.768080859Z app[ac202b5f] ams [info]         type: 'convertkit_and_discord_read',
2021-09-14T19:39:30.768612511Z app[ac202b5f] ams [info]         time: 2.0900500006973743
2021-09-14T19:39:30.768837270Z app[ac202b5f] ams [info]       }
2021-09-14T19:39:30.768963405Z app[ac202b5f] ams [info]     ]
2021-09-14T19:39:30.769094213Z app[ac202b5f] ams [info]   }
2021-09-14T19:39:30.769295465Z app[ac202b5f] ams [info] }
2021-09-14T19:39:30.979118367Z app[ac202b5f] ams [info] GET /calls/record/new?_data=root 200 - - 226.777 ms
2021-09-14T19:39:30.988703576Z app[ac202b5f] ams [info] GET /calls/record/new?_data=routes%2Fcalls.record%2Fnew 200 - - 206.982 ms
2021-09-14T19:39:31.003902408Z app[ac202b5f] ams [info] GET /calls/record/new?_data=routes%2Fcalls.record 200 - - 216.706 ms
2021-09-14T19:39:32.760331365Z app[f29bb9a4] dfw [info] POST /calls/record/new?_data=routes%2Fcalls.record%2Fnew - - - - ms

What more information do you need from me to figure out what’s going on? Thanks!

I saw my thread mentioned - it’s worth mentioning that I believe both @kentcdodds and my own apps are using Remix!

2 Likes

Ahhh I’m glad you posted this! We have much better error logs for these kinds of errors, so we can see exactly what caused it.

This error was the result of a failure to read the body from the request. We’re going to try to reproduce the problem. It could be a bug on our end, it might be something specific about how a browser is interacting with our proxy. It’s probably not your app.

Do you think it’s possible the user that hit this has a browser extension running that could modify requests, or is potentially running through a proxy server of some kind?

I think it’s possible, but unlikely. I could ask them to try in incognito mode though.

If they can try in icognito with extensions disabled, that’ll help. We haven’t been able to make this happen yet.

He’s in Amsterdam and told me he’ll try tomorrow.

1 Like

He tried with all extensions disabled in incognito mode and it didn’t work still.

I’m digging into this.

The “true” error we see internally is an error reading the request body between our servers, sometimes. We were missing some instrumentation (I’ve just added it) and the next time this happens I should be able to tell what’s wrong and probably fix it with a few tweaks.

It does appear to be related to latency. MAA in particular is pretty far away from your primary (I assume).

How big was the request body? We buffer up to 1 MiB. If the body is less or equal to that, it should be safe to replay under any circumstances (or else it could be a bug).

If the body is bigger than 1 MiB, then we’ll wait up to 3 seconds before starting to read the body OR until we get response headers from either our own server or your instance. This means some “uploads” may take some time to start, but it ensures it can be replayed.

Now that I think about it, we’re not resetting that timer on replays. I’ll work on that this morning (in addition to monitoring body errors for your app).

Yeah, the body can be several MBs for this particular request.

My friend did it again and it took longer to fail he said, but it still failed. Hopefully you see what you need to see now. Here’s what I have in my logs:

2021-09-15T14:11:32.856187689Z proxy ams [error] error.code=1002 error.message="No suitable (healthy) instance found to handle request" request.method="POST" request.url="https://kent.dev/calls/record/new?_data=routes%2Fcalls.record%2Fnew" request.id="01FFMWQ263QWPBVHQ6ET2Y2Y3N" response.status=502 

Based on the error message I wondered whether there was an instance health issue, but my fly status gives me this:

~/code/remix-kentcdodds (main) 👨‍🔬
$ fly status
App
  Name     = withered-frost-3196          
  Owner    = personal                     
  Version  = 554                          
  Status   = running                      
  Hostname = withered-frost-3196.fly.dev  

Instances
ID       TASK VERSION REGION DESIRED STATUS  HEALTH CHECKS      RESTARTS CREATED    
ec2b50ab app  554     maa    run     running 1 total, 1 passing 0        6h12m ago  
ac202b5f app  554     ams    run     running 1 total, 1 passing 0        21h23m ago 
ef8e02d4 app  554     maa    run     running 1 total, 1 passing 0        21h23m ago 
fc169d58 app  554     syd    run     running 1 total, 1 passing 0        21h23m ago 
0bd9d4c6 app  554     hkg    run     running 1 total, 1 passing 0        21h23m ago 
43ff6e3a app  554     scl    run     running 1 total, 1 passing 0        21h23m ago 

~/code/remix-kentcdodds (main) 👨‍🔬
$ fly status -a kcd-postgres
WARN app flag 'kcd-postgres' does not match app name in config file 'withered-frost-3196'
? Continue using 'kcd-postgres' Yes
App
  Name     = kcd-postgres          
  Owner    = personal              
  Version  = 22                    
  Status   = running               
  Hostname = kcd-postgres.fly.dev  

Instances
ID       TASK VERSION REGION DESIRED STATUS            HEALTH CHECKS      RESTARTS CREATED              
bf35b1ba app  22      maa    run     running (replica) 3 total, 3 passing 0        53m30s ago           
7b00ecb0 app  22      dfw    run     running (leader)  3 total, 3 passing 0        6h7m ago             
ee9797cf app  22      dfw    run     running (replica) 3 total, 3 passing 0        6h8m ago             
3fe4347a app  22      scl    run     running (replica) 3 total, 3 passing 2        2021-09-12T06:02:46Z 
9968616e app  22      hkg    run     running (replica) 3 total, 3 passing 0        2021-09-10T22:37:47Z 
9ec72ecf app  22      ams    run     running (replica) 3 total, 3 passing 1        2021-09-02T18:33:42Z 
8445923f app  22      syd    run     running (replica) 3 total, 3 passing 1        2021-08-21T17:23:40Z 

~/code/remix-kentcdodds (main) 👨‍🔬
$ fly status -a kcd-redis   
WARN app flag 'kcd-redis' does not match app name in config file 'withered-frost-3196'
? Continue using 'kcd-redis' Yes
App
  Name     = kcd-redis          
  Owner    = personal           
  Version  = 10                 
  Status   = running            
  Hostname = kcd-redis.fly.dev  

Instances
ID       TASK VERSION REGION DESIRED STATUS  HEALTH CHECKS      RESTARTS CREATED              
1a9f347b app  10      dfw    run     running 1 total, 1 passing 0        6h8m ago             
854108fa app  10      scl    run     running 1 total, 1 passing 0        2021-09-12T06:02:45Z 
0393a9c9 app  10      hkg    run     running 1 total, 1 passing 0        2021-09-10T21:13:21Z 
c6c55a6b app  10      maa    run     running 1 total, 1 passing 0        2021-09-08T12:06:54Z 
2956469e app  10      ams    run     running 1 total, 1 passing 0        2021-08-24T17:09:21Z 
c4dbe6d1 app  10      syd    run     running 1 total, 1 passing 0        2021-08-06T23:43:25Z 

I see the error. Your app asked for a replay in DFW, but you have no instance there. So it kept retrying (that’s why it was long) and eventually gave up because it couldn’t find a “suitable” instance to send the request to.

Edit: I’m aware this is a different issue than before. Pretty sure you had an instance in DFW before this.

@kentcdodds see if it works now? I looked and we didn’t log any 502s since before the 10th.

Not sure how that happened, but @kurt suggested that I run: fly scale count 6 --max-per-region=1 and now my status is:

~/code/remix-kentcdodds (main) 👨‍🔬
$ fly status
App
  Name     = withered-frost-3196          
  Owner    = personal                     
  Version  = 555                          
  Status   = running                      
  Hostname = withered-frost-3196.fly.dev  

Instances
ID       TASK VERSION REGION DESIRED STATUS  HEALTH CHECKS      RESTARTS CREATED    
b7274be3 app  555     maa    run     running 1 total, 1 passing 0        9m37s ago  
8c01af67 app  555     dfw    run     running 1 total, 1 passing 0        9m37s ago  
ac202b5f app  555     ams    run     running 1 total, 1 passing 0        21h46m ago 
fc169d58 app  555     syd    run     running 1 total, 1 passing 0        21h46m ago 
43ff6e3a app  555     scl    run     running 1 total, 1 passing 0        21h46m ago 
0bd9d4c6 app  555     hkg    run     running 1 total, 1 passing 0        21h46m ago 

I’ll ask my friend to try again.

We should get better logs for the next error, assuming there is one.

Without that --max-per-region set, the scheduler does its best to spread VMs out but will occasionally put two per region. This is fine if you’re not trying to replay, really bad if you are.

We have a better setup you can use for primary/read regions but the --max-per-region should be fine for now.

1 Like

@kentcdodds I’ve noticed more errors for your app. I just deployed a branch that should help (in MAA, AMS and FRA), I found a few issues with the way we read the bodies (sometimes prematurely).

If your friend is still around, can you have them try it from amsterdam? If it’s possible to reproduce with a curl call, I’d like that.

My Amsterdam friend said it worked for him :slight_smile:

1 Like

Just had someone in France have the same issue. Got this in the logs:

2021-09-15T18:37:38.626047107Z proxy[995decfc] fra [error] error.code=1 error.message="Undocumented" request.method="POST" request.url="https://kent.dev/calls/record/new?_data=routes%2Fcalls.record%2Fnew" request.id="01FFNC01ATJTVC37G60P84E3J9" response.status=502 

Don’t know if it’s helpful, but I don’t have an instance in the fra region. @kurt said that’s just where fly received the request, not where my server that handled it was running :man_shrugging:

I got another one of these Undocumented issues:

2021-09-18T16:47:06.708519083Z proxy[2ce4f113] cdg [error] error.code=1 error.message="Undocumented" request.method="POST" request.url="https://kent.dev/calls/record/new?_data=routes%2Fcalls.record%2Fnew" request.id="01FFWWVK3ASSNNEP8MM2ZNWN17" response.status=502 

Ok we’ve tracked down the cause of this error. We can likely build a workaround, but it happens in some instances when:

  1. Your app receives an HTTP request in one region
    a. Usually with a file upload or otherwise large (>64KB probably) body
  2. It reads the whole request body
  3. Then responds with a replay

We think this is exacerbated by connection speed, it works fine for some browsers / connections, errors on others.

This is problematic because many HTTP frameworks eagerly read a request body (which they really shouldn’t but that’s not your fault).

To prevent these errors, for now, you can disable request body reading. In an Express app, you might be able to do that by returning a replay with middleware that runs before this:

app.use(express.json())

We should be able to engineer around this for bodies up to a certain size, but in general things will behave better if the app doesn’t read the body. If you can prevent the app from reading the body, it’ll respond faster, and then replay will work fine with even very large bodies.

I may be able to just replay all non-GET requests before handling them at all :thinking: