Timeout on new connections

I’m not sure what happened, but I observed this very concerning behaviour on new connections from 11:10-11:20 PST on traffic that should have running out of Seattle.

  1. Existing websocket connections worked.
  2. Any new connections would time out. ping would show 10 ms. curl would hang and not open the page.
  3. I checked from three different machines and two distinct ip addresses.
  4. We are nowhere close to softlimit or hardlimit.

The following are log entries that show the timed out requests actually make it to the backend. I have not included the existing websocket connections which worked correctly.

2022-08-03T12:33:20.790 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":334.173,"guid":"eba7d3d0-25b4-4e2e-a6ea-587d726103c6","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{"id":"3"},"path":"REDACTED","pid":"#PID<0.3222.0>","request_id":"FwfVvisnviuvtScAAA_x","status":200,"time":"2022-08-03T12:33:20.788743Z","user_id":8}
2022-08-03T18:15:29.811 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":1301.2,"guid":"eba7d3d0-25b4-4e2e-a6ea-587d726103c6","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"/","pid":"#PID<0.4796.0>","request_id":"FwfoabqFpi3NacYAABKh","status":200,"time":"2022-08-03T18:15:29.809085Z","user_id":8}
2022-08-03T18:15:53.484 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":1298.474,"guid":"eba7d3d0-25b4-4e2e-a6ea-587d726103c6","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"/","pid":"#PID<0.4800.0>","request_id":"Fwfobz3E2vNkc8QAABMx","status":200,"time":"2022-08-03T18:15:53.483158Z","user_id":8}
2022-08-03T18:15:57.050 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":22.594,"guid":"3840806c-ad65-45e7-96e7-ea7f42ff1a28","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"/","pid":"#PID<0.4802.0>","request_id":"FwfocF5aQNHxEqwAABOR","status":200,"time":"2022-08-03T18:15:57.048901Z"}
2022-08-03T18:16:47.332 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":24.982,"guid":"5d2833f6-f232-486b-a303-b4fcd33eb614","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"/","pid":"#PID<0.4807.0>","request_id":"FwfofBM27hzHTLUAABPh","status":200,"time":"2022-08-03T18:16:47.330283Z"}
2022-08-03T18:16:53.294 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":1161.323,"guid":"eba7d3d0-25b4-4e2e-a6ea-587d726103c6","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"/","pid":"#PID<0.4811.0>","request_id":"FwfofTLjXWYAAv4AABQx","status":200,"time":"2022-08-03T18:16:53.292997Z","user_id":8}
2022-08-03T18:17:44.122 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":1166.828,"guid":"eba7d3d0-25b4-4e2e-a6ea-587d726103c6","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"/","pid":"#PID<0.4818.0>","request_id":"FwfoiQgkw2pbgPYAABVR","status":200,"time":"2022-08-03T18:17:44.120995Z","user_id":8}
2022-08-03T18:18:01.360 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":1436.48,"guid":"eba7d3d0-25b4-4e2e-a6ea-587d726103c6","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"/","pid":"#PID<0.4821.0>","request_id":"FwfojPtz1a3vDLgAABWx","status":200,"time":"2022-08-03T18:18:01.357572Z","user_id":8}
2022-08-03T18:18:18.699 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":1352.14,"guid":"eba7d3d0-25b4-4e2e-a6ea-587d726103c6","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"REDACTED","pid":"#PID<0.4824.0>","request_id":"FwfokQoR0Ju3fJkAABbR","status":200,"time":"2022-08-03T18:18:18.699022Z","user_id":8}
2022-08-03T18:18:30.653 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":20.785,"guid":"3840806c-ad65-45e7-96e7-ea7f42ff1a28","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"/","pid":"#PID<0.4827.0>","request_id":"FwfolCHidM3rKa8AABbx","status":200,"time":"2022-08-03T18:18:30.651424Z"}
2022-08-03T18:19:18.132 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":763.878,"guid":"eba7d3d0-25b4-4e2e-a6ea-587d726103c6","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"/","pid":"#PID<0.4834.0>","request_id":"FwfonwOJJBNs0_oAABdx","status":200,"time":"2022-08-03T18:19:18.130693Z","user_id":8}
2022-08-03T18:19:18.725 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":1357.279,"guid":"eba7d3d0-25b4-4e2e-a6ea-587d726103c6","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"REDACTED","pid":"#PID<0.4835.0>","request_id":"FwfonwOT_GgTd6MAABeB","status":200,"time":"2022-08-03T18:19:18.724786Z","user_id":8}
2022-08-03T18:19:51.407 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":22.718,"guid":"b81eec14-3a09-4011-892e-74a77ede659f","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"/","pid":"#PID<0.4840.0>","request_id":"Fwfopu8jK292Z0YAABgh","status":200,"time":"2022-08-03T18:19:51.406342Z"}
2022-08-03T18:21:09.349 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":24.753,"guid":"268789a9-9153-4e38-b7eb-125888a801b7","ip":"REDACTED2","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"/","pid":"#PID<0.4870.0>","request_id":"FwfouRSstR3Czs8AABnR","status":200,"time":"2022-08-03T18:21:09.347567Z"}
2022-08-03T18:21:31.961 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":895.593,"guid":"eba7d3d0-25b4-4e2e-a6ea-587d726103c6","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"/","pid":"#PID<0.4874.0>","request_id":"FwfoviSI46gJJCYAABqR","status":200,"time":"2022-08-03T18:21:31.959326Z","user_id":8}
2022-08-03T18:21:38.043 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":23.013,"guid":"3840806c-ad65-45e7-96e7-ea7f42ff1a28","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"/","pid":"#PID<0.4882.0>","request_id":"Fwfov8MhnAjba0EAABsx","status":200,"time":"2022-08-03T18:21:38.042525Z"}
2022-08-03T18:23:00.714 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":83.895,"guid":"eba7d3d0-25b4-4e2e-a6ea-587d726103c6","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"/","pid":"#PID<0.4909.0>","request_id":"Fwfo0v8Dt9vPA8MAABwh","status":200,"time":"2022-08-03T18:23:00.712459Z","user_id":8}
2022-08-03T18:28:49.487 app[fbcb5e54] sea [info] {"app":"APPNAME","duration":761.722,"guid":"eba7d3d0-25b4-4e2e-a6ea-587d726103c6","ip":"REDACTED1","level":"info","method":"GET","mfa":"StructuredLogger.Plugs.Logger.call/2","params":{},"path":"/","pid":"#PID<0.4961.0>","request_id":"FwfpJAsXJX3qsQcAAB-h","status":200,"time":"2022-08-03T18:28:49.485930Z","user_id":8}

Thank you for letting us know-- happy to help you look into this further. I have some initial questions that might help us narrow things down on our end.

  • First, just to make sure I’ve got the timeline right: you observed this issue from 11:10-11:20 PST, but haven’t since? Did it recover on its own, or did you fix it while troubleshooting?
  • As you’re probably aware, app[fbcb5e54] in those log lines indicates that those logs are from a particular instance (fbcb5e54). Did you see these timeouts on other instances for your app?

I was in the middle of debugging from different locations and then it recovered. Have not encountered since.

Yes, I only had two regions set - sea and maa and all the traffic was routing to sea from the locations I tested from before self-recovery.

Thanks for confirming all of that. Taking a closer look, we didn’t find anything that would indicate an issue with traffic on its way to your apps’ instances. That said, we did notice that the app instance in ams went unhealthy a few times, and was restarted/replaced, and this might help explain why you only saw traffic to Seattle while the timeouts were happening.

While you’ve probably done some of this already, there are a few things you can check that could help pinpoint this further. First, fly status --all will show you all instances and restarts from the past 7 days. You can then check out an individual instance’s logs with fly vm status. To investigate further, you might also want to check out the fly metrics dashboard for the app, where you can observe time-series data for a number of metrics that could come in handy here (fly dashboard metrics -a $app-name` will even take you straight to the browser UI).

I hope some of this is useful! Let us know if there’s anything else you’d like us to weigh in on :slightly_smiling_face:

Hey Eli, I agree, there was no issue with traffic going to the instances. In fact, all traffic should have been routed to sea since I was only testing from the US. The maa and restarts are a red herring.

The issue is that no traffic was being returned to the client unless a websocket connection was already established.

To summarize, there were new connection requests to which the app responded. However, none of those responses made it back to the client for 10 minutes. As far as I can tell, this is a really bad thing that just happened. If it was 1 minute and recovery, it’s all good. But 10 minutes is something that is tough to brush off. And if it just happened to me, you can bet it is happening elsewhere and is not being reported.

I would suggest continuing to debug by looking at the traffic going out from the proxy to the client. We already know that existing websocket connections were working, so we can rule out some types of networking issues. That leaves two spots as far as I can tell:

  1. Did the response from the app make it back to the proxy?
  2. Did the proxy send a response back to the client?
  3. A myriad of other networking issues that I would rather not contemplate.

Update: editing for additional clarity

@tj1 we see no indication of issues in our proxies. Can you explain what those logs mean? I’m having a hard time gleaning any meaning from them.

Hmmm…ok. I should do a better job of communication.

The logs from 18:15 - 18:22 are all new connection requests that were made to uri “/” that were received and processed correctly by fcb5e54 . However, none of these responses made it back to the downstream client (firefox, chrome, curl). The downstream clients were laptop and remote vps’s aka. tested from multiple locations.

Existing websocket connections to fcb5e54 were working correctly.

For the logs listed, if you are seeing those requests being processed correctly by the proxy, that would leave a myriad of networking weirdness which I’m not even sure can be monitored.

As far as I can tell the request went:

  1. Client
  2. Established connection to proxy
  3. Proxy routed to fcb5e54
  4. fcb5e54 responded.

The next steps I would expect are:
5) Proxy receives response from fcb5e54
6) Proxy returns response to client.
7) Client receives response and closes connection.

Can you or someone confirm that those requests in the logs were seen and returned correctly by the proxy steps 5 and 6? If the problem is at 7, that is yuck.

We don’t log requests through the proxy (there are way too many).

We do log proxy errors, timeouts, etc. We don’t see any proxy generated errors for the app, except for that one failing vm in Amsterdam.

What happened when you were receiving a response? Did you get an error or timeout or something else?

I glanced at your metrics and it looks like the connection counts shot up on those VMs around that time. Are you sure you weren’t hitting soft limits? Soft limits in Seattle will actually send connections to other regions.

Gah. Well, I really hope this is the tail of a tail of a tail. I don’t envy the job of building a platform since any tiny issue has an inordinate blast radius. Not sure how you’d find out if you don’t log some percentage of the queries. Otherwise, all you got is random user reports which may be of suspect reliability. If there is even a report at all.

Semi-retired now, but I think the last time I quit tech I was contemplating opening a small tea shop instead. I’m a big fan of fly, so may you have the blessing of more patience.

Hanging response. I was trying from different machines, so I don’t recall if any of the requests officially timed out or not. I canceled and retried from most locations as I was trying to figure out what was happening.

Nah, I redeployed a new environment last night in preparation for a cutover. When this happened I had autoscale off, a count of 2, maybe 5-10 open connections, and soft-limit is set to 45. After recovery, I looked thru all the logs made the post.

I think soft-limits/hard-limits are also a red-herring. At the end of the day, client sent a request, it was received on the app server and the response was never received by the client. Not sure what role the limits will play. I cancelled most of the requests, so there should be some connection resets on the proxy metrics if those are being measured.

Afterwards, I saw that @wjordan had deployed a change to how connections were being counted for websockets, so I figured I should re-test if autoscaling still works as I expected it to (yes, it does).