Routing layer visibility

One issue with platforms like Heroku is the opaque routing layer. You get a few general metrics from it, but lack visibility into or control over things like:

  • chosen routing algorithm (random afaik)
  • websocket connections (only logged when a connection completes)
  • headers (not loggable)
  • time series metrics (none available generally)

It would be interesting to know if we could get insight into any or all of these in the future. One header that does get passed on a few platforms is X-Request-Start to help understand if requests are queued at before the VM layer. Anyway, just thought Iā€™d raise this topic and see what your thoughts are on routing and visibility.

Weā€™ve been thinking about logging HTTP information per-request. We donā€™t even do that right now to save on logging.

We mostly have 2 algorithms, theyā€™re not chosen dynamically though. We do ā€œpower of 2 random choicesā€ out of healthy, nearby, instances.

Iā€™d be happy to add logging for this. What are you looking for specifically?

If this was to happen, youā€™d like to specify which ones to log or do you want all of them? For some apps, thats would mean tens of thousands of logs per second.

We do offer prometheus time series for our proxy: Metrics on Fly.io Ā· Fly Docs

X-Request-Start is interesting. Iā€™d be happy to add that. This might be encoded in our request IDs too :thinking: however Iā€™m not sure if itā€™s monotonic or since epoch. Iā€™ll have to check.

I love posts like this! Thereā€™s a lot of quick things we can add that can make your life easier.

One thing I was going to add soon is: logging error statuses produced by our proxy. Mostly 502s, but we also produce some 503s. We have reasons for most of these and should expose that to our users, via logs.

2 Likes

Fair enough. I think ā€˜off by defaultā€™ is reasonable but should be possible to enable.

Iā€™d want to be able to specify this, but itā€™s not that important. The use case Iā€™ve had in the past is debugging requests that never make it to the VMs (for whatever reason) or are rejected by app-level rate limiting.

Nice! Will check that out.

Cool - that can work. Using a common default would make things ā€˜just workā€™ with many APM services like ScoutAPM out of the box: Scout APM Documentation

Good to know - will make some more from my PaaS wish/complaint list :laughing:

This makes sense. One thing that I think Heroku did well was assign an error code to these different conditions so they can be tracked independently in log analyzers.

1 Like

A few things I missed. For logging websockets, it would be helpful to see both start and end, since these requests can live for a long time.

Also it would be interesting to see open websocket count in the time series metrics, if possible.

1 Like

Websocket info in the time series metrics is a good idea. Right now you end up seeing weird 95th percentile metrics on HTTP requests from apps that use websockets. Splitting those out could be really nice.

1 Like

@jsierles Iā€™ve deployed a change, if you upgrade flyctl, showing proxy errors (502s and some 503s) and their reasons (not all documented yet) in app logs.

Hopefully this helps everybody understand a bit more when errors occur. Usually itā€™s because the connection was abruptly severed between us and the app.

I plan on adding X-Request-Start soon. Thatā€™s a much smaller feature, sounds useful too.

1 Like

Iā€™ve added X-Request-Start: t=<microseconds since epoch> .

Nice! Iā€™ll test it out with ScoutAPM.