Routing layer visibility

One issue with platforms like Heroku is the opaque routing layer. You get a few general metrics from it, but lack visibility into or control over things like:

  • chosen routing algorithm (random afaik)
  • websocket connections (only logged when a connection completes)
  • headers (not loggable)
  • time series metrics (none available generally)

It would be interesting to know if we could get insight into any or all of these in the future. One header that does get passed on a few platforms is X-Request-Start to help understand if requests are queued at before the VM layer. Anyway, just thought I’d raise this topic and see what your thoughts are on routing and visibility.

We’ve been thinking about logging HTTP information per-request. We don’t even do that right now to save on logging.

We mostly have 2 algorithms, they’re not chosen dynamically though. We do “power of 2 random choices” out of healthy, nearby, instances.

I’d be happy to add logging for this. What are you looking for specifically?

If this was to happen, you’d like to specify which ones to log or do you want all of them? For some apps, thats would mean tens of thousands of logs per second.

We do offer prometheus time series for our proxy: Metrics on Fly · Fly

X-Request-Start is interesting. I’d be happy to add that. This might be encoded in our request IDs too :thinking: however I’m not sure if it’s monotonic or since epoch. I’ll have to check.

I love posts like this! There’s a lot of quick things we can add that can make your life easier.

One thing I was going to add soon is: logging error statuses produced by our proxy. Mostly 502s, but we also produce some 503s. We have reasons for most of these and should expose that to our users, via logs.

2 Likes

Fair enough. I think ‘off by default’ is reasonable but should be possible to enable.

I’d want to be able to specify this, but it’s not that important. The use case I’ve had in the past is debugging requests that never make it to the VMs (for whatever reason) or are rejected by app-level rate limiting.

Nice! Will check that out.

Cool - that can work. Using a common default would make things ‘just work’ with many APM services like ScoutAPM out of the box: Help Docs ~ Scout APM

Good to know - will make some more from my PaaS wish/complaint list :laughing:

This makes sense. One thing that I think Heroku did well was assign an error code to these different conditions so they can be tracked independently in log analyzers.

1 Like

A few things I missed. For logging websockets, it would be helpful to see both start and end, since these requests can live for a long time.

Also it would be interesting to see open websocket count in the time series metrics, if possible.

1 Like

Websocket info in the time series metrics is a good idea. Right now you end up seeing weird 95th percentile metrics on HTTP requests from apps that use websockets. Splitting those out could be really nice.

1 Like

@joshua I’ve deployed a change, if you upgrade flyctl, showing proxy errors (502s and some 503s) and their reasons (not all documented yet) in app logs.

Hopefully this helps everybody understand a bit more when errors occur. Usually it’s because the connection was abruptly severed between us and the app.

I plan on adding X-Request-Start soon. That’s a much smaller feature, sounds useful too.

1 Like

I’ve added X-Request-Start: t=<microseconds since epoch> .

Nice! I’ll test it out with ScoutAPM.