One issue with platforms like Heroku is the opaque routing layer. You get a few general metrics from it, but lack visibility into or control over things like:
chosen routing algorithm (random afaik)
websocket connections (only logged when a connection completes)
headers (not loggable)
time series metrics (none available generally)
It would be interesting to know if we could get insight into any or all of these in the future. One header that does get passed on a few platforms is X-Request-Start to help understand if requests are queued at before the VM layer. Anyway, just thought Iād raise this topic and see what your thoughts are on routing and visibility.
Weāve been thinking about logging HTTP information per-request. We donāt even do that right now to save on logging.
We mostly have 2 algorithms, theyāre not chosen dynamically though. We do āpower of 2 random choicesā out of healthy, nearby, instances.
Iād be happy to add logging for this. What are you looking for specifically?
If this was to happen, youād like to specify which ones to log or do you want all of them? For some apps, thats would mean tens of thousands of logs per second.
X-Request-Start is interesting. Iād be happy to add that. This might be encoded in our request IDs too however Iām not sure if itās monotonic or since epoch. Iāll have to check.
I love posts like this! Thereās a lot of quick things we can add that can make your life easier.
One thing I was going to add soon is: logging error statuses produced by our proxy. Mostly 502s, but we also produce some 503s. We have reasons for most of these and should expose that to our users, via logs.
Fair enough. I think āoff by defaultā is reasonable but should be possible to enable.
Iād want to be able to specify this, but itās not that important. The use case Iāve had in the past is debugging requests that never make it to the VMs (for whatever reason) or are rejected by app-level rate limiting.
Nice! Will check that out.
Cool - that can work. Using a common default would make things ājust workā with many APM services like ScoutAPM out of the box: Scout APM Documentation
Good to know - will make some more from my PaaS wish/complaint list
This makes sense. One thing that I think Heroku did well was assign an error code to these different conditions so they can be tracked independently in log analyzers.
Websocket info in the time series metrics is a good idea. Right now you end up seeing weird 95th percentile metrics on HTTP requests from apps that use websockets. Splitting those out could be really nice.
@jsierles Iāve deployed a change, if you upgrade flyctl, showing proxy errors (502s and some 503s) and their reasons (not all documented yet) in app logs.
Hopefully this helps everybody understand a bit more when errors occur. Usually itās because the connection was abruptly severed between us and the app.
I plan on adding X-Request-Start soon. Thatās a much smaller feature, sounds useful too.