Infrastructure dev notes

This is development notes from engineers at Fly. It is an experiment in transparency.

Proxy

Features

  • Reject requests when backhauling for an instance that’s reached its hard limit, allowing our edge proxy to retry with a different, less loaded, instance.
    • This appears to have improved performances by spreading the load better
  • Now using adaptive http/2 window resizing

Fixes

  • Enable retrying canceled requests (canceled at connection time)
  • Added a timeout where it was missing (e.g: creating backhaul connections)
    • This could lead to needless waiting on bad connections
  • Allow a much longer timeout for connections to end when restarting / deploying our proxy
    • This resulted in less connection breakage upon reload
  • Enabled expose-fd listeners on our haproxy stats endpoint for hitless reloads

Internal improvements

  • Upgraded multiple dependencies
  • Reduced the amount of reloading of our proxy configuration
  • Started using Honeycomb for tracing slow requests
    • We built our own Honeycomb client for Rust, it’s far from perfect, but it works


(this isn’t a particularly slow requests, but we also submit requests where there was an error)

Virtual machines

Features

  • Upgraded the default kernel to 4.19.146
    • This fixes the “firmware bug” kernel messages on some of our newer AMD Epyc hosts"

This is very neat, and could be a whole “how to build a global load balancer” article. The architectural problem we have here is that the farther an instance of our load balancer is from a VM, the harder it is to accurately know how loaded that VM is. Load fluctuates by the millisecond, and even in the best cases we have to wait a few hundred milliseconds to “see” any changes from some instances.

What ends up happening is VMs reach their hard concurrency limit (set in fly.toml) and we still send them traffic. Prior to this change, we’d queue new connections/requests locally and let the VM work through them as it could. This worked decently, but VMs still ended up with queues that wouldn’t quit.

The retry change prevents queueing in many cases. When a loaded up VM gets traffic it can’t handle now, it sends a message back to the load balancer saying “full, come back later”. The load balancer then reissues the request to another VM.

This is the effect on a CPU bound app:

What’s cool is how fast the retries can happen (at least, compared to queues backing up). Some apps are hitting their limits, then retrying hundreds of times per second.

Big week of performance improvements!

Proxy

Features

  • Added a way to target a specific app instance via a header. I’m not telling which header because this is still very much in flux.

Improvements

  • Use more realistic latency between our servers to choose where to forward a request
  • Unclogged the tokio event loop in many ways (some operations were discovered to be blocking and were moved off the main event loop).

Internal improvements

  • Ensure distributed load data is invalidated on restart (we’re using consul’s session TTL)
  • Upgraded dependencies
  • Switched from the mimalloc allocator to jemalloc
1 Like

This is my favorite from last week. We abuse Consul in many ways. The most heinous thing we do is run it globally as if it’s on a single local network. That usually works fine for us, but sometimes it gets weird.

Consul includes an rtt feature that’s useful for finding the nearest node. When a request comes in for an app, we send it to the closest available VM. Consul rtt was very convenient for this … and also wrong in surprising ways.

We noticed that some requests coming in to Tokyo were being routed to Hong Kong VMs instead of Tokyo VMs. Hong Kong is 2900km from Tokyo (about 1800 miles), adding 50ms+ of dumb latency.

The culprit was Consul’s RTT metric. It was reporting that nodes in Toykyo were 100ms+ away from each other. Our routing logic naturally thought “50ms is better than 100ms so let’s go to Hong Kong”.

We ended up doing a ping tracking project we’ve been putting off for a while to improve this. Every node now pings every other node and we keep track of our own RTT, packet loss, etc. The net result is that requests end up at the lowest latency VMs they can be serviced from.

Incidentally, it’s odd that American companies tend to classify Tokyo/Hong Kong/Singapore/Sydney as “Asia Pacific”. Those cities are nowhere close to each other!

Proxy

Features

  • Started recording request and response bodies timings. Coming soon to our prometheus API!

Fixes

  • HTTP/1.1 will now use HTTP/1.1 for backhauling instead of converting to HTTP/2 which appeared to be causing issues. HTTP/2 requests will still go through HTTP/2 backhaul for best performances.
  • More strict parsing of SNI extension in TLS ClientHello message.

Improvements

  • Switched up the logic for caching SSL certificates. This should improve performances.

Switched up the logic for caching SSL certificates. This should improve performances.

I was noticing that SSL negotiation could take upwards of 100ms. Is that what this should help with?

It surely didn’t help.

100ms is not too bad (but not good) since that includes latency.

We have more optimisations coming up until the next dev notes update.

Is the 100ms handshake from an automated test? When we first issue certificates, each edge server has to retrieve them from cache. Which means if you’re running test on apps without activity, you need to run it multiple times to get things warmed up.

We track these pretty closely, here’s what we see from updown.io for most apps:

I’m unable to reproduce @kurt, but it was on a webpagetest.com test, and they do have some shaky networks sometimes when measuring ttfb.

Thanks for following up with the stats from updown.io

Virtual machines

Features

  • Now detecting OOM kills from within a virtual machine (usually causing a restart)

Fixes

  • Private IP allocation was racy. Fixed by using a mutex and allocation an IP only one at a time

Internal improvements

  • Upgraded dependencies for our kernel init program

Proxy

Internal improvements

  • Updated dependencies
  • Reduced metrics cardinality for smoother metrics ingestion / querying

Virtual machines

Features

  • Allow attaching persistent volumes to VMs (more on this soon)

Fixes

  • Set localhost 127.0.0.1 in /etc/hosts by default

Improvements

  • umount attached volumes (persistent and ephemeral) when shutting down to prevent file system corruption
  • Attaching and executing commands inside VMs (from outside) is now smoother. We will be making this feature available soon.

Proxy

Features

  • Configurable load balancing strategies (not exposed to users yet)
    • closest (default): pick the closest, least loaded instance
    • leastload: better suited for applications with a large quantity of instances, this picsk least loaded instances and then picks the closest one out of a random selection

Fixes

  • Fixed a bug where TLS certificates were poisoned in cache