This is development notes from engineers at Fly. It is an experiment in transparency.
Proxy
Features
Reject requests when backhauling for an instance that’s reached its hard limit, allowing our edge proxy to retry with a different, less loaded, instance.
This appears to have improved performances by spreading the load better
Now using adaptive http/2 window resizing
Fixes
Enable retrying canceled requests (canceled at connection time)
Added a timeout where it was missing (e.g: creating backhaul connections)
This could lead to needless waiting on bad connections
Allow a much longer timeout for connections to end when restarting / deploying our proxy
This resulted in less connection breakage upon reload
Enabled expose-fd listeners on our haproxy stats endpoint for hitless reloads
This is very neat, and could be a whole “how to build a global load balancer” article. The architectural problem we have here is that the farther an instance of our load balancer is from a VM, the harder it is to accurately know how loaded that VM is. Load fluctuates by the millisecond, and even in the best cases we have to wait a few hundred milliseconds to “see” any changes from some instances.
What ends up happening is VMs reach their hard concurrency limit (set in fly.toml) and we still send them traffic. Prior to this change, we’d queue new connections/requests locally and let the VM work through them as it could. This worked decently, but VMs still ended up with queues that wouldn’t quit.
The retry change prevents queueing in many cases. When a loaded up VM gets traffic it can’t handle now, it sends a message back to the load balancer saying “full, come back later”. The load balancer then reissues the request to another VM.
What’s cool is how fast the retries can happen (at least, compared to queues backing up). Some apps are hitting their limits, then retrying hundreds of times per second.
This is my favorite from last week. We abuse Consul in many ways. The most heinous thing we do is run it globally as if it’s on a single local network. That usually works fine for us, but sometimes it gets weird.
Consul includes an rtt feature that’s useful for finding the nearest node. When a request comes in for an app, we send it to the closest available VM. Consul rtt was very convenient for this … and also wrong in surprising ways.
We noticed that some requests coming in to Tokyo were being routed to Hong Kong VMs instead of Tokyo VMs. Hong Kong is 2900km from Tokyo (about 1800 miles), adding 50ms+ of dumb latency.
The culprit was Consul’s RTT metric. It was reporting that nodes in Toykyo were 100ms+ away from each other. Our routing logic naturally thought “50ms is better than 100ms so let’s go to Hong Kong”.
We ended up doing a ping tracking project we’ve been putting off for a while to improve this. Every node now pings every other node and we keep track of our own RTT, packet loss, etc. The net result is that requests end up at the lowest latency VMs they can be serviced from.
Incidentally, it’s odd that American companies tend to classify Tokyo/Hong Kong/Singapore/Sydney as “Asia Pacific”. Those cities are nowhere close to each other!
Started recording request and response bodies timings. Coming soon to our prometheus API!
Fixes
HTTP/1.1 will now use HTTP/1.1 for backhauling instead of converting to HTTP/2 which appeared to be causing issues. HTTP/2 requests will still go through HTTP/2 backhaul for best performances.
More strict parsing of SNI extension in TLS ClientHello message.
Improvements
Switched up the logic for caching SSL certificates. This should improve performances.
Is the 100ms handshake from an automated test? When we first issue certificates, each edge server has to retrieve them from cache. Which means if you’re running test on apps without activity, you need to run it multiple times to get things warmed up.
We track these pretty closely, here’s what we see from updown.io for most apps:
Allow attaching persistent volumes to VMs (more on this soon)
Fixes
Set localhost 127.0.0.1 in /etc/hosts by default
Improvements
umount attached volumes (persistent and ephemeral) when shutting down to prevent file system corruption
Attaching and executing commands inside VMs (from outside) is now smoother. We will be making this feature available soon.
Proxy
Features
Configurable load balancing strategies (not exposed to users yet)
closest (default): pick the closest, least loaded instance
leastload: better suited for applications with a large quantity of instances, this picsk least loaded instances and then picks the closest one out of a random selection
Fixes
Fixed a bug where TLS certificates were poisoned in cache
Slow downloads could sometimes cause our idle timeout to trigger. This has been fixed by also checking kernel buffers.
A few places were not as well isolated as they could be on each app’s context when dealing with connections, this has been resolved.
Improvements
Some connections / actions were still not counted right, causing restarts to end these tasks prematurely.
Internal improvements
Various dependencies upgrades
Stopped collecting a bunch of superfluous metrics
Virtual Machines
Fixes
We now support non-numeric user and group in Dockerfile. There was previously a bug with the user:group format (but uid:gid, uid and user worked fine).
Disabled brotli encoding for now, it wasn’t stable and was sometimes causing crashes (and therefore closed connections).
Virtual machines
Improvements
Switched to containerd for pulling images. This allowed us to remove the slow “building rootfs” and reuse more layers, resulting in faster subsequent boot for larger images.
Started logging “lifecycle” events (configuring, starting, etc. virtual machines).
Lazily initialise and encrypt volumes
Removed timestamps from our init program’s logs
Fixes
FLY_PUBLIC_IP was set to the wrong IP (private network IP). This is now fixed.
There were still cases where figuring out the UID and GID to run your docker command wasn’t quite working right. Not anymore.