Infrastructure dev notes

This is development notes from engineers at Fly. It is an experiment in transparency.

Proxy

Features

  • Reject requests when backhauling for an instance that’s reached its hard limit, allowing our edge proxy to retry with a different, less loaded, instance.
    • This appears to have improved performances by spreading the load better
  • Now using adaptive http/2 window resizing

Fixes

  • Enable retrying canceled requests (canceled at connection time)
  • Added a timeout where it was missing (e.g: creating backhaul connections)
    • This could lead to needless waiting on bad connections
  • Allow a much longer timeout for connections to end when restarting / deploying our proxy
    • This resulted in less connection breakage upon reload
  • Enabled expose-fd listeners on our haproxy stats endpoint for hitless reloads

Internal improvements

  • Upgraded multiple dependencies
  • Reduced the amount of reloading of our proxy configuration
  • Started using Honeycomb for tracing slow requests
    • We built our own Honeycomb client for Rust, it’s far from perfect, but it works


(this isn’t a particularly slow requests, but we also submit requests where there was an error)

Virtual machines

Features

  • Upgraded the default kernel to 4.19.146
    • This fixes the “firmware bug” kernel messages on some of our newer AMD Epyc hosts"
3 Likes

This is very neat, and could be a whole “how to build a global load balancer” article. The architectural problem we have here is that the farther an instance of our load balancer is from a VM, the harder it is to accurately know how loaded that VM is. Load fluctuates by the millisecond, and even in the best cases we have to wait a few hundred milliseconds to “see” any changes from some instances.

What ends up happening is VMs reach their hard concurrency limit (set in fly.toml) and we still send them traffic. Prior to this change, we’d queue new connections/requests locally and let the VM work through them as it could. This worked decently, but VMs still ended up with queues that wouldn’t quit.

The retry change prevents queueing in many cases. When a loaded up VM gets traffic it can’t handle now, it sends a message back to the load balancer saying “full, come back later”. The load balancer then reissues the request to another VM.

This is the effect on a CPU bound app:

What’s cool is how fast the retries can happen (at least, compared to queues backing up). Some apps are hitting their limits, then retrying hundreds of times per second.

Big week of performance improvements!

Proxy

Features

  • Added a way to target a specific app instance via a header. I’m not telling which header because this is still very much in flux.

Improvements

  • Use more realistic latency between our servers to choose where to forward a request
  • Unclogged the tokio event loop in many ways (some operations were discovered to be blocking and were moved off the main event loop).

Internal improvements

  • Ensure distributed load data is invalidated on restart (we’re using consul’s session TTL)
  • Upgraded dependencies
  • Switched from the mimalloc allocator to jemalloc
3 Likes

This is my favorite from last week. We abuse Consul in many ways. The most heinous thing we do is run it globally as if it’s on a single local network. That usually works fine for us, but sometimes it gets weird.

Consul includes an rtt feature that’s useful for finding the nearest node. When a request comes in for an app, we send it to the closest available VM. Consul rtt was very convenient for this … and also wrong in surprising ways.

We noticed that some requests coming in to Tokyo were being routed to Hong Kong VMs instead of Tokyo VMs. Hong Kong is 2900km from Tokyo (about 1800 miles), adding 50ms+ of dumb latency.

The culprit was Consul’s RTT metric. It was reporting that nodes in Toykyo were 100ms+ away from each other. Our routing logic naturally thought “50ms is better than 100ms so let’s go to Hong Kong”.

We ended up doing a ping tracking project we’ve been putting off for a while to improve this. Every node now pings every other node and we keep track of our own RTT, packet loss, etc. The net result is that requests end up at the lowest latency VMs they can be serviced from.

Incidentally, it’s odd that American companies tend to classify Tokyo/Hong Kong/Singapore/Sydney as “Asia Pacific”. Those cities are nowhere close to each other!

Proxy

Features

  • Started recording request and response bodies timings. Coming soon to our prometheus API!

Fixes

  • HTTP/1.1 will now use HTTP/1.1 for backhauling instead of converting to HTTP/2 which appeared to be causing issues. HTTP/2 requests will still go through HTTP/2 backhaul for best performances.
  • More strict parsing of SNI extension in TLS ClientHello message.

Improvements

  • Switched up the logic for caching SSL certificates. This should improve performances.
2 Likes

Switched up the logic for caching SSL certificates. This should improve performances.

I was noticing that SSL negotiation could take upwards of 100ms. Is that what this should help with?

It surely didn’t help.

100ms is not too bad (but not good) since that includes latency.

We have more optimisations coming up until the next dev notes update.

Is the 100ms handshake from an automated test? When we first issue certificates, each edge server has to retrieve them from cache. Which means if you’re running test on apps without activity, you need to run it multiple times to get things warmed up.

We track these pretty closely, here’s what we see from updown.io for most apps:

1 Like

I’m unable to reproduce @kurt, but it was on a webpagetest.com test, and they do have some shaky networks sometimes when measuring ttfb.

Thanks for following up with the stats from updown.io

Virtual machines

Features

  • Now detecting OOM kills from within a virtual machine (usually causing a restart)

Fixes

  • Private IP allocation was racy. Fixed by using a mutex and allocation an IP only one at a time

Internal improvements

  • Upgraded dependencies for our kernel init program

Proxy

Internal improvements

  • Updated dependencies
  • Reduced metrics cardinality for smoother metrics ingestion / querying
2 Likes

Virtual machines

Features

  • Allow attaching persistent volumes to VMs (more on this soon)

Fixes

  • Set localhost 127.0.0.1 in /etc/hosts by default

Improvements

  • umount attached volumes (persistent and ephemeral) when shutting down to prevent file system corruption
  • Attaching and executing commands inside VMs (from outside) is now smoother. We will be making this feature available soon.

Proxy

Features

  • Configurable load balancing strategies (not exposed to users yet)
    • closest (default): pick the closest, least loaded instance
    • leastload: better suited for applications with a large quantity of instances, this picsk least loaded instances and then picks the closest one out of a random selection

Fixes

  • Fixed a bug where TLS certificates were poisoned in cache
3 Likes

Virtual machines

Fixes

  • Fixed a bug where volume names could not contain a _
  • Fixed an unsoundness bug in how we cleaned up failed mounts

Internal improvements

  • Preparing for a way to exec into running instances

API

Features

  • Added “script” type checks (going to document this soon)

Internal improvements

  • Paving the way for worker-type instances
3 Likes

Proxy

Features

  • Isolate apps so they don’t affect each other (as much) when there is an affluence of traffic to a single one of them (of malicious intent, or not).
  • Record resource usage per app in the scope of our proxy, could be made available to our users.

Internal improvements

  • Refactor handlers to use tower services
2 Likes

Proxy

Fixes

  • Might have fixed a bug with serving stale cached certificates instead of brand new ones.
  • ACME (Let’s Encrypt) TLS ALPN challenges were checking for wildcards in some scenarios. They don’t anymore.

Improvements

  • More graceful shutdown of connections on restarts

Virtual Machines

Improvements

  • Upgraded to firecracker v0.23

Internal improvements

  • Fixed a bug making nomad (our VM scheduler) restart very slowly causing all instances to restart
1 Like

(Apparently, I forgot to post last week)

Proxy

Fixes

  • Slow downloads could sometimes cause our idle timeout to trigger. This has been fixed by also checking kernel buffers.
  • A few places were not as well isolated as they could be on each app’s context when dealing with connections, this has been resolved.

Improvements

  • Some connections / actions were still not counted right, causing restarts to end these tasks prematurely.

Internal improvements

  • Various dependencies upgrades
  • Stopped collecting a bunch of superfluous metrics

Virtual Machines

Fixes

  • We now support non-numeric user and group in Dockerfile. There was previously a bug with the user:group format (but uid:gid, uid and user worked fine).

Internal improvements

  • Laying the foundation for faster booting VMs
2 Likes

Proxy

Fixes

  • Disabled brotli encoding for now, it wasn’t stable and was sometimes causing crashes (and therefore closed connections).

Virtual machines

Improvements

  • Switched to containerd for pulling images. This allowed us to remove the slow “building rootfs” and reuse more layers, resulting in faster subsequent boot for larger images.
  • Started logging “lifecycle” events (configuring, starting, etc. virtual machines).
  • Lazily initialise and encrypt volumes
  • Removed timestamps from our init program’s logs

Fixes

  • FLY_PUBLIC_IP was set to the wrong IP (private network IP). This is now fixed.
  • There were still cases where figuring out the UID and GID to run your docker command wasn’t quite working right. Not anymore.
3 Likes

Proxy

Features

  • Laid the groundwork to respond with self-signed certificates. This is not yet generally available.

Improvements

  • State is now replicated faster and should reduce/eliminate errors when scaling up or down too fast.

Virtual machines

Features

Improvements

  • More reliable zombie processes reaping
  • Solidified vsock interface between us and VMs
4 Likes

Can we have more of this please, @jerome :face_with_peeking_eye:

3 Likes