Infrastructure dev notes

jerome · September 22, 2020, 3:17pm

This is development notes from engineers at Fly. It is an experiment in transparency.

Proxy

Features

Reject requests when backhauling for an instance that’s reached its hard limit, allowing our edge proxy to retry with a different, less loaded, instance.
- This appears to have improved performances by spreading the load better
Now using adaptive http/2 window resizing

Fixes

Enable retrying canceled requests (canceled at connection time)
Added a timeout where it was missing (e.g: creating backhaul connections)
- This could lead to needless waiting on bad connections
Allow a much longer timeout for connections to end when restarting / deploying our proxy
- This resulted in less connection breakage upon reload
Enabled expose-fd listeners on our haproxy stats endpoint for hitless reloads
- _We had upgraded to a more recent version of haproxy, but didn’t have the exact configuration required for “hitless” reloads (keeping connections alive on reload). More information here: https://www.haproxy.com/blog/hitless-reloads-with-haproxy-howto/_

Internal improvements

Upgraded multiple dependencies
Reduced the amount of reloading of our proxy configuration
Started using Honeycomb for tracing slow requests
- We built our own Honeycomb client for Rust, it’s far from perfect, but it works

(this isn’t a particularly slow requests, but we also submit requests where there was an error)

Virtual machines

Features

Upgraded the default kernel to 4.19.146
- This fixes the “firmware bug” kernel messages on some of our newer AMD Epyc hosts"

kurt · September 22, 2020, 5:06pm

This is very neat, and could be a whole “how to build a global load balancer” article. The architectural problem we have here is that the farther an instance of our load balancer is from a VM, the harder it is to accurately know how loaded that VM is. Load fluctuates by the millisecond, and even in the best cases we have to wait a few hundred milliseconds to “see” any changes from some instances.

What ends up happening is VMs reach their hard concurrency limit (set in fly.toml) and we still send them traffic. Prior to this change, we’d queue new connections/requests locally and let the VM work through them as it could. This worked decently, but VMs still ended up with queues that wouldn’t quit.

The retry change prevents queueing in many cases. When a loaded up VM gets traffic it can’t handle now, it sends a message back to the load balancer saying “full, come back later”. The load balancer then reissues the request to another VM.

This is the effect on a CPU bound app:

What’s cool is how fast the retries can happen (at least, compared to queues backing up). Some apps are hitting their limits, then retrying hundreds of times per second.

jerome · September 29, 2020, 7:45pm

Big week of performance improvements!

Proxy

Features

Added a way to target a specific app instance via a header. I’m not telling which header because this is still very much in flux.

Improvements

Use more realistic latency between our servers to choose where to forward a request
Unclogged the tokio event loop in many ways (some operations were discovered to be blocking and were moved off the main event loop).

Internal improvements

Ensure distributed load data is invalidated on restart (we’re using consul’s session TTL)
Upgraded dependencies
Switched from the mimalloc allocator to jemalloc

kurt · September 29, 2020, 8:52pm

This is my favorite from last week. We abuse Consul in many ways. The most heinous thing we do is run it globally as if it’s on a single local network. That usually works fine for us, but sometimes it gets weird.

Consul includes an rtt feature that’s useful for finding the nearest node. When a request comes in for an app, we send it to the closest available VM. Consul rtt was very convenient for this … and also wrong in surprising ways.

We noticed that some requests coming in to Tokyo were being routed to Hong Kong VMs instead of Tokyo VMs. Hong Kong is 2900km from Tokyo (about 1800 miles), adding 50ms+ of dumb latency.

The culprit was Consul’s RTT metric. It was reporting that nodes in Toykyo were 100ms+ away from each other. Our routing logic naturally thought “50ms is better than 100ms so let’s go to Hong Kong”.

We ended up doing a ping tracking project we’ve been putting off for a while to improve this. Every node now pings every other node and we keep track of our own RTT, packet loss, etc. The net result is that requests end up at the lowest latency VMs they can be serviced from.

Incidentally, it’s odd that American companies tend to classify Tokyo/Hong Kong/Singapore/Sydney as “Asia Pacific”. Those cities are nowhere close to each other!

jerome · October 6, 2020, 11:56am

Proxy

Features

Started recording request and response bodies timings. Coming soon to our prometheus API!

Fixes

HTTP/1.1 will now use HTTP/1.1 for backhauling instead of converting to HTTP/2 which appeared to be causing issues. HTTP/2 requests will still go through HTTP/2 backhaul for best performances.
More strict parsing of SNI extension in TLS ClientHello message.

Improvements

Switched up the logic for caching SSL certificates. This should improve performances.

austincollinpena · October 8, 2020, 2:29am

Switched up the logic for caching SSL certificates. This should improve performances.

I was noticing that SSL negotiation could take upwards of 100ms. Is that what this should help with?

jerome · October 8, 2020, 1:46pm

It surely didn’t help.

100ms is not too bad (but not good) since that includes latency.

We have more optimisations coming up until the next dev notes update.

kurt · October 8, 2020, 2:44pm

Is the 100ms handshake from an automated test? When we first issue certificates, each edge server has to retrieve them from cache. Which means if you’re running test on apps without activity, you need to run it multiple times to get things warmed up.

We track these pretty closely, here’s what we see from updown.io for most apps:

austincollinpena · October 9, 2020, 4:40pm

I’m unable to reproduce @kurt, but it was on a webpagetest.com test, and they do have some shaky networks sometimes when measuring ttfb.

Thanks for following up with the stats from updown.io

jerome · October 13, 2020, 4:11pm

Virtual machines

Features

Now detecting OOM kills from within a virtual machine (usually causing a restart)

Fixes

Private IP allocation was racy. Fixed by using a mutex and allocation an IP only one at a time

Internal improvements

Upgraded dependencies for our kernel init program

Proxy

Internal improvements

Updated dependencies
Reduced metrics cardinality for smoother metrics ingestion / querying

jerome · October 21, 2020, 4:35pm

Virtual machines

Features

Allow attaching persistent volumes to VMs (more on this soon)

Fixes

Set localhost 127.0.0.1 in /etc/hosts by default

Improvements

umount attached volumes (persistent and ephemeral) when shutting down to prevent file system corruption
Attaching and executing commands inside VMs (from outside) is now smoother. We will be making this feature available soon.

Proxy

Features

Configurable load balancing strategies (not exposed to users yet)
- closest (default): pick the closest, least loaded instance
- leastload: better suited for applications with a large quantity of instances, this picsk least loaded instances and then picks the closest one out of a random selection

Fixes

Fixed a bug where TLS certificates were poisoned in cache

jerome · October 28, 2020, 11:22am

Virtual machines

Fixes

Fixed a bug where volume names could not contain a _
Fixed an unsoundness bug in how we cleaned up failed mounts

Internal improvements

Preparing for a way to exec into running instances

API

Features

Added “script” type checks (going to document this soon)

Internal improvements

Paving the way for worker-type instances

jerome · November 4, 2020, 3:07pm

Proxy

Features

Isolate apps so they don’t affect each other (as much) when there is an affluence of traffic to a single one of them (of malicious intent, or not).
Record resource usage per app in the scope of our proxy, could be made available to our users.

Internal improvements

Refactor handlers to use tower services

jerome · November 11, 2020, 6:31pm

Proxy

Fixes

Might have fixed a bug with serving stale cached certificates instead of brand new ones.
ACME (Let’s Encrypt) TLS ALPN challenges were checking for wildcards in some scenarios. They don’t anymore.

Improvements

More graceful shutdown of connections on restarts

Virtual Machines

Improvements

Upgraded to firecracker v0.23

Internal improvements

Fixed a bug making nomad (our VM scheduler) restart very slowly causing all instances to restart

jerome · November 25, 2020, 3:44pm

(Apparently, I forgot to post last week)

Proxy

Fixes

Slow downloads could sometimes cause our idle timeout to trigger. This has been fixed by also checking kernel buffers.
A few places were not as well isolated as they could be on each app’s context when dealing with connections, this has been resolved.

Improvements

Some connections / actions were still not counted right, causing restarts to end these tasks prematurely.

Internal improvements

Various dependencies upgrades
Stopped collecting a bunch of superfluous metrics

Virtual Machines

Fixes

We now support non-numeric user and group in Dockerfile. There was previously a bug with the user:group format (but uid:gid, uid and user worked fine).

Internal improvements

Laying the foundation for faster booting VMs

jerome · December 22, 2020, 9:37pm

Proxy

Fixes

Disabled brotli encoding for now, it wasn’t stable and was sometimes causing crashes (and therefore closed connections).

Virtual machines

Improvements

Switched to containerd for pulling images. This allowed us to remove the slow “building rootfs” and reuse more layers, resulting in faster subsequent boot for larger images.
Started logging “lifecycle” events (configuring, starting, etc. virtual machines).
Lazily initialise and encrypt volumes
Removed timestamps from our init program’s logs

Fixes

FLY_PUBLIC_IP was set to the wrong IP (private network IP). This is now fixed.
There were still cases where figuring out the UID and GID to run your docker command wasn’t quite working right. Not anymore.

jerome · February 9, 2021, 5:34pm

Proxy

Features

Laid the groundwork to respond with self-signed certificates. This is not yet generally available.

Improvements

State is now replicated faster and should reduce/eliminate errors when scaling up or down too fast.

Virtual machines

Features

Improvements

More reliable zombie processes reaping
Solidified vsock interface between us and VMs

FrequentFlyer · February 1, 2022, 6:16am

Can we have more of this please, @jerome

Topic		Replies	Views
Could not proxy HTTP request. Retrying in 1000 ms	16	988	March 7, 2023
High availability on Fly.io Questions / Help	2	1645	December 17, 2021
Debugging Dropped Requests during Bursts	38	789	April 29, 2021
Sudden increase in connections causes hard_limit to be exhausted (even with minimal test-case app doing no work) Questions / Help proxy	4	77	February 25, 2025
lingering connections and ghost vms. Questions / Help	5	578	September 16, 2022

Infrastructure dev notes

Proxy

Features

Fixes

Internal improvements

Virtual machines

Features

Proxy

Features

Improvements

Internal improvements

Proxy

Features

Fixes

Improvements

Virtual machines

Features

Fixes

Internal improvements

Proxy

Internal improvements

Virtual machines

Features

Fixes

Improvements

Proxy

Features

Fixes

Virtual machines

Fixes

Internal improvements

API

Features

Internal improvements

Proxy

Features

Internal improvements

Proxy

Fixes

Improvements

Virtual Machines

Improvements

Internal improvements

Proxy

Fixes

Improvements

Internal improvements

Virtual Machines

Fixes

Internal improvements

Proxy

Fixes

Virtual machines

Improvements

Fixes

Proxy

Features

Improvements

Virtual machines

Features

Improvements

Related topics