We made the Machines API faster by adding Machines

The Machines API is our REST API for full control over fast launching Machines. We had some tail latency issues that caused some requests to take a long time to verify authentication. We added more Machines and it got faster, meaning you can launch your Machines even faster!

Read on for the technical details.

Machines API Architecture

A quick tour of the Machines API architecture:

You connect to https://api.machine.dev when you launch a machine with flyctl or directly. api.machines.dev is pointed at our Machine API Proxy app, which is a regular Fly App just like the ones you use. fly-proxy receives the http requests and routes them to the nearest Machine in the Machine API Proxy app.

The Machine API Proxy app runs envoy and forwards requests to flaps at http://[fdaa::3]:4280. flaps is the internal Machines API. You can use flaps directly at http://[fdaa::3]:4280 when connected to the Fly wireguard network. flyctl previously used the internal Machines API over user-mode wireguard before we built the Machine API Proxy.

We provide this envoy proxy so you can access the api from anywhere. envoy makes it easy for us to sprinkle in generous rate limiting and other features to keep the service available for you and everyone else.

flaps will then communicate with flyd on workers across our network to fulfill the Machines API request. More details on flaps and flyd on our blog: Carving The Scheduler Out Of Our Orchestrator · The Fly Blog

The Problem

We observed some of your requests were taking 500–1000ms to verify authentication. Ouch! We made a perf test that measures this latency. Here’s the response times from that perf test over a series of runs:

p(95)=757ms
p(90)=643ms 
med=190ms

What We Found

We discovered a few performance issues while investigating this tail latency.

  1. Machines API Proxy had one shared-cpu-1x Machine deployed in every region, which meant all the requests for that region landed on the same worker. That caused contention in a couple places. The contention happens reliably for those of you deploying apps with lots of machines.

  2. Legacy OAuth tokens—like the ones flyctl uses—are slower to verify than macaroons, especially when you connect to regions far away from iad.

  3. flaps trace data shows 500-1000ms latency for a function that makes a query to a local SQLite database; the trace data shows the SQL query consistently happening in under 1ms. We don’t know what’s going on there yet.

The Fix

We fixed the first contention issue by expanding the capacity of the Machines API Proxy. Now we run multiple shared-cpu-2x machines in every region. fly-proxy will distribute the incoming requests between all the Machines API Proxy machines in that region, giving us more capacity before we hit other bottlenecks.

The tail latency for authentication verification improved significantly with that change! This is how those same perf tests from above performed after the change:

Before After
p(95) 757ms 294ms
p(90) 643ms 152ms
med 190ms 81ms

All requests got faster, with the slowest requests seeing a big improvement. That means everyone gets the same fast authentication verification on each request, so you can get on with building your apps!

The other two issues have not been addressed. Yet…

What’s Next

We’re working on the other two issues! Watch the Fresh Produce category for future updates.

In addition we’re going to add auto scaling to the Machines API Proxy app. fly-proxy’s auto scaling features should make this simple. Stay tuned for updates!

Let us know if you see any issues, or performance improvements!

13 Likes