We made the Machines API faster by adding Machines

tvdfly · April 5, 2024, 8:38pm

The Machines API is our REST API for full control over fast launching Machines. We had some tail latency issues that caused some requests to take a long time to verify authentication. We added more Machines and it got faster, meaning you can launch your Machines even faster!

Read on for the technical details.

Machines API Architecture

A quick tour of the Machines API architecture:

You connect to https://api.machine.dev when you launch a machine with flyctl or directly. api.machines.dev is pointed at our Machine API Proxy app, which is a regular Fly App just like the ones you use. fly-proxy receives the http requests and routes them to the nearest Machine in the Machine API Proxy app.

The Machine API Proxy app runs envoy and forwards requests to flaps at http://[fdaa::3]:4280. flaps is the internal Machines API. You can use flaps directly at http://[fdaa::3]:4280 when connected to the Fly wireguard network. flyctl previously used the internal Machines API over user-mode wireguard before we built the Machine API Proxy.

We provide this envoy proxy so you can access the api from anywhere. envoy makes it easy for us to sprinkle in generous rate limiting and other features to keep the service available for you and everyone else.

flaps will then communicate with flyd on workers across our network to fulfill the Machines API request. More details on flaps and flyd on our blog: Carving The Scheduler Out Of Our Orchestrator · The Fly Blog

The Problem

We observed some of your requests were taking 500–1000ms to verify authentication. Ouch! We made a perf test that measures this latency. Here’s the response times from that perf test over a series of runs:

p(95)=757ms
p(90)=643ms 
med=190ms

What We Found

We discovered a few performance issues while investigating this tail latency.

Machines API Proxy had one shared-cpu-1x Machine deployed in every region, which meant all the requests for that region landed on the same worker. That caused contention in a couple places. The contention happens reliably for those of you deploying apps with lots of machines.
Legacy OAuth tokens—like the ones flyctl uses—are slower to verify than macaroons, especially when you connect to regions far away from iad.
flaps trace data shows 500-1000ms latency for a function that makes a query to a local SQLite database; the trace data shows the SQL query consistently happening in under 1ms. We don’t know what’s going on there yet.

The Fix

We fixed the first contention issue by expanding the capacity of the Machines API Proxy. Now we run multiple shared-cpu-2x machines in every region. fly-proxy will distribute the incoming requests between all the Machines API Proxy machines in that region, giving us more capacity before we hit other bottlenecks.

The tail latency for authentication verification improved significantly with that change! This is how those same perf tests from above performed after the change:

	Before	After
p(95)	757ms	294ms
p(90)	643ms	152ms
med	190ms	81ms

All requests got faster, with the slowest requests seeing a big improvement. That means everyone gets the same fast authentication verification on each request, so you can get on with building your apps!

The other two issues have not been addressed. Yet…

What’s Next

We’re working on the other two issues! Watch the Fresh Produce category for future updates.

In addition we’re going to add auto scaling to the Machines API Proxy app. fly-proxy’s auto scaling features should make this simple. Stay tuned for updates!

Let us know if you see any issues, or performance improvements!

Topic		Replies	Views
flyctl now uses api.machines.dev by default Fresh Produce appsv2 , machines	5	833	April 28, 2023
Why parts of machines API work only with Wireguard? wishlist	3	374	September 2, 2022
We're launching Fly machines today	43	5253	September 28, 2022
New feature in preview: suspend/resume for Machines Fresh Produce machines	15	2132	October 26, 2024
Cannot modify concurrency for apps using the machine API Questions / Help	2	559	November 4, 2022