For reasons we don’t fully yet understand, we got a flood of new users over the last few weeks. Welcome to Fly.io! We’re happy to see you!
Well, most of us are. One thing that wasn’t happy to see everyone was our WireGuard gateways, which started buckling under the load of new peers. As always: installed WireGuard peers are chugging along just fine, but it started taking untenably long times to create new peers.
This really murdered people using CI systems like Github Actions. Without going through contortions, the natural way to build to Fly.io creates a new WireGuard peer on every build. Slow peer creation times weren’t just slowing builds down; they were blowing timeouts and breaking builds.
Over the past week we’ve rewritten our WireGuard gateway code. On the new gateways, I’m seeing sub-second peer creation regardless of how many peers are already on the host. And, as of today, all our gateways are “new” gateways (the IAD gateway, which soaks up most of our CI load, has been “new” for about a day already.
For those playing the home game of “build you a Fly.io”, here’s the rough chronology of our WireGuard gateways:
They were originally a consul-template; our API wrote peers to Consul, and the template expanded out to a
wg syncconf’d on updates. Note: don’t do this.
We pulled out consul-templaterb and replaced it with attache, our SQLite Consul mirror. consul-templaterb was itself a big chunk of our latency, so this bought us time. At this point, we’re still writing
wg.conf’s, just smarter (we had dumb reasons for wanting a true record of peers in a real wg conf file).
We got rid of the Consul part of WireGuard altogether, and switched to NATS transactions: the API directly reserves peers on gateways, and doesn’t confirm the peer to the API client until the gateway has installed it. Much, much faster.
Will found a bash bug that was adding multiple seconds to every update by (if I remember right) being dumb about adding routes as the number of peers scaled up. Thanks, Will!
As we got to mid-5-digits numbers of peers, peer creation slowed down again. The culprit now was
wg syncconf. Jerome fixed attache to directly add new peers rather than shelling out. Note that we were still writing
wg.conf, and also that peer deletion still did a
syncconf, which had become terribly slow.
Even with Jerome’s change, after the mysterious new user flood, peer creation got real, real slow. In attache, WireGuard competes with a lot of additional Consul data (records for every service running everywhere in the fleet), and SQLite only gives us a single writer thread. We split the attache database to get a dedicated WireGuard writer thread; no luck; we quadrupled the size of a bunch of gateways; still no luck! The culprit turned out to be the
wg.confwrite: I was using Go’s
text/templateto blit it out (note: don’t do this).
We stopped writing wg.conf — we have a better source of truth for peers now. But deletes were still slow, and before I rolled the change out, I got angry and rewrote everything into a new gateway service. Now create and delete is fast, and also we can garbage collect old peers, and we have better telemetry.
I have reasons to believe that the current design is going to last awhile (it’s easy to understand, and it’s ultra fast even on gateways we stopped creating new peers on because peer creation had gotten too slow on them).
WireGuard is just one aspect of doing a remote build on Fly.io and there’s a bunch of other things we’re working on, but this was a particularly painful problem, so I wanted to bring everyone up to speed on it.