The last four months have been rough. We’ve had more issues than we’re OK with.
I’ve hesitated to share this because, well, I’m fighting a debilitating feeling of failure. Fear, too. If we don’t improve, our company ceases to exist, and I really like working on this company.
One interesting problem we have is that we’ve exploded in popularity. It sounds like a good problem to have! But we’ve pushed the platform past what it was originally built to do. We’ve put a lot of work and resources into growing the platform and maturing our engineering organization. But that work has lagged growth.
This sucks for you all individually. You don’t really care about our popularity. I mean, a lot of you do. But, really, you just want to confidently ship your apps.
That’s what we want, too. It’s a grind, though, and I think we’re not as forward about our struggles as we should be. Y’all are devs, like us, and we should have been trusting you with the grimy details. So, here some of them are.
Our platform is a bunch of moving pieces that all need to work together so you can deploy an app, deploy it again, walk away, and then come back 24 months later and find out it’s still working. Here’s what goes into making that work:
- A centralized API that does auth and CRUD stuff against a database,
- The WireGuard gateways
flyctl
uses to connect to your organization’s private network, - Remote Docker builder VMs
flyctl
uses to build your app into a Docker image, - A global Docker Image registry to hold those Docker images,
- A secret storage Vault,
- A scheduler that launches Docker images in VMs (that’s Nomad for most apps today),
- Service discovery to propagates information about all the VMs running in our infrastructure,
- The proxy that routes traffic to your app instances, and
- Networking infrastructure to link apps up with each other.
These have all failed in unique and surprising ways. Often, when this happens, we get lucky, and you don’t notice the hiccups. But sometimes we get unlucky.
In no particular order, here are some major incidents from the last 4 months:
- Service discovery & Corrosion
- Centralized secret storage
- Postgres
- Capacity issues
- Volumes pinned to host hardware
- Status paging
Service discovery & Corrosion
We propagate app instance and health information across all our regions. That’s how our proxies know where to route requests, and how our DNS servers know what names to give out.
We started out using HashiCorp Consul for this. But we were shoehorning Consul, which has a centralized server model design for individual data center deployments, into a global service discovery role it wasn’t suited for. The result: continuously stale data, a proxy that would route to old expired interfaces, and private DNS that would routinely have stale entries.
All of this was a consequence of round-tripping every state update we had (every VM start and stop) through a central cluster of servers, often transcontinentally.
In response, we’ve shipped a project called Corrosion. Corrosion is a gossip based service discovery system. When a VM comes up, that host gossips the instance information. Corrosion’s goal is to propagate changes in under one second, globally (and to get as close to instant as possible).
The problem with Corrosion is that it’s new and gossip based consistency is a difficult problem.
We got Corrosion out the door quickly because Consul was causing problems for users. It’s new software, and it’s caused a pair of issues. Both manifested as corrupted global service discovery state. The first issue happened when one of our process spammed Corrosion with updates, essentially turning it into an internal DDoS. The second occurred during a routine update that unexpectedly messed up a database.
The effect of both of these issues was to break applications during deploys. As VMs came and went, our proxy and DNS servers would find themselves stuck working off stale data.
Corrosion needs to be more resilient to failure. We’re doing incremental things to improve it (rate limits, for instance, mitigate the “internal DDoS” risk). But we’re working on architectural changes, too. Gossip is hard because issues aren’t easy to trace to specific broken nodes, and it propagates quickly, which is what you don’t want when there’s a problem.
Moving off Nomad will also help mitigate Corrosion issues. Because Nomad creates entirely new instances for each deploy, there’s a lot of service discovery churn; many, many event updates per second. Fly Machine-based apps are less frantic – when we update an app running on Machines, we do it in place.
Finally, and this is sort of a general thing not just about service discovery: we deploy a lot of changes to our platform during the week. Sometimes, our changes have collided with yours; an ill-timed app deploy can leave that app in a wonky state. We’re updating our tooling so that app deploys are paused at these times, and when that happens, we’ll make it as obvious as possible why.
Centralized Secret Storage
We store application secrets in HashiCorp Vault. HashiCorp Vault works a lot like Consul does, with a central cluster of servers.
The problems we have with Vault are less severe than the ones we had with Consul, but they rhyme with them. Every time a new VM boots, the worker running it has to pull secrets from Vault. There are two basic problems with this:
-
Vault is in the US, internet connectivity between distant regions (like MAA) and the US can cause secret lookups to fail
-
There are failure scenarios that will make Vault inaccessible. For instance, we had a hardware failure on one of our Vault servers that caused widespread VM creation failures.
As with service discovery, these problems are exacerbated by Nomad and mitigated by Fly Machines. But new Fly Machine creation will also fail if Vault is in a bad state.
This is a theme. Existing open source is not designed for global deployment. So when we make the choice to “buy” existing infrastructure software, we’re often paying in part with global resilience.
Postgres
Our Postgres clusters have had two major problems: (1) our reliance on Stolon and live connections to Consul clusters, and (2) the expectations we’ve set with “unmanaged Postgres”.
The first is an architectural problem. The Consul clusters Postgres depends on are different than the ones we use for service discovery, but they can still “fail” in strange ways. Stolon, the Postgres cluster software we built the first iteration of Fly Postgres on, doesn’t handle Consul connection issues well.
New Postgres clusters don’t use Stolon, and instead come up with repmgr
. repmgr
handles leader election within the cluster, without needing a second database. These new Postgres clusters still use Consul to share configuration information, but if Consul melt downs, the cluster keeps going.
We are working on getting previously provisioned Postgres DBs upgraded to the new repmgr
setup. There are complications, but we’ll keep posting about this.
The second problem we have with Postgres was a poor choice on my part. We decided to ship “unmanaged Postgres” to buy ourselves time to wait for the right managed Postgres provider to show up. The problem is, fly pg create
implies that people are getting a managed Postgres cluster. That’s because every other provider with a “get an easy Postgres” feature gives you a managed stack to go with it.
This makes sense now, but was a surprising lesson for me. We ended up presenting a UX that promised a lot, then not following through. We’re not the type of company to writes value statements, but if we were, we’d write something like “don’t create nasty surprises by violating developer expectations”.
We’re going to solve managed Postgres. It’s going to take a while to get there, but it’s a core component of the infrastructure stack and we can’t afford to pretend otherwise.
Capacity issues
An influx of new users ran us out of server capacity in multiple regions, sometimes more than once (hello, Frankfurt).
This was a failure on two levels: we didn’t buy servers fast enough, and we didn’t have good tools for taking pressure off specific regions.
Last year, I assumed that if we hit capacity issues, we could prevent new users from launching in specific regions. This didn’t pan out.
The Heroku exodus broke our assumptions. Pre-Heroku, most of the apps we were running were spread across regions. And: we were growing about 15% per month. But post-Heroku, we got a huge influx of apps in just a few hot spots — and at 30% per month.
In hindsight, I should have started acting like we were doing srsbzns much earlier, basically as soon as we had investor cash to spend.
We’re getting better with capacity planning and logistics. I was doing capacity planning as a side hustle to the rest of my job. The company needed to scale beyond my spreadsheet. We’ve hired here, and reorganized a bit; things are a little better now, and they will rapidly improve.
Volumes are pinned to host hardware
The fly volumes
command creates a block device on specific host hardware. When we first shipped this, we had a lot of content explaining the limitations of this approach. We designed our volumes to run in sets of 2+.
This means that if the host your volume is on goes down, your app goes down. If the host doesn’t have enough memory or CPU available to run your app VM, you may not be able to deploy.
Those details got lost as our docs improved, however, and it’s led to some nasty surprises. It’s also counterintuitive. People are used to AWS EBS magic. But our volumes aren’t EBS (I shipped the initial version of volumes myself!)
This is another case of the UX creating the wrong expectations.
Status paging
We’re taking a lot of legitimate flak for vague-posting on our status page. Or not posting on our status page. All while we’re being shamelessly rah-rah about our tech stack in blog posts. Issues happen, and we have not communicated aggressively when they do. That makes us look out to lunch.
This is hard. Even this post is hard. Our egos are all wrapped up in this work. We want you to know everything that’s going on, but it’s easy to slip when we’re tired.
Some of the challenges I’ve written about here are Hard, in the CS sense of the word. But this problem isn’t. There’s no way to excuse it. We’re just going to be better at communicating immediately.
We’ve hired a really great person to build up our Infra/Ops organization. In addition to beefing up that team so that it’s no longer spread so thin it’s translucent, they’re also standardizing our incident responses. When the shit hits the fan, we want as few decisions to make as possible, so we can get information out quicker.
We’re also shipping a personalized status page. As our fleet grows, and we rack more and more servers, the chance of us experiencing a hardware failure at any given moment increases. This has made it tricky for us to keep a totally honest status page. The personalized status page will make it easier for us to tell specific customers impacted by hardware failures “hey, a drive died in this region, we’re working on it”.
This is going in the community forum specifically so you all can reply to it. You may throw tomatoes, if that’s your thing. Or ask questions. We’re in an awkward phase where the company isn’t quite mature enough to support the infrastructure we need to deliver a good developer UX, and we’re going to take the bad with the good until that changes.