First off, I really like what Fly.io is doing. The flexibility of the platform, especially with the Machines API and internal networking, is awesome for what I’m building. My app heavily depends on creating and destroying machines for background jobs and using the <machine_id>.vm.<appname>.internal addresses for internal communication between those machines.
That said, I’ve noticed a pattern with incidents across regions that specifically impact the Machines API and internal networking. During these times, my app essentially becomes unavailable because it relies so much on those features. What’s interesting is that the public-facing part of the app (like a typical web service) seems to stay up just fine. So, if your app is just a standard public web service talking to an external database, these incidents probably don’t feel like a big deal. I’d guess this covers most apps (maybe 95% of them?), but for use cases like mine that depend on Machines and internal networking, it’s a different story.
I’m honestly a bit nervous about launching on Fly.io because of this. I need these core features to be rock solid in production. So, I wanted to ask:
Are there ongoing efforts to make the Machines API and internal networking more reliable?
What’s being done to reduce the frequency and impact of these incidents?
Is there any advice for building apps on Fly.io in a way that minimizes these risks?
I really want Fly.io to work for me because it’s such a great fit for my app’s architecture. But these incidents are making me hesitant. Any updates or transparency on this would go a long way.
Thanks for all the work you’re putting into the platform!
Hi… This doesn’t answer all of your questions, obviously, but, as an outside observer, I strongly recommend the (excellent but much overlooked) Infrastructure Log for the transparency part, if you haven’t already seen it…
Once again, an incident affecting the Machines API. If at least these incident would be region-specific, one could always fall back to another region. But this is not the case.
The only reason I’m building on Fly is because of the main selling point to me, the Machines API and its ability to create and start machines within seconds without having to manage a Kubernetes cluster or deal with complicated container orchestration. All this with a great developer experience, but if this is going to come with the cost of having frequent incidents that result in complete app outage, I might prefer dealing with Kubernetes clusters on one of the big cloud providers.
You always suggest running our apps in multiple regions to avoid issues, but the issues you guys have are always affecting the whole thing globally. It’s not that I can’t create a machine in the iad region, it’s that I can’t create a machine anywhere.
Do you have any plans to improve this and maybe be able to regionalize the incidents so we can, at least, fallback to another region during an incident?
Yet another global incident impacting the Machines API. Over the past 40 days of running my app’s background jobs on Fly, I’ve experienced at least five incidents where creating and starting machines via the API was disrupted. I’ve even had to implement a feature flag to fall back to running jobs on AWS Lambda—an alternative that’s significantly more expensive—just to keep things running.
I’ll ask again: what’s being done to ensure Machines API incidents are regionalized? My app already includes logic to retry machine creation in a different region if it fails, but during these incidents, it doesn’t matter which region I try—they all fail.
I’d really like to see a focus on improving the stability of core features like this rather than releasing new capabilities like Docker-in-Docker. It feels like there’s a new incident every week affecting global machine creation, and it’s making it hard to trust this critical part of the platform.
Hey @empz, I missed your OP and happened to see it today after working through the Machine API incident.
I’ll start here. We’ve been working over the last month to do some very specific changes to make outages more regional. It’s a big change so we being as careful and thoughtful as possible to ensure we don’t end up with less reliable operations as we make the changes.
I can speak to one specific item because it’s what I’ve been focused on for the last 3-4 months (and also the cause of today’s brief API incident). Before the Machines API was a product, all customer workloads ran on Nomad (central system) and we relied on our GraphQL API (central system) for flyctl to use for deploys. Once we decided to make Machines a product and exposed via an API for flyctl and customers to use directly, the goal has always been for API operations to be regional. If you want to create a machine in syd, we want every aspect of the request to stay in syd. The machines API is deployed to every worker and gateway in our region and can handle any request. However, we couldn’t immediately stop relying on some of the data backing our central GraphQL API because it is the origin of our platform/product. When you create (or update) a Machine, the process running across our fleet has to make a call to the GraphQL API to convert parts of the Machine Config you provide into our internal representation.
What this means in reliability terms is Machine Create/Update operations are only going to be as reliable as the GraphQL API. We know this has never been ideal and have been working to remove this dependency and also constantly work to make the GraphQL API reliable but it’s a different system entirely. Earlier today, I deployed a change to have the GraphQL API not convert another part of the Machine Config (which had been tested in a staging environment) but the circular dependency on the two caused validation errors during the production deploy. We had alerts going off within a few minutes and quickly worked to resolve it.
I assume you’re referencing my recent Fresh Produce? If so, I can tell you the amount of time I’ve spent on various reliability work compared to the work for our new init is maybe 80% reliability, 20% pilot for the last ~3 months. The pilot project started almost a year ago and we have a lot more details to share about it soon. From a company standpoint, the focus is reliability in all forms (engineering, support, operations) and that’s not new. The features mean nothing if you can’t trust the platform to work when you need it.
Glad to hear there’s been a lot of work going on to decouple from that single GraphQL API. Do you believe this effort will be completed sometime during 2025?
Aside from the Machines API’s reliance on this single point of failure—the GraphQL API—are there any other components of Fly that strongly depend on non-regionalized services? Ideally, any incident should be manageable by switching to a healthy region.
There are still some dependencies on Hashicorp Vault; for the most part, customer secrets are now run through PetSem, our in-house secret store, and PetSem is regionalized (and also much simpler to reason about; to serve live requests, it’s just a web API and a SQLite service), but there are still some legacy connections to Vault that we’re working on rooting out.
That’s the one that most immediately jumps to mind.
It’s not a perfect example of what we’re talking about on this thread, but the work to regionalize Corrosion — a large-scale distributed system, but one with a global state space for the whole platform — is another example of the direction we’re going here.
Any project that credibly removes a SPOF in our architecture is staffed right now; there’s some new-feature work, but overwhelmingly the engineering team is focused on reliability, scale, and capacity management issues, and the company is mostly the engineering team.