Hey @empz, I missed your OP and happened to see it today after working through the Machine API incident.
I’ll start here. We’ve been working over the last month to do some very specific changes to make outages more regional. It’s a big change so we being as careful and thoughtful as possible to ensure we don’t end up with less reliable operations as we make the changes.
I can speak to one specific item because it’s what I’ve been focused on for the last 3-4 months (and also the cause of today’s brief API incident). Before the Machines API was a product, all customer workloads ran on Nomad (central system) and we relied on our GraphQL API (central system) for flyctl
to use for deploys. Once we decided to make Machines a product and exposed via an API for flyctl
and customers to use directly, the goal has always been for API operations to be regional. If you want to create a machine in syd
, we want every aspect of the request to stay in syd
. The machines API is deployed to every worker and gateway in our region and can handle any request. However, we couldn’t immediately stop relying on some of the data backing our central GraphQL API because it is the origin of our platform/product. When you create (or update) a Machine, the process running across our fleet has to make a call to the GraphQL API to convert parts of the Machine Config you provide into our internal representation.
What this means in reliability terms is Machine Create/Update operations are only going to be as reliable as the GraphQL API. We know this has never been ideal and have been working to remove this dependency and also constantly work to make the GraphQL API reliable but it’s a different system entirely. Earlier today, I deployed a change to have the GraphQL API not convert another part of the Machine Config (which had been tested in a staging environment) but the circular dependency on the two caused validation errors during the production deploy. We had alerts going off within a few minutes and quickly worked to resolve it.
I assume you’re referencing my recent Fresh Produce? If so, I can tell you the amount of time I’ve spent on various reliability work compared to the work for our new init is maybe 80% reliability, 20% pilot for the last ~3 months. The pilot project started almost a year ago and we have a lot more details to share about it soon. From a company standpoint, the focus is reliability in all forms (engineering, support, operations) and that’s not new. The features mean nothing if you can’t trust the platform to work when you need it.