More Reliable Volume Operations

Over the past week, we’ve worked to change how our central API communicates to the fleet of workers when performing volume related actions (i.e. created, destroy, extend, etc.). We’ve been able to rely on nats.io for quite a while and it continues to be a good choice for certain operations in our system but we also have routing issues at times which become difficult to diagnose and recover from without manually intervention. The end result is you would experience random errors when working with volumes even though the system was in a stable/working condition.

What Changed

All of our infrastructure is connected over a WireGuard mesh which allows us to create direct, point-to-point, communication between any two hosts. nats.io has been very easy to work with and didn’t require us to worry about where the message would be sent based on our topic naming conventions. However, all of the messages exchanged were synchronous request/reply and the caller knew which host it needed to send the message to, so we reached a point where we really weren’t taking advantage of the flexibility and features of nats.io.

Over the last several months, we experimented with swapping out nats.io as the communication mechanism between flyd and the global machines API process (a.k.a flaps) to use https://connect.build/ . Once we saw an improvement in the reliability of API calls, we then turned our attention to figuring out if it would be a good fit for our central API. The machines API and flyd are both written in Golang so we were able to make use of the connect-go module when building the integration. Because our central API is Ruby/Rails and the Connect protocol is built on top of normal HTTP semantics, we were able to easily swap out our nats.iorequests for HTTP without any major refactoring needed.

What’s Next

Now that flyd exposes the same type of API for Volumes internally, we’re hopeful to be able to expand upon the Machines API to support more resources types giving you more options for how you manage the systems you build on top of our platform.

11 Likes