More Reliable Volume Operations

JP_Phillips · March 21, 2023, 9:58pm

Over the past week, we’ve worked to change how our central API communicates to the fleet of workers when performing volume related actions (i.e. created, destroy, extend, etc.). We’ve been able to rely on nats.io for quite a while and it continues to be a good choice for certain operations in our system but we also have routing issues at times which become difficult to diagnose and recover from without manually intervention. The end result is you would experience random errors when working with volumes even though the system was in a stable/working condition.

What Changed

All of our infrastructure is connected over a WireGuard mesh which allows us to create direct, point-to-point, communication between any two hosts. nats.io has been very easy to work with and didn’t require us to worry about where the message would be sent based on our topic naming conventions. However, all of the messages exchanged were synchronous request/reply and the caller knew which host it needed to send the message to, so we reached a point where we really weren’t taking advantage of the flexibility and features of nats.io.

Over the last several months, we experimented with swapping out nats.io as the communication mechanism between flyd and the global machines API process (a.k.a flaps) to use https://connect.build/ . Once we saw an improvement in the reliability of API calls, we then turned our attention to figuring out if it would be a good fit for our central API. The machines API and flyd are both written in Golang so we were able to make use of the connect-go module when building the integration. Because our central API is Ruby/Rails and the Connect protocol is built on top of normal HTTP semantics, we were able to easily swap out our nats.iorequests for HTTP without any major refactoring needed.

What’s Next

Now that flyd exposes the same type of API for Volumes internally, we’re hopeful to be able to expand upon the Machines API to support more resources types giving you more options for how you manage the systems you build on top of our platform.

Topic		Replies	Views
[Fresh Produce] Volumes endpoints in Machines API Fresh Produce	23	2828	December 10, 2024
There's an incident affecting the Machines API globally every week Questions / Help	10	288	December 17, 2024
Communicating with other servers on network Questions / Help	4	429	January 23, 2024
Host statuses for Machines and volumes added to the API and flyctl Fresh Produce machines , volumes	3	307	September 18, 2024
Fly Log Streams are now a lot sturdier Fresh Produce logs	19	985	September 26, 2024

More Reliable Volume Operations

What Changed

What’s Next

Related topics