At the heart of our platform runs Corrosion - our distributed service discovery system. It’s used by most of the platform components, from Machines API to fly-proxy
and UDP router. We used to run one big cluster that spanned the entire platform and contained all the data needed to make routing decisions. That made it easy for the proxy to route requests, but proved to be problematic when Corrosion misbehaves. An issue with Corrosion could render the whole platform unstable.
For the last few months we’ve been busy redesigning this system. We now run dedicated Corrosion clusters in each region that contain high-fidelity data for machines in the given region, like machine status (started
/stopped
/suspended
) and health check status (passing
/critical
). We still run the global cluster for the data that needs to be global and doesn’t change that frequently, like IP addresses assigned to an app.
fly-proxy
has been using the new regional Corrosion clusters to make routing decisions for a while. And now our UDP router uses them too.
This means that edges in region A
don’t really know which specific machines in region B
could handle an UDP packet received on specific IP:port
, they just know that some machines in region B
that could handle it exist, and forward the packet to a worker in that region to let it deal with the packet. In most cases the new routing scheme won’t result in any additional hops as we maintain enough low-fidelity information in the global Corrosion cluster to “guess” the correct worker host in another region right away. But in case the global Corrosion cluster is lagging or misbehaving and no longer has fresh enough information, an UDP packet may hop a few times between hosts until it lands on a machine that can handle it.