Routing internal requests during outages

I have a doubt about internal request routing.

Say I have 3 openresty instances which proxy to 3 varnish instances, spread across 3 regions. Varnish proxies to a Rails app.

Under normal circumstances, openresty should route to its region-local varnish pair. That can be done using the current_region.varnish.internal address. What happens if the current region’s varnish is unhealthy?

If the answer is that no IPs would be returned, would it smart to get all healthy records from varnish.internal and ensure the regional one is weighted first?

Naturally, my next question: could this logic could be handled somehow by service discovery directly? If not, I suppose it could be generalized into a script that returns the correctly ordered IPs.

One way to handle this weighting that would be compatible with Varnish vmod_dynamic is adding support for DNS SRV records. Is this something Fly would consider? The weighting could be done by placing the current region first, then the next nearest, etc.

This would be great for services like Varnish which do not require any additional configuration to refresh backends dynamically. Nginx also appears to support this in their upstream module.

There’s been interest internally to add support for SRV records. Looks like you just gave us a good reason to do it.

Nginx only does upstream DNS refreshes with a commercial license. But it’s not hard to reload nginx when DNS changes. :smiley:

Doing an SRV record that prioritizes instances by region distance would be neat. Seems like setting a weight to 0 effectively makes them backups?

I wasn’t sure about Nginx and dynamic DNS - typical! In any case, this brings up a few questions:

  • What restart strategy will fly use if we force a restart like fly restart? Can this be controlled like the dpeloyment strategy?
  • What about adding an event stream somewhere to inform about cluster activity instead of having to build something to poll DNS?

0 weight would be like a backup, yes. At least it looks like it would be for Varnish vmod_dynamic.

fly restart just cycles through each VM and restarts the process. It’s not very smart. If you want better control you can fly vm stop <id> one by one, or do a deploy.

An event stream would make a ton of sense, it should be pretty quick to add this to the NATs endpoint we’re using for longs. I don’t know when we’ll get to it but we’re doing a lot of related replacing-of-parts.