Routing internal requests during outages

jsierles · June 3, 2021, 3:57am

I have a doubt about internal request routing.

Say I have 3 openresty instances which proxy to 3 varnish instances, spread across 3 regions. Varnish proxies to a Rails app.

Under normal circumstances, openresty should route to its region-local varnish pair. That can be done using the current_region.varnish.internal address. What happens if the current region’s varnish is unhealthy?

If the answer is that no IPs would be returned, would it smart to get all healthy records from varnish.internal and ensure the regional one is weighted first?

Naturally, my next question: could this logic could be handled somehow by service discovery directly? If not, I suppose it could be generalized into a script that returns the correctly ordered IPs.

jsierles · June 3, 2021, 9:55am

One way to handle this weighting that would be compatible with Varnish vmod_dynamic is adding support for DNS SRV records. Is this something Fly would consider? The weighting could be done by placing the current region first, then the next nearest, etc.

This would be great for services like Varnish which do not require any additional configuration to refresh backends dynamically. Nginx also appears to support this in their upstream module.

jerome · June 3, 2021, 11:43am

There’s been interest internally to add support for SRV records. Looks like you just gave us a good reason to do it.

kurt · June 3, 2021, 1:22pm

Nginx only does upstream DNS refreshes with a commercial license. But it’s not hard to reload nginx when DNS changes.

Doing an SRV record that prioritizes instances by region distance would be neat. Seems like setting a weight to 0 effectively makes them backups?

jsierles · June 3, 2021, 4:03pm

I wasn’t sure about Nginx and dynamic DNS - typical! In any case, this brings up a few questions:

What restart strategy will fly use if we force a restart like fly restart? Can this be controlled like the dpeloyment strategy?
What about adding an event stream somewhere to inform about cluster activity instead of having to build something to poll DNS?

0 weight would be like a backup, yes. At least it looks like it would be for Varnish vmod_dynamic.

kurt · June 3, 2021, 6:07pm

fly restart just cycles through each VM and restarts the process. It’s not very smart. If you want better control you can fly vm stop <id> one by one, or do a deploy.

An event stream would make a ton of sense, it should be pretty quick to add this to the NATs endpoint we’re using for longs. I don’t know when we’ll get to it but we’re doing a lot of related replacing-of-parts.

Topic		Replies	Views
Using fly-proxy for internal requests Questions / Help	3	422	March 1, 2022
Best way to internally send a request to ALL running vms?	14	1241	September 6, 2022
SSR Node.js + Varnish Proxy. Do I need Nginx? Questions / Help	13	1009	August 14, 2023
fly-proxy now routes around unstable network links Fresh Produce	0	427	July 15, 2024
Direct network connection to instance or external load balancing?	1	460	October 20, 2021

Routing internal requests during outages

Related topics