We have a passive health check system where our edges will avoid sending connections to certain hosts if they’re detected as unhealthy from the edge’s perspective. This allows us to circumvent some nodes temporarily in case of network failures or any other condition causing connection failures.
If you only have 1 instance, and it happens to be on a node experiencing difficulties, there’s nowhere better our edge can route to.
That said, these problems shouldn’t persist without us creating a status page incident and investigating. This could be a bug in our proxy where it might be keeping the bad status of a host. I’ve deployed the latest version of our proxy now to all our edges and hope this will get cleared.
That said, I don’t think all our issues are the same!
@davidfro and @jsphc: we’ve detected very high latency between the 2 denver nodes hosting your app and your DB. Looking into that now.
@containerops: this might be the issue I’ve outlined at the start of my message, can you tell if your metrics are better from synthetic monitoring?