SSL errors on a long running service

davidhodge · October 29, 2020, 11:44pm

I had some very simple nginx services with max 2 servers each (maybe just 1) that suddenly started giving me ssl errors despite me not having changed anything about them. Would love general tips on how to prevent this sort of thing from happening in the future. I “fixed” it by redeploying, but it felt a bit like turning it off and then on again instead of understanding the cause. Thank you!

kurt · October 30, 2020, 1:22am

These happen when an app isn’t accepting requests for us. Normally this should fail health checks, and trigger a replacement VM. But we’ve seen apps where the health checks continue working properly even though we can’t get them to return responses for us, and those won’t get rescheduled.

This doesn’t necessarily require a change in the app. The VMs we run can move at any time. If someone launches a large VM, it could evict smaller VMs from a given set of hardware.

The best thing to do here is to run a minimum of 2 instances for failover purposes. That’ll give you a buffer if an instance dies and isn’t coming back up properly. Or cover for periods where we aren’t rescheduling the VM reliably.

Sometimes on and off again is ok, especially with disposable VMs like these. But redundancy is good either way.

kurt · October 30, 2020, 6:38pm

FYI, I dug into your app specifically. It looks like what happened was:

Hosts in one region experience a network disruption
We migrated the VM to another region
The host the VM got scheduled on got behind launching VMs
Some time later, VM successfully launched on the new host

This is a definite bug in our scheduling, in most cases the VM only would have been down for a minute or two. I do not think this will happen frequently but it definitely can happen, and is a good reason to run extra VMs for redundancy.

davidhodge · October 30, 2020, 7:36pm

Thanks Kurt! I really appreciate it!