I had some very simple nginx services with max 2 servers each (maybe just 1) that suddenly started giving me ssl errors despite me not having changed anything about them. Would love general tips on how to prevent this sort of thing from happening in the future. I “fixed” it by redeploying, but it felt a bit like turning it off and then on again instead of understanding the cause. Thank you!
These happen when an app isn’t accepting requests for us. Normally this should fail health checks, and trigger a replacement VM. But we’ve seen apps where the health checks continue working properly even though we can’t get them to return responses for us, and those won’t get rescheduled.
This doesn’t necessarily require a change in the app. The VMs we run can move at any time. If someone launches a large VM, it could evict smaller VMs from a given set of hardware.
The best thing to do here is to run a minimum of 2 instances for failover purposes. That’ll give you a buffer if an instance dies and isn’t coming back up properly. Or cover for periods where we aren’t rescheduling the VM reliably.
Sometimes on and off again is ok, especially with disposable VMs like these. But redundancy is good either way.
FYI, I dug into your app specifically. It looks like what happened was:
- Hosts in one region experience a network disruption
- We migrated the VM to another region
- The host the VM got scheduled on got behind launching VMs
- Some time later, VM successfully launched on the new host
This is a definite bug in our scheduling, in most cases the VM only would have been down for a minute or two. I do not think this will happen frequently but it definitely can happen, and is a good reason to run extra VMs for redundancy.
Thanks Kurt! I really appreciate it!