Internal Monitoring Work

After investigating some recent outages, we discovered that we weren’t catching certain categories of 500 errors in our Machines API. To address this, we did a thorough audit of our alerts. This process helped us identify and fix an issue that shaved 500ms off machine creation time (you can read more about it here), and it also highlighted some hosts that needed resource rebalancing to prevent capacity errors on machine start. We paired the alert audit with changes to our escalation policies to route machines issues directly to our team. We’re hopeful this will help us fix your machines issues faster.

6 Likes