Check machine creation’s success rate on status.flyio.net

We’ve added machine creation’s success rate on status.flyio.net, our public status page. We are planning to add more granular metrics on the status page and eventually retire the “API Success Rate” metrics.

Why are we doing this? So that you have more specific information about what’s actually happening with your machines! Combining all of our Machine APIs into one “API Success Rate” made that one metric pretty muddy and less actionable. (We’re aiming for “less muddy” and very actionable.)

You may notice that machine creation’s success rate is currently worse than the “API Success Rate.” Don’t be alarmed! Nothing’s changed. The existing “API Success Rate” is helpful, but it doesn’t include information that you probably want to know! For instance:

  • It doesn’t include capacity issues. Now, don’t get us wrong; most of our regions are not at capacity. When you run into capacity issues, they’re mostly caused by overcrowded hosts (and we’re actively working on platform features that will reduce this).
  • It does include lots of other APIs. Different APIs have different success rates, since some operations are more complex than others (and thus more likely to fail). For example, creating a new machine is much more complex than getting machine information. In separating out the machine creation metric from APIs that measure other less-complex operations, we’re giving you a window into which pieces you need to pay attention to when things go sideways.

Let us know if you have any comments or suggestions!

8 Likes

What’s the target value for this metric?

I see values as low as 86% at some points, which seems alarmingly low if the metric applies to machines created across all regions:

My reason to ask is that we’ve been seeing intermittent 500s when using the machines API. Issues like this have happened 4-5 times in the past, and almost every time I’ve had to write in to support w/ details before a status update was posted. In the most recent instance, we’ve seen ~50 500s in 24hrs on the machines API across multiple apps. API success rate says 100%, but the % machines created graph does show multiple blips today.

(Appreciate the effort to add more information to the status page!)

The metric applies to machines created across all regions. Our target is 100%, but we are not there yet.

Let me check your support tickets. In addition to improving the success rate, I want to make our error messages clear and actionable.

1 Like