(false alarm) Rate limit issues on machines API - our app is crashing all of a sudden

Our fly.io app was running great for 4 months. 20 minutes ago it just started crashing. 4 machines across 2 zones. no changes pushed over the last months.

22:08:31
[PR03] could not find a good candidate within 1 attempts at load balancing. last error: [PM01] machines API returned an error: "rate limit exceeded"
22:08:31
[PR03] could not find a good candidate within 1 attempts at load balancing. last error: [PM01] machines API returned an error: "rate limit exceeded"
22:08:31
[PR03] could not find a good candidate within 1 attempts at load balancing. last error: [PM01] machines API returned an error: "rate limit exceeded"
22:08:31
Starting machine
22:08:31
[PM01] machines API returned an error: "rate limit exceeded"

status monitor reports no general issues with fly.io.

4 machines across 2 zones

What’s the two regions?

20 minutes ago it just started crashing

Have you got some machine logs (e.g. from Grafana) to see why this might be?

Ah, if this is the Machines API, are you /createing a new machine via REST? What is the spec of the machine you’re creating? Do you have a set of region fallbacks, so that if your first preference is not available, you can try the second, etc?

Hey @halfer thanks for getting back! actually these logs were covering the the root cause of an expired secret..

zones where AMS and FRA. admittedly, fly.io has been running perfectly fine and its our own fault

1 Like

Noice! Maybe a candidate for additional root-cause monitoring on that machine creation code…

1 Like

Absolutely, I’m going to introduce some proper monitoring & alerts, but most importantly an automated secret rotation system