(false alarm) Rate limit issues on machines API - our app is crashing all of a sudden

bkniffler · January 19, 2026, 10:11pm

Our fly.io app was running great for 4 months. 20 minutes ago it just started crashing. 4 machines across 2 zones. no changes pushed over the last months.

22:08:31
[PR03] could not find a good candidate within 1 attempts at load balancing. last error: [PM01] machines API returned an error: "rate limit exceeded"
22:08:31
[PR03] could not find a good candidate within 1 attempts at load balancing. last error: [PM01] machines API returned an error: "rate limit exceeded"
22:08:31
[PR03] could not find a good candidate within 1 attempts at load balancing. last error: [PM01] machines API returned an error: "rate limit exceeded"
22:08:31
Starting machine
22:08:31
[PM01] machines API returned an error: "rate limit exceeded"

status monitor reports no general issues with fly.io.

halfer · January 19, 2026, 10:18pm

4 machines across 2 zones

What’s the two regions?

20 minutes ago it just started crashing

Have you got some machine logs (e.g. from Grafana) to see why this might be?

halfer · January 19, 2026, 10:20pm

Ah, if this is the Machines API, are you /createing a new machine via REST? What is the spec of the machine you’re creating? Do you have a set of region fallbacks, so that if your first preference is not available, you can try the second, etc?

bkniffler · January 19, 2026, 10:21pm

Hey @halfer thanks for getting back! actually these logs were covering the the root cause of an expired secret..

zones where AMS and FRA. admittedly, fly.io has been running perfectly fine and its our own fault

halfer · January 19, 2026, 10:22pm

Noice! Maybe a candidate for additional root-cause monitoring on that machine creation code…

bkniffler · January 19, 2026, 10:54pm

Absolutely, I’m going to introduce some proper monitoring & alerts, but most importantly an automated secret rotation system