Got it, thanks!
Funnily, the reason we’re trying to use this is to improve reliability Dumping some context in case it is useful:
We’re testing fly.io with an application that spins up fly apps/machines based on user requests. We have 10% of our traffic pointed to fly while we test it out. Our API usage volume currently looks like this:
- ~50 requests/day to create apps/machines
- ~200 requests/day to start stopped machines
- ~50 requests/day to delete apps
The application that triggers these calls isn’t hosted on Fly, so it needs to reach the machines API from outside Fly.
Even with volume this low, we’ve been seeing a 5-10% error rate in API calls. Sometime it’s a read timeout, sometimes an open timeout. Sometimes even straight up 500s, but those aren’t very common. We do retry errors a few times (the 10% number is in spite of our retry logic). We aren’t re-using HTTP connections.
Initially, I thought the reliability issues were because we were using the proxy offered by flyctl
, which isn’t intended for production use. So I switched to hosting a simple Caddy instance inside our organization instead. That doesn’t seem to have made much of a dent. I also tried upgrading the Caddy instance from using the shared tier to dedicated, that didn’t help either.
I’m pretty confident the HTTP reliability problems aren’t related to the application making the requests, because 90% of our traffic still uses self-hosted Nomad instances (roughly the same number of outgoing API calls), and we’ve consistently seen a ~0% error rate there.
I found one related post that mentions timeouts with the Machines API: Fly.io Machines Proxy Timeout - #5 by ignoramous.
I was hoping that switching to api.machines.dev would help with these problems. I’ll give it a whirl and report back on what I find!