This issue mentions that api.machines.dev is publicly accessible (and I’ve confirmed it works), but the official docs still say “The Machines API endpoint requires a connection to your Fly private network”.
Is api.machines.dev stable and safe to use?
This probably refers to the Machines GraphQL endpoint: Public API for launching containers? - #2 by lubien
Overall, Machines (aka Fly Apps v2) themselves are in-preview for the most part, but yes, I’d expect this API to be “stable” as in “open for business”.
Yes it is. We haven’t decided how we’re going to document the difference yet, but the endpoint is just fine to use.
Note that it’s a proxy service between you and the internal machines API endpoint. So there’s one more moving part. It should be low maintenance, but the internal endpoint is best for maximum reliability.
Got it, thanks!
Funnily, the reason we’re trying to use this is to improve reliability Dumping some context in case it is useful:
We’re testing fly.io with an application that spins up fly apps/machines based on user requests. We have 10% of our traffic pointed to fly while we test it out. Our API usage volume currently looks like this:
- ~50 requests/day to create apps/machines
- ~200 requests/day to start stopped machines
- ~50 requests/day to delete apps
The application that triggers these calls isn’t hosted on Fly, so it needs to reach the machines API from outside Fly.
Even with volume this low, we’ve been seeing a 5-10% error rate in API calls. Sometime it’s a read timeout, sometimes an open timeout. Sometimes even straight up 500s, but those aren’t very common. We do retry errors a few times (the 10% number is in spite of our retry logic). We aren’t re-using HTTP connections.
Initially, I thought the reliability issues were because we were using the proxy offered by
flyctl, which isn’t intended for production use. So I switched to hosting a simple Caddy instance inside our organization instead. That doesn’t seem to have made much of a dent. I also tried upgrading the Caddy instance from using the shared tier to dedicated, that didn’t help either.
I’m pretty confident the HTTP reliability problems aren’t related to the application making the requests, because 90% of our traffic still uses self-hosted Nomad instances (roughly the same number of outgoing API calls), and we’ve consistently seen a ~0% error rate there.
I found one related post that mentions timeouts with the Machines API: Fly.io Machines Proxy Timeout - #5 by ignoramous.
I was hoping that switching to api.machines.dev would help with these problems. I’ll give it a whirl and report back on what I find!
The api.machines.dev endpoint is effectively the same as your Caddy instance.
Can you share a 500 you got from the API endpoint directly? Also more details on API timeouts would be great. These should not be happening so the more you can tell us, the more we can investigate.
We’ll need to know which specific API calls you are making and which regions you’re doing it from, too.
The 500s are far and few, here are some examples:
01GGF9Z5HJP3QWZ2HV1A1H08HS-iad: Oct 28, 2022 1:44:49 PM BST
01GGFAY1NW1GM3NP8WV21G8MYF-iad: Oct 28, 2022 2:01:41 PM BST
- … more
Actually, looks like all the 500s coincide with the downtime incident on Oct 28, so that explains it.
I’ll gather more data on the timeouts and post soon