I am building a product on top of Fly.io machines. We have a service that needs to access the Fly.io Machine API, and launch, list, and delete machines. We deploy that service in a Kubernetes cluster. I have been running into a somewhat odd issue where requests to the Fly.io API timeout. We are running flyctl machines api-proxy in a side-care container, and routing requests to it. This seems to work for a time, but quickly runs into issues. I suspected that there was some form of rate-limiting with the Wireguard proxy, and so I created a separate pod just for the proxy. The same issue continued.
Since we are in a K8s cluster, it would be extremely inconvenient to setup Wireguard in a different form. Has anyone else run into similar problems? Is there any we can further debug this issue? Machines are perfect for our use case, but this is a big deal-breaker for us.
I might be wrong, but you really don’t want to rely on flyctl anything in prod.
If you want to reach your machine endpoint, a prudent thing would be to allocate it a Public (Anycast) IP address (IPv6 addresses are free!). To reach a specific machine instance behind that Anycast IP, you’d have to do a bit of request-gymnastics.
# one ip free per app?
flyctl ips allocate-v4 -a <machine-app-name>
# any number of them!
flyctl ips allocate-v6 -a <machine-app-name>
If you want public-access to the machine locked-down, consider http basic-auth if you’re using http, if not, some other form of authentication.
It seems like this advice is mainly for connecting to a machine itself, which I don’t have a problem with. The issue is that the management API is locked behind the same Wireguard proxy as normal machines. This design choices forces you to either run a full Wireguard proxy inside of your management cluster. Which I am not going to do, because getting AWS’S CNI plugins to play nice with Wireguard sounds like some love-craftian nightmare. Or, you can use flyctl (or a similar user space Wireguard proxy) as a key part of the infrastructure.
In any case, the issue doesn’t really appear to be full from flyctl proxy. If you look at the error I sent above, it seems like there is some issue with resolving the Fly.io machine API itself. I’m not sure what causes this.
True. It is upto Fly engs to know what causes it as frequently as you report to have encountered them. I guess, you can treat the API as unstable and address it with retries, graceful degradation, monitoring and alerts, workarounds, etc. That said, flyctl-proxyis running wireguard (like you pointed out), and I don’t think is intended for production use (I may be wrong, of course).
Fwiw, Machines themselves have sharp edges. Just today a Machine VM was stuck in replacing state and blocked all subsequent deploys for me. I had to rid of it with a -f: fly m remove <vm-id> -a <app> -f