I am building a product on top of Fly.io machines. We have a service that needs to access the Fly.io Machine API, and launch, list, and delete machines. We deploy that service in a Kubernetes cluster. I have been running into a somewhat odd issue where requests to the Fly.io API timeout. We are running
flyctl machines api-proxy in a side-care container, and routing requests to it. This seems to work for a time, but quickly runs into issues. I suspected that there was some form of rate-limiting with the Wireguard proxy, and so I created a separate pod just for the proxy. The same issue continued.
Since we are in a K8s cluster, it would be extremely inconvenient to setup Wireguard in a different form. Has anyone else run into similar problems? Is there any we can further debug this issue? Machines are perfect for our use case, but this is a big deal-breaker for us.
I enabled debug logging, and got the following errors
flyio-proxy-0 flyctl-proxy DEBUG failed to connect to target: dial: lookup _api.internal. on fdaa:0:8241::3: read udp [fdaa:0:8241:a7b:177d:0:b:1e00]:23881: i/o timeout
flyio-proxy-0 flyctl-proxy DEBUG accepted new connection from: 127.0.0.1:39648
I also noticed the timeout seems to start when I have a connection kept-alive, which our Rust based service does by default
I might be wrong, but you really don’t want to rely on
flyctl anything in prod.
If you want to reach your machine endpoint, a prudent thing would be to allocate it a Public (Anycast) IP address (IPv6 addresses are free!). To reach a specific machine instance behind that Anycast IP, you’d have to do a bit of request-gymnastics.
# one ip free per app?
flyctl ips allocate-v4 -a <machine-app-name>
# any number of them!
flyctl ips allocate-v6 -a <machine-app-name>
If you want public-access to the machine locked-down, consider http basic-auth if you’re using http, if not, some other form of authentication.
It seems like this advice is mainly for connecting to a machine itself, which I don’t have a problem with. The issue is that the management API is locked behind the same Wireguard proxy as normal machines. This design choices forces you to either run a full Wireguard proxy inside of your management cluster. Which I am not going to do, because getting AWS’S CNI plugins to play nice with Wireguard sounds like some love-craftian nightmare. Or, you can use flyctl (or a similar user space Wireguard proxy) as a key part of the infrastructure.
In any case, the issue doesn’t really appear to be full from
flyctl proxy. If you look at the error I sent above, it seems like there is some issue with resolving the Fly.io machine API itself. I’m not sure what causes this.
Btw, you can straight up hit the GraphQL (?) / REST (?) Machine APIs from within Machine VMs via
_api.internal: Fly Machines Manager - #2 by kurt
True. It is upto Fly engs to know what causes it as frequently as you report to have encountered them. I guess, you can treat the API as unstable and address it with retries, graceful degradation, monitoring and alerts, workarounds, etc. That said,
flyctl-proxy is running
wireguard (like you pointed out), and I don’t think is intended for production use (I may be wrong, of course).
Fwiw, Machines themselves have sharp edges. Just today a Machine VM was stuck in
replacing state and blocked all subsequent
deploys for me. I had to rid of it with a
fly m remove <vm-id> -a <app> -f