Fly.io Machines Proxy Timeout

I am building a product on top of Fly.io machines. We have a service that needs to access the Fly.io Machine API, and launch, list, and delete machines. We deploy that service in a Kubernetes cluster. I have been running into a somewhat odd issue where requests to the Fly.io API timeout. We are running flyctl machines api-proxy in a side-care container, and routing requests to it. This seems to work for a time, but quickly runs into issues. I suspected that there was some form of rate-limiting with the Wireguard proxy, and so I created a separate pod just for the proxy. The same issue continued.

Since we are in a K8s cluster, it would be extremely inconvenient to setup Wireguard in a different form. Has anyone else run into similar problems? Is there any we can further debug this issue? Machines are perfect for our use case, but this is a big deal-breaker for us.

I enabled debug logging, and got the following errors

flyio-proxy-0 flyctl-proxy DEBUG failed to connect to target: dial: lookup _api.internal. on fdaa:0:8241::3: read udp [fdaa:0:8241:a7b:177d:0:b:1e00]:23881: i/o timeout
flyio-proxy-0 flyctl-proxy DEBUG accepted new connection from: 127.0.0.1:39648

I also noticed the timeout seems to start when I have a connection kept-alive, which our Rust based service does by default

I might be wrong, but you really don’t want to rely on flyctl anything in prod.

If you want to reach your machine endpoint, a prudent thing would be to allocate it a Public (Anycast) IP address (IPv6 addresses are free!). To reach a specific machine instance behind that Anycast IP, you’d have to do a bit of request-gymnastics.

# one ip free per app?
flyctl ips allocate-v4 -a <machine-app-name>
# any number of them!
flyctl ips allocate-v6 -a <machine-app-name>

If you want public-access to the machine locked-down, consider http basic-auth if you’re using http, if not, some other form of authentication.

1 Like

It seems like this advice is mainly for connecting to a machine itself, which I don’t have a problem with. The issue is that the management API is locked behind the same Wireguard proxy as normal machines. This design choices forces you to either run a full Wireguard proxy inside of your management cluster. Which I am not going to do, because getting AWS’S CNI plugins to play nice with Wireguard sounds like some love-craftian nightmare. Or, you can use flyctl (or a similar user space Wireguard proxy) as a key part of the infrastructure.

In any case, the issue doesn’t really appear to be full from flyctl proxy. If you look at the error I sent above, it seems like there is some issue with resolving the Fly.io machine API itself. I’m not sure what causes this.

1 Like

Gotcha.

Btw, you can straight up hit the GraphQL (?) / REST (?) Machine APIs from within Machine VMs via _api.internal: Fly Machines Manager - #2 by kurt

True. It is upto Fly engs to know what causes it as frequently as you report to have encountered them. I guess, you can treat the API as unstable and address it with retries, graceful degradation, monitoring and alerts, workarounds, etc. That said, flyctl-proxy is running wireguard (like you pointed out), and I don’t think is intended for production use (I may be wrong, of course).

Fwiw, Machines themselves have sharp edges. Just today a Machine VM was stuck in replacing state and blocked all subsequent deploys for me. I had to rid of it with a -f: fly m remove <vm-id> -a <app> -f :person_shrugging: