Requesting recommendations for making Fly less flaky for CI

We currently use Fly for CI, via creating a new app per-PR, but we’re running into quite a lot of CI flakiness.

Specifically, we’ve had over 50% of CI runs need to be manually re-run over the last 24h due to one of the following errors:

  • WARN Remote builder did not start in time. Check remote builder logs
  • Error Post "https://api.fly.io/graphql": http2: server sent GOAWAY and closed the connection
  • flyctl deploy just gets stuck for a long long time (>20 minutes).

Is there any official way to wrap around this for more consistent CI, ideally some combination of retries and timeouts?
I’m also open to any suggestions from anyone else using flyctl for CI as to how you’d best wrap this.

1 Like

Some of these remote builder issues should have been fixed; but you may want to consider building images locally in the mean while:

flyctl deploy <args> --local-only

You could also choose to build image wherever and push them to Fly’s registry to then later deploy the pushed image: Deploying infrastructure - #2 by kurt | Deployment in CI issue - #3 by ignoramous

Don’t think this approach is super helpful:

  • Locally building is kinda a pain, and we’d like to use remote building since Fly provides it
  • Remote builder issues are transient (ie: immediately retrying often works), and actually only a small portion of our total failures.

I think this approach would need some significant effort to figure out where else we’d be building our image, and yet still not solve majority of the flakiness.

I think I might default to something like for i in {1..3}; do timeout 600 flyctl deploy <args> && break || sleep 15; done - but I’m interested in any other proposals in that genre - which don’t try to work around a single sources of flakiness, but rather provide some limited level of blanket resilience against current and future transient errors due to Fly.

1 Like