Could Fly maintain a “deploys are/are not completely succeeding” health check on the status page?
This is requested with some (current) frustration, but it has also been a repeated experience: There’s currently an issue on the status page called Degraded API Performance. But what it actually seems to mean: we cannot deploy. Numerous calls to this API are required to effect a deploy, and apparently some are 5xxing.
Random examples from retries in our deploy action:
Run superfly/flyctl-actions/setup-flyctl@master
Error: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
Run flyctl deploy --image-label git-3ce0ed3 --build-arg CI_RELEASE=prod-v168-git-3ce0ed3 --config fly.prod.toml --strategy rolling --auto-confirm --remote-only --verbose
==> Verifying app config
--> Verified app config
Validating fly.prod.toml
✓ Configuration is valid
Error: server returned a non-200 status code: 504
Error: failed retrieving current user: server returned a non-200 status code: 504
Error: Process completed with exit code 1.
I have been oncall throughout my career and I’m sympathetic to the balance that must be struck by oncall responders in just stating the facts, vs wasting time editorializing what impact could be. I don’t think Fly is intentionally diffusing the impact of the issue.
But our experience is that many dryly-named & narrated outages at fly, like “degraded API performance”, often have very blunt & immediate impact which we only discover when we try to ship. There’s something in this update about “Corrosion”; while it’s cool to see technical details, in the moment I don’t really care what the name of the global state store is. I mostly just care (a) are my servers responsive, and (b) are deploys working.
So the concrete suggestion is: Run deploys continuously, from a variety of regions, and show whether they’re succeeding or not. I think it might tell a more complete story about what’s going on, and complement the work of oncall responders when there is an outage. Thanks!