Request: Automated "deploys are / are not healthy" check on the Fly status page

mikey · November 25, 2024, 11:37pm

Could Fly maintain a “deploys are/are not completely succeeding” health check on the status page?

This is requested with some (current) frustration, but it has also been a repeated experience: There’s currently an issue on the status page called Degraded API Performance. But what it actually seems to mean: we cannot deploy. Numerous calls to this API are required to effect a deploy, and apparently some are 5xxing.

Random examples from retries in our deploy action:

Run superfly/flyctl-actions/setup-flyctl@master
  
Error: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.

Run flyctl deploy --image-label git-3ce0ed3 --build-arg CI_RELEASE=prod-v168-git-3ce0ed3 --config fly.prod.toml --strategy rolling --auto-confirm --remote-only --verbose 
==> Verifying app config
--> Verified app config
Validating fly.prod.toml
✓ Configuration is valid
Error: server returned a non-200 status code: 504

Error: failed retrieving current user: server returned a non-200 status code: 504
Error: Process completed with exit code 1.

I have been oncall throughout my career and I’m sympathetic to the balance that must be struck by oncall responders in just stating the facts, vs wasting time editorializing what impact could be. I don’t think Fly is intentionally diffusing the impact of the issue.

But our experience is that many dryly-named & narrated outages at fly, like “degraded API performance”, often have very blunt & immediate impact which we only discover when we try to ship. There’s something in this update about “Corrosion”; while it’s cool to see technical details, in the moment I don’t really care what the name of the global state store is. I mostly just care (a) are my servers responsive, and (b) are deploys working.

So the concrete suggestion is: Run deploys continuously, from a variety of regions, and show whether they’re succeeding or not. I think it might tell a more complete story about what’s going on, and complement the work of oncall responders when there is an outage. Thanks!

sevenseacat · November 26, 2024, 7:05am

It’s more than just automated deploys though - it’s SSH connections, it’s some people’s whole apps, its everything that uses the API. You can’t even list apps using the CLI right now.

mikey · November 26, 2024, 5:08pm

I think you could generalize my idea to other types of probes, and also expose those on the status page.

mayailurus · November 26, 2024, 9:24pm

That really would be nice… As I understand it, the probes themselves do already exist to some extent, under the name “synthetic alerts”:

Andres shipped a first cut of a new synthetic monitoring system (“synthetics” is the cool-kid way of saying “actually making requests and seeing if they complete”, as opposed to watching metrics). We had some synthetic monitoring, but now we have substantially more, broken out into regions, particularly for the APIs reachable from flyctl, our CLI.

(Along with several other mentions over time, in the Infrastructure Log.)

And Fly.io seemed very open to eventually and gradually exposing such things in an automated way, back in a postmortem comment in the thread for the September major outage:

system · December 3, 2024, 9:25pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cannot deploy, getting 504s from Fly API	16	1018	December 13, 2022
Deploys failing at release stage with 504	1	276	November 2, 2022
Any current issue with deploys?	3	401	January 22, 2021
Unable to deploy successfully, seeing some strange behaviour with flyctl Questions / Help elixir	4	474	October 6, 2021
Release revert/health check detailed logs? Questions / Help	2	720	January 31, 2023

Request: Automated "deploys are / are not healthy" check on the Fly status page

Related topics