Infrastructure Log for Jan 12–18 - blue-green failures, health checks, 6PN address stability, decentralization

mayailurus · January 23, 2025, 11:28pm

The Infrastructure Log tends to fly under people’s radars, which is unfortunate, since it’s a good way to learn more about the details and root causes of past incidents—as well as to hear about many upcoming changes in advance.

https://fly.io/infra-log/2025-01-18/

The narrow but intense box there on Thursday is another global(?) failure of the Machines API, this time due to a CI/CD glitch.

There were also regional failures of blue-green deployments, early on the previous day.

Definitely read the link at the top for official details.

Behind the scenes, efforts included the following, outlined here non-exhaustively…

Measures being taken to prevent future problems

Decentralization...
- Of Corrosion, the global metadata database. *
- Of health checks, moving away from Consul to a newer, custom (and presumably regionalized) setup. *
Elimination of Corrosion slowness stemming from I/O contention with customers(!).
Preemptive—as opposed to reactive—trawls throughout the infrastructure...
- Comprehensive review of recurring errors on physical host machines and their respective individual management daemons...
  - "[F]ound a bunch of Fly Machines in inconsistent states".
  - Swap glitches.
  - Metrics for dead Machines.
- Synthetic alerts (proactive probes of functionality, rather than passive). *
- Deliberate reboots to avoid an AMD firmware bug.

Efficiency and capacity improvements

Periodic garbage collection on physical host machines' local databases.
Silencing metrics for dead Machines (above) also improves Prometheus capacity (IIUC).

New features

6PN address stability after migrations. As the Log points out, this sounds like an oxymoron, so it’ll be interesting to hear more, ...
CPU throttling on the low-budget shared class, which was originally intended for things like HTTP servers—instead of background workers, media encoders, etc. *

No word yet

Intermediate flex class of vCPUs for people who don't need a full 100% yet are feeling pinched now by the 6% of the shared class.‡
More convenient, floating, app-level egress IPs.‡
Automated display of system-wide metrics and probe outcomes on the status page.‡

*Work ongoing from previously.
‡Fly.io gave encouraging responses to these earlier, but they aren’t necessarily in progress.

Caveat: The above are just my own interpretations and paraphrases, as a fellow user.

system · January 30, 2025, 11:28pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Infrastructure Log for Dec 8–14 - global proxy deadlock, "burst" providers, decentralization blog , machines , postmortem , volumes , proxy	1	58	December 25, 2024
Infrastructure Log for Jan 19–25 - builders outage, logging outage, I/O improvements, decentralization logs , postgres , blog , postmortem , builders	1	82	February 7, 2025
There's an incident affecting the Machines API globally every week Questions / Help	10	309	December 17, 2024
Error: failed to connect to fly machine: Supposedly started, and not stopped Questions / Help	10	3377	September 27, 2022
Reliability: It's Not Great	53	79049	April 15, 2024

Infrastructure Log for Jan 12–18 - blue-green failures, health checks, 6PN address stability, decentralization

Measures being taken to prevent future problems

Efficiency and capacity improvements

New features

No word yet

Related topics