Infrastructure Log for Jan 12–18 - blue-green failures, health checks, 6PN address stability, decentralization

The Infrastructure Log tends to fly under people’s radars, which is unfortunate, since it’s a good way to learn more about the details and root causes of past incidents—as well as to hear about many upcoming changes in advance.

https://fly.io/infra-log/2025-01-18/

The narrow but intense box there on Thursday is another global(?) failure of the Machines API, this time due to a CI/CD glitch.

There were also regional failures of blue-green deployments, early on the previous day.

Definitely read the link at the top for official details.


Behind the scenes, efforts included the following, outlined here non-exhaustively…

Measures being taken to prevent future problems

  • Decentralization...
    • Of Corrosion, the global metadata database. *
    • Of health checks, moving away from Consul to a newer, custom (and presumably regionalized) setup. *
  • Elimination of Corrosion slowness stemming from I/O contention with customers(!).

  • Preemptive—as opposed to reactive—trawls throughout the infrastructure...
    • Comprehensive review of recurring errors on physical host machines and their respective individual management daemons...
      • "[F]ound a bunch of Fly Machines in inconsistent states".
      • Swap glitches.
      • Metrics for dead Machines.
    • Synthetic alerts (proactive probes of functionality, rather than passive). *
    • Deliberate reboots to avoid an AMD firmware bug.

Efficiency and capacity improvements

  • Periodic garbage collection on physical host machines' local databases.
  • Silencing metrics for dead Machines (above) also improves Prometheus capacity (IIUC).

New features

  • 6PN address stability after migrations. As the Log points out, this sounds like an oxymoron, so it’ll be interesting to hear more, :black_cat:...
  • CPU throttling on the low-budget shared class, which was originally intended for things like HTTP servers—instead of background workers, media encoders, etc. *

No word yet

*Work ongoing from previously.
‡Fly.io gave encouraging responses to these earlier, but they aren’t necessarily in progress.


Caveat: The above are just my own interpretations and paraphrases, as a fellow user.

4 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.