Infrastructure Log for Dec 8–14 - global proxy deadlock, "burst" providers, decentralization

The Infrastructure Log is an excellent auxiliary resource that tends to fall off people’s radars, :leaves:, and since this latest one is more eventful than might be expected, I thought I’d draw a little attention during the forum’s off hours…

https://fly.io/infra-log/2024-12-14/

The narrow but intense box there on Thursday is a reprise of the if let global proxy deadlock, a (fortunately) briefer recurrence of the problem that caused the September major outage. This month, “watchdog safeguards” auto-rebooted the proxy, greatly reducing the impact.

Definitely read the link above for official details.


And here’s a partial outline of what else can be found there, on the off chance that anyone might need a little extra nudging…

Measures being taken to prevent future problems

  • Automated, structural (as opposed to just greping for strings) source-code sweeps, to catch that if let pattern.*
  • Decentralization...
    • Of Corrosion, the global metadata database,*
      • including its interactions with the proxy, those being not simple—given Fly-Replay, and the like.
    • "Any project that credibly removes a SPOF [single point of failure] in our architecture is staffed right now" (hidden away in a thread elsewhere).
  • Audit of all certificates and their expirations (which were the cause of the October major outage).*

Efficiency and capacity improvements

  • "Burst" providers - third-party hosting† that Fly.io's own customers' loads transparently run on until Fly can purchase and install more hardware (?). [I don't recall hearing this mentioned before—although burst "capacity" (without involving other companies) was described in August.]
  • "CI/CD process for BGP4 announcement changes".
  • Better use of RootFS storage, on physical host machines.
  • IO performance, particularly during volume snapshots.

New features

  • Containers (e.g., for sidecars, composable volume-related modules, nicer cron, …), :black_cat:.

*Ongoing efforts.
†Generally, these would have to be bare-metal servers, due to Firecracker.


Caveat: These are just the interpretations of a fellow user. (And are admittedly at least partly for writing practise, so please bear with me.)

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.