The Infrastructure Log is an excellent auxiliary resource that tends to fall off people’s radars, , and since this latest one is more eventful than might be expected, I thought I’d draw a little attention during the forum’s off hours…
https://fly.io/infra-log/2024-12-14/
The narrow but intense box there on Thursday is a reprise of the if let
global proxy deadlock, a (fortunately) briefer recurrence of the problem that caused the September major outage. This month, “watchdog safeguards” auto-rebooted the proxy, greatly reducing the impact.
Definitely read the link above for official details.
And here’s a partial outline of what else can be found there, on the off chance that anyone might need a little extra nudging…
Measures being taken to prevent future problems
- Automated, structural (as opposed to just
grep
ing for strings) source-code sweeps, to catch thatif let
pattern.* - Decentralization...
- Of Corrosion, the global metadata database,*
- including its interactions with the proxy, those being not simple—given
Fly-Replay
, and the like.
- including its interactions with the proxy, those being not simple—given
- "Any project that credibly removes a SPOF [single point of failure] in our architecture is staffed right now" (hidden away in a thread elsewhere).
- Of Corrosion, the global metadata database,*
- Audit of all certificates and their expirations (which were the cause of the October major outage).*
Efficiency and capacity improvements
- "Burst" providers - third-party hosting† that Fly.io's own customers' loads transparently run on until Fly can purchase and install more hardware (?). [I don't recall hearing this mentioned before—although burst "capacity" (without involving other companies) was described in August.]
- "CI/CD process for BGP4 announcement changes".
- Better use of
RootFS
storage, on physical host machines. - IO performance, particularly during volume snapshots.
New features
- Containers (e.g., for sidecars, composable volume-related modules, nicer
cron
, …), .
*Ongoing efforts.
†Generally, these would have to be bare-metal servers, due to Firecracker.
Caveat: These are just the interpretations of a fellow user. (And are admittedly at least partly for writing practise, so please bear with me.)