Overview
We began getting reports of 500 errors during deployment from this thread on the community board. On investigation, the sin
, yul
, & jnb
regions were having intermittent connectivity issues with iad
and LiteFS was unable to connect to the primary.
While deploying a LiteFS Proxy fix later in the day to address this, a checksum error occurred on the candidate nodes in the IAD region and we needed to perform a manual recovery of the primary node in the cluster.
Timeline
The following is an overview of the incident’s activity (all times in CST):
- 07:16: Users begin reporting 500 errors during deployment; we began investigating the root cause.
- 08:46: Connectivity improves and error rate also improves.
- 10:18: Bad gateway errors begin happening from
sin
&yul
instances of Fly registry. - 11:02: LiteFS begins showing errors for writes against replicas indicating that they have lost connection with the primary.
- 11:11: Problematic replica nodes have been restarted and errors drop.
- 13:56: Implemented a LiteFS Proxy fix to prevent replicas from accepting writes.
- 17:32: Deployed new version of LiteFS to registry but LiteFS is not starting up registry service. Beginning to see
database checksum mismatch
errors in candidate node logs. - 17:54: Manually cleared the data directory on one candidate, restarted, and re-imported the database file. Successfully came up as primary again. Manually cleared out the other two candidate nodes in
iad
. - 18:15: Begin manually restarting replica nodes, stopped after 5 or so as other replicas were recovering on their own.
Learnings
We learned that LiteFS should better handle failure modes where it is unable to connect to the primary. We also found a bug in the primary handoff which caused a checksum mismatch on startup. Our first responder steps can also be improved so we can more quickly respond to similar situations in the future.
Action Items
- Fix elevated 500 error alerting.
- Improve LiteFS proxy to handle primary disconnection.
- Set up a representative cluster to reproduce the checksum bug. The current theory is that the larger database sizes and frequent deploys caused a rare bug to show when it hasn’t in the past.
- Add trace logging to registry nodes.
- Deploy registry service sequentially in the primary region.
- Add a health check to the LiteFS proxy to pull nodes out of the fly-proxy when they lag.
- Update the registry ops guide with some of the errors/behaviors
- Improve deployment speed of the registry service through flyctl.