We began getting reports of 500 errors during deployment from this thread on the community board. On investigation, the
jnb regions were having intermittent connectivity issues with
iad and LiteFS was unable to connect to the primary.
While deploying a LiteFS Proxy fix later in the day to address this, a checksum error occurred on the candidate nodes in the IAD region and we needed to perform a manual recovery of the primary node in the cluster.
The following is an overview of the incident’s activity (all times in CST):
- 07:16: Users begin reporting 500 errors during deployment; we began investigating the root cause.
- 08:46: Connectivity improves and error rate also improves.
- 10:18: Bad gateway errors begin happening from
yulinstances of Fly registry.
- 11:02: LiteFS begins showing errors for writes against replicas indicating that they have lost connection with the primary.
- 11:11: Problematic replica nodes have been restarted and errors drop.
- 13:56: Implemented a LiteFS Proxy fix to prevent replicas from accepting writes.
- 17:32: Deployed new version of LiteFS to registry but LiteFS is not starting up registry service. Beginning to see
database checksum mismatcherrors in candidate node logs.
- 17:54: Manually cleared the data directory on one candidate, restarted, and re-imported the database file. Successfully came up as primary again. Manually cleared out the other two candidate nodes in
- 18:15: Begin manually restarting replica nodes, stopped after 5 or so as other replicas were recovering on their own.
We learned that LiteFS should better handle failure modes where it is unable to connect to the primary. We also found a bug in the primary handoff which caused a checksum mismatch on startup. Our first responder steps can also be improved so we can more quickly respond to similar situations in the future.
- Fix elevated 500 error alerting.
- Improve LiteFS proxy to handle primary disconnection.
- Set up a representative cluster to reproduce the checksum bug. The current theory is that the larger database sizes and frequent deploys caused a rare bug to show when it hasn’t in the past.
- Add trace logging to registry nodes.
- Deploy registry service sequentially in the primary region.
- Add a health check to the LiteFS proxy to pull nodes out of the fly-proxy when they lag.
- Update the registry ops guide with some of the errors/behaviors
- Improve deployment speed of the registry service through flyctl.