Postmortem: Fly Registry (2023-08-08)

Overview

We began getting reports of 500 errors during deployment from this thread on the community board. On investigation, the sin, yul, & jnb regions were having intermittent connectivity issues with iad and LiteFS was unable to connect to the primary.

While deploying a LiteFS Proxy fix later in the day to address this, a checksum error occurred on the candidate nodes in the IAD region and we needed to perform a manual recovery of the primary node in the cluster.

Timeline

The following is an overview of the incident’s activity (all times in CST):

  • 07:16: Users begin reporting 500 errors during deployment; we began investigating the root cause.
  • 08:46: Connectivity improves and error rate also improves.
  • 10:18: Bad gateway errors begin happening from sin & yul instances of Fly registry.
  • 11:02: LiteFS begins showing errors for writes against replicas indicating that they have lost connection with the primary.
  • 11:11: Problematic replica nodes have been restarted and errors drop.
  • 13:56: Implemented a LiteFS Proxy fix to prevent replicas from accepting writes.
  • 17:32: Deployed new version of LiteFS to registry but LiteFS is not starting up registry service. Beginning to see database checksum mismatch errors in candidate node logs.
  • 17:54: Manually cleared the data directory on one candidate, restarted, and re-imported the database file. Successfully came up as primary again. Manually cleared out the other two candidate nodes in iad.
  • 18:15: Begin manually restarting replica nodes, stopped after 5 or so as other replicas were recovering on their own.

Learnings

We learned that LiteFS should better handle failure modes where it is unable to connect to the primary. We also found a bug in the primary handoff which caused a checksum mismatch on startup. Our first responder steps can also be improved so we can more quickly respond to similar situations in the future.

Action Items

  • Fix elevated 500 error alerting.
  • Improve LiteFS proxy to handle primary disconnection.
  • Set up a representative cluster to reproduce the checksum bug. The current theory is that the larger database sizes and frequent deploys caused a rare bug to show when it hasn’t in the past.
  • Add trace logging to registry nodes.
  • Deploy registry service sequentially in the primary region.
  • Add a health check to the LiteFS proxy to pull nodes out of the fly-proxy when they lag.
  • Update the registry ops guide with some of the errors/behaviors
  • Improve deployment speed of the registry service through flyctl.
5 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.