Postmortem: Fly Registry (2023-08-08)

benbjohnson · August 10, 2023, 5:57pm

Overview

We began getting reports of 500 errors during deployment from this thread on the community board. On investigation, the sin, yul, & jnb regions were having intermittent connectivity issues with iad and LiteFS was unable to connect to the primary.

While deploying a LiteFS Proxy fix later in the day to address this, a checksum error occurred on the candidate nodes in the IAD region and we needed to perform a manual recovery of the primary node in the cluster.

Timeline

The following is an overview of the incident’s activity (all times in CST):

07:16: Users begin reporting 500 errors during deployment; we began investigating the root cause.
08:46: Connectivity improves and error rate also improves.
10:18: Bad gateway errors begin happening from sin & yul instances of Fly registry.
11:02: LiteFS begins showing errors for writes against replicas indicating that they have lost connection with the primary.
11:11: Problematic replica nodes have been restarted and errors drop.
13:56: Implemented a LiteFS Proxy fix to prevent replicas from accepting writes.
17:32: Deployed new version of LiteFS to registry but LiteFS is not starting up registry service. Beginning to see database checksum mismatch errors in candidate node logs.
17:54: Manually cleared the data directory on one candidate, restarted, and re-imported the database file. Successfully came up as primary again. Manually cleared out the other two candidate nodes in iad.
18:15: Begin manually restarting replica nodes, stopped after 5 or so as other replicas were recovering on their own.

Learnings

We learned that LiteFS should better handle failure modes where it is unable to connect to the primary. We also found a bug in the primary handoff which caused a checksum mismatch on startup. Our first responder steps can also be improved so we can more quickly respond to similar situations in the future.

Action Items

Fix elevated 500 error alerting.
Improve LiteFS proxy to handle primary disconnection.
Set up a representative cluster to reproduce the checksum bug. The current theory is that the larger database sizes and frequent deploys caused a rare bug to show when it hasn’t in the past.
Add trace logging to registry nodes.
Deploy registry service sequentially in the primary region.
Add a health check to the LiteFS proxy to pull nodes out of the fly-proxy when they lag.
Update the registry ops guide with some of the errors/behaviors
Improve deployment speed of the registry service through flyctl.

system · August 17, 2023, 5:57pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
LiteFS replica disconnection issues Build debugging litefs	20	1561	May 23, 2023
LiteFS with consul leasing: Cannot connect to consul due to failed cert verification elixir , litefs	13	1070	June 2, 2023
Playing with LiteFS, not sure if I understand how it works Questions / Help machines , litefs	6	44	June 28, 2025
(SOS) LiteFS + Fly + Remix configuration Build debugging litefs	7	995	April 2, 2024
LiteFS "wal error: read only replica" on a POST Request litefs , proxy	5	73	September 2, 2024

Postmortem: Fly Registry (2023-08-08)

Related topics