LiteFS replica disconnection issues

I’ve recently deployed an app using LiteFS in two regions: primary in Toronto and the replica in Sydney. Initial deployment works great with replication working as expected. However after some amount of time (could be hours or maybe a day), replication begins to fail with the following errors:

 2022-11-10T20:19:21.529 app[7831561d] syd [info] replica disconnected, retrying: connect to primary: invalid response: code=503 ('http://d968ff0c.vm.cool-stuff-api.internal:20202')

2022-11-10T20:19:22.742 app[7831561d] syd [info] existing primary found (d968ff0c), connecting as replica

2022-11-10T20:19:22.845 app[d968ff0c] yyz [info] http: error: context canceled

2022-11-10T20:19:22.948 app[7831561d] syd [info] replica disconnected, retrying: connect to primary: invalid response: code=503 ('http://d968ff0c.vm.cool-stuff-api.internal:20202')

2022-11-10T20:19:24.159 app[7831561d] syd [info] existing primary found (d968ff0c), connecting as replica

2022-11-10T20:19:24.262 app[d968ff0c] yyz [info] http: error: context canceled

2022-11-10T20:19:24.366 app[7831561d] syd [info] replica disconnected, retrying: connect to primary: invalid response: code=503 ('http://d968ff0c.vm.cool-stuff-api.internal:20202')

2022-11-10T20:19:25.596 app[7831561d] syd [info] existing primary found (d968ff0c), connecting as replica

2022-11-10T20:19:25.702 app[d968ff0c] yyz [info] http: error: context canceled

2022-11-10T20:19:25.806 app[7831561d] syd [info] replica disconnected, retrying: connect to primary: invalid response: code=503 ('http://d968ff0c.vm.cool-stuff-api.internal:20202')

2022-11-10T20:19:27.015 app[7831561d] syd [info] existing primary found (d968ff0c), connecting as replica

2022-11-10T20:19:27.118 app[d968ff0c] yyz [info] http: error: context canceled

2022-11-10T20:19:27.222 app[7831561d] syd [info] replica disconnected, retrying: connect to primary: invalid response: code=503 ('http://d968ff0c.vm.cool-stuff-api.internal:20202')

2022-11-10T20:19:28.435 app[7831561d] syd [info] existing primary found (d968ff0c), connecting as replica

2022-11-10T20:19:28.539 app[d968ff0c] yyz [info] http: error: context canceled 

It doesn’t seem to recover from the issue and only is resolved when I scale down to 0 and then back up to 2.

Here’s my configuration

litefs.yaml:

# The path to where the SQLite database will be accessed.
mount-dir: "/data"

# The path to where the underlying volume mount is.
data-dir: "/mnt/data"

# These environment variables will be available in your Fly.io application.
# You must specify "experiement.enable_consul" for FLY_CONSUL_URL to be available.
consul:
  url: "${FLY_CONSUL_URL}"
  advertise-url: "http://${HOSTNAME}.vm.${FLY_APP_NAME}.internal:20202"

Dockerfile (abbreviated)

# Fetch the LiteFS binary using a multi-stage build.
FROM flyio/litefs:0.3.0-beta2 AS litefs

...

# Copy litefs
COPY --from=litefs /usr/local/bin/litefs /usr/local/bin/litefs

# Copy litefs config
COPY ./litefs.yml /etc/litefs.yml

RUN mkdir -p /data /mnt/data

...

# entrypoint.sh runs migrations and then runs the command via `exec "$@"`
ENTRYPOINT ["litefs", "--", "/code/entrypoint.sh"]
CMD ["uvicorn", "coolstuff.shell.api:app", "--port", "80", "--host", "0.0.0.0"]

My app loads the SQLite database via /data/app_db and currently is not using WAL mode.

Any help would be appreciated!

Thanks for reporting the issue, @bencdavis. It looks like the only spot in LiteFS that returns a 503 is here:

So it looks like either the server’s context is canceled or the context from the request is canceled. What log messages do you see before this occurs?

Also, can you try hitting the `GET /debug/pprof/goroutine?debug=2" endpoint on both instance’s port 20202 when this occurs? It’ll dump out a bunch of stack traces but that might help narrow down what the cause of the context cancelation is.

Thanks @benbjohnson2 for the quick response!

Unfortunately the logs are truncated and can’t go far enough back to see before the disconnection. One thing I did notice is I have a bug in my app that causes it to occasionally attempt to write to the local database on the replica which throws a (sqlite3.OperationalError) attempt to write a readonly database from the driver. Presumably that doesn’t effect replication though.

I was only able to get the output of /debug/pprof/goroutine on the replica:

Sydney (replica):

goroutine 34116 [running]:
runtime/pprof.writeGoroutineStacks({0xb66740, 0xc0001840e0})
        /usr/local/go/src/runtime/pprof/pprof.go:692 +0x70
runtime/pprof.writeGoroutine({0xb66740?, 0xc0001840e0?}, 0x0?)
        /usr/local/go/src/runtime/pprof/pprof.go:681 +0x2b
runtime/pprof.(*Profile).WriteTo(0xa2abc0?, {0xb66740?, 0xc0001840e0?}, 0xc?)
        /usr/local/go/src/runtime/pprof/pprof.go:330 +0x14b
net/http/pprof.handler.ServeHTTP({0xc00002e911, 0x9}, {0xb6a690, 0xc0001840e0}, 0xc00009806f?)
        /usr/local/go/src/net/http/pprof/pprof.go:253 +0x4a5
net/http/pprof.Index({0xb6a690?, 0xc0001840e0}, 0xc000130600)
        /usr/local/go/src/net/http/pprof/pprof.go:371 +0x13e
github.com/superfly/litefs/http.(*Server).serveHTTP(0xc000162e60?, {0xb6a690, 0xc0001840e0}, 0xc000130600)
        /src/litefs/http/server.go:128 +0x186
net/http.HandlerFunc.ServeHTTP(0xc0002009b8?, {0xb6a690?, 0xc0001840e0?}, 0xc000200990?)
        /usr/local/go/src/net/http/server.go:2109 +0x2f
golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP({{0xb67280?, 0xc0001509a0?}, 0xc00008b980?}, {0xb6a690, 0xc0001840e0}, 0xc000130600)
        /go/pkg/mod/golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/h2c/h2c.go:113 +0x49f
net/http.serverHandler.ServeHTTP({0xc000492600?}, {0xb6a690, 0xc0001840e0}, 0xc000130600)
        /usr/local/go/src/net/http/server.go:2947 +0x30c
net/http.(*conn).serve(0xc0001628c0, {0xb6ae18, 0xc0001d1290})
        /usr/local/go/src/net/http/server.go:1991 +0x607
created by net/http.(*Server).Serve
        /usr/local/go/src/net/http/server.go:3102 +0x4db

goroutine 1 [select, 1336 minutes]:
main.main()
        /src/litefs/cmd/litefs/main.go:79 +0x557

goroutine 23 [syscall, 1336 minutes]:
os/signal.signal_recv()
        /usr/local/go/src/runtime/sigqueue.go:152 +0x2f
os/signal.loop()
        /usr/local/go/src/os/signal/signal_unix.go:23 +0x19
created by os/signal.Notify.func1.1
        /usr/local/go/src/os/signal/signal.go:151 +0x2a

goroutine 38 [sleep]:
time.Sleep(0x3b9aca00)
        /usr/local/go/src/runtime/time.go:195 +0x135
github.com/superfly/litefs.(*Store).monitorLease(0xc000130500, {0xb6ad70, 0xc00008b8c0})
        /src/litefs/store.go:462 +0x2c8
github.com/superfly/litefs.(*Store).Open.func1()
        /src/litefs/store.go:134 +0x25
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /go/pkg/mod/golang.org/x/sync@v0.0.0-20220601150217-0de741cfad7f/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
        /go/pkg/mod/golang.org/x/sync@v0.0.0-20220601150217-0de741cfad7f/errgroup/errgroup.go:72 +0xa5

goroutine 39 [select, 1 minutes]:
github.com/superfly/litefs.(*Store).monitorRetention(0xc000130500, {0xb6ad70, 0xc00008b8c0})
        /src/litefs/store.go:607 +0xe7
github.com/superfly/litefs.(*Store).Open.func2()
        /src/litefs/store.go:138 +0x25
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /go/pkg/mod/golang.org/x/sync@v0.0.0-20220601150217-0de741cfad7f/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
        /go/pkg/mod/golang.org/x/sync@v0.0.0-20220601150217-0de741cfad7f/errgroup/errgroup.go:72 +0xa5

goroutine 34117 [runnable]:
net/http.(*connReader).startBackgroundRead.func2()
        /usr/local/go/src/net/http/server.go:674
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1594 +0x1
created by net/http.(*connReader).startBackgroundRead
        /usr/local/go/src/net/http/server.go:674 +0xca

goroutine 44 [syscall, 21 minutes]:
syscall.Syscall(0x4780ee?, 0xf03940?, 0x40dc07?, 0x9f5880?)
        /usr/local/go/src/syscall/syscall_linux.go:68 +0x27
syscall.read(0xf03940?, {0xc000494000?, 0x11?, 0x73586c6f67666e00?})
        /usr/local/go/src/syscall/zsyscall_linux_amd64.go:696 +0x45
syscall.Read(...)
        /usr/local/go/src/syscall/syscall_unix.go:183
bazil.org/fuse.(*Conn).ReadRequest(0xc0001f2840)
        /go/pkg/mod/bazil.org/fuse@v0.0.0-20200524192727-fb710f7dfd05/fuse.go:580 +0xc6
bazil.org/fuse/fs.(*Server).Serve(0xc000185dc0, {0xb65d80?, 0xc0001f25a0})
        /go/pkg/mod/bazil.org/fuse@v0.0.0-20200524192727-fb710f7dfd05/fs/serve.go:501 +0x3ef
github.com/superfly/litefs/fuse.(*FileSystem).Mount.func2()
        /src/litefs/fuse/file_system.go:93 +0x28
created by github.com/superfly/litefs/fuse.(*FileSystem).Mount
        /src/litefs/fuse/file_system.go:92 +0x31c

goroutine 45 [IO wait]:
internal/poll.runtime_pollWait(0x7efbed903038, 0x72)
        /usr/local/go/src/runtime/netpoll.go:305 +0x89
internal/poll.(*pollDesc).wait(0xc00016a180?, 0x6?, 0x0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc00016a180)
        /usr/local/go/src/internal/poll/fd_unix.go:614 +0x234
net.(*netFD).accept(0xc00016a180)
        /usr/local/go/src/net/fd_unix.go:172 +0x35
net.(*TCPListener).accept(0xc000011608)
        /usr/local/go/src/net/tcpsock_posix.go:142 +0x28
net.(*TCPListener).Accept(0xc000011608)
        /usr/local/go/src/net/tcpsock.go:288 +0x3d
net/http.(*Server).Serve(0xc0000fe1e0, {0xb6a480, 0xc000011608})
        /usr/local/go/src/net/http/server.go:3070 +0x385
github.com/superfly/litefs/http.(*Server).Serve.func1()
        /src/litefs/http/server.go:74 +0x2d
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /go/pkg/mod/golang.org/x/sync@v0.0.0-20220601150217-0de741cfad7f/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
        /go/pkg/mod/golang.org/x/sync@v0.0.0-20220601150217-0de741cfad7f/errgroup/errgroup.go:72 +0xa5

goroutine 50 [IO wait]:
internal/poll.runtime_pollWait(0x7efbed902c78, 0x72)
        /usr/local/go/src/runtime/netpoll.go:305 +0x89
internal/poll.(*pollDesc).wait(0xc00016b080?, 0xc00024b000?, 0x0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00016b080, {0xc00024b000, 0x1000, 0x1000})
        /usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc00016b080, {0xc00024b000?, 0x0?, 0x400?})
        /usr/local/go/src/net/fd_posix.go:55 +0x29
net.(*conn).Read(0xc0000123b0, {0xc00024b000?, 0x1000000406730?, 0x4ed000801?})
        /usr/local/go/src/net/net.go:183 +0x45
bufio.(*Reader).Read(0xc0001f2ea0, {0xc000185ee0, 0x9, 0x8c6fa5?})
        /usr/local/go/src/bufio/bufio.go:237 +0x1bb
io.ReadAtLeast({0xb65920, 0xc0001f2ea0}, {0xc000185ee0, 0x9, 0x9}, 0x9)
        /usr/local/go/src/io/io.go:332 +0x9a
io.ReadFull(...)
        /usr/local/go/src/io/io.go:351
golang.org/x/net/http2.readFrameHeader({0xc000185ee0?, 0x9?, 0x0?}, {0xb65920?, 0xc0001f2ea0?})
        /go/pkg/mod/golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/frame.go:237 +0x6e
golang.org/x/net/http2.(*Framer).ReadFrame(0xc000185ea0)
        /go/pkg/mod/golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/frame.go:498 +0x95
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc0001bdf98)
        /go/pkg/mod/golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/transport.go:2134 +0x130
golang.org/x/net/http2.(*ClientConn).readLoop(0xc000025080)
        /go/pkg/mod/golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/transport.go:2030 +0x6f
created by golang.org/x/net/http2.(*Transport).newClientConn
        /go/pkg/mod/golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/transport.go:737 +0xaaa

goroutine 53 [syscall, 1336 minutes]:
syscall.Syscall6(0x55666d4937632b34?, 0x704b45626271542b?, 0x647355635a497271?, 0x64464a7536304d36?, 0x714678544e4a7268?, 0x3166557a46705937?, 0x5a73426752374f47?)
        /usr/local/go/src/syscall/syscall_linux.go:90 +0x36
os.(*Process).blockUntilWaitable(0xc000290450)
        /usr/local/go/src/os/wait_waitid.go:32 +0x87
os.(*Process).wait(0xc000290450)
        /usr/local/go/src/os/exec_unix.go:22 +0x28
os.(*Process).Wait(...)
        /usr/local/go/src/os/exec.go:132
os/exec.(*Cmd).Wait(0xc000195b80)
        /usr/local/go/src/os/exec/exec.go:599 +0x4b
main.(*Main).execCmd.func1()
        /src/litefs/cmd/litefs/main.go:402 +0x25
created by main.(*Main).execCmd
        /src/litefs/cmd/litefs/main.go:402 +0x311

goroutine 52 [select, 1336 minutes]:
os/exec.(*Cmd).watchCtx.func1()
        /usr/local/go/src/os/exec/exec.go:648 +0x96
created by os/exec.(*Cmd).watchCtx
        /usr/local/go/src/os/exec/exec.go:647 +0x8e

goroutine 26564 [IO wait]:
internal/poll.runtime_pollWait(0x7efbed902f48, 0x72)
        /usr/local/go/src/runtime/netpoll.go:305 +0x89
internal/poll.(*pollDesc).wait(0xc00016aa80?, 0xc0002ae000?, 0x0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00016aa80, {0xc0002ae000, 0x1300, 0x1300})
        /usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc00016aa80, {0xc0002ae000?, 0xc000068560?, 0xc0002ae1b9?})
        /usr/local/go/src/net/fd_posix.go:55 +0x29
net.(*conn).Read(0xc0000121b8, {0xc0002ae000?, 0x1299?, 0xc000068560?})
        /usr/local/go/src/net/net.go:183 +0x45
crypto/tls.(*atLeastReader).Read(0xc0001fa090, {0xc0002ae000?, 0x0?, 0xc000202860?})
        /usr/local/go/src/crypto/tls/conn.go:787 +0x3d
bytes.(*Buffer).ReadFrom(0xc0000a05f8, {0xb65aa0, 0xc0001fa090})
        /usr/local/go/src/bytes/buffer.go:202 +0x98
crypto/tls.(*Conn).readFromUntil(0xc0000a0380, {0xb66480?, 0xc0000121b8}, 0x114c?)
        /usr/local/go/src/crypto/tls/conn.go:809 +0xe5
crypto/tls.(*Conn).readRecordOrCCS(0xc0000a0380, 0x0)
        /usr/local/go/src/crypto/tls/conn.go:616 +0x116
crypto/tls.(*Conn).readRecord(...)
        /usr/local/go/src/crypto/tls/conn.go:582
crypto/tls.(*Conn).Read(0xc0000a0380, {0xc0002c8000, 0x1000, 0x6a0b60?})
        /usr/local/go/src/crypto/tls/conn.go:1287 +0x16f
bufio.(*Reader).Read(0xc0002c7080, {0xc0001842e0, 0x9, 0x6bda65?})
        /usr/local/go/src/bufio/bufio.go:237 +0x1bb
io.ReadAtLeast({0xb65920, 0xc0002c7080}, {0xc0001842e0, 0x9, 0x9}, 0x9)
        /usr/local/go/src/io/io.go:332 +0x9a
io.ReadFull(...)
        /usr/local/go/src/io/io.go:351
net/http.http2readFrameHeader({0xc0001842e0?, 0x9?, 0xc000492300?}, {0xb65920?, 0xc0002c7080?})
        /usr/local/go/src/net/http/h2_bundle.go:1565 +0x6e
net/http.(*http2Framer).ReadFrame(0xc0001842a0)
        /usr/local/go/src/net/http/h2_bundle.go:1829 +0x95
net/http.(*http2clientConnReadLoop).run(0xc000202f98)
        /usr/local/go/src/net/http/h2_bundle.go:8875 +0x130
net/http.(*http2ClientConn).readLoop(0xc000024900)
        /usr/local/go/src/net/http/h2_bundle.go:8771 +0x6f
created by net/http.(*http2Transport).newClientConn
        /usr/local/go/src/net/http/h2_bundle.go:7478 +0xaaa

On the primary, the connection is refused, even though the instances seem to be running and passing health checks:

App
  Name     = cool-stuff-api          
  Owner    = personal                
  Version  = 10                      
  Status   = running                 
  Hostname = cool-stuff-api.fly.dev  
  Platform = nomad                   

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS  HEALTH CHECKS           RESTARTS        CREATED    
d968ff0c        app     10      yyz     run     running 1 total, 1 passing      0               22h20m ago
7831561d        app     10      syd     run     running 1 total, 1 passing      0               22h20m ago

On the primary instance, the litefs process is still alive:

# ps aux | grep litefs
root       528  0.0  3.6 724280  8224 ?        Sl   Nov09   0:14 litefs -- /code/entrypoint.sh uvicorn coolstuff.shell.api:app --port 80 --host 0.0.0.0
1 Like

Just tried scaling down to 0 then back to 2 (1 per region), and it’s not stuck in a restart loop due to checksum verification failure:

 2022-11-10T21:41:23.127 runner[74279dd3] syd [info] Starting instance

2022-11-10T21:41:23.544 runner[74279dd3] syd [info] Configuring virtual machine

2022-11-10T21:41:23.546 runner[74279dd3] syd [info] Pulling container image

2022-11-10T21:41:24.507 runner[74279dd3] syd [info] Unpacking image

2022-11-10T21:41:24.517 runner[74279dd3] syd [info] Preparing kernel init

2022-11-10T21:41:24.656 runner[74279dd3] syd [info] Setting up volume 'api'

2022-11-10T21:41:24.659 runner[74279dd3] syd [info] Opening encrypted volume

2022-11-10T21:41:24.803 runner[74279dd3] syd [info] Configuring firecracker

2022-11-10T21:41:25.048 runner[74279dd3] syd [info] Starting virtual machine

2022-11-10T21:41:25.225 app[74279dd3] syd [info] Starting init (commit: 81d5330)...

2022-11-10T21:41:25.244 app[74279dd3] syd [info] Mounting /dev/vdc at /mnt/data w/ uid: 0, gid: 0 and chmod 0755

2022-11-10T21:41:25.248 app[74279dd3] syd [info] Preparing to run: `litefs -- /code/entrypoint.sh uvicorn coolstuff.shell.api:app --port 80 --host 0.0.0.0` as root

2022-11-10T21:41:25.264 app[74279dd3] syd [info] 2022/11/10 21:41:25 listening on [fdaa:0:b45d:a7b:2983:2:d03f:2]:22 (DNS: [fdaa::3]:53)

2022-11-10T21:41:25.283 app[74279dd3] syd [info] config file read from /etc/litefs.yml

2022-11-10T21:41:25.283 app[74279dd3] syd [info] LiteFS v0.3.0-beta1, commit=f40cf36930e1d0b259bbbdb72c3fd3a508ff1936

2022-11-10T21:41:25.283 app[74279dd3] syd [info] Using Consul to determine primary

2022-11-10T21:41:25.581 runner[77061cb0] yyz [info] Starting instance

2022-11-10T21:41:26.174 runner[77061cb0] yyz [info] Configuring virtual machine

2022-11-10T21:41:26.181 runner[77061cb0] yyz [info] Pulling container image

2022-11-10T21:41:26.460 runner[77061cb0] yyz [info] Unpacking image

2022-11-10T21:41:26.472 runner[77061cb0] yyz [info] Preparing kernel init

2022-11-10T21:41:26.693 runner[77061cb0] yyz [info] Setting up volume 'api'

2022-11-10T21:41:26.698 runner[77061cb0] yyz [info] Opening encrypted volume

2022-11-10T21:41:26.776 app[74279dd3] syd [info] initializing consul: key= url=https://:4ee340cf-0c2e-f4e1-be1b-d5acd18d0add@consul-iad.fly-shared.net/cool-stuff-api-jg589z352ggqkze7/ hostname=74279dd3 advertise-url=http://74279dd3.vm.cool-stuff-api.internal:20202

2022-11-10T21:41:26.783 app[74279dd3] syd [info] ERROR: cannot open store: open databases: open database("coolstuff"): verify database file: database checksum (eec31726781b11cc) does not match latest LTX checksum (9b5d67d38caaa1cd)

2022-11-10T21:41:26.894 runner[77061cb0] yyz [info] Configuring firecracker

2022-11-10T21:41:26.932 runner[77061cb0] yyz [info] Starting virtual machine

2022-11-10T21:41:27.117 app[77061cb0] yyz [info] Starting init (commit: 81d5330)...

2022-11-10T21:41:27.142 app[77061cb0] yyz [info] Mounting /dev/vdc at /mnt/data w/ uid: 0, gid: 0 and chmod 0755

2022-11-10T21:41:27.149 app[77061cb0] yyz [info] Preparing to run: `litefs -- /code/entrypoint.sh uvicorn coolstuff.shell.api:app --port 80 --host 0.0.0.0` as root

2022-11-10T21:41:27.174 app[77061cb0] yyz [info] 2022/11/10 21:41:27 listening on [fdaa:0:b45d:a7b:8a0f:2:d03d:2]:22 (DNS: [fdaa::3]:53)

2022-11-10T21:41:27.207 app[77061cb0] yyz [info] LiteFS v0.3.0-beta1, commit=f40cf36930e1d0b259bbbdb72c3fd3a508ff1936

2022-11-10T21:41:27.207 app[77061cb0] yyz [info] config file read from /etc/litefs.yml

2022-11-10T21:41:27.207 app[77061cb0] yyz [info] Using Consul to determine primary

2022-11-10T21:41:27.258 app[74279dd3] syd [info] Starting clean up.

2022-11-10T21:41:27.258 app[74279dd3] syd [info] Umounting /dev/vdc from /mnt/data

2022-11-10T21:41:29.330 app[77061cb0] yyz [info] initializing consul: key= url=https://:4ee340cf-0c2e-f4e1-be1b-d5acd18d0add@consul-iad.fly-shared.net/cool-stuff-api-jg589z352ggqkze7/ hostname=77061cb0 advertise-url=http://77061cb0.vm.cool-stuff-api.internal:20202

2022-11-10T21:41:29.340 app[77061cb0] yyz [info] ERROR: cannot open store: open databases: open database("coolstuff"): verify database file: database checksum (eec31726781b11cc) does not match latest LTX checksum (9b5d67d38caaa1cd)

2022-11-10T21:41:30.161 app[77061cb0] yyz [info] Starting clean up.

2022-11-10T21:41:30.162 app[77061cb0] yyz [info] Umounting /dev/vdc from /mnt/data 

Yeah, that’s the expected behavior. It’s just LiteFS denying writes against the replica.

Thanks for grabbing this on the replica. I didn’t see anything out of the ordinary with it so perhaps the issue is on the primary.

Hmm, that’s odd that you can’t hit the HTTP server on the primary. Another option is to send a SIGQUIT to the litefs process on there and that will kill the node but also dump out the stack trace.

Hmm, I thought this issue was fixed but someone else just reported the issue too. I’ll get a fix in to gracefully recover from it tomorrow.

One more question: is this a Python app that you’re running? I saw uvicorn and assumed so. Nothing wrong with Python but trying to narrow down if there’s issues with any specific clients.

Thanks!

Hmm, I thought this issue was fixed but someone else just reported the issue too. I’ll get a fix in to gracefully recover from it tomorrow.

Sounds good. Does that mean it’ll be able to recover data? One thing I’m a little confused on is how to recover the database file in the event that litefs fails to start. I assumed that I would be able to mount the volume and find the file, but that doesn’t seem to be the case. I tried restarting the app without litefs, but when my app accessed the file, it was empty and so it ran the initial migrations as if it was a clean file.

One more question: is this a Python app that you’re running? I saw uvicorn and assumed so. Nothing wrong with Python but trying to narrow down if there’s issues with any specific clients.

Yep! FastAPI with SQLAlchemy and the pysqlite3 driver (not the async driver).

Does that mean it’ll be able to recover data? One thing I’m a little confused on is how to recover the database file in the event that litefs fails to start. I assumed that I would be able to mount the volume and find the file, but that doesn’t seem to be the case. I tried restarting the app without litefs, but when my app accessed the file, it was empty and so it ran the initial migrations as if it was a clean file.

Ignore this! I realized I was looking in the data directory provided by litefs not the volume mount itself. I see the database file in there now.

@bencdavis Someone else had a similar issue and I was able to get some fixes in for it. Can you give the new 0.3.0-beta3 release a try?

Or, from Docker it’s: flyio/litefs:0.3.0-beta3

@benbjohnson2 Just tried starting a single instance with beta3 and it still failing to start with the checksum verification error:

2022-11-11T22:51:54Z   [info]Starting init (commit: 81d5330)...
2022-11-11T22:51:54Z   [info]Mounting /dev/vdc at /mnt/data w/ uid: 0, gid: 0 and chmod 0755
2022-11-11T22:51:54Z   [info]Preparing to run: `litefs -- /code/entrypoint.sh uvicorn coolstuff.shell.api:app --port 80 --host 0.0.0.0` as root
2022-11-11T22:51:54Z   [info]2022/11/11 22:51:54 listening on [fdaa:0:b45d:a7b:8a0f:2:d03d:2]:22 (DNS: [fdaa::3]:53)
2022-11-11T22:51:54Z   [info]LiteFS v0.3.0-beta3, commit=fabf62dacbc590ed94d79493a3691deb19b153bd
2022-11-11T22:51:54Z   [info]config file read from /etc/litefs.yml
2022-11-11T22:51:54Z   [info]Using Consul to determine primary
2022-11-11T22:51:55Z   [info]initializing consul: key= url=https://:4ee340cf-0c2e-f4e1-be1b-d5acd18d0add@consul-iad.fly-shared.net/cool-stuff-api-jg589z352ggqkze7/ hostname=dad099c9 advertise-url=http://dad099c9.vm.cool-stuff-api.internal:20202
2022-11-11T22:51:56Z   [info]ERROR: cannot open store: open databases: open database("coolstuff"): verify database file: database checksum (eec31726781b11cc) does not match latest LTX checksum (9b5d67d38caaa1cd)
2022-11-11T22:51:56Z   [info]Umountingclean up.
2022-11-11T22:51:56Z   [info]Umounting /dev/vdc from /mnt/data

Is it expected that this would be fixed or is it now in an fundamentally inconsistent state due to the disconnection issues? I’m assuming I could manually edit the LTX file so that it matches the current state of the DB. My project is still pre-release so it’s not crucial to maintain 100% consistency.

@bencdavis If you have a persistent volume, you’ll need to drop and recreate it. The checksum was persisted to the data directory so it’ll reappear on startup until you clear out the data.

You could manually edit the checksum LTX file, however, the LTX file itself has a checksum for its content so you’d need to update that as well. :slight_smile:

We could recover by restarting the LTX files based on a snapshot, however, these issues generally shouldn’t happen so it’s better to figure out the root cause rather than automatically recover and ignoring them. Thanks for digging into this and trying it again.