LiteFS replica disconnection issues

I’ve recently deployed an app using LiteFS in two regions: primary in Toronto and the replica in Sydney. Initial deployment works great with replication working as expected. However after some amount of time (could be hours or maybe a day), replication begins to fail with the following errors:

 2022-11-10T20:19:21.529 app[7831561d] syd [info] replica disconnected, retrying: connect to primary: invalid response: code=503 ('http://d968ff0c.vm.cool-stuff-api.internal:20202')

2022-11-10T20:19:22.742 app[7831561d] syd [info] existing primary found (d968ff0c), connecting as replica

2022-11-10T20:19:22.845 app[d968ff0c] yyz [info] http: error: context canceled

2022-11-10T20:19:22.948 app[7831561d] syd [info] replica disconnected, retrying: connect to primary: invalid response: code=503 ('http://d968ff0c.vm.cool-stuff-api.internal:20202')

2022-11-10T20:19:24.159 app[7831561d] syd [info] existing primary found (d968ff0c), connecting as replica

2022-11-10T20:19:24.262 app[d968ff0c] yyz [info] http: error: context canceled

2022-11-10T20:19:24.366 app[7831561d] syd [info] replica disconnected, retrying: connect to primary: invalid response: code=503 ('http://d968ff0c.vm.cool-stuff-api.internal:20202')

2022-11-10T20:19:25.596 app[7831561d] syd [info] existing primary found (d968ff0c), connecting as replica

2022-11-10T20:19:25.702 app[d968ff0c] yyz [info] http: error: context canceled

2022-11-10T20:19:25.806 app[7831561d] syd [info] replica disconnected, retrying: connect to primary: invalid response: code=503 ('http://d968ff0c.vm.cool-stuff-api.internal:20202')

2022-11-10T20:19:27.015 app[7831561d] syd [info] existing primary found (d968ff0c), connecting as replica

2022-11-10T20:19:27.118 app[d968ff0c] yyz [info] http: error: context canceled

2022-11-10T20:19:27.222 app[7831561d] syd [info] replica disconnected, retrying: connect to primary: invalid response: code=503 ('http://d968ff0c.vm.cool-stuff-api.internal:20202')

2022-11-10T20:19:28.435 app[7831561d] syd [info] existing primary found (d968ff0c), connecting as replica

2022-11-10T20:19:28.539 app[d968ff0c] yyz [info] http: error: context canceled 

It doesn’t seem to recover from the issue and only is resolved when I scale down to 0 and then back up to 2.

Here’s my configuration

litefs.yaml:

# The path to where the SQLite database will be accessed.
mount-dir: "/data"

# The path to where the underlying volume mount is.
data-dir: "/mnt/data"

# These environment variables will be available in your Fly.io application.
# You must specify "experiement.enable_consul" for FLY_CONSUL_URL to be available.
consul:
  url: "${FLY_CONSUL_URL}"
  advertise-url: "http://${HOSTNAME}.vm.${FLY_APP_NAME}.internal:20202"

Dockerfile (abbreviated)

# Fetch the LiteFS binary using a multi-stage build.
FROM flyio/litefs:0.3.0-beta2 AS litefs

...

# Copy litefs
COPY --from=litefs /usr/local/bin/litefs /usr/local/bin/litefs

# Copy litefs config
COPY ./litefs.yml /etc/litefs.yml

RUN mkdir -p /data /mnt/data

...

# entrypoint.sh runs migrations and then runs the command via `exec "$@"`
ENTRYPOINT ["litefs", "--", "/code/entrypoint.sh"]
CMD ["uvicorn", "coolstuff.shell.api:app", "--port", "80", "--host", "0.0.0.0"]

My app loads the SQLite database via /data/app_db and currently is not using WAL mode.

Any help would be appreciated!

Thanks for reporting the issue, @bencdavis. It looks like the only spot in LiteFS that returns a 503 is here:

So it looks like either the server’s context is canceled or the context from the request is canceled. What log messages do you see before this occurs?

Also, can you try hitting the `GET /debug/pprof/goroutine?debug=2" endpoint on both instance’s port 20202 when this occurs? It’ll dump out a bunch of stack traces but that might help narrow down what the cause of the context cancelation is.

Thanks @benbjohnson for the quick response!

Unfortunately the logs are truncated and can’t go far enough back to see before the disconnection. One thing I did notice is I have a bug in my app that causes it to occasionally attempt to write to the local database on the replica which throws a (sqlite3.OperationalError) attempt to write a readonly database from the driver. Presumably that doesn’t effect replication though.

I was only able to get the output of /debug/pprof/goroutine on the replica:

Sydney (replica):

goroutine 34116 [running]:
runtime/pprof.writeGoroutineStacks({0xb66740, 0xc0001840e0})
        /usr/local/go/src/runtime/pprof/pprof.go:692 +0x70
runtime/pprof.writeGoroutine({0xb66740?, 0xc0001840e0?}, 0x0?)
        /usr/local/go/src/runtime/pprof/pprof.go:681 +0x2b
runtime/pprof.(*Profile).WriteTo(0xa2abc0?, {0xb66740?, 0xc0001840e0?}, 0xc?)
        /usr/local/go/src/runtime/pprof/pprof.go:330 +0x14b
net/http/pprof.handler.ServeHTTP({0xc00002e911, 0x9}, {0xb6a690, 0xc0001840e0}, 0xc00009806f?)
        /usr/local/go/src/net/http/pprof/pprof.go:253 +0x4a5
net/http/pprof.Index({0xb6a690?, 0xc0001840e0}, 0xc000130600)
        /usr/local/go/src/net/http/pprof/pprof.go:371 +0x13e
github.com/superfly/litefs/http.(*Server).serveHTTP(0xc000162e60?, {0xb6a690, 0xc0001840e0}, 0xc000130600)
        /src/litefs/http/server.go:128 +0x186
net/http.HandlerFunc.ServeHTTP(0xc0002009b8?, {0xb6a690?, 0xc0001840e0?}, 0xc000200990?)
        /usr/local/go/src/net/http/server.go:2109 +0x2f
golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP({{0xb67280?, 0xc0001509a0?}, 0xc00008b980?}, {0xb6a690, 0xc0001840e0}, 0xc000130600)
        /go/pkg/mod/golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/h2c/h2c.go:113 +0x49f
net/http.serverHandler.ServeHTTP({0xc000492600?}, {0xb6a690, 0xc0001840e0}, 0xc000130600)
        /usr/local/go/src/net/http/server.go:2947 +0x30c
net/http.(*conn).serve(0xc0001628c0, {0xb6ae18, 0xc0001d1290})
        /usr/local/go/src/net/http/server.go:1991 +0x607
created by net/http.(*Server).Serve
        /usr/local/go/src/net/http/server.go:3102 +0x4db

goroutine 1 [select, 1336 minutes]:
main.main()
        /src/litefs/cmd/litefs/main.go:79 +0x557

goroutine 23 [syscall, 1336 minutes]:
os/signal.signal_recv()
        /usr/local/go/src/runtime/sigqueue.go:152 +0x2f
os/signal.loop()
        /usr/local/go/src/os/signal/signal_unix.go:23 +0x19
created by os/signal.Notify.func1.1
        /usr/local/go/src/os/signal/signal.go:151 +0x2a

goroutine 38 [sleep]:
time.Sleep(0x3b9aca00)
        /usr/local/go/src/runtime/time.go:195 +0x135
github.com/superfly/litefs.(*Store).monitorLease(0xc000130500, {0xb6ad70, 0xc00008b8c0})
        /src/litefs/store.go:462 +0x2c8
github.com/superfly/litefs.(*Store).Open.func1()
        /src/litefs/store.go:134 +0x25
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /go/pkg/mod/golang.org/x/sync@v0.0.0-20220601150217-0de741cfad7f/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
        /go/pkg/mod/golang.org/x/sync@v0.0.0-20220601150217-0de741cfad7f/errgroup/errgroup.go:72 +0xa5

goroutine 39 [select, 1 minutes]:
github.com/superfly/litefs.(*Store).monitorRetention(0xc000130500, {0xb6ad70, 0xc00008b8c0})
        /src/litefs/store.go:607 +0xe7
github.com/superfly/litefs.(*Store).Open.func2()
        /src/litefs/store.go:138 +0x25
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /go/pkg/mod/golang.org/x/sync@v0.0.0-20220601150217-0de741cfad7f/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
        /go/pkg/mod/golang.org/x/sync@v0.0.0-20220601150217-0de741cfad7f/errgroup/errgroup.go:72 +0xa5

goroutine 34117 [runnable]:
net/http.(*connReader).startBackgroundRead.func2()
        /usr/local/go/src/net/http/server.go:674
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1594 +0x1
created by net/http.(*connReader).startBackgroundRead
        /usr/local/go/src/net/http/server.go:674 +0xca

goroutine 44 [syscall, 21 minutes]:
syscall.Syscall(0x4780ee?, 0xf03940?, 0x40dc07?, 0x9f5880?)
        /usr/local/go/src/syscall/syscall_linux.go:68 +0x27
syscall.read(0xf03940?, {0xc000494000?, 0x11?, 0x73586c6f67666e00?})
        /usr/local/go/src/syscall/zsyscall_linux_amd64.go:696 +0x45
syscall.Read(...)
        /usr/local/go/src/syscall/syscall_unix.go:183
bazil.org/fuse.(*Conn).ReadRequest(0xc0001f2840)
        /go/pkg/mod/bazil.org/fuse@v0.0.0-20200524192727-fb710f7dfd05/fuse.go:580 +0xc6
bazil.org/fuse/fs.(*Server).Serve(0xc000185dc0, {0xb65d80?, 0xc0001f25a0})
        /go/pkg/mod/bazil.org/fuse@v0.0.0-20200524192727-fb710f7dfd05/fs/serve.go:501 +0x3ef
github.com/superfly/litefs/fuse.(*FileSystem).Mount.func2()
        /src/litefs/fuse/file_system.go:93 +0x28
created by github.com/superfly/litefs/fuse.(*FileSystem).Mount
        /src/litefs/fuse/file_system.go:92 +0x31c

goroutine 45 [IO wait]:
internal/poll.runtime_pollWait(0x7efbed903038, 0x72)
        /usr/local/go/src/runtime/netpoll.go:305 +0x89
internal/poll.(*pollDesc).wait(0xc00016a180?, 0x6?, 0x0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc00016a180)
        /usr/local/go/src/internal/poll/fd_unix.go:614 +0x234
net.(*netFD).accept(0xc00016a180)
        /usr/local/go/src/net/fd_unix.go:172 +0x35
net.(*TCPListener).accept(0xc000011608)
        /usr/local/go/src/net/tcpsock_posix.go:142 +0x28
net.(*TCPListener).Accept(0xc000011608)
        /usr/local/go/src/net/tcpsock.go:288 +0x3d
net/http.(*Server).Serve(0xc0000fe1e0, {0xb6a480, 0xc000011608})
        /usr/local/go/src/net/http/server.go:3070 +0x385
github.com/superfly/litefs/http.(*Server).Serve.func1()
        /src/litefs/http/server.go:74 +0x2d
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /go/pkg/mod/golang.org/x/sync@v0.0.0-20220601150217-0de741cfad7f/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
        /go/pkg/mod/golang.org/x/sync@v0.0.0-20220601150217-0de741cfad7f/errgroup/errgroup.go:72 +0xa5

goroutine 50 [IO wait]:
internal/poll.runtime_pollWait(0x7efbed902c78, 0x72)
        /usr/local/go/src/runtime/netpoll.go:305 +0x89
internal/poll.(*pollDesc).wait(0xc00016b080?, 0xc00024b000?, 0x0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00016b080, {0xc00024b000, 0x1000, 0x1000})
        /usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc00016b080, {0xc00024b000?, 0x0?, 0x400?})
        /usr/local/go/src/net/fd_posix.go:55 +0x29
net.(*conn).Read(0xc0000123b0, {0xc00024b000?, 0x1000000406730?, 0x4ed000801?})
        /usr/local/go/src/net/net.go:183 +0x45
bufio.(*Reader).Read(0xc0001f2ea0, {0xc000185ee0, 0x9, 0x8c6fa5?})
        /usr/local/go/src/bufio/bufio.go:237 +0x1bb
io.ReadAtLeast({0xb65920, 0xc0001f2ea0}, {0xc000185ee0, 0x9, 0x9}, 0x9)
        /usr/local/go/src/io/io.go:332 +0x9a
io.ReadFull(...)
        /usr/local/go/src/io/io.go:351
golang.org/x/net/http2.readFrameHeader({0xc000185ee0?, 0x9?, 0x0?}, {0xb65920?, 0xc0001f2ea0?})
        /go/pkg/mod/golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/frame.go:237 +0x6e
golang.org/x/net/http2.(*Framer).ReadFrame(0xc000185ea0)
        /go/pkg/mod/golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/frame.go:498 +0x95
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc0001bdf98)
        /go/pkg/mod/golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/transport.go:2134 +0x130
golang.org/x/net/http2.(*ClientConn).readLoop(0xc000025080)
        /go/pkg/mod/golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/transport.go:2030 +0x6f
created by golang.org/x/net/http2.(*Transport).newClientConn
        /go/pkg/mod/golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/transport.go:737 +0xaaa

goroutine 53 [syscall, 1336 minutes]:
syscall.Syscall6(0x55666d4937632b34?, 0x704b45626271542b?, 0x647355635a497271?, 0x64464a7536304d36?, 0x714678544e4a7268?, 0x3166557a46705937?, 0x5a73426752374f47?)
        /usr/local/go/src/syscall/syscall_linux.go:90 +0x36
os.(*Process).blockUntilWaitable(0xc000290450)
        /usr/local/go/src/os/wait_waitid.go:32 +0x87
os.(*Process).wait(0xc000290450)
        /usr/local/go/src/os/exec_unix.go:22 +0x28
os.(*Process).Wait(...)
        /usr/local/go/src/os/exec.go:132
os/exec.(*Cmd).Wait(0xc000195b80)
        /usr/local/go/src/os/exec/exec.go:599 +0x4b
main.(*Main).execCmd.func1()
        /src/litefs/cmd/litefs/main.go:402 +0x25
created by main.(*Main).execCmd
        /src/litefs/cmd/litefs/main.go:402 +0x311

goroutine 52 [select, 1336 minutes]:
os/exec.(*Cmd).watchCtx.func1()
        /usr/local/go/src/os/exec/exec.go:648 +0x96
created by os/exec.(*Cmd).watchCtx
        /usr/local/go/src/os/exec/exec.go:647 +0x8e

goroutine 26564 [IO wait]:
internal/poll.runtime_pollWait(0x7efbed902f48, 0x72)
        /usr/local/go/src/runtime/netpoll.go:305 +0x89
internal/poll.(*pollDesc).wait(0xc00016aa80?, 0xc0002ae000?, 0x0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00016aa80, {0xc0002ae000, 0x1300, 0x1300})
        /usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc00016aa80, {0xc0002ae000?, 0xc000068560?, 0xc0002ae1b9?})
        /usr/local/go/src/net/fd_posix.go:55 +0x29
net.(*conn).Read(0xc0000121b8, {0xc0002ae000?, 0x1299?, 0xc000068560?})
        /usr/local/go/src/net/net.go:183 +0x45
crypto/tls.(*atLeastReader).Read(0xc0001fa090, {0xc0002ae000?, 0x0?, 0xc000202860?})
        /usr/local/go/src/crypto/tls/conn.go:787 +0x3d
bytes.(*Buffer).ReadFrom(0xc0000a05f8, {0xb65aa0, 0xc0001fa090})
        /usr/local/go/src/bytes/buffer.go:202 +0x98
crypto/tls.(*Conn).readFromUntil(0xc0000a0380, {0xb66480?, 0xc0000121b8}, 0x114c?)
        /usr/local/go/src/crypto/tls/conn.go:809 +0xe5
crypto/tls.(*Conn).readRecordOrCCS(0xc0000a0380, 0x0)
        /usr/local/go/src/crypto/tls/conn.go:616 +0x116
crypto/tls.(*Conn).readRecord(...)
        /usr/local/go/src/crypto/tls/conn.go:582
crypto/tls.(*Conn).Read(0xc0000a0380, {0xc0002c8000, 0x1000, 0x6a0b60?})
        /usr/local/go/src/crypto/tls/conn.go:1287 +0x16f
bufio.(*Reader).Read(0xc0002c7080, {0xc0001842e0, 0x9, 0x6bda65?})
        /usr/local/go/src/bufio/bufio.go:237 +0x1bb
io.ReadAtLeast({0xb65920, 0xc0002c7080}, {0xc0001842e0, 0x9, 0x9}, 0x9)
        /usr/local/go/src/io/io.go:332 +0x9a
io.ReadFull(...)
        /usr/local/go/src/io/io.go:351
net/http.http2readFrameHeader({0xc0001842e0?, 0x9?, 0xc000492300?}, {0xb65920?, 0xc0002c7080?})
        /usr/local/go/src/net/http/h2_bundle.go:1565 +0x6e
net/http.(*http2Framer).ReadFrame(0xc0001842a0)
        /usr/local/go/src/net/http/h2_bundle.go:1829 +0x95
net/http.(*http2clientConnReadLoop).run(0xc000202f98)
        /usr/local/go/src/net/http/h2_bundle.go:8875 +0x130
net/http.(*http2ClientConn).readLoop(0xc000024900)
        /usr/local/go/src/net/http/h2_bundle.go:8771 +0x6f
created by net/http.(*http2Transport).newClientConn
        /usr/local/go/src/net/http/h2_bundle.go:7478 +0xaaa

On the primary, the connection is refused, even though the instances seem to be running and passing health checks:

App
  Name     = cool-stuff-api          
  Owner    = personal                
  Version  = 10                      
  Status   = running                 
  Hostname = cool-stuff-api.fly.dev  
  Platform = nomad                   

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS  HEALTH CHECKS           RESTARTS        CREATED    
d968ff0c        app     10      yyz     run     running 1 total, 1 passing      0               22h20m ago
7831561d        app     10      syd     run     running 1 total, 1 passing      0               22h20m ago

On the primary instance, the litefs process is still alive:

# ps aux | grep litefs
root       528  0.0  3.6 724280  8224 ?        Sl   Nov09   0:14 litefs -- /code/entrypoint.sh uvicorn coolstuff.shell.api:app --port 80 --host 0.0.0.0
1 Like

Just tried scaling down to 0 then back to 2 (1 per region), and it’s not stuck in a restart loop due to checksum verification failure:

 2022-11-10T21:41:23.127 runner[74279dd3] syd [info] Starting instance

2022-11-10T21:41:23.544 runner[74279dd3] syd [info] Configuring virtual machine

2022-11-10T21:41:23.546 runner[74279dd3] syd [info] Pulling container image

2022-11-10T21:41:24.507 runner[74279dd3] syd [info] Unpacking image

2022-11-10T21:41:24.517 runner[74279dd3] syd [info] Preparing kernel init

2022-11-10T21:41:24.656 runner[74279dd3] syd [info] Setting up volume 'api'

2022-11-10T21:41:24.659 runner[74279dd3] syd [info] Opening encrypted volume

2022-11-10T21:41:24.803 runner[74279dd3] syd [info] Configuring firecracker

2022-11-10T21:41:25.048 runner[74279dd3] syd [info] Starting virtual machine

2022-11-10T21:41:25.225 app[74279dd3] syd [info] Starting init (commit: 81d5330)...

2022-11-10T21:41:25.244 app[74279dd3] syd [info] Mounting /dev/vdc at /mnt/data w/ uid: 0, gid: 0 and chmod 0755

2022-11-10T21:41:25.248 app[74279dd3] syd [info] Preparing to run: `litefs -- /code/entrypoint.sh uvicorn coolstuff.shell.api:app --port 80 --host 0.0.0.0` as root

2022-11-10T21:41:25.264 app[74279dd3] syd [info] 2022/11/10 21:41:25 listening on [fdaa:0:b45d:a7b:2983:2:d03f:2]:22 (DNS: [fdaa::3]:53)

2022-11-10T21:41:25.283 app[74279dd3] syd [info] config file read from /etc/litefs.yml

2022-11-10T21:41:25.283 app[74279dd3] syd [info] LiteFS v0.3.0-beta1, commit=f40cf36930e1d0b259bbbdb72c3fd3a508ff1936

2022-11-10T21:41:25.283 app[74279dd3] syd [info] Using Consul to determine primary

2022-11-10T21:41:25.581 runner[77061cb0] yyz [info] Starting instance

2022-11-10T21:41:26.174 runner[77061cb0] yyz [info] Configuring virtual machine

2022-11-10T21:41:26.181 runner[77061cb0] yyz [info] Pulling container image

2022-11-10T21:41:26.460 runner[77061cb0] yyz [info] Unpacking image

2022-11-10T21:41:26.472 runner[77061cb0] yyz [info] Preparing kernel init

2022-11-10T21:41:26.693 runner[77061cb0] yyz [info] Setting up volume 'api'

2022-11-10T21:41:26.698 runner[77061cb0] yyz [info] Opening encrypted volume

2022-11-10T21:41:26.776 app[74279dd3] syd [info] initializing consul: key= url=https://:4ee340cf-0c2e-f4e1-be1b-d5acd18d0add@consul-iad.fly-shared.net/cool-stuff-api-jg589z352ggqkze7/ hostname=74279dd3 advertise-url=http://74279dd3.vm.cool-stuff-api.internal:20202

2022-11-10T21:41:26.783 app[74279dd3] syd [info] ERROR: cannot open store: open databases: open database("coolstuff"): verify database file: database checksum (eec31726781b11cc) does not match latest LTX checksum (9b5d67d38caaa1cd)

2022-11-10T21:41:26.894 runner[77061cb0] yyz [info] Configuring firecracker

2022-11-10T21:41:26.932 runner[77061cb0] yyz [info] Starting virtual machine

2022-11-10T21:41:27.117 app[77061cb0] yyz [info] Starting init (commit: 81d5330)...

2022-11-10T21:41:27.142 app[77061cb0] yyz [info] Mounting /dev/vdc at /mnt/data w/ uid: 0, gid: 0 and chmod 0755

2022-11-10T21:41:27.149 app[77061cb0] yyz [info] Preparing to run: `litefs -- /code/entrypoint.sh uvicorn coolstuff.shell.api:app --port 80 --host 0.0.0.0` as root

2022-11-10T21:41:27.174 app[77061cb0] yyz [info] 2022/11/10 21:41:27 listening on [fdaa:0:b45d:a7b:8a0f:2:d03d:2]:22 (DNS: [fdaa::3]:53)

2022-11-10T21:41:27.207 app[77061cb0] yyz [info] LiteFS v0.3.0-beta1, commit=f40cf36930e1d0b259bbbdb72c3fd3a508ff1936

2022-11-10T21:41:27.207 app[77061cb0] yyz [info] config file read from /etc/litefs.yml

2022-11-10T21:41:27.207 app[77061cb0] yyz [info] Using Consul to determine primary

2022-11-10T21:41:27.258 app[74279dd3] syd [info] Starting clean up.

2022-11-10T21:41:27.258 app[74279dd3] syd [info] Umounting /dev/vdc from /mnt/data

2022-11-10T21:41:29.330 app[77061cb0] yyz [info] initializing consul: key= url=https://:4ee340cf-0c2e-f4e1-be1b-d5acd18d0add@consul-iad.fly-shared.net/cool-stuff-api-jg589z352ggqkze7/ hostname=77061cb0 advertise-url=http://77061cb0.vm.cool-stuff-api.internal:20202

2022-11-10T21:41:29.340 app[77061cb0] yyz [info] ERROR: cannot open store: open databases: open database("coolstuff"): verify database file: database checksum (eec31726781b11cc) does not match latest LTX checksum (9b5d67d38caaa1cd)

2022-11-10T21:41:30.161 app[77061cb0] yyz [info] Starting clean up.

2022-11-10T21:41:30.162 app[77061cb0] yyz [info] Umounting /dev/vdc from /mnt/data 

Yeah, that’s the expected behavior. It’s just LiteFS denying writes against the replica.

Thanks for grabbing this on the replica. I didn’t see anything out of the ordinary with it so perhaps the issue is on the primary.

Hmm, that’s odd that you can’t hit the HTTP server on the primary. Another option is to send a SIGQUIT to the litefs process on there and that will kill the node but also dump out the stack trace.

Hmm, I thought this issue was fixed but someone else just reported the issue too. I’ll get a fix in to gracefully recover from it tomorrow.

One more question: is this a Python app that you’re running? I saw uvicorn and assumed so. Nothing wrong with Python but trying to narrow down if there’s issues with any specific clients.

Thanks!

Hmm, I thought this issue was fixed but someone else just reported the issue too. I’ll get a fix in to gracefully recover from it tomorrow.

Sounds good. Does that mean it’ll be able to recover data? One thing I’m a little confused on is how to recover the database file in the event that litefs fails to start. I assumed that I would be able to mount the volume and find the file, but that doesn’t seem to be the case. I tried restarting the app without litefs, but when my app accessed the file, it was empty and so it ran the initial migrations as if it was a clean file.

One more question: is this a Python app that you’re running? I saw uvicorn and assumed so. Nothing wrong with Python but trying to narrow down if there’s issues with any specific clients.

Yep! FastAPI with SQLAlchemy and the pysqlite3 driver (not the async driver).

Does that mean it’ll be able to recover data? One thing I’m a little confused on is how to recover the database file in the event that litefs fails to start. I assumed that I would be able to mount the volume and find the file, but that doesn’t seem to be the case. I tried restarting the app without litefs, but when my app accessed the file, it was empty and so it ran the initial migrations as if it was a clean file.

Ignore this! I realized I was looking in the data directory provided by litefs not the volume mount itself. I see the database file in there now.

@bencdavis Someone else had a similar issue and I was able to get some fixes in for it. Can you give the new 0.3.0-beta3 release a try?

Or, from Docker it’s: flyio/litefs:0.3.0-beta3

@benbjohnson Just tried starting a single instance with beta3 and it still failing to start with the checksum verification error:

2022-11-11T22:51:54Z   [info]Starting init (commit: 81d5330)...
2022-11-11T22:51:54Z   [info]Mounting /dev/vdc at /mnt/data w/ uid: 0, gid: 0 and chmod 0755
2022-11-11T22:51:54Z   [info]Preparing to run: `litefs -- /code/entrypoint.sh uvicorn coolstuff.shell.api:app --port 80 --host 0.0.0.0` as root
2022-11-11T22:51:54Z   [info]2022/11/11 22:51:54 listening on [fdaa:0:b45d:a7b:8a0f:2:d03d:2]:22 (DNS: [fdaa::3]:53)
2022-11-11T22:51:54Z   [info]LiteFS v0.3.0-beta3, commit=fabf62dacbc590ed94d79493a3691deb19b153bd
2022-11-11T22:51:54Z   [info]config file read from /etc/litefs.yml
2022-11-11T22:51:54Z   [info]Using Consul to determine primary
2022-11-11T22:51:55Z   [info]initializing consul: key= url=https://:4ee340cf-0c2e-f4e1-be1b-d5acd18d0add@consul-iad.fly-shared.net/cool-stuff-api-jg589z352ggqkze7/ hostname=dad099c9 advertise-url=http://dad099c9.vm.cool-stuff-api.internal:20202
2022-11-11T22:51:56Z   [info]ERROR: cannot open store: open databases: open database("coolstuff"): verify database file: database checksum (eec31726781b11cc) does not match latest LTX checksum (9b5d67d38caaa1cd)
2022-11-11T22:51:56Z   [info]Umountingclean up.
2022-11-11T22:51:56Z   [info]Umounting /dev/vdc from /mnt/data

Is it expected that this would be fixed or is it now in an fundamentally inconsistent state due to the disconnection issues? I’m assuming I could manually edit the LTX file so that it matches the current state of the DB. My project is still pre-release so it’s not crucial to maintain 100% consistency.

@bencdavis If you have a persistent volume, you’ll need to drop and recreate it. The checksum was persisted to the data directory so it’ll reappear on startup until you clear out the data.

You could manually edit the checksum LTX file, however, the LTX file itself has a checksum for its content so you’d need to update that as well. :slight_smile:

We could recover by restarting the LTX files based on a snapshot, however, these issues generally shouldn’t happen so it’s better to figure out the root cause rather than automatically recover and ignoring them. Thanks for digging into this and trying it again.

Hey @benbjohnson! I had to focus on other things for the past couple weeks, so I wasn’t able to give this time, but I’m back on it now.

I just upgraded to beta7, recreated the volume, use litefs import to import the existing sqlite file (thanks for that command btw!), and brought up the app with a single instance in one region.

I was then making some changes to my app code and deployed a couple times. The first redeploy after initializing litefs worked fine, but the second hit the same LTX checksum failure.

Here’s the entire log, including the first run that created the database in litefs, the EOF (which I’m guessing was a result of me running litefs import), a bunch of requests I made to my app (including an error that was my fault), and then a bunch of failed app starts due to the checksum issue:

 2022-12-16T21:32:30.529 runner[a9faea67] yyz [info] Starting instance

2022-12-16T21:32:31.682 runner[a9faea67] yyz [info] Configuring virtual machine

2022-12-16T21:32:31.683 runner[a9faea67] yyz [info] Pulling container image

2022-12-16T21:32:32.035 runner[a9faea67] yyz [info] Unpacking image

2022-12-16T21:32:32.045 runner[a9faea67] yyz [info] Preparing kernel init

2022-12-16T21:32:32.194 runner[a9faea67] yyz [info] Setting up volume 'api'

2022-12-16T21:32:32.198 runner[a9faea67] yyz [info] Opening encrypted volume

2022-12-16T21:32:32.368 runner[a9faea67] yyz [info] Configuring firecracker

2022-12-16T21:32:32.396 runner[a9faea67] yyz [info] Starting virtual machine

2022-12-16T21:32:32.624 app[a9faea67] yyz [info] Starting init (commit: f447594)...

2022-12-16T21:32:32.643 app[a9faea67] yyz [info] Mounting /dev/vdc at /mnt/data w/ uid: 0, gid: 0 and chmod 0755

2022-12-16T21:32:32.648 app[a9faea67] yyz [info] Preparing to run: `litefs mount -- /code/entrypoint.sh uvicorn coolstuff.shell.api:app --port 80 --host 0.0.0.0` as root

2022-12-16T21:32:32.672 app[a9faea67] yyz [info] 2022/12/16 21:32:32 listening on [fdaa:0:b45d:a7b:88dc:3:8570:2]:22 (DNS: [fdaa::3]:53)

2022-12-16T21:32:32.701 app[a9faea67] yyz [info] config file read from /etc/litefs.yml

2022-12-16T21:32:32.701 app[a9faea67] yyz [info] LiteFS v0.3.0-beta7, commit=7e5287a655c985ca5122f1b082d63ef96e546cd5

2022-12-16T21:32:32.701 app[a9faea67] yyz [info] Using Consul to determine primary

2022-12-16T21:32:34.897 app[a9faea67] yyz [info] initializing consul: key= url=https://:4ee340cf-0c2e-f4e1-be1b-d5acd18d0add@consul-iad.fly-shared.net/cool-stuff-api-jg589z352ggqkze7/ hostname=a9faea67 advertise-url=http://a9faea67.vm.cool-stuff-api.internal:20202

2022-12-16T21:32:34.919 app[a9faea67] yyz [info] LiteFS mounted to: /data

2022-12-16T21:32:34.919 app[a9faea67] yyz [info] http server listening on: http://localhost:20202

2022-12-16T21:32:34.919 app[a9faea67] yyz [info] waiting to connect to cluster

2022-12-16T21:32:35.035 app[a9faea67] yyz [info] primary lease acquired, advertising as http://a9faea67.vm.cool-stuff-api.internal:20202

2022-12-16T21:32:35.035 app[a9faea67] yyz [info] connected to cluster, ready

2022-12-16T21:32:35.035 app[a9faea67] yyz [info] starting subprocess: /code/entrypoint.sh [uvicorn coolstuff.shell.api:app --port 80 --host 0.0.0.0]

2022-12-16T21:32:35.037 app[a9faea67] yyz [info] waiting for signal or subprocess to exit

2022-12-16T21:32:36.053 app[a9faea67] yyz [info] INFO [alembic.runtime.migration] Context impl SQLiteImpl.

2022-12-16T21:32:36.053 app[a9faea67] yyz [info] INFO [alembic.runtime.migration] Will assume non-transactional DDL.

2022-12-16T21:32:36.067 app[a9faea67] yyz [info] INFO [alembic.runtime.migration] Running upgrade -> 48cf3a42ba7b, empty message

2022-12-16T21:32:36.128 app[a9faea67] yyz [info] INFO [alembic.runtime.migration] Running upgrade 48cf3a42ba7b -> ddb106113e7b, empty message

2022-12-16T21:32:37.842 app[a9faea67] yyz [info] INFO: Started server process [543]

2022-12-16T21:32:37.843 app[a9faea67] yyz [info] INFO: Waiting for application startup.

2022-12-16T21:32:37.943 app[a9faea67] yyz [info] INFO: Application startup complete.

2022-12-16T21:32:37.944 app[a9faea67] yyz [info] INFO: Uvicorn running on http://0.0.0.0:80 (Press CTRL+C to quit)

2022-12-16T21:34:00.421 app[a9faea67] yyz [info] 2022/12/16 21:34:00 unexpected error: EOF

2022-12-16T21:35:33.060 app[a9faea67] yyz [info] INFO: 172.19.8.249:58124 - "OPTIONS /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:35:33.344 app[a9faea67] yyz [info] INFO: 172.19.8.249:58180 - "OPTIONS /searchs?filter%5Bpinned%5D=true HTTP/1.1" 200 OK

2022-12-16T21:35:33.345 app[a9faea67] yyz [info] INFO: 172.19.8.249:58182 - "OPTIONS /searchs?filter%5Bpinned%5D=false HTTP/1.1" 200 OK

2022-12-16T21:35:33.345 app[a9faea67] yyz [info] INFO: 172.19.8.249:58184 - "OPTIONS /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:35:33.525 app[a9faea67] yyz [info] INFO: 172.19.8.249:58188 - "GET /searchs?filter%5Bpinned%5D=true HTTP/1.1" 200 OK

2022-12-16T21:35:33.526 app[a9faea67] yyz [info] INFO: 172.19.8.249:58186 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:35:33.526 app[a9faea67] yyz [info] INFO: 172.19.8.249:58190 - "GET /searchs?filter%5Bpinned%5D=false HTTP/1.1" 200 OK

2022-12-16T21:35:48.002 app[a9faea67] yyz [info] INFO: 172.19.8.249:63684 - "OPTIONS /links/1763c1ca-e9eb-4409-98df-bd358bd65fc2 HTTP/1.1" 200 OK

2022-12-16T21:35:48.155 app[a9faea67] yyz [info] INFO: 172.19.8.249:63714 - "PATCH /links/1763c1ca-e9eb-4409-98df-bd358bd65fc2 HTTP/1.1" 500 Internal Server Error

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] ERROR: Exception in ASGI application

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] Traceback (most recent call last):

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] File "/opt/venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 404, in run_asgi

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] result = await app( # type: ignore[func-returns-value]

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] File "/opt/venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] return await self.app(scope, receive, send)

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] File "/opt/venv/lib/python3.10/site-packages/fastapi/applications.py", line 270, in __call__

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] await super().__call__(scope, receive, send)

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] File "/opt/venv/lib/python3.10/site-packages/sentry_sdk/integrations/starlette.py", line 325, in _sentry_patched_asgi_app

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] return await middleware(scope, receive, send)

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] File "/opt/venv/lib/python3.10/site-packages/sentry_sdk/integrations/asgi.py", line 139, in _run_asgi3

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] return await self._run_app(scope, lambda: self.app(scope, receive, send))

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] File "/opt/venv/lib/python3.10/site-packages/sentry_sdk/integrations/asgi.py", line 188, in _run_app

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] raise exc from None

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] File "/opt/venv/lib/python3.10/site-packages/sentry_sdk/integrations/asgi.py", line 183, in _run_app

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] return await callback()

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] File "/opt/venv/lib/python3.10/site-packages/starlette/applications.py", line 124, in __call__

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] await self.middleware_stack(scope, receive, send)

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] File "/opt/venv/lib/python3.10/site-packages/sentry_sdk/integrations/starlette.py", line 130, in _create_span_call

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] return await old_call(app, scope, new_receive, new_send, **kwargs)

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] File "/opt/venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] raise exc

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] File "/opt/venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] await self.app(scope, receive, _send)

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] File "/opt/venv/lib/python3.10/site-packages/sentry_sdk/integrations/starlette.py", line 130, in _create_span_call

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] return await old_call(app, scope, new_receive, new_send, **kwargs)

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] File "/opt/venv/lib/python3.10/site-packages/starlette/middleware/base.py", line 72, in __call__

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] response = await self.dispatch_func(request, call_next)

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] File "/code/./coolstuff/shell/middleware/litefs_write_replay.py", line 50, in litefs_write_replace

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] response.headers["litefs-txid"] = txid

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] File "/opt/venv/lib/python3.10/site-packages/starlette/datastructures.py", line 600, in __setitem__

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] set_value = value.encode("latin-1")

2022-12-16T21:35:48.179 app[a9faea67] yyz [info] AttributeError: 'int' object has no attribute 'encode'

2022-12-16T21:37:12.493 runner[a9faea67] yyz [info] Shutting down virtual machine

2022-12-16T21:37:12.524 app[a9faea67] yyz [info] Sending signal SIGINT to main child process w/ PID 528

2022-12-16T21:37:12.524 app[a9faea67] yyz [info] sending signal to exec process

2022-12-16T21:37:12.525 app[a9faea67] yyz [info] waiting for exec process to close

2022-12-16T21:37:12.564 app[a9faea67] yyz [info] INFO: Shutting down

2022-12-16T21:37:12.665 app[a9faea67] yyz [info] INFO: Waiting for application shutdown.

2022-12-16T21:37:12.672 app[a9faea67] yyz [info] INFO: Application shutdown complete.

2022-12-16T21:37:12.673 app[a9faea67] yyz [info] INFO: Finished server process [543]

2022-12-16T21:37:13.038 app[a9faea67] yyz [info] signal received, litefs shutting down

2022-12-16T21:37:13.039 app[a9faea67] yyz [info] exiting primary, destroying lease

2022-12-16T21:37:13.064 app[a9faea67] yyz [info] litefs shut down complete

2022-12-16T21:37:14.008 app[a9faea67] yyz [info] Starting clean up.

2022-12-16T21:37:14.008 app[a9faea67] yyz [info] Umounting /dev/vdc from /mnt/data

2022-12-16T21:37:15.401 runner[de196946] yyz [info] Starting instance

2022-12-16T21:37:16.428 runner[de196946] yyz [info] Configuring virtual machine

2022-12-16T21:37:16.429 runner[de196946] yyz [info] Pulling container image

2022-12-16T21:37:18.324 runner[de196946] yyz [info] Unpacking image

2022-12-16T21:37:18.803 runner[de196946] yyz [info] Preparing kernel init

2022-12-16T21:37:18.967 runner[de196946] yyz [info] Setting up volume 'api'

2022-12-16T21:37:18.971 runner[de196946] yyz [info] Opening encrypted volume

2022-12-16T21:37:19.148 runner[de196946] yyz [info] Configuring firecracker

2022-12-16T21:37:19.178 runner[de196946] yyz [info] Starting virtual machine

2022-12-16T21:37:19.414 app[de196946] yyz [info] Starting init (commit: f447594)...

2022-12-16T21:37:19.438 app[de196946] yyz [info] Mounting /dev/vdc at /mnt/data w/ uid: 0, gid: 0 and chmod 0755

2022-12-16T21:37:19.442 app[de196946] yyz [info] Preparing to run: `litefs mount -- /code/entrypoint.sh uvicorn coolstuff.shell.api:app --port 80 --host 0.0.0.0` as root

2022-12-16T21:37:19.469 app[de196946] yyz [info] 2022/12/16 21:37:19 listening on [fdaa:0:b45d:a7b:88dc:3:8570:2]:22 (DNS: [fdaa::3]:53)

2022-12-16T21:37:19.497 app[de196946] yyz [info] config file read from /etc/litefs.yml

2022-12-16T21:37:19.497 app[de196946] yyz [info] LiteFS v0.3.0-beta7, commit=7e5287a655c985ca5122f1b082d63ef96e546cd5

2022-12-16T21:37:19.497 app[de196946] yyz [info] Using Consul to determine primary

2022-12-16T21:37:20.649 app[de196946] yyz [info] initializing consul: key= url=https://:4ee340cf-0c2e-f4e1-be1b-d5acd18d0add@consul-iad.fly-shared.net/cool-stuff-api-jg589z352ggqkze7/ hostname=de196946 advertise-url=http://de196946.vm.cool-stuff-api.internal:20202

2022-12-16T21:37:20.678 app[de196946] yyz [info] LiteFS mounted to: /data

2022-12-16T21:37:20.678 app[de196946] yyz [info] http server listening on: http://localhost:20202

2022-12-16T21:37:20.678 app[de196946] yyz [info] waiting to connect to cluster

2022-12-16T21:37:21.075 app[de196946] yyz [info] primary lease acquired, advertising as http://de196946.vm.cool-stuff-api.internal:20202

2022-12-16T21:37:21.075 app[de196946] yyz [info] connected to cluster, ready

2022-12-16T21:37:21.075 app[de196946] yyz [info] starting subprocess: /code/entrypoint.sh [uvicorn coolstuff.shell.api:app --port 80 --host 0.0.0.0]

2022-12-16T21:37:21.077 app[de196946] yyz [info] waiting for signal or subprocess to exit

2022-12-16T21:37:22.211 app[de196946] yyz [info] INFO [alembic.runtime.migration] Context impl SQLiteImpl.

2022-12-16T21:37:22.211 app[de196946] yyz [info] INFO [alembic.runtime.migration] Will assume non-transact

2022-12-16T21:37:22.211 app[de196946] yyz [info] onal DDL.

2022-12-16T21:37:23.994 app[de196946] yyz [info] INFO: Started server process [544]

2022-12-16T21:37:23.995 app[de196946] yyz [info] INFO: Waiting for application startup.

2022-12-16T21:37:24.107 app[de196946] yyz [info] INFO: Application startup complete.

2022-12-16T21:37:24.107 app[de196946] yyz [info] INFO: Uvicorn running on http://0.0.0.0:80 (Press CTRL+C to quit)

2022-12-16T21:37:58.957 app[de196946] yyz [info] INFO: 172.19.8.249:41342 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:38:06.937 app[de196946] yyz [info] INFO: 172.19.8.249:44674 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:38:06.939 app[de196946] yyz [info] INFO: 172.19.8.249:44672 - "GET /searchs?filter%5Bpinned%5D=true HTTP/1.1" 200 OK

2022-12-16T21:38:06.939 app[de196946] yyz [info] INFO: 172.19.8.249:44676 - "GET /searchs?filter%5Bpinned%5D=false HTTP/1.1" 200 OK

2022-12-16T21:38:08.869 app[de196946] yyz [info] INFO: 172.19.8.249:45068 - "OPTIONS /links/1763c1ca-e9eb-4409-98df-bd358bd65fc2 HTTP/1.1" 200 OK

2022-12-16T21:38:08.912 app[de196946] yyz [info] INFO: 172.19.8.249:45072 - "PATCH /links/1763c1ca-e9eb-4409-98df-bd358bd65fc2 HTTP/1.1" 200 OK

2022-12-16T21:38:40.790 app[de196946] yyz [info] INFO: 172.19.8.249:55796 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:38:41.799 app[de196946] yyz [info] INFO: 172.19.8.249:56002 - "GET /searchs?filter%5Bpinned%5D=true HTTP/1.1" 200 OK

2022-12-16T21:38:42.012 app[de196946] yyz [info] INFO: 172.19.8.249:56006 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:38:42.012 app[de196946] yyz [info] INFO: 172.19.8.249:56004 - "GET /searchs?filter%5Bpinned%5D=false HTTP/1.1" 200 OK

2022-12-16T21:38:47.760 app[de196946] yyz [info] INFO: 172.19.8.249:59218 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:38:47.761 app[de196946] yyz [info] INFO: 172.19.8.249:59216 - "GET /searchs?filter%5Bpinned%5D=false HTTP/1.1" 200 OK

2022-12-16T21:38:47.762 app[de196946] yyz [info] INFO: 172.19.8.249:59214 - "GET /searchs?filter%5Bpinned%5D=true HTTP/1.1" 200 OK

2022-12-16T21:38:53.242 app[de196946] yyz [info] INFO: 172.19.8.249:60220 - "PATCH /links/1763c1ca-e9eb-4409-98df-bd358bd65fc2 HTTP/1.1" 200 OK

2022-12-16T21:38:55.599 app[de196946] yyz [info] INFO: 172.19.8.249:60674 - "PATCH /links/1763c1ca-e9eb-4409-98df-bd358bd65fc2 HTTP/1.1" 200 OK

2022-12-16T21:38:56.564 app[de196946] yyz [info] INFO: 172.19.8.249:60872 - "PATCH /links/1763c1ca-e9eb-4409-98df-bd358bd65fc2 HTTP/1.1" 200 OK

2022-12-16T21:43:41.609 app[de196946] yyz [info] INFO: 172.19.8.249:59096 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:45:03.768 app[de196946] yyz [info] INFO: 172.19.8.249:54924 - "OPTIONS /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:45:03.768 app[de196946] yyz [info] INFO: 172.19.8.249:54926 - "OPTIONS /searchs?filter%5Bpinned%5D=true HTTP/1.1" 200 OK

2022-12-16T21:45:03.769 app[de196946] yyz [info] INFO: 172.19.8.249:54928 - "OPTIONS /searchs?filter%5Bpinned%5D=false HTTP/1.1" 200 OK

2022-12-16T21:45:03.988 app[de196946] yyz [info] INFO: 172.19.8.249:54938 - "GET /searchs?filter%5Bpinned%5D=false HTTP/1.1" 200 OK

2022-12-16T21:45:03.989 app[de196946] yyz [info] INFO: 172.19.8.249:54936 - "GET /searchs?filter%5Bpinned%5D=true HTTP/1.1" 200 OK

2022-12-16T21:45:03.989 app[de196946] yyz [info] INFO: 172.19.8.249:54940 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:45:10.568 app[de196946] yyz [info] INFO: 172.19.8.249:56764 - "OPTIONS /links/1763c1ca-e9eb-4409-98df-bd358bd65fc2 HTTP/1.1" 200 OK

2022-12-16T21:45:10.654 app[de196946] yyz [info] INFO: 172.19.8.249:56778 - "PATCH /links/1763c1ca-e9eb-4409-98df-bd358bd65fc2 HTTP/1.1" 200 OK

2022-12-16T21:45:18.391 app[de196946] yyz [info] INFO: 172.19.8.249:60204 - "PATCH /links/1763c1ca-e9eb-4409-98df-bd358bd65fc2 HTTP/1.1" 200 OK

2022-12-16T21:45:28.821 app[de196946] yyz [info] INFO: 172.19.8.249:62166 - "PATCH /links/1763c1ca-e9eb-4409-98df-bd358bd65fc2 HTTP/1.1" 200 OK

2022-12-16T21:45:30.852 app[de196946] yyz [info] INFO: 172.19.8.249:64216 - "PATCH /links/1763c1ca-e9eb-4409-98df-bd358bd65fc2 HTTP/1.1" 200 OK

2022-12-16T21:45:50.113 app[de196946] yyz [info] INFO: 172.19.8.249:38034 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:46:13.164 app[de196946] yyz [info] INFO: 172.19.8.249:45048 - "GET /searchs?filter%5Bpinned%5D=true HTTP/1.1" 200 OK

2022-12-16T21:46:13.165 app[de196946] yyz [info] INFO: 172.19.8.249:45046 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:46:13.165 app[de196946] yyz [info] INFO: 172.19.8.249:45050 - "GET /searchs?filter%5Bpinned%5D=false HTTP/1.1" 200 OK

2022-12-16T21:46:19.001 app[de196946] yyz [info] INFO: 172.19.8.249:47830 - "GET /searchs?filter%5Bpinned%5D=true HTTP/1.1" 200 OK

2022-12-16T21:46:19.001 app[de196946] yyz [info] INFO: 172.19.8.249:47826 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:46:19.002 app[de196946] yyz [info] INFO: 172.19.8.249:47828 - "GET /searchs?filter%5Bpinned%5D=false HTTP/1.1" 200 OK

2022-12-16T21:46:19.144 app[de196946] yyz [info] INFO: 172.19.8.249:47864 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:46:22.593 app[de196946] yyz [info] INFO: 172.19.8.249:48438 - "PATCH /links/1763c1ca-e9eb-4409-98df-bd358bd65fc2 HTTP/1.1" 200 OK

2022-12-16T21:46:26.416 app[de196946] yyz [info] INFO: 172.19.8.249:49200 - "PATCH /links/1763c1ca-e9eb-4409-98df-bd358bd65fc2 HTTP/1.1" 200 OK

2022-12-16T21:46:42.585 app[de196946] yyz [info] INFO: 172.19.8.249:54702 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:46:42.587 app[de196946] yyz [info] INFO: 172.19.8.249:54704 - "GET /searchs?filter%5Bpinned%5D=true HTTP/1.1" 200 OK

2022-12-16T21:46:42.588 app[de196946] yyz [info] INFO: 172.19.8.249:54706 - "GET /searchs?filter%5Bpinned%5D=false HTTP/1.1" 200 OK

2022-12-16T21:46:46.506 app[de196946] yyz [info] INFO: 172.19.8.249:57250 - "PATCH /links/1763c1ca-e9eb-4409-98df-bd358bd65fc2 HTTP/1.1" 200 OK

2022-12-16T21:46:47.897 app[de196946] yyz [info] INFO: 172.19.8.249:57510 - "PATCH /links/1763c1ca-e9eb-4409-98df-bd358bd65fc2 HTTP/1.1" 200 OK

2022-12-16T21:47:40.913 app[de196946] yyz [info] INFO: 172.19.8.249:41940 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:47:50.827 app[de196946] yyz [info] INFO: 172.19.8.249:45620 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:47:54.763 app[de196946] yyz [info] INFO: 172.19.8.249:46368 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:47:59.486 app[de196946] yyz [info] INFO: 172.19.8.249:47354 - "GET /links?include=link_data HTTP/1.1" 200 OK

2022-12-16T21:48:04.710 runner[de196946] yyz [info] Shutting down virtual machine

2022-12-16T21:48:04.936 app[de196946] yyz [info] Sending signal SIGINT to main child process w/ PID 529

2022-12-16T21:48:04.937 app[de196946] yyz [info] sending signal to exec process

2022-12-16T21:48:04.937 app[de196946] yyz [info] waiting for exec process to close

2022-12-16T21:48:05.003 app[de196946] yyz [info] INFO: Shutting down

2022-12-16T21:48:05.104 app[de196946] yyz [info] INFO: Waiting for application shutdown.

2022-12-16T21:48:05.107 app[de196946] yyz [info] INFO: Application shutdown complete.

2022-12-16T21:48:05.107 app[de196946] yyz [info] INFO: Finished server process [544]

2022-12-16T21:48:05.433 app[de196946] yyz [info] signal received, litefs shutting down

2022-12-16T21:48:05.435 app[de196946] yyz [info] exiting primary, destroying lease

2022-12-16T21:48:05.605 app[de196946] yyz [info] litefs shut down complete

2022-12-16T21:48:06.254 app[de196946] yyz [info] Starting clean up.

2022-12-16T21:48:06.255 app[de196946] yyz [info] Umounting /dev/vdc from /mnt/data

2022-12-16T21:48:08.143 runner[1fef9d1c] yyz [info] Starting instance

2022-12-16T21:48:09.043 runner[1fef9d1c] yyz [info] Configuring virtual machine

2022-12-16T21:48:09.044 runner[1fef9d1c] yyz [info] Pulling container image

2022-12-16T21:48:09.469 runner[1fef9d1c] yyz [info] Unpacking image

2022-12-16T21:48:10.036 runner[1fef9d1c] yyz [info] Preparing kernel init

2022-12-16T21:48:10.193 runner[1fef9d1c] yyz [info] Setting up volume 'api'

2022-12-16T21:48:10.231 runner[1fef9d1c] yyz [info] Opening encrypted volume

2022-12-16T21:48:10.396 runner[1fef9d1c] yyz [info] Configuring firecracker

2022-12-16T21:48:10.426 runner[1fef9d1c] yyz [info] Starting virtual machine

2022-12-16T21:48:10.680 app[1fef9d1c] yyz [info] Starting init (commit: f447594)...

2022-12-16T21:48:10.700 app[1fef9d1c] yyz [info] Mounting /dev/vdc at /mnt/data w/ uid: 0, gid: 0 and chmod 0755

2022-12-16T21:48:10.705 app[1fef9d1c] yyz [info] Preparing to run: `litefs mount -- /code/entrypoint.sh uvicorn coolstuff.shell.api:app --port 80 --host 0.0.0.0` as root

2022-12-16T21:48:10.728 app[1fef9d1c] yyz [info] 2022/12/16 21:48:10 listening on [fdaa:0:b45d:a7b:88dc:3:8570:2]:22 (DNS: [fdaa::3]:53)

2022-12-16T21:48:10.756 app[1fef9d1c] yyz [info] Using Consul to determine primary

2022-12-16T21:48:10.756 app[1fef9d1c] yyz [info] config file read from /etc/litefs.yml

2022-12-16T21:48:10.756 app[1fef9d1c] yyz [info] LiteFS v0.3.0-beta7, commit=7e5287a655c985ca5122f1b082d63ef96e546cd5

2022-12-16T21:48:12.191 app[1fef9d1c] yyz [info] initializing consul: key= url=https://:4ee340cf-0c2e-f4e1-be1b-d5acd18d0add@consul-iad.fly-shared.net/cool-stuff-api-jg589z352ggqkze7/ hostname=1fef9d1c advertise-url=http://1fef9d1c.vm.cool-stuff-api.internal:20202

2022-12-16T21:48:12.202 app[1fef9d1c] yyz [info] ERROR: cannot open store: open databases: open database("coolstuff"): verify database file: database checksum (fbb9e42a3f652ae2) does not match latest LTX checksum (cdc027fdeda3eae4)

2022-12-16T21:48:12.717 app[1fef9d1c] yyz [info] Starting clean up.

2022-12-16T21:48:12.717 app[1fef9d1c] yyz [info] Umounting /dev/vdc from /mnt/data

2022-12-16T21:48:19.847 runner[1fef9d1c] yyz [info] Starting instance

2022-12-16T21:48:20.732 runner[1fef9d1c] yyz [info] Configuring virtual machine

2022-12-16T21:48:20.733 runner[1fef9d1c] yyz [info] Pulling container image

2022-12-16T21:48:20.924 runner[1fef9d1c] yyz [info] Unpacking image

2022-12-16T21:48:20.936 runner[1fef9d1c] yyz [info] Preparing kernel init

2022-12-16T21:48:21.080 runner[1fef9d1c] yyz [info] Setting up volume 'api'

2022-12-16T21:48:21.084 runner[1fef9d1c] yyz [info] Opening encrypted volume

2022-12-16T21:48:21.253 runner[1fef9d1c] yyz [info] Configuring firecracker

2022-12-16T21:48:21.283 runner[1fef9d1c] yyz [info] Starting virtual machine

2022-12-16T21:48:21.488 app[1fef9d1c] yyz [info] Starting init (commit: f447594)...

2022-12-16T21:48:21.510 app[1fef9d1c] yyz [info] Mounting /dev/vdc at /mnt/data w/ uid: 0, gid: 0 and chmod 0755

2022-12-16T21:48:21.515 app[1fef9d1c] yyz [info] Preparing to run: `litefs mount -- /code/entrypoint.sh uvicorn coolstuff.shell.api:app --port 80 --host 0.0.0.0` as root

2022-12-16T21:48:21.539 app[1fef9d1c] yyz [info] 2022/12/16 21:48:21 listening on [fdaa:0:b45d:a7b:88dc:3:8570:2]:22 (DNS: [fdaa::3]:53)

2022-12-16T21:48:21.570 app[1fef9d1c] yyz [info] config file read from /etc/litefs.yml

2022-12-16T21:48:21.570 app[1fef9d1c] yyz [info] LiteFS v0.3.0-beta7, commit=7e5287a655c985ca5122f1b082d63ef96e546cd5

2022-12-16T21:48:21.570 app[1fef9d1c] yyz [info] Using Consul to determine primary

2022-12-16T21:48:23.782 app[1fef9d1c] yyz [info] initializing consul: key= url=https://:4ee340cf-0c2e-f4e1-be1b-d5acd18d0add@consul-iad.fly-shared.net/cool-stuff-api-jg589z352ggqkze7/ hostname=1fef9d1c advertise-url=http://1fef9d1c.vm.cool-stuff-api.internal:20202

2022-12-16T21:48:23.793 app[1fef9d1c] yyz [info] ERROR: cannot open store: open databases: open database("coolstuff"): verify database file: database checksum (fbb9e42a3f652ae2) does not match latest LTX checksum (cdc027fdeda3eae4)

2022-12-16T21:48:24.529 app[1fef9d1c] yyz [info] Starting clean up.

2022-12-16T21:48:24.530 app[1fef9d1c] yyz [info] Umounting /dev/vdc from /mnt/data

2022-12-16T21:48:30.562 runner[1fef9d1c] yyz [info] Starting instance

2022-12-16T21:48:31.555 runner[1fef9d1c] yyz [info] Configuring virtual machine

2022-12-16T21:48:31.556 runner[1fef9d1c] yyz [info] Pulling container image

2022-12-16T21:48:31.706 runner[1fef9d1c] yyz [info] Unpacking image

2022-12-16T21:48:31.713 runner[1fef9d1c] yyz [info] Preparing kernel init

2022-12-16T21:48:31.873 runner[1fef9d1c] yyz [info] Setting up volume 'api'

2022-12-16T21:48:31.878 runner[1fef9d1c] yyz [info] Opening encrypted volume

2022-12-16T21:48:32.042 runner[1fef9d1c] yyz [info] Configuring firecracker

2022-12-16T21:48:32.070 runner[1fef9d1c] yyz [info] Starting virtual machine

2022-12-16T21:48:32.277 app[1fef9d1c] yyz [info] Starting init (commit: f447594)...

2022-12-16T21:48:32.302 app[1fef9d1c] yyz [info] Mounting /dev/vdc at /mnt/data w/ uid: 0, gid: 0 and chmod 0755

2022-12-16T21:48:32.308 app[1fef9d1c] yyz [info] Preparing to run: `litefs mount -- /code/entrypoint.sh uvicorn coolstuff.shell.api:app --port 80 --host 0.0.0.0` as root

2022-12-16T21:48:32.331 app[1fef9d1c] yyz [info] 2022/12/16 21:48:32 listening on [fdaa:0:b45d:a7b:88dc:3:8570:2]:22 (DNS: [fdaa::3]:53)

2022-12-16T21:48:32.360 app[1fef9d1c] yyz [info] Using Consul to determine primary

2022-12-16T21:48:32.360 app[1fef9d1c] yyz [info] config file read from /etc/litefs.yml

2022-12-16T21:48:32.360 app[1fef9d1c] yyz [info] LiteFS v0.3.0-beta7, commit=7e5287a655c985ca5122f1b082d63ef96e546cd5

2022-12-16T21:48:33.795 app[1fef9d1c] yyz [info] initializing consul: key= url=https://:4ee340cf-0c2e-f4e1-be1b-d5acd18d0add@consul-iad.fly-shared.net/cool-stuff-api-jg589z352ggqkze7/ hostname=1fef9d1c advertise-url=http://1fef9d1c.vm.cool-stuff-api.internal:20202

2022-12-16T21:48:33.805 app[1fef9d1c] yyz [info] ERROR: cannot open store: open databases: open database("coolstuff"): verify database file: database checksum (fbb9e42a3f652ae2) does not match latest LTX checksum (cdc027fdeda3eae4)

2022-12-16T21:48:34.318 app[1fef9d1c] yyz [info] Starting clean up.

2022-12-16T21:48:34.318 app[1fef9d1c] yyz [info] Umounting /dev/vdc from /mnt/data 

So my initial thought is is this somehow related to me not having multiple instances running? Is there a presumption of there always being a primary? And as the primary is being shutdown during the deployment and no replica to takeover, is that somehow causing the ltx to be out of date?

hey @bencdavis, thanks for giving it another try. I’ve reworked the whole checksumming system since beta7 since it was too easy to end up with inconsistencies. Overall it’s been working a lot better. It was a fairly substantial change so I’ve been doing testing since then and working out some kinks. I’m hoping to have a beta8 early next week. Hopefully that’s the last v0.3.0 beta! :crossed_fingers: :crossed_fingers: :crossed_fingers:

LiteFS should run fine with a single node deployment. Checksum issues could prevent this though. I’ll do some additional testing around takeovers before I ship the beta8.

hey @bencdavis, thanks for giving it another try. I’ve reworked the whole checksumming system since beta7 since it was too easy to end up with inconsistencies. Overall it’s been working a lot better. It was a fairly substantial change so I’ve been doing testing since then and working out some kinks. I’m hoping to have a beta8 early next week. Hopefully that’s the last v0.3.0 beta! :crossed_fingers: :crossed_fingers: :crossed_fingers:

Sounds great! I’ll keep an eye on the GH releases and try it out when it’s ready. It’s pretty easy for me to switch back and forth, so I’m very happy to test out these beta releases.

Hey @benbjohnson! Just updated to the newly released v0.3 and all the issues I was experiencing before seem to be completely resolved. It worked flawlessly with a single instance in a single region and equally well with two regions, with two instances running in the region I’ve constrained the primary to. The logic to replay mutating requests to the primary and the logic to support waiting for a replica to catch up to the last written transaction ID was surprisingly simple; works great so far. Based on my initial tests, the replication is so fast that no requests have had to wait at all.

Thanks for all the back and forth on this. Couldn’t be more pleased with the end result!

I have a simple question that doesn’t seem worth creating another topic for: how do I enable WAL mode? I assumed I’d just run PRAGMA journal_mode=WAL; on the database file (which as I understand it is a permanent change, not one scoped to the connection), but it doesn’t seem to work. There’s neither a WAL file alongside the database file (as I write this I’m wondering if I’m doing it against the wrong file? I assumed I’d do it against the virtual file in the fuse mount?), and on startup I see this wal-sync: no wal file exists on "coolstuff", skipping sync with ltx.

That’s awesome, @bencdavis! I’m glad to hear it’s working well for you. Let me know if you run into any issues.

You should just need to run PRAGMA journal_mode = wal on the primary node. The wal-sync message you posted is kinda confusing. SQLite will automatically clean up the WAL file and merge it into the database file when it closes cleanly. The wal-sync message occurs on startup because LiteFS ensures the transaction files it writes and the WAL file are in sync but it skips the check if there’s no WAL file.

You can verify if your database is in WAL mode by running PRAGMA journal_mode from the sqlite3 CLI and it should return "wal".

Hey Ben, I just ran into this, I tried creating another volume to make it so that when I deploy the app doesn’t go down, but in doing so I seem to have gotten myself into a similar issue.

I don’t see PRAGMA journal_mode = wal anywhere in the codebase in GitHub - epicweb-dev/epic-stack: This is a Remix Stack with the foundational things setup and configured for you to hit the ground running on your next EPIC idea. is this something that must be run? if so, when? and how would you go about doing it?

happy to share more info if something comes to mind, but right now I’m unable to access my deployment due to this issue.

2023-05-22T08:09:13.505 app[1516edbb] lax [info] 54357AF3DBF0453B: existing primary found (6386fb99), connecting as replica

2023-05-22T08:09:13.506 app[1516edbb] lax [info] 54357AF3DBF0453B: disconnected from primary with error, retrying: connect to primary: Post "http://6386fb99-df23-f121-4e86-5be9ee078b69.vm.infinite-threads-production.internal:20202/stream": dial tcp: lookup 6386fb99-df23-f121-4e86-5be9ee078b69.vm.infinite-threads-production.internal on [fdaa::3]:53: no such host ('http://6386fb99-df23-f121-4e86-5be9ee078b69.vm.infinite-threads-production.internal:20202')

2023-05-22T08:09:14.564 app[1516edbb] lax [info] 54357AF3DBF0453B: existing primary found (6386fb99), connecting as replica

2023-05-22T08:09:14.565 app[1516edbb] lax [info] 54357AF3DBF0453B: disconnected from primary with error, retrying: connect to primary: Post "http://6386fb99-df23-f121-4e86-5be9ee078b69.vm.infinite-threads-production.internal:20202/stream": dial tcp: lookup 6386fb99-df23-f121-4e86-5be9ee078b69.vm.infinite-threads-production.internal on [fdaa::3]:53: no such host ('http://6386fb99-df23-f121-4e86-5be9ee078b69.vm.infinite-threads-production.internal:20202')

2023-05-22T08:09:15.624 app[1516edbb] lax [info] 54357AF3DBF0453B: existing primary found (6386fb99), connecting as replica

@lifeiscontent Can you share your litefs.yml file? And can you run fly status to see what "platform" you’re running? It’s either “nomad” or “machines”.

You don’t need to run in WAL mode. LiteFS supports both the rollback journal & WAL.

1 Like

nomad, already sent the litefs.yml in our email thread :slight_smile: