KeyDB constantly "failing"

I’ve been using KeyDB for a low-traffic site for about a year, IIRC. Since a week or two ago (can’t be exactly sure), the keydb servers have been failing. There’s nothing in the log output to show why, and they’re not hitting any memory limits or anything, according to the metrics I can see in the dashboard.

I’ve just executed fly vm status --app {my_app} {instance_id} and see this:

Instance
  ID            = e28bXXXX
  Process       =
  Version       = 1
  Region        = ams
  Desired       = run
  Status        = running
  Health Checks = 2 total, 2 critical
  Restarts      = 1
  Created       = 12h34m ago

Recent Events
TIMESTAMP            TYPE             MESSAGE
2021-10-11T18:30:00Z Received         Task received by client
2021-10-11T18:30:00Z Task Setup       Building Task Directory
2021-10-11T18:30:08Z Started          Task started by client
2021-10-12T05:20:40Z Restart Signaled User requested restart
2021-10-12T05:20:44Z Terminated       Exit Code: 0
2021-10-12T05:20:44Z Restarting       Task restarting in 0s
2021-10-12T05:20:49Z Started          Task started by client

Checks
ID                               SERVICE  STATE    OUTPUT
27913efa8328a926c127fdb9XXXXXXXX tcp-6379 critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": EOF
8d0f2f31129f763f17b124b7XXXXXXXX tcp-6379 critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": EOF

I’ve tried recreating the cluster using the launcher, but the problem persists.

I’ve also tried creating a redis instance following the github example but that also seems to crash on connection by a client.

I don’t think I’m doing anything out of the ordinary. If the cache (either redis or keydb) doesn’t crash on connection, then it crashes on the first command such as MEMORY STATS. I’ve tried different versions of redis clients to connect. Doesn’t help.

I need keydb (or redis) as an LRU cache to be persistent between app restarts/updates, so moving the cache to my app would be counter productive.

I’m seeing this while tailing the logs:

2021-10-12T05:20:55.250 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 2021-10-12T05:20:55.133 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.132 * Connecting to MASTER fdaa:0:28bd:a7b:23c2:0:4759:2:6379
2021-10-12T05:20:55.133 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.132 * MASTER <-> REPLICA sync started
2021-10-12T05:20:55.133 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.132 * Non blocking connect for SYNC fired the event.
2021-10-12T05:20:55.133 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.132 * Master replied to PING, replication can continue...
2021-10-12T05:20:55.133 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.132 * Partial resynchronization not possible (no cached master)
2021-10-12T05:20:55.133 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.132 * Replica [fdaa:0:28bd:a7b:23c2:0:4759:2]:6379 asks for synchronization
2021-10-12T05:20:55.133 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.132 * Full resync requested by replica [fdaa:0:28bd:a7b:23c2:0:4759:2]:6379
2021-10-12T05:20:55.133 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.132 * Starting BGSAVE for SYNC with target: disk
2021-10-12T05:20:55.133 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.133 * Background saving started by pid 620
2021-10-12T05:20:55.133 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.133 * Full resync from master: f9ce118f46e97ee34ecc8f272c2622f736205c6a:0
2021-10-12T05:20:55.133 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.133 * Discarding previously cached master state.
2021-10-12T05:20:55.165 app[e28bb399] ams [info] keydb         | 620:583:C 12 Oct 2021 05:20:55.165 * DB saved on disk
2021-10-12T05:20:55.166 app[e28bb399] ams [info] keydb         | 620:583:C 12 Oct 2021 05:20:55.166 * RDB: 1 MB of memory used by copy-on-write
2021-10-12T05:20:55.232 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.232 * Background saving terminated with success
2021-10-12T05:20:55.232 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.232 * MASTER <-> REPLICA sync: receiving 4065011 bytes from master to disk
2021-10-12T05:20:55.238 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.237 * Synchronization with replica [fdaa:0:28bd:a7b:23c2:0:4759:2]:6379 succeeded
2021-10-12T05:20:55.238 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.238 * MASTER <-> REPLICA sync: Loading DB in memory
2021-10-12T05:20:55.238 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.238 * Loading RDB produced by version 6.0.16
2021-10-12T05:20:55.238 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.238 * RDB age 0 seconds
2021-10-12T05:20:55.238 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.238 * RDB memory usage when created 14.60 Mb
2021-10-12T05:20:55.250 app[e28bb399] ams [info] keydb         | 543:583:S 12 Oct 2021 05:20:55.250 * MASTER <-> REPLICA sync: Finished with success
2021-10-12T05:20:59.669 proxy[65894576] ams [info] Health check status changed 'critical' => 'passing'
2021-10-12T05:21:07.370 proxy[e28bb399] ams [info] Health check status changed 'critical' => 'passing'
2021-10-12T05:21:43.150 proxy[e28bb399] ams [error] Health check status changed 'passing' => 'critical'
2021-10-12T05:22:16.538 proxy[65894576] ams [error] Health check status changed 'passing' => 'critical'

I’ve also tried following the newer redis cluster but no joy. It’s like the routing for .internal for my app isn’t working. I’m using lhr and ams regions, if that helps.

Can you post your fly.toml for the Keydb cluster?

@jsierles I don’t have one locally; I’m using the launcher.

Ah, that would explain it! The launcher uses our old ‘script’ health checks. These will run scripts on the VM to check its health. We’ve deprecated this behavior. So the best way to get things working would be to use the updated Github example: GitHub - fly-apps/keydb: KeyDB server on Fly

You could grab this and change the values in fly.toml to match your app. The important part would be to make sure the volume names are the same. You can check those with fly volumes -a keydb-app-name.

I’ll give this a try, thanks.

Any idea if/when the launcher will be fixed? It’s super useful.

We’re working this from different angles.

We’d like to allow flyctl to run launchers like this for services that aren’t fully-formed like fly postgres. This would likely pull config from our up-to-date example repo instead of hiding the behavior behind a UI.

That said, we want to keep the UI and have it work in a similar way as flyctl. We’re working on an Elixir/Phoenix launcher now which will serve as a template for this. Stay tuned!

2 Likes

Running from flyctl would be useful indeed.

Sorry to say, this also hasn’t worked. Same as the redis issue: the .internal URL doesn’t seem to route successfully. The Redis client can’t connect, and the keydb server doesn’t show a client connection.

What Redis client are you using? Can you double check and make sure it (a) can connect to IPv6 and (b) will do an AAAA lookup on a hostname? These things didn’t work by default with libcluster or Ecto, it wouldn’t surprise me if a Redis client on Elixir has the same issue.

Redix.

For now, I’ve tweaked the deployment of keydb to open it up on an external port, and that (thankfully) is working.

It would be worth posting an issue on the Redis repo to ask about IPv6 support. I would bet money (a small amount) that it requires a configuration option to work properly over IPv6.

1 Like

Will do, though I did check docs/issues/ddg and couldn’t find a single mention of IPv6. I’ll also do some more testing locally against the public IPv6 of my working cache to confirm it is an issue with Redix.