KeyDB instance crashed, unable to start

lewis · October 1, 2021, 2:21am

Hi there, I cloned the GitHub - fly-apps/keydb: KeyDB server on Fly repo a few weeks ago to deploy for my app and it’s been running fine for a while.

However, this morning, I found out that it has apparently crashed, for reasons I haven’t been able to identify using fly logs as all it shows are messages like these:

2021-09-30T02:34:25.528104644Z runner[904dff81] sjc [info] Starting instance
2021-09-30T03:34:26.825582422Z runner[383ea0a7] sjc [info] Starting instance
2021-09-30T04:34:28.549429999Z runner[dfabc32b] sjc [info] Starting instance
2021-09-30T20:01:21.293326236Z runner[19d0c2cc] sjc [info] Starting instance
2021-09-30T20:01:22.474958346Z runner[e4142838] sjc [info] Starting instance
2021-09-30T20:06:31.414894682Z runner[e7a53242] sjc [info] Starting instance
2021-09-30T20:09:52.128372513Z runner[04b2a16a] sjc [info] Starting instance
2021-09-30T20:09:53.662655660Z runner[04a290e8] sjc [info] Starting instance
2021-09-30T21:00:20.940894632Z runner[2c2d7adf] sjc [info] Starting instance
2021-09-30T21:00:51.900525849Z runner[d2bd7b26] sjc [info] Starting instance
2021-09-30T21:01:52.754711933Z runner[8c5bac60] sjc [info] Starting instance
2021-09-30T21:05:26.779474629Z runner[6f2d77fd] sjc [info] Starting instance
2021-09-30T21:09:28.068863796Z runner[38127825] sjc [info] Starting instance
2021-10-01T01:42:56.804428288Z runner[5763df09] sjc [info] Starting instance
2021-10-01T01:42:57.970893814Z runner[2da7e605] sjc [info] Starting instance

I also tried to redeploy but no luck, failing at the initial healthcheck:

1 desired, 1 placed, 0 healthy, 1 unhealthy
v14 failed - Failed due to unhealthy allocations - rolling back to job version 13

(rollback doesn’t seem to be working either)

Would really appreciate some help with recovering the instance, and also ideally with identifying the cause of the original crash. Thanks!

kurt · October 1, 2021, 2:27am

That’s unpleasant!

Try running fly status --all to get a list of failed VMs, then run fly vm status <id> to see what actually failed. If you can share the errors we may be able to point you in the right direction.

lewis · October 1, 2021, 2:35am

Here’s the output from fly vm status <id>:

Instance
  ID            = 557e78bd   
  Task          =
  Version       = 20
  Region        = sjc
  Desired       = run        
  Status        = failed
  Health Checks =
  Restarts      = 0
  Created       = 12m0s ago

Recent Events
TIMESTAMP            TYPE           MESSAGE                                                

2021-10-01T02:18:39Z Received       Task received by client                                

2021-10-01T02:18:39Z Task Setup     Building Task Directory                                

2021-10-01T02:18:40Z Driver Failure rpc error: code = Unknown desc = IP allocation error: ip collision: fdaa:0:3220:a7b:ad0:0:3cca:1
2021-10-01T02:18:40Z Not Restarting Error was unrecoverable                                

2021-10-01T02:18:40Z Killing        Sent interrupt. Waiting 5s before force killing        


Checks
ID SERVICE STATE OUTPUT

Recent Logs
Done in 0.78s.

Hmm… maybe the crashed instance didn’t unallocate its IP? Any ideas on how to move forward?

kurt · October 1, 2021, 2:36am

Oh thanks for these logs, we’ll have a look, I don’t know exactly what happened there.

lewis · October 1, 2021, 11:29am

Strangely enough, the service seems to be back up now (like it did the last time I posted an issue here ), so this is no longer a blocker. I am still quite curious about what exactly caused the crash and subsequent IP collision errors though.

kurt · October 1, 2021, 12:57pm

We cleaned up the IP collision last night (and forgot to tell you). It’s not clear why it crashed, though. Our logs only go back about 3 days.

Topic		Replies	Views
KeyDB constantly "failing" Questions / Help	13	1048	October 12, 2021
Keydb - fly not scaling Build debugging	0	255	October 3, 2022
Strange behavior preventing app startup	14	586	January 29, 2021
Cannot find matching keyid corepack issue JavaScript	0	48	February 3, 2025
App stuck in Pending, Volume "Migrated", maybe for a week already	1	221	November 4, 2022

KeyDB instance crashed, unable to start

Related topics