KeyDB instance crashed, unable to start

Hi there, I cloned the GitHub - fly-apps/keydb: KeyDB server on Fly repo a few weeks ago to deploy for my app and it’s been running fine for a while.

However, this morning, I found out that it has apparently crashed, for reasons I haven’t been able to identify using fly logs as all it shows are messages like these:

2021-09-30T02:34:25.528104644Z runner[904dff81] sjc [info] Starting instance
2021-09-30T03:34:26.825582422Z runner[383ea0a7] sjc [info] Starting instance
2021-09-30T04:34:28.549429999Z runner[dfabc32b] sjc [info] Starting instance
2021-09-30T20:01:21.293326236Z runner[19d0c2cc] sjc [info] Starting instance
2021-09-30T20:01:22.474958346Z runner[e4142838] sjc [info] Starting instance
2021-09-30T20:06:31.414894682Z runner[e7a53242] sjc [info] Starting instance
2021-09-30T20:09:52.128372513Z runner[04b2a16a] sjc [info] Starting instance
2021-09-30T20:09:53.662655660Z runner[04a290e8] sjc [info] Starting instance
2021-09-30T21:00:20.940894632Z runner[2c2d7adf] sjc [info] Starting instance
2021-09-30T21:00:51.900525849Z runner[d2bd7b26] sjc [info] Starting instance
2021-09-30T21:01:52.754711933Z runner[8c5bac60] sjc [info] Starting instance
2021-09-30T21:05:26.779474629Z runner[6f2d77fd] sjc [info] Starting instance
2021-09-30T21:09:28.068863796Z runner[38127825] sjc [info] Starting instance
2021-10-01T01:42:56.804428288Z runner[5763df09] sjc [info] Starting instance
2021-10-01T01:42:57.970893814Z runner[2da7e605] sjc [info] Starting instance

I also tried to redeploy but no luck, failing at the initial healthcheck:

1 desired, 1 placed, 0 healthy, 1 unhealthy
v14 failed - Failed due to unhealthy allocations - rolling back to job version 13

(rollback doesn’t seem to be working either)

Would really appreciate some help with recovering the instance, and also ideally with identifying the cause of the original crash. Thanks!

That’s unpleasant!

Try running fly status --all to get a list of failed VMs, then run fly vm status <id> to see what actually failed. If you can share the errors we may be able to point you in the right direction.

Here’s the output from fly vm status <id>:

Instance
  ID            = 557e78bd   
  Task          =
  Version       = 20
  Region        = sjc
  Desired       = run        
  Status        = failed
  Health Checks =
  Restarts      = 0
  Created       = 12m0s ago

Recent Events
TIMESTAMP            TYPE           MESSAGE                                                

2021-10-01T02:18:39Z Received       Task received by client                                

2021-10-01T02:18:39Z Task Setup     Building Task Directory                                

2021-10-01T02:18:40Z Driver Failure rpc error: code = Unknown desc = IP allocation error: ip collision: fdaa:0:3220:a7b:ad0:0:3cca:1
2021-10-01T02:18:40Z Not Restarting Error was unrecoverable                                

2021-10-01T02:18:40Z Killing        Sent interrupt. Waiting 5s before force killing        


Checks
ID SERVICE STATE OUTPUT

Recent Logs
Done in 0.78s.

Hmm… maybe the crashed instance didn’t unallocate its IP? Any ideas on how to move forward?

Oh thanks for these logs, we’ll have a look, I don’t know exactly what happened there.

Strangely enough, the service seems to be back up now (like it did the last time I posted an issue here :sweat_smile:), so this is no longer a blocker. I am still quite curious about what exactly caused the crash and subsequent IP collision errors though.

1 Like

We cleaned up the IP collision last night (and forgot to tell you). It’s not clear why it crashed, though. Our logs only go back about 3 days.