IP allocation error: ip collision - killed our app

Instance
  ID            = ccad3c8e
  Process       =
  Version       = xxx
  Region        = xxx
  Desired       = run
  Status        = failed
  Health Checks =
  Restarts      = 0
  Created       = 8m41s ago

Recent Events
TIMESTAMP            TYPE            MESSAGE
2022-04-07T11:28:06Z Received        Task received by client
2022-04-07T11:28:06Z Task Setup      Building Task Directory
2022-04-07T11:28:08Z Driver Failure  rpc error: code = Unknown desc = IP allocation error: ip collision: yyy:xxx
2022-04-07T11:28:08Z Not Restarting  Error was unrecoverable
2022-04-07T11:28:08Z Alloc Unhealthy Unhealthy because of failed task
2022-04-07T11:28:10Z Killing         Sent interrupt. Waiting 5s before force killing

Seems similar and some what frustrating that we can fix it on our end - KeyDB instance crashed, unable to start

This happened when scaling the app via the ui (increased memory).

Hi, is anyone available to help here? This production redis instance is still dead

The only solution was to bring up a new one

Hey there-- I wanted to circle back to this to see how things were going.

It sounds like you were the right track with your research on this error. I wanted to follow up to let you know that we took a look at that instance’s (ccad3c8e) host, and cleared up an issue that looked like it might have led to the problem. You might try restarting the instance now (fly vm restart ccad3c8e) to see if that clears things up!

Only way we could get around this was to stand up another redis instance server. Do you have more information on what may have contributed to this issue and how we could avoid it happening again?

Thanks,
David

Of course! Happy to share what I can and help you better understand your apps’ environment. As you might have surmised from reading our documentation and platform blog, our platform’s our platform’s implementation is a bit of a moving target, and this is useful to keep in mind.

In this case, there wouldn’t have been any way for you to predict or circumvent the failure you encountered after your application crashed. App instances that have a volume are assigned IP addresses and placed on hosts a little differently than a standalone app instance.

We recently changed the way this worked, releasing the current more performant version and eliminating the old way. A subset of hosts didn’t get the new logic, so when your app crashed it was blocked from restarting until we could fix it. I can imagine that you might want the instance-level logic to be more resilient, and we agree – it’s something we thing a a lot about., and we’re always experimenting with ways to improve, and ways to share this info with our uses.

I hope this additional context proves useful! Let us know if there’s anything else we might be able to help with.