Instance
ID = ccad3c8e
Process =
Version = xxx
Region = xxx
Desired = run
Status = failed
Health Checks =
Restarts = 0
Created = 8m41s ago
Recent Events
TIMESTAMP TYPE MESSAGE
2022-04-07T11:28:06Z Received Task received by client
2022-04-07T11:28:06Z Task Setup Building Task Directory
2022-04-07T11:28:08Z Driver Failure rpc error: code = Unknown desc = IP allocation error: ip collision: yyy:xxx
2022-04-07T11:28:08Z Not Restarting Error was unrecoverable
2022-04-07T11:28:08Z Alloc Unhealthy Unhealthy because of failed task
2022-04-07T11:28:10Z Killing Sent interrupt. Waiting 5s before force killing
Hey there-- I wanted to circle back to this to see how things were going.
It sounds like you were the right track with your research on this error. I wanted to follow up to let you know that we took a look at that instance’s (ccad3c8e) host, and cleared up an issue that looked like it might have led to the problem. You might try restarting the instance now (fly vm restart ccad3c8e) to see if that clears things up!
Only way we could get around this was to stand up another redis instance server. Do you have more information on what may have contributed to this issue and how we could avoid it happening again?
Of course! Happy to share what I can and help you better understand your apps’ environment. As you might have surmised from reading our documentation and platform blog, our platform’s our platform’s implementation is a bit of a moving target, and this is useful to keep in mind.
In this case, there wouldn’t have been any way for you to predict or circumvent the failure you encountered after your application crashed. App instances that have a volume are assigned IP addresses and placed on hosts a little differently than a standalone app instance.
We recently changed the way this worked, releasing the current more performant version and eliminating the old way. A subset of hosts didn’t get the new logic, so when your app crashed it was blocked from restarting until we could fix it. I can imagine that you might want the instance-level logic to be more resilient, and we agree – it’s something we thing a a lot about., and we’re always experimenting with ways to improve, and ways to share this info with our uses.
I hope this additional context proves useful! Let us know if there’s anything else we might be able to help with.