Outage caused apps to die, but fly never recovered them?

We just noted a few apps randomly died because of redis connection errors to another redis app deployed on fly so they are dead. But nothing brings them back?

Another app that is now dead and wont restart was last seen logging:

Pulling image failed

This happen a few times with no prior errors and then died, won’t recover.

How do we avoid this happening in the future?

fly restart is completely unresponsive, how do we get these apps back up?

The only way I have found in the past to unbrick dead apps is doing a fly deploy - but with a complicated CI/CD setup, this can be a problem to get apps back up ASAP.

Were these apps + redis instances by chance? We are pretty good at rescheduling apps, but when redis needs to boot first it may not work properly. This is currently our biggest ongoing projects.

A fly secrets set is a simpler way to do the fly deploy process. fly restart just restarts VMs in place, so if there are non scheduled it won’t do anything.

There’s not much you can do about this right this second. It will improve, and last night’s outage was somewhat unique, so there’s a low percentage chance of the same project occurring again.

Gotcha, this was 2 apps, one was connected to redis and started to get errors, then it died.

The other died with the last logs as : Pulling image failed repeated times, but then never recovered.