Health checks failing: error waiting for vsock readiness


Our Rails app is failing deploy health checks. I tried to deploy with fly deploy, the CLI successfully created a release, but while monitoring the deployment, I get back a message that looks like this:

 6 desired, 5 placed, 3 healthy, 1 unhealthy [health checks: 1 total, 1 passing]
--> v151 failed - Failed due to unhealthy allocations - rolling back to job version 150 and deploying as v152 

When I ran fly vm status on one of the failed instances, I saw this:

TIMESTAMP               TYPE            MESSAGE                                                                              
2023-01-06T19:07:40Z    Received        Task received by client                                                             
2023-01-06T19:08:15Z    Task Setup      Building Task Directory                                                             
2023-01-06T19:13:15Z    Alloc Unhealthy Task not running by deadline                                                        
2023-01-06T19:13:51Z    Killing         Sent interrupt. Waiting 5s before force killing                                     
2023-01-06T19:22:55Z    Driver Failure  rpc error: code = Unknown desc = error waiting for vsock readiness: context canceled
2023-01-06T19:22:55Z    Not Restarting  Error was unrecoverable                                                             
2023-01-06T19:22:59Z    Killing         Sent interrupt. Waiting 5s before force killing   

The rollback also failed with similar messages (even though the original deployment to that version succeeded earlier today).

Our app seems to be running fine with no exceptions, and we haven’t made any recent configuration changes.

Any troubleshooting tips?

I didn’t change anything, but deploys seem to be working again. Not sure what was going on.

1 Like

This was most likely the result of a VM landing on a server under some load. It’ll clear itself up, but it is disruptive to deploys. It’s a known issue, should be fixed in the next few months.

1 Like

Thanks, @kurt! Do you know if there’s anything we can do avoid this happening again in the meantime? It seems like it could be bad if we had an incident and were unable to deploy.

There’s no workaround for fly deploy yet. Machines don’t suffer from this in the same way, so if you want to take over the deploy logic and run your app on Machines, it will bypass some of the complexity that causes these kinds of problems. That’s not the easiest, though, there’s a lot of magic in fly deploy.