I’ve had several deploys fail over the last few days but the error seems to be somewhat intermittent.
I’m running
flyctl deploy --remote-only --image registry.hub.docker.com/shieldsio/shields:next
and sometimes the deploy job will fail with Failed due to unhealthy allocations
.
If I inspect the instance with a failed health check using
flyctl vm status <instance-id>
the output will look something like
Recent Events
TIMESTAMP TYPE MESSAGE
2022-05-03T12:49:20Z Received Task received by client
2022-05-03T12:49:20Z Task Setup Building Task Directory
2022-05-03T12:49:39Z Driver Failure rpc error: code = Unknown desc = could not set bigger stdout pipe: cannot allocate memory
2022-05-03T12:49:39Z Not Restarting Error was unrecoverable
2022-05-03T12:49:39Z Alloc Unhealthy Unhealthy because of failed task
2022-05-03T12:49:39Z Killing Sent interrupt. Waiting 5s before force killing
2022-05-03T12:49:40Z Killing Sent interrupt. Waiting 5s before force killing
and show that the cause of the failure was rpc error: code = Unknown desc = could not set bigger stdout pipe: cannot allocate memory
.
There are two patterns I have noticed here, but they could be red herrings:
- We have two apps in our organisation: staging and production. Staging runs one VM instance. Proudction runs lots of VM instances (the exact number varies but the minimum is 14). I’ve only ever seen this failure deploying to production, not staging. This makes me think it could be some kind of concurrency related issue but this may just be because the sample size is larger: there are many more instances that could possibly fail deploying to production.
- We usually kick off deploys using a GitHub
workflow_dispatch
action which usessuperfly/flyctl-actions/setup-flyctl@master
to installflyctl
and then runsflyctl deploy
. I’ve only ever seen this error happen when kicking off the deploy via GitHub actions. I’ve never seen it happen when running the deploy locally. I can’t see any obvious reason for this difference given we are using remote builders. Might be coincidence. Might not.
Is there any other information I can provide to help track down the cause of this?
Thanks.