rpc error: code = Unknown desc = could not set bigger stdout pipe: cannot allocate memory

I’ve had several deploys fail over the last few days but the error seems to be somewhat intermittent.

I’m running

flyctl deploy --remote-only --image registry.hub.docker.com/shieldsio/shields:next

and sometimes the deploy job will fail with Failed due to unhealthy allocations.

If I inspect the instance with a failed health check using

flyctl vm status <instance-id>

the output will look something like

Recent Events
TIMESTAMP            TYPE            MESSAGE                                                                                   
2022-05-03T12:49:20Z Received        Task received by client                                                                   
2022-05-03T12:49:20Z Task Setup      Building Task Directory                                                                   
2022-05-03T12:49:39Z Driver Failure  rpc error: code = Unknown desc = could not set bigger stdout pipe: cannot allocate memory 
2022-05-03T12:49:39Z Not Restarting  Error was unrecoverable                                                                   
2022-05-03T12:49:39Z Alloc Unhealthy Unhealthy because of failed task                                                          
2022-05-03T12:49:39Z Killing         Sent interrupt. Waiting 5s before force killing                                           
2022-05-03T12:49:40Z Killing         Sent interrupt. Waiting 5s before force killing

and show that the cause of the failure was rpc error: code = Unknown desc = could not set bigger stdout pipe: cannot allocate memory.

There are two patterns I have noticed here, but they could be red herrings:

  1. We have two apps in our organisation: staging and production. Staging runs one VM instance. Proudction runs lots of VM instances (the exact number varies but the minimum is 14). I’ve only ever seen this failure deploying to production, not staging. This makes me think it could be some kind of concurrency related issue but this may just be because the sample size is larger: there are many more instances that could possibly fail deploying to production.
  2. We usually kick off deploys using a GitHub workflow_dispatch action which uses superfly/flyctl-actions/setup-flyctl@master to install flyctl and then runs flyctl deploy. I’ve only ever seen this error happen when kicking off the deploy via GitHub actions. I’ve never seen it happen when running the deploy locally. I can’t see any obvious reason for this difference given we are using remote builders. Might be coincidence. Might not.

Is there any other information I can provide to help track down the cause of this?
Thanks.

Does this seem to be hitting one region more than the other? It’s quite possible this is something cropping up in one of the regions your production app is running in.

Hi @chris48s

This was the error that we were getting and turned out to be capacity issues in Sydney. This should be resolved now.

I’m still seeing this again tonight. I’ve also now seen this fail initiating the deploy locally so that rules out GitHub Actions as a factor.

Here’s the status of a VM that just failed. Our region is EWR:

$ flyctl vm status 7d42e865
Instance
  ID            = 7d42e865   
  Process       =            
  Version       = 139        
  Region        = ewr        
  Desired       = stop       
  Status        = failed     
  Health Checks =            
  Restarts      = 0          
  Created       = 1m15s ago  

Recent Events
TIMESTAMP            TYPE            MESSAGE                                                                                   
2022-05-12T17:47:12Z Received        Task received by client                                                                   
2022-05-12T17:47:12Z Task Setup      Building Task Directory                                                                   
2022-05-12T17:47:15Z Driver Failure  rpc error: code = Unknown desc = could not set bigger stdout pipe: cannot allocate memory 
2022-05-12T17:47:15Z Not Restarting  Error was unrecoverable                                                                   
2022-05-12T17:47:15Z Alloc Unhealthy Unhealthy because of failed task                                                          
2022-05-12T17:47:15Z Killing         Sent interrupt. Waiting 5s before force killing                                           
2022-05-12T17:47:15Z Killing         Sent interrupt. Waiting 5s before force killing                                           

Checks
ID SERVICE STATE OUTPUT 

Recent Logs
  2022-05-12T17:47:12.000 [info] Starting instance

Thank you for letting us know! We’ll take a close look at the ewr region to try to clear this up.

I’ve done 3 deploys in the last 2 weeks and this hasn’t recurred. Is it safe to assume this has been resolved?