Failed due to unhealthy allocations in syd


Been getting this error for the last couple of hours trying to deploy to the syd region.

Status report below:

❯ fly status --all
  Name     = vex          
  Owner    = alembic      
  Version  = 47           
  Status   = running      
  Hostname =  

Deployment Status
  ID          = a952f6ff-6b08-0d99-ea99-514cbd61c2b9                                                                                   
  Version     = v47                                                                                                                    
  Status      = failed                                                                                                                 
  Description = Failed due to unhealthy allocations - not rolling back to stable job version 47 as current job has same specification  
  Instances   = 1 desired, 1 placed, 0 healthy, 1 unhealthy                                                                            

3b1cdf0c	app    	47 ⇡   	syd   	stop   	complete	1 total, 1 critical	0       	1h11m ago           
2e6b93be	app    	46     	syd   	stop   	complete	1 total, 1 critical	0       	1h16m ago           
2418566a	app    	45     	syd   	stop   	complete	1 total, 1 critical	0       	1h33m ago           
ec087973	app    	44     	syd   	stop   	complete	1 total, 1 critical	0       	1h42m ago           
e6353a48	app    	43     	syd   	stop   	complete	1 total, 1 critical	0       	1h47m ago           
2c016110	app    	42     	syd   	run    	running 	1 total, 1 passing 	0       	1h58m ago           
c9fb5318	app    	41     	syd   	stop   	complete	1 total, 1 critical	0       	1h54m ago           
718491a9	app    	39     	syd   	stop   	complete	1 total, 1 passing 	0       	2h33m ago           
061faecc	app    	38     	syd   	stop   	complete	1 total, 1 passing 	0       	2h36m ago           
43f7c8cf	app    	37     	syd   	stop   	complete	1 total, 1 passing 	0       	3h24m ago           
d7aa258b	app    	35     	syd   	stop   	complete	1 total, 1 passing 	0       	18h9m ago           
33377ef2	app    	31     	syd   	stop   	failed  	                   	0       	18h38m ago          
dfad04e0	app    	27     	syd   	stop   	failed  	                   	0       	2022-05-09T04:07:46Z

App name is vex_liveview_prototype

I couldn’t find vex_liveview_prototype, but found vex which appears to be running. Are you still having issues?

Hi Joshua,

It deployed ok once or twice after I posted, but still seems flaky, this is from 5 mins ago:

 ==> Monitoring deployment

v53 is being deployed
3870b91a: syd pending
3870b91a: syd pending
3870b91a: syd running unhealthy [health checks: 1 total, 1 critical]
Failed Instances

Failure #1

--> v53 failed - Failed due to unhealthy allocations - not rolling back to stable job version 53 as current job has same specification and deploying as v54 
3870b91a	       	53     	syd   	run    	running	1 total, 1 critical	0       	4m58s ago	

--> Troubleshooting guide at
Error abort

And here’s the status:

❯ fly status --all
  Name     = vex          
  Owner    = alembic      
  Version  = 53           
  Status   = running      
  Hostname =  

Deployment Status
  ID          = 3e8c668a-d6c4-98c2-d678-ed118110679a                                                                                   
  Version     = v53                                                                                                                    
  Status      = failed                                                                                                                 
  Description = Failed due to unhealthy allocations - not rolling back to stable job version 53 as current job has same specification  
  Instances   = 1 desired, 1 placed, 0 healthy, 1 unhealthy                                                                            

3870b91a	app    	53 ⇡   	syd   	stop   	complete	1 total, 1 critical	0       	8m55s ago           
d54aedd0	app    	52     	syd   	run    	running 	1 total, 1 passing 	0       	8h52m ago           
d9891f44	app    	51     	syd   	stop   	complete	1 total, 1 critical	0       	8h39m ago           
67b06272	app    	48     	syd   	stop   	failed  	                   	0       	8h57m ago           
3b1cdf0c	app    	47     	syd   	stop   	complete	1 total, 1 critical	0       	10h28m ago          
2e6b93be	app    	46     	syd   	stop   	complete	1 total, 1 critical	0       	10h33m ago          
2418566a	app    	45     	syd   	stop   	complete	1 total, 1 critical	0       	10h50m ago          
ec087973	app    	44     	syd   	stop   	complete	1 total, 1 critical	0       	11h0m ago           
e6353a48	app    	43     	syd   	stop   	complete	1 total, 1 critical	0       	11h5m ago           
2c016110	app    	42     	syd   	stop   	complete	1 total, 1 passing 	0       	11h15m ago          
c9fb5318	app    	41     	syd   	stop   	complete	1 total, 1 critical	0       	11h11m ago          
718491a9	app    	39     	syd   	stop   	complete	1 total, 1 passing 	0       	11h50m ago          
061faecc	app    	38     	syd   	stop   	complete	1 total, 1 passing 	0       	11h53m ago          
43f7c8cf	app    	37     	syd   	stop   	complete	1 total, 1 passing 	0       	12h41m ago          
33377ef2	app    	31     	syd   	stop   	failed  	                   	0       	2022-05-10T06:53:14Z

Can you also post the results of fly checks list?

We have some capacity problems in Sydney right now, but are preparing new servers to take on the load. Meanwhile, you could deploy in another region or temporarily reduce your scaling count.

Health Checks for vex
  NAME                             | STATUS  | ALLOCATION | REGION | TYPE | LAST UPDATED | OUTPUT                                     
  3df2415693844068640885b45074b954 | passing | d54aedd0   | syd    | TCP  | 9h18m ago    | TCP connect Success[✓]  
                                   |         |            |        |      |              |                                            

Our scaling count is currently 1.

❯ flyctl scale show
VM Resources for vex
        VM Size: shared-cpu-1x
      VM Memory: 1 GB
          Count: 1
 Max Per Region: Not set

We had some issues recently where we were deployed in a backup region and performance was really bad due to the db being in syd and the app elsewhere.

What’s the timeframe for the new servers to get provisioned Joshua?

We’re hoping to get it going today. We’ve run into some snags there with incorrectly configured networking.

Cool, happy to wait until tomorrow to try again.

I noticed a networking issue today in SYD today as well, probably the one you’re already resolving, I had some apps in the same organization that weren’t able to communicate with each other. I narrowed it down to DNS entries being missing by integrating the code from GitHub - fly-apps/privatenet: Examples around querying 6PN private networking on Fly and seeing that only some of the apps had DNS entries. When I deleted and recreated everything I had the same issue plus a DNS entry for an app that just had a lowercase L for a name which might have been the first letter truncated from my application name. I will try again tomorrow too as it’s quite late here.

I am not sure if this is related to this issue, however I am also having issues building / deploying apps to Sydney today.

Hi @jsierles,

Tried redeploying to syd this morning, but still getting the same failure. Did you manage to get the extra machines provisioned?

 --> Pushing image done
image size: 157 MB
==> Creating release
--> release v59 created

--> You can detach the terminal anytime without stopping the deployment
==> Release command detected: /app/bin/vex_liveview_prototype eval VexLiveviewPrototype.Release.migrate

--> This release will not be available until the release command succeeds.
	 Starting instance
	 Configuring virtual machine
	 Pulling container image
	 Unpacking image
	 Preparing kernel init
	 Configuring firecracker
	 Starting virtual machine
	 Starting init (commit: 252b7bd)...
	 Preparing to run: `/app/bin/vex_liveview_prototype eval VexLiveviewPrototype.Release.migrate` as nobody
	 2022/05/11 23:24:35 listening on [fdaa:0:59b1:a7b:66:5ea7:eb11:2]:22 (DNS: [fdaa::3]:53)
	 23:24:41.283 [info] Migrations already up
	 Main child exited normally with code: 0
	 Staped child process with pid: 569 and signal: SIGUSR1, core dumped? false
	 Starting clean up.
	 Starting instance
	 Configuring virtual machine
	 Pulling container image
	 Unpacking image
	 Preparing kernel init
	 Configuring firecracker
	 Starting virtual machine
	 Starting init (commit: 252b7bd)...
	 Preparing to run: `/app/bin/vex_liveview_prototype eval VexLiveviewPrototype.Release.migrate` as nobody
	 2022/05/11 23:24:35 listening on [fdaa:0:59b1:a7b:66:5ea7:eb11:2]:22 (DNS: [fdaa::3]:53)
	 23:24:41.283 [info] Migrations already up
	 Main child exited normally with code: 0
	 Staped child process with pid: 569 and signal: SIGUSR1, core dumped? false
	 Starting clean up.
==> Monitoring deployment

v59 is being deployed
--> v59 failed - Failed due to unhealthy allocations - rolling back to job version 58 and deploying as v60 

--> Troubleshooting guide at
Error abort

@martin1 this is probably not a capacity issue in Sydney. Can you run fly status --all, find the ID of a VM that failed, and then run fly vm status <id>?

It looks like maybe your app isn’t passing health checks in time.

Sure @kurt :

❯ fly status --all
  Name     = vex          
  Owner    = alembic      
  Version  = 56           
  Status   = running      
  Hostname =  

Deployment Status
  ID          = 81eed7b3-af6d-3e20-37fd-8fa68ebe0539                                                                                   
  Version     = v56                                                                                                                    
  Status      = failed                                                                                                                 
  Description = Failed due to unhealthy allocations - not rolling back to stable job version 56 as current job has same specification  
  Instances   = 1 desired, 1 placed, 0 healthy, 1 unhealthy                                                                            

494f94e2	app    	56 ⇡   	syd   	stop   	complete	1 total, 1 critical	0       	11m28s ago          	
3871f86a	app    	55     	syd   	stop   	complete	1 total, 1 critical	0       	59m52s ago          	
642ad002	app    	55     	syd   	stop   	complete	1 total, 1 critical	0       	11h36m ago          	
c859c994	app    	54     	syd   	stop   	complete	1 total, 1 critical	0       	11h41m ago          	
f0921193	app    	53     	syd   	stop   	complete	1 total, 1 critical	0       	11h48m ago          	
d54aedd0	app    	52     	syd   	run    	running 	1 total, 1 passing 	0       	21h51m ago          	
67b06272	app    	48     	syd   	stop   	failed  	                   	0       	21h55m ago          	
33377ef2	app    	31     	syd   	stop   	failed  	                   	0       	2022-05-10T06:53:14Z	

❯ fly vm status 494f94e2
  ID            = 494f94e2             
  Process       =                      
  Version       = 56                   
  Region        = syd                  
  Desired       = stop                 
  Status        = complete             
  Health Checks = 1 total, 1 critical  
  Restarts      = 0                    
  Created       = 11m49s ago           

Recent Events
TIMESTAMP            TYPE            MESSAGE                                                  
2022-05-11T23:35:14Z Received        Task received by client                                  
2022-05-11T23:35:14Z Task Setup      Building Task Directory                                  
2022-05-11T23:35:20Z Started         Task started by client                                   
2022-05-11T23:40:14Z Alloc Unhealthy Task not running for min_healthy_time of 10s by deadline 
2022-05-11T23:40:16Z Killing         Sent interrupt. Waiting 5s before force killing          
2022-05-11T23:40:34Z Terminated      Exit Code: 0                                             
2022-05-11T23:40:34Z Killed          Task successfully killed                                 

ID                               SERVICE  STATE    OUTPUT                                                 
3df2415693844068640885b45074b954 tcp-8080 critical dial tcp connect: connection refused 

Recent Logs
    ~/work/alembic/vex_liveview_prototype    main *16   

Can you check fly logs for anything suspicious? This is reporting that your app is not listening on port 8080, so failing health checks. This could happen, for example, if the VM runs out of memory.

Hi Folks,

I think I’m having simillar issues.

❯ fly status --all
  Name     = valuable-api
  Owner    = ringfence-industrial
  Version  = 85
  Status   = running
  Hostname =

ea0e588b	app    	85     	syd   	run    	running	1 total, 1 passing	0       	2022-05-05T23:37:07Z

When I fly logs though - the command hangs. I see no log output at all.

When I fly checks list I see:

Health Checks for valuable-api
  NAME                             | STATUS  | ALLOCATION | REGION | TYPE | LAST UPDATED         | OUTPUT
  3aa2b6b5b997fe9add768527b3fcd5c3 | passing | ea0e588b   | syd    | HTTP | 2022-05-05T23:38:02Z | HTTP GET 200  Output: {"status":"UP"}[✓]
                                   |         |            |        |      |                      |
                                   |         |            |        |      |                      |

Which looks okay from Fly’s perspective but my web app times out on API requests both by DNS name and by the IP listed in the health check.

I’ve also tried restarting my app and this behaviour occurs both before and after the restart.

It feels like maybe the app is up and healthy but there is a networking issue between the app and I?

Not quite sure what happened here, but my API is accessible again. It was inaccessible for about 20m but recovered without me doing anything. A ghost in the machine perhaps :slight_smile:

You can also see it in the firecracker load average:

That IP address is private, so you can’t hit it externally. Did you happen to try hitting directly? We didn’t have any outages, but it’s possible there was a DNS issue. UptimeRobot might tell you what the actual error was.

For what it’s worth, those load numbers are so low that it’s effectively zero.