Failed due to unhealthy allocations in syd

Health Checks for vex
  NAME                             | STATUS  | ALLOCATION | REGION | TYPE | LAST UPDATED | OUTPUT                                     
-----------------------------------*---------*------------*--------*------*--------------*--------------------------------------------
  3df2415693844068640885b45074b954 | passing | d54aedd0   | syd    | TCP  | 9h18m ago    | TCP connect 172.19.34.26:8080: Success[✓]  
                                   |         |            |        |      |              |                                            

Our scaling count is currently 1.

❯ flyctl scale show
VM Resources for vex
        VM Size: shared-cpu-1x
      VM Memory: 1 GB
          Count: 1
 Max Per Region: Not set

We had some issues recently where we were deployed in a backup region and performance was really bad due to the db being in syd and the app elsewhere.

What’s the timeframe for the new servers to get provisioned Joshua?

We’re hoping to get it going today. We’ve run into some snags there with incorrectly configured networking.

Cool, happy to wait until tomorrow to try again.

I noticed a networking issue today in SYD today as well, probably the one you’re already resolving, I had some apps in the same organization that weren’t able to communicate with each other. I narrowed it down to DNS entries being missing by integrating the code from GitHub - fly-apps/privatenet: Examples around querying 6PN private networking on Fly and seeing that only some of the apps had DNS entries. When I deleted and recreated everything I had the same issue plus a DNS entry for an app that just had a lowercase L for a name which might have been the first letter truncated from my application name. I will try again tomorrow too as it’s quite late here.

I am not sure if this is related to this issue, however I am also having issues building / deploying apps to Sydney today.

Hi @jsierles,

Tried redeploying to syd this morning, but still getting the same failure. Did you manage to get the extra machines provisioned?

 --> Pushing image done
image: registry.fly.io/vex-staging:deployment-1652311439
image size: 157 MB
==> Creating release
--> release v59 created

--> You can detach the terminal anytime without stopping the deployment
==> Release command detected: /app/bin/vex_liveview_prototype eval VexLiveviewPrototype.Release.migrate

--> This release will not be available until the release command succeeds.
	 Starting instance
	 Configuring virtual machine
	 Pulling container image
	 Unpacking image
	 Preparing kernel init
	 Configuring firecracker
	 Starting virtual machine
	 Starting init (commit: 252b7bd)...
	 Preparing to run: `/app/bin/vex_liveview_prototype eval VexLiveviewPrototype.Release.migrate` as nobody
	 2022/05/11 23:24:35 listening on [fdaa:0:59b1:a7b:66:5ea7:eb11:2]:22 (DNS: [fdaa::3]:53)
	 23:24:41.283 [info] Migrations already up
	 Main child exited normally with code: 0
	 Staped child process with pid: 569 and signal: SIGUSR1, core dumped? false
	 Starting clean up.
	 Starting instance
	 Configuring virtual machine
	 Pulling container image
	 Unpacking image
	 Preparing kernel init
	 Configuring firecracker
	 Starting virtual machine
	 Starting init (commit: 252b7bd)...
	 Preparing to run: `/app/bin/vex_liveview_prototype eval VexLiveviewPrototype.Release.migrate` as nobody
	 2022/05/11 23:24:35 listening on [fdaa:0:59b1:a7b:66:5ea7:eb11:2]:22 (DNS: [fdaa::3]:53)
	 23:24:41.283 [info] Migrations already up
	 Main child exited normally with code: 0
	 Staped child process with pid: 569 and signal: SIGUSR1, core dumped? false
	 Starting clean up.
==> Monitoring deployment

v59 is being deployed
--> v59 failed - Failed due to unhealthy allocations - rolling back to job version 58 and deploying as v60 

--> Troubleshooting guide at https://fly.io/docs/getting-started/troubleshooting/
Error abort

@martin1 this is probably not a capacity issue in Sydney. Can you run fly status --all, find the ID of a VM that failed, and then run fly vm status <id>?

It looks like maybe your app isn’t passing health checks in time.

Sure @kurt :

❯ fly status --all
App
  Name     = vex          
  Owner    = alembic      
  Version  = 56           
  Status   = running      
  Hostname = vex.fly.dev  

Deployment Status
  ID          = 81eed7b3-af6d-3e20-37fd-8fa68ebe0539                                                                                   
  Version     = v56                                                                                                                    
  Status      = failed                                                                                                                 
  Description = Failed due to unhealthy allocations - not rolling back to stable job version 56 as current job has same specification  
  Instances   = 1 desired, 1 placed, 0 healthy, 1 unhealthy                                                                            

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS  	HEALTH CHECKS      	RESTARTS	CREATED              
494f94e2	app    	56 ⇡   	syd   	stop   	complete	1 total, 1 critical	0       	11m28s ago          	
3871f86a	app    	55     	syd   	stop   	complete	1 total, 1 critical	0       	59m52s ago          	
642ad002	app    	55     	syd   	stop   	complete	1 total, 1 critical	0       	11h36m ago          	
c859c994	app    	54     	syd   	stop   	complete	1 total, 1 critical	0       	11h41m ago          	
f0921193	app    	53     	syd   	stop   	complete	1 total, 1 critical	0       	11h48m ago          	
d54aedd0	app    	52     	syd   	run    	running 	1 total, 1 passing 	0       	21h51m ago          	
67b06272	app    	48     	syd   	stop   	failed  	                   	0       	21h55m ago          	
33377ef2	app    	31     	syd   	stop   	failed  	                   	0       	2022-05-10T06:53:14Z	

❯ fly vm status 494f94e2
Instance
  ID            = 494f94e2             
  Process       =                      
  Version       = 56                   
  Region        = syd                  
  Desired       = stop                 
  Status        = complete             
  Health Checks = 1 total, 1 critical  
  Restarts      = 0                    
  Created       = 11m49s ago           

Recent Events
TIMESTAMP            TYPE            MESSAGE                                                  
2022-05-11T23:35:14Z Received        Task received by client                                  
2022-05-11T23:35:14Z Task Setup      Building Task Directory                                  
2022-05-11T23:35:20Z Started         Task started by client                                   
2022-05-11T23:40:14Z Alloc Unhealthy Task not running for min_healthy_time of 10s by deadline 
2022-05-11T23:40:16Z Killing         Sent interrupt. Waiting 5s before force killing          
2022-05-11T23:40:34Z Terminated      Exit Code: 0                                             
2022-05-11T23:40:34Z Killed          Task successfully killed                                 

Checks
ID                               SERVICE  STATE    OUTPUT                                                 
3df2415693844068640885b45074b954 tcp-8080 critical dial tcp 172.19.0.90:8080: connect: connection refused 

Recent Logs
    ~/work/alembic/vex_liveview_prototype    main *16   

Can you check fly logs for anything suspicious? This is reporting that your app is not listening on port 8080, so failing health checks. This could happen, for example, if the VM runs out of memory.

Hi Folks,

I think I’m having simillar issues.

❯ fly status --all
App
  Name     = valuable-api
  Owner    = ringfence-industrial
  Version  = 85
  Status   = running
  Hostname = valuable-api.fly.dev

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS 	HEALTH CHECKS     	RESTARTS	CREATED
ea0e588b	app    	85     	syd   	run    	running	1 total, 1 passing	0       	2022-05-05T23:37:07Z

When I fly logs though - the command hangs. I see no log output at all.

When I fly checks list I see:

Health Checks for valuable-api
  NAME                             | STATUS  | ALLOCATION | REGION | TYPE | LAST UPDATED         | OUTPUT
-----------------------------------*---------*------------*--------*------*----------------------*------------------------------------------------------------------------------------
  3aa2b6b5b997fe9add768527b3fcd5c3 | passing | ea0e588b   | syd    | HTTP | 2022-05-05T23:38:02Z | HTTP GET http://172.19.3.74:8080/actuator/health: 200  Output: {"status":"UP"}[✓]
                                   |         |            |        |      |                      |
                                   |         |            |        |      |                      |

Which looks okay from Fly’s perspective but my web app times out on API requests both by DNS name and by the IP listed in the health check.

I’ve also tried restarting my app and this behaviour occurs both before and after the restart.

It feels like maybe the app is up and healthy but there is a networking issue between the app and I?

Not quite sure what happened here, but my API is accessible again. It was inaccessible for about 20m but recovered without me doing anything. A ghost in the machine perhaps :slight_smile:

You can also see it in the firecracker load average:

That IP address is private, so you can’t hit it externally. Did you happen to try hitting https://valuable-api.fly.dev directly? We didn’t have any outages, but it’s possible there was a DNS issue. UptimeRobot might tell you what the actual error was.

For what it’s worth, those load numbers are so low that it’s effectively zero.

Bit of a hobby project and very little traffic so low numbers :slight_smile:

Uptime robot says time out. Here’s the trace route:

Tracing route to 109.105.216.45
hop no  -  node ip - ms
1 → 216.245.214.73 (1 ms)
2 → 63.143.63.145 (1 ms)
3 → 208.115.252.21 (1 ms)
4 → 172.22.1.17 (2 ms)
5 → 184.105.11.129 (39 ms)
6 → 184.105.81.209 (32 ms)
7 → 109.105.216.45 (15001 ms) Request timed out
8 → 109.105.216.45 (15000 ms) Request timed out
9 → 109.105.216.45 (15002 ms) Request timed out

I didn’t think to try the .fly.dev address! Maybe I’ll make another uptime robot check for that one as well.

There was a brief bit of temporary unavailability on a single syd host after we restarted an internal system service. This update should be seamless, and we’re actively investigating why it wasn’t so these kinds of blips won’t happen again. Sorry for the inconvenience!

It’s probably worth noting that I’m now mainly seeing these failures when deploying via GH Actions. Deployment from local seems to be working more reliably now, but still fails sometimes.

Here’s the fly logs while the GH Action is running:

2022-05-13T00:22:04Z runner[868efff6] syd [info]Starting instance
2022-05-13T00:22:04Z runner[868efff6] syd [info]Configuring virtual machine
2022-05-13T00:22:04Z runner[868efff6] syd [info]Pulling container image
2022-05-13T00:22:12Z runner[868efff6] syd [info]Unpacking image
2022-05-13T00:22:14Z runner[868efff6] syd [info]Preparing kernel init
2022-05-13T00:22:15Z runner[868efff6] syd [info]Configuring firecracker
2022-05-13T00:22:15Z runner[868efff6] syd [info]Starting virtual machine
2022-05-13T00:22:15Z app[868efff6] syd [info]Starting init (commit: 252b7bd)...
2022-05-13T00:22:15Z app[868efff6] syd [info]Preparing to run: `/app/bin/vex_liveview_prototype eval VexLiveviewPrototype.Release.migrate` as nobody
2022-05-13T00:22:15Z app[868efff6] syd [info]2022/05/13 00:22:15 listening on [fdaa:0:59b1:a7b:9c3e:868e:fff6:2]:22 (DNS: [fdaa::3]:53)
2022-05-13T00:22:18Z app[868efff6] syd [info]00:22:18.744 [info] Migrations already up
2022-05-13T00:22:19Z app[868efff6] syd [info]Main child exited normally with code: 0
2022-05-13T00:22:19Z app[868efff6] syd [info]Reaped child process with pid: 569 and signal: SIGUSR1, core dumped? false
2022-05-13T00:22:19Z app[868efff6] syd [info]Starting clean up.
2022-05-13T00:22:29Z runner[dee7fb29] syd [info]Starting instance
2022-05-13T00:22:30Z runner[dee7fb29] syd [info]Configuring virtual machine
2022-05-13T00:22:30Z runner[dee7fb29] syd [info]Pulling container image
2022-05-13T00:22:31Z runner[dee7fb29] syd [info]Unpacking image
2022-05-13T00:22:31Z runner[dee7fb29] syd [info]Preparing kernel init
2022-05-13T00:22:32Z runner[dee7fb29] syd [info]Configuring firecracker
2022-05-13T00:22:33Z runner[dee7fb29] syd [info]Starting virtual machine
2022-05-13T00:22:33Z app[dee7fb29] syd [info]Starting init (commit: 252b7bd)...
2022-05-13T00:22:33Z app[dee7fb29] syd [info]Preparing to run: `/app/bin/server` as nobody
2022-05-13T00:22:33Z app[dee7fb29] syd [info]2022/05/13 00:22:33 listening on [fdaa:0:59b1:a7b:9c3e:dee7:fb29:2]:22 (DNS: [fdaa::3]:53)
2022-05-13T00:22:34Z app[dee7fb29] syd [info]Reaped child process with pid: 555, exit code: 0
2022-05-13T00:22:37Z app[dee7fb29] syd [info]00:22:37.253 [info] Running VexLiveviewPrototypeWeb.Endpoint with cowboy 2.9.0 at :::4000 (http)
2022-05-13T00:22:37Z app[dee7fb29] syd [info]00:22:37.256 [info] Access VexLiveviewPrototypeWeb.Endpoint at http://vex.fly.dev
2022-05-13T00:22:37Z app[dee7fb29] syd [info]Reaped child process with pid: 576 and signal: SIGUSR1, core dumped? false
2022-05-13T00:22:41Z app[dee7fb29] syd [info]00:22:41.297 [info] tzdata release in place is from a file last modified Fri, 22 Oct 2021 02:20:47 GMT. Release file on server was last modified Wed, 16 Mar 2022 13:36:02 GMT.
2022-05-13T00:22:43Z app[dee7fb29] syd [info]00:22:43.211 [info] Tzdata has updated the release from 2021e to 2022a

#####################    App just sits here for several minutes

2022-05-13T00:27:44Z runner[dee7fb29] syd [info]Shutting down virtual machine
2022-05-13T00:27:44Z app[dee7fb29] syd [info]00:27:44.260 [notice] SIGTERM received - shutting down
2022-05-13T00:27:44Z app[dee7fb29] syd [info]Sending signal SIGTERM to main child process w/ PID 516
2022-05-13T00:27:45Z app[dee7fb29] syd [info]Reaped child process with pid: 579 and signal: SIGUSR1, core dumped? false
2022-05-13T00:27:46Z app[dee7fb29] syd [info]Main child exited normally with code: 0
2022-05-13T00:27:46Z app[dee7fb29] syd [info]Reaped child process with pid: 560, exit code: 0
2022-05-13T00:27:46Z app[dee7fb29] syd [info]Starting clean up.


Anything else we can try here @jsierles ? Our deploys are working fine from local, but every time we try to deploy via a GH Action it fails and ends up rolling back or down. Do you an example of a working GH action to deploy a Phoenix app to fly?

Thanks!

Hi @martin1

Have you tried clearing your local environment and going through the GH Action steps locally? There might be something in your local environment thats missing from the GH Action environment.

Hi @charsleysa,

Looks like there was an issue trying to get the Git revision as an env var and removing that has fixed the issue. Thanks for the tip. Now to figure out how to do that with breaking the build…