Been getting this error for the last couple of hours trying to deploy to the syd region.
Status report below:
❯ fly status --all
App
Name = vex
Owner = alembic
Version = 47
Status = running
Hostname = vex.fly.dev
Deployment Status
ID = a952f6ff-6b08-0d99-ea99-514cbd61c2b9
Version = v47
Status = failed
Description = Failed due to unhealthy allocations - not rolling back to stable job version 47 as current job has same specification
Instances = 1 desired, 1 placed, 0 healthy, 1 unhealthy
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
3b1cdf0c app 47 ⇡ syd stop complete 1 total, 1 critical 0 1h11m ago
2e6b93be app 46 syd stop complete 1 total, 1 critical 0 1h16m ago
2418566a app 45 syd stop complete 1 total, 1 critical 0 1h33m ago
ec087973 app 44 syd stop complete 1 total, 1 critical 0 1h42m ago
e6353a48 app 43 syd stop complete 1 total, 1 critical 0 1h47m ago
2c016110 app 42 syd run running 1 total, 1 passing 0 1h58m ago
c9fb5318 app 41 syd stop complete 1 total, 1 critical 0 1h54m ago
718491a9 app 39 syd stop complete 1 total, 1 passing 0 2h33m ago
061faecc app 38 syd stop complete 1 total, 1 passing 0 2h36m ago
43f7c8cf app 37 syd stop complete 1 total, 1 passing 0 3h24m ago
d7aa258b app 35 syd stop complete 1 total, 1 passing 0 18h9m ago
33377ef2 app 31 syd stop failed 0 18h38m ago
dfad04e0 app 27 syd stop failed 0 2022-05-09T04:07:46Z
It deployed ok once or twice after I posted, but still seems flaky, this is from 5 mins ago:
==> Monitoring deployment
v53 is being deployed
3870b91a: syd pending
3870b91a: syd pending
3870b91a: syd running unhealthy [health checks: 1 total, 1 critical]
Failed Instances
Instance
Failure #1
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
--> v53 failed - Failed due to unhealthy allocations - not rolling back to stable job version 53 as current job has same specification and deploying as v54
3870b91a 53 syd run running 1 total, 1 critical 0 4m58s ago
--> Troubleshooting guide at https://fly.io/docs/getting-started/troubleshooting/
Error abort
❯ fly status --all
App
Name = vex
Owner = alembic
Version = 53
Status = running
Hostname = vex.fly.dev
Deployment Status
ID = 3e8c668a-d6c4-98c2-d678-ed118110679a
Version = v53
Status = failed
Description = Failed due to unhealthy allocations - not rolling back to stable job version 53 as current job has same specification
Instances = 1 desired, 1 placed, 0 healthy, 1 unhealthy
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
3870b91a app 53 ⇡ syd stop complete 1 total, 1 critical 0 8m55s ago
d54aedd0 app 52 syd run running 1 total, 1 passing 0 8h52m ago
d9891f44 app 51 syd stop complete 1 total, 1 critical 0 8h39m ago
67b06272 app 48 syd stop failed 0 8h57m ago
3b1cdf0c app 47 syd stop complete 1 total, 1 critical 0 10h28m ago
2e6b93be app 46 syd stop complete 1 total, 1 critical 0 10h33m ago
2418566a app 45 syd stop complete 1 total, 1 critical 0 10h50m ago
ec087973 app 44 syd stop complete 1 total, 1 critical 0 11h0m ago
e6353a48 app 43 syd stop complete 1 total, 1 critical 0 11h5m ago
2c016110 app 42 syd stop complete 1 total, 1 passing 0 11h15m ago
c9fb5318 app 41 syd stop complete 1 total, 1 critical 0 11h11m ago
718491a9 app 39 syd stop complete 1 total, 1 passing 0 11h50m ago
061faecc app 38 syd stop complete 1 total, 1 passing 0 11h53m ago
43f7c8cf app 37 syd stop complete 1 total, 1 passing 0 12h41m ago
33377ef2 app 31 syd stop failed 0 2022-05-10T06:53:14Z
We have some capacity problems in Sydney right now, but are preparing new servers to take on the load. Meanwhile, you could deploy in another region or temporarily reduce your scaling count.
Health Checks for vex
NAME | STATUS | ALLOCATION | REGION | TYPE | LAST UPDATED | OUTPUT
-----------------------------------*---------*------------*--------*------*--------------*--------------------------------------------
3df2415693844068640885b45074b954 | passing | d54aedd0 | syd | TCP | 9h18m ago | TCP connect 172.19.34.26:8080: Success[✓]
| | | | | |
I noticed a networking issue today in SYD today as well, probably the one you’re already resolving, I had some apps in the same organization that weren’t able to communicate with each other. I narrowed it down to DNS entries being missing by integrating the code from GitHub - fly-apps/privatenet: Examples around querying 6PN private networking on Fly and seeing that only some of the apps had DNS entries. When I deleted and recreated everything I had the same issue plus a DNS entry for an app that just had a lowercase L for a name which might have been the first letter truncated from my application name. I will try again tomorrow too as it’s quite late here.
Tried redeploying to syd this morning, but still getting the same failure. Did you manage to get the extra machines provisioned?
--> Pushing image done
image: registry.fly.io/vex-staging:deployment-1652311439
image size: 157 MB
==> Creating release
--> release v59 created
--> You can detach the terminal anytime without stopping the deployment
==> Release command detected: /app/bin/vex_liveview_prototype eval VexLiveviewPrototype.Release.migrate
--> This release will not be available until the release command succeeds.
Starting instance
Configuring virtual machine
Pulling container image
Unpacking image
Preparing kernel init
Configuring firecracker
Starting virtual machine
Starting init (commit: 252b7bd)...
Preparing to run: `/app/bin/vex_liveview_prototype eval VexLiveviewPrototype.Release.migrate` as nobody
2022/05/11 23:24:35 listening on [fdaa:0:59b1:a7b:66:5ea7:eb11:2]:22 (DNS: [fdaa::3]:53)
23:24:41.283 [info] Migrations already up
Main child exited normally with code: 0
Staped child process with pid: 569 and signal: SIGUSR1, core dumped? false
Starting clean up.
Starting instance
Configuring virtual machine
Pulling container image
Unpacking image
Preparing kernel init
Configuring firecracker
Starting virtual machine
Starting init (commit: 252b7bd)...
Preparing to run: `/app/bin/vex_liveview_prototype eval VexLiveviewPrototype.Release.migrate` as nobody
2022/05/11 23:24:35 listening on [fdaa:0:59b1:a7b:66:5ea7:eb11:2]:22 (DNS: [fdaa::3]:53)
23:24:41.283 [info] Migrations already up
Main child exited normally with code: 0
Staped child process with pid: 569 and signal: SIGUSR1, core dumped? false
Starting clean up.
==> Monitoring deployment
v59 is being deployed
--> v59 failed - Failed due to unhealthy allocations - rolling back to job version 58 and deploying as v60
--> Troubleshooting guide at https://fly.io/docs/getting-started/troubleshooting/
Error abort
@martin1 this is probably not a capacity issue in Sydney. Can you run fly status --all, find the ID of a VM that failed, and then run fly vm status <id>?
It looks like maybe your app isn’t passing health checks in time.
❯ fly status --all
App
Name = vex
Owner = alembic
Version = 56
Status = running
Hostname = vex.fly.dev
Deployment Status
ID = 81eed7b3-af6d-3e20-37fd-8fa68ebe0539
Version = v56
Status = failed
Description = Failed due to unhealthy allocations - not rolling back to stable job version 56 as current job has same specification
Instances = 1 desired, 1 placed, 0 healthy, 1 unhealthy
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
494f94e2 app 56 ⇡ syd stop complete 1 total, 1 critical 0 11m28s ago
3871f86a app 55 syd stop complete 1 total, 1 critical 0 59m52s ago
642ad002 app 55 syd stop complete 1 total, 1 critical 0 11h36m ago
c859c994 app 54 syd stop complete 1 total, 1 critical 0 11h41m ago
f0921193 app 53 syd stop complete 1 total, 1 critical 0 11h48m ago
d54aedd0 app 52 syd run running 1 total, 1 passing 0 21h51m ago
67b06272 app 48 syd stop failed 0 21h55m ago
33377ef2 app 31 syd stop failed 0 2022-05-10T06:53:14Z
❯ fly vm status 494f94e2
Instance
ID = 494f94e2
Process =
Version = 56
Region = syd
Desired = stop
Status = complete
Health Checks = 1 total, 1 critical
Restarts = 0
Created = 11m49s ago
Recent Events
TIMESTAMP TYPE MESSAGE
2022-05-11T23:35:14Z Received Task received by client
2022-05-11T23:35:14Z Task Setup Building Task Directory
2022-05-11T23:35:20Z Started Task started by client
2022-05-11T23:40:14Z Alloc Unhealthy Task not running for min_healthy_time of 10s by deadline
2022-05-11T23:40:16Z Killing Sent interrupt. Waiting 5s before force killing
2022-05-11T23:40:34Z Terminated Exit Code: 0
2022-05-11T23:40:34Z Killed Task successfully killed
Checks
ID SERVICE STATE OUTPUT
3df2415693844068640885b45074b954 tcp-8080 critical dial tcp 172.19.0.90:8080: connect: connection refused
Recent Logs
~/work/alembic/vex_liveview_prototype main *16
Can you check fly logs for anything suspicious? This is reporting that your app is not listening on port 8080, so failing health checks. This could happen, for example, if the VM runs out of memory.
❯ fly status --all
App
Name = valuable-api
Owner = ringfence-industrial
Version = 85
Status = running
Hostname = valuable-api.fly.dev
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
ea0e588b app 85 syd run running 1 total, 1 passing 0 2022-05-05T23:37:07Z
When I fly logs though - the command hangs. I see no log output at all.
When I fly checks list I see:
Health Checks for valuable-api
NAME | STATUS | ALLOCATION | REGION | TYPE | LAST UPDATED | OUTPUT
-----------------------------------*---------*------------*--------*------*----------------------*------------------------------------------------------------------------------------
3aa2b6b5b997fe9add768527b3fcd5c3 | passing | ea0e588b | syd | HTTP | 2022-05-05T23:38:02Z | HTTP GET http://172.19.3.74:8080/actuator/health: 200 Output: {"status":"UP"}[✓]
| | | | | |
| | | | | |
Which looks okay from Fly’s perspective but my web app times out on API requests both by DNS name and by the IP listed in the health check.
I’ve also tried restarting my app and this behaviour occurs both before and after the restart.
It feels like maybe the app is up and healthy but there is a networking issue between the app and I?
Not quite sure what happened here, but my API is accessible again. It was inaccessible for about 20m but recovered without me doing anything. A ghost in the machine perhaps
That IP address is private, so you can’t hit it externally. Did you happen to try hitting https://valuable-api.fly.dev directly? We didn’t have any outages, but it’s possible there was a DNS issue. UptimeRobot might tell you what the actual error was.
For what it’s worth, those load numbers are so low that it’s effectively zero.