Something strange has happened with my app - it gradually dropped connections with its clients over the past half-hour or so, and then it stopped accepting connections entirely:
I noticed there was a host issue with my machines:
We are performing emergency maintenance on a host some of your apps instances are running on in SJC. Machines on this host may be unavailable until the maintenance is completed.
So I deployed to a new region to mitigate this:
fly scale count 2 --region ewr
But this failed. Health checks are failing with “connection refused”. Sounds like an internal networking issue:
As a trick to trigger a re-deploy, I like to use fly secrets deploy, but that failed as well:
Verifying if app can be safely deployed
Creating green machines
Created machine 8731d7a06792d8 [app]
Created machine 781337db990178 [app]
Created machine 148e10e5f06078 [app]
Created machine 32879220b51578 [app]
Waiting for all green machines to start
Machine 148e10e5f06078 [app] - started
Machine 32879220b51578 [app] - started
Machine 781337db990178 [app] - started
Machine 8731d7a06792d8 [app] - created
WARN error refreshing lease for machine 9080007eb94e98: failed to get lease on VM 9080007eb94e98: unauthorized
WARN error refreshing lease for machine 148e1179f9d018: failed to get lease on VM 148e1179f9d018: unauthorized
WARN error refreshing lease for machine 568365edae1dd8: failed to get lease on VM 568365edae1dd8: unauthorized
WARN error refreshing lease for machine e286d924b0e4e8: failed to get lease on VM e286d924b0e4e8: unauthorized
WARN error refreshing lease for machine 9080007eb94e98: failed to get lease on VM 9080007eb94e98: unauthorized
WARN error refreshing lease for machine 148e1179f9d018: failed to get lease on VM 148e1179f9d018: unauthorized
WARN error refreshing lease for machine 568365edae1dd8: failed to get lease on VM 568365edae1dd8: unauthorized
WARN error refreshing lease for machine e286d924b0e4e8: failed to get lease on VM e286d924b0e4e8: unauthorized
At which point I just ^C'd.
The new machines ostensibly start successfully; I can see my app’s initialization in the logs, and it appears to be listening on the correct port. But subsequently I see these two messages repeated over and over:
15:28:56 [PM05] failed to connect to machine: gave up after 15 attempts (in 8.109285926s)
15:28:56 [PR03] could not find a good candidate within 1 attempts at load balancing. last error: [PM05] failed to connect to machine: gave up after 15 attempts (in 8.085415579s)
Something is hella borked.


