Bug / inconsistent state: Errors from dead app

Our app is showing the following logs, corresponding to a vm that is supposedly dead:

2023-01-22T22:22:05Z proxy[a18bee76] ord [error]timed out while connecting to instance
2023-01-22T22:22:18Z proxy[a18bee76] ord [error]timed out while connecting to instance
2023-01-22T22:22:31Z proxy[a18bee76] ord [error]timed out while connecting to instance

Here is fly vm status claiming this instance is dead:

fly vm status a18bee76
Instance
  ID            = a18bee76
  Process       = app
  Version       = 9
  Region        = ord
  Desired       = stop
  Status        = complete
  Health Checks = 1 total, 1 passing
  Restarts      = 0
  Created       = 2022-08-20T00:49:59Z

Events
TIMESTAMP           	TYPE      	MESSAGE
2022-08-20T00:49:54Z	Received  	Task received by client
2022-08-20T00:50:18Z	Task Setup	Building Task Directory
2022-08-20T00:50:20Z	Started   	Task started by client
2023-01-22T21:46:11Z	Killing   	Sent interrupt. Waiting 5s before force killing
2023-01-22T21:46:36Z	Terminated	Exit Code: 0
2023-01-22T21:46:36Z	Killed    	Task successfully killed

Checks
ID                              	SERVICE 	STATE  	OUTPUT
23cccf88ca5d2baabe79459ed82bee94	tcp-8000	passing	TCP connect 172.19.14.34:8000: Success

Recent Logs
  2023-01-22T22:18:57Z   [error]timed out while connecting to instance
  2023-01-22T22:19:14Z   [error]timed out while connecting to instance
[...]

Prior to this, the app was deployed in ord. We moved it to iad, because it couldn’t connect to upstash redis. This are now working in iad, aside from these weird errors. As the app hadn’t been redeployed in a while, I wonder if something is broken in ord.

Actually there are lots of different proxy instance ids now showing this error, all in ord (where nothing should be running).

Its as if old instances of our app are still registered to e.g. a fly-shared load balancer in ord?

error.message="timed out while connecting to instance" 2023-01-22T22:29:28Z proxy[dac387c9] ord [error]request.method="GET" request.id="01GQDSKGV24H43YAQADS2HE0YD-sjc"
error.message="timed out while connecting to instance" 2023-01-22T22:29:29Z proxy[03d1855b] ord [error]request.method="GET" request.id="01GQDSKGV24H43YAQADS2HE0YD-sjc"
error.message="timed out while connecting to instance" 2023-01-22T22:29:31Z proxy[313ecb15] ord [error]request.method="GET" request.id="01GQDSKNPX404XPB0MVFHM3F0X-sjc"
error.message="timed out while connecting to instance" 2023-01-22T22:29:32Z proxy[03d1855b] ord [error]request.method="GET" request.id="01GQDSKNPX404XPB0MVFHM3F0X-sjc"
error.message="timed out while connecting to instance" 2023-01-22T22:29:33Z proxy[a18bee76] ord [error]request.method="GET" request.id="01GQDSKNPX404XPB0MVFHM3F0X-sjc"
error.message="timed out while connecting to instance" 2023-01-22T22:29:34Z proxy[dac387c9] ord [error]request.method="GET" request.id="01GQDSKNPX404XPB0MVFHM3F0X-sjc"

Potentially seeing the same thing on our end in iad. These instance IDs are all dead:

2023-01-22T22:33:00Z proxy[9cd07cb8] iad [error]timed out while connecting to instance
2023-01-22T22:33:00Z proxy[9e6437ff] iad [error]timed out while connecting to instance
2023-01-22T22:33:00Z proxy[35044dfc] iad [error]timed out while connecting to instance

Here’s the output of fly vm status

Instance
  ID            = 35044dfc
  Process       = app
  Version       = 27
  Region        = iad
  Desired       = stop
  Status        = complete
  Health Checks = 2 total, 2 passing
  Restarts      = 0
  Created       = 34m0s ago

Events
TIMESTAMP               TYPE            MESSAGE
2023-01-22T22:02:42Z    Received        Task received by client
2023-01-22T22:03:02Z    Task Setup      Building Task Directory
2023-01-22T22:03:07Z    Started         Task started by client
2023-01-22T22:19:21Z    Killing         Sent interrupt. Waiting 5s before force killing
2023-01-22T22:19:37Z    Terminated      Exit Code: 143
2023-01-22T22:19:37Z    Killed          Task successfully killed
2023-01-22T22:19:37Z    Killing         Sent interrupt. Waiting 5s before force killing

Checks
ID                                      SERVICE         STATE   OUTPUT
83010af023fdad91452a29a04d3df276        tcp-3000        passing HTTP GET http://172.19.67.202:3000/api/health: 200 OK Output: {"status":"ok"}
922782e87ff4b0d0c109fe795d94fb5b        tcp-3000        passing TCP connect 172.19.67.202:3000: Success

Taking a look. Possibly related to Fly.io Status - Creating or listing checks for Postgres clusters return errors (even if these aren’t postgres clusters).

2 Likes

I think I am seeing similar issues of errors corresponding to dead VMs. I pushed some changes today, so I’m not sure if it is my own doing, but avg HTTP request times have shot up for me.

023-01-22T23:02:24Z proxy[61c9b867] lax [error]timed out while connecting to instance
2023-01-22T23:02:25Z proxy[cd4d8e9d] lax [error]timed out while connecting to instance
2023-01-22T23:02:26Z proxy[81571c5e] lax [error]timed out while connecting to instance
2023-01-22T23:02:31Z proxy[9a7a594a] lax [error]timed out while connecting to instance
2023-01-22T23:02:33Z proxy[81571c5e] lax [error]timed out while connecting to instance
Instances
ID          PROCESS VERSION REGION  DESIRED STATUS  HEALTH CHECKS       RESTARTS    CREATED
586fcfee    app     443     ams     run     running 1 total, 1 passing  0           38m46s ago  
c7b4b09a    app     443     lax     run     running 1 total, 1 passing  0           39m33s ago  

Yes, I created a different incident, still investigating, but I believe the situation is improving right now.

1 Like

Things should be a lot better right now.

Looks good, thanks Jerome!