Error: failed to connect to fly machine: Supposedly started, and not stopped

Since (~10:30 UTC) 4 hours ago (of what I know), all Machines in app, udns, have failed to respond to any TCP or HTTP request.

There have been no deployments since 15 hours or so back.

Here are the only logs from the service, right now:

proxy[0e286074a97686] syd [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

logs:

2022-09-20T11:55:28.712 proxy[0e286074a97686] syd [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:28.744 proxy[9e78424ebd5783] nrt [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:28.763 proxy[9e78424ebd5783] nrt [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:28.814 proxy[6e82559a296187] iad [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:29.119 proxy[73287114c44085] gru [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:29.314 proxy[6e82559a296187] iad [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:29.496 proxy[73287114c44085] gru [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:29.575 proxy[6e82559a296187] iad [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:29.613 proxy[5683991c16e28e] cdg [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:30.040 proxy[6e82975b296287] fra [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:30.721 proxy[6e82559a296187] iad [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:31.226 proxy[6e82975b296287] fra [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:33.718 proxy[e784375b066683] mad [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:33.770 proxy[73287114c44085] gru [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:34.569 proxy[6e82975b296287] fra [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:34.599 proxy[73287114c44085] gru [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:34.780 proxy[6e82975b296287] fra [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:35.058 proxy[e784375b066683] mad [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:35.173 proxy[6e82975b296287] fra [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:35.294 proxy[6e82559a296187] iad [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:35.420 proxy[0e286074a97686] syd [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:35.698 proxy[5683991c16e28e] cdg [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:36.098 proxy[73287114c44085] gru [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-20T11:55:36.682 proxy[e784375b066683] mad [error] Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server 

I couldn’t flyctl ssh into any of the Machine VMs, too.

LOG_LEVEL=DEBUG fly ssh console -a udns out:
DEBUG Retrieving app info for udns
DEBUG --> POST https://api.fly.io/graphql

{
  "query": "query ($appName: String!) { appcompact:app(name: $appName) { id name hostname deployed status appUrl platformVersion organization { id slug } postgresAppRole: role { name } imageDetails { repository version } } }",
  "variables": {
    "appName": "udns"
  }
}

DEBUG {}
DEBUG querying for release resulted to v0.0.393
DEBUG <-- 200 https://api.fly.io/graphql (430.68ms)

{
  "data": {
    "appcompact": {
      "id": "udns",
      "name": "udns",
      "hostname": "udns.fly.dev",
      "deployed": true,
      "status": "deployed",
      "appUrl": "https://137.66.7.89",
      "platformVersion": "machines",
      "organization": {
        "id": "ekQlpPKZYvQjGH4o4ge0Y82e0Oh8m5",
        "slug": "celzero"
      },
      "postgresAppRole": null,
      "imageDetails": {
        "repository": "unknown",
        "version": "unknown"
      }
    }
  }
}
DEBUG --> POST https://api.fly.io/graphql

{
  "query": "mutation($input: ValidateWireGuardPeersInput!) { validateWireGuardPeers(input: $input) { invalidPeerIps } }",
  "variables": {
    "input": {
      "peerIps": [
        "fdaa:0:35f3:a7b:1bfe:0:a:702"
      ]
    }
  }
}

DEBUG {}
DEBUG <-- 200 https://api.fly.io/graphql (2.03s)

{
  "data": {
    "validateWireGuardPeers": {
      "invalidPeerIps": []
    }
  }
}
DEBUG --> POST https://api.fly.io/graphql

{
  "query": "mutation($input: ValidateWireGuardPeersInput!) { validateWireGuardPeers(input: $input) { invalidPeerIps } }",
  "variables": {
    "input": {
      "peerIps": [
        "fdaa:0:35f3:a7b:1bfe:0:a:702"
      ]
    }
  }
}

DEBUG {}
DEBUG <-- 200 https://api.fly.io/graphql (270.6ms)

{
  "data": {
    "validateWireGuardPeers": {
      "invalidPeerIps": []
    }
  }
}
DEBUG --> GET http://[fdaa:0:35f3::3]:4280/v1/apps/udns/machines

DEBUG <-- 200 http://[fdaa:0:35f3::3]:4280/v1/apps/udns/machines (1m0.03s)

[
  {
    "id": "23908066ec6875",
    "name": "udns-vin",
    "state": "started",
    "region": "",
    "instance_id": "",
    "private_ip": "",
    "config": null,
    "created_at": "1970-01-01T00:00:00Z",
    "updated_at": "0001-01-01T00:00:00Z"
  },
  {
    "id": "73287340a36285",
    "name": "udns-hkg",
    "state": "started",
    "region": "hkg",
    "instance_id": "01GDBTRJ2T2NEMBTRZNVPCH5S4",
    "private_ip": "fdaa:0:35f3:a7b:7f07:f613:60ba:2",
    "config": {
      "env": {
        "CLOUD_PLATFORM": "fly",
        "LOG_LEVEL": "info",
        "MACHINES_TIMEOUT_SEC": "180",
        "NODE_ENV": "production"
      },
      "init": {
        "exec": null,
        "entrypoint": null,
        "cmd": null,
        "tty": false
      },
      "image": "registry.fly.io/udns:deployment-01GDBJYPWZQT6N34AZ437M6C26",
      "metadata": {
        "process_group": "app"
      },
      "restart": {
        "policy": ""
      },
      "services": [
        {
          "protocol": "tcp",
          "internal_port": 8080,
          "ports": [
            {
              "port": 443
            },
            {
              "port": 8080
            }
          ],
          "concurrency": {
            "type": "connections",
            "hard_limit": 75,
            "soft_limit": 60
          }
        },
        {
          "protocol": "tcp",
          "internal_port": 10000,
          "ports": [
            {
              "port": 853
            },
            {
              "port": 10000
            }
          ],
          "concurrency": {
            "type": "connections",
            "hard_limit": 75,
            "soft_limit": 60
          }
        }
      ],
      "guest": {
        "cpu_kind": "shared",
        "cpus": 1,
        "memory_mb": 256
      },
      "metrics": null
    },
    "image_ref": {
      "registry": "registry.fly.io",
      "repository": "udns",
      "tag": "deployment-01GDBJYPWZQT6N34AZ437M6C26",
      "digest": "sha256:baba8a10af4d87038621bee9cd5eec98e0884c23c69fdeb57c4a302d0d4e52ac",
      "labels": {}
    },
    "created_at": "2022-09-18T00:43:11Z",
    "updated_at": "2022-09-20T03:06:29Z",
    "events": [
      {
        "type": "start",
        "status": "started",
        "source": "flyd",
        "timestamp": 1663643189335
      },
      {
        "type": "start",
        "status": "starting",
        "source": "user",
        "timestamp": 1663643188841
      },
      {
        "type": "exit",
        "status": "stopped",
        "request": {
          "exit_event": {
            "exit_code": 0,
            "exited_at": 1663643169004,
            "guest_exit_code": 0,
            "guest_signal": -1,
            "oom_killed": false,
            "requested_stop": false,
            "restarting": false,
            "signal": -1
          },
          "restart_count": 0
        },
        "source": "flyd",
        "timestamp": 1663643169328
      },
      {
        "type": "start",
        "status": "started",
        "source": "flyd",
        "timestamp": 1663642036547
      },
      {
        "type": "start",
        "status": "starting",
        "source": "user",
        "timestamp": 1663642036183
      },
      {
        "type": "exit",
        "status": "stopped",
        "request": {
          "exit_event": {
            "exit_code": 0,
            "exited_at": 1663642031471,
            "guest_exit_code": 0,
            "guest_signal": -1,
            "oom_killed": false,
            "requested_stop": false,
            "restarting": false,
            "signal": -1
          },
          "restart_count": 0
        },
        "source": "flyd",
        "timestamp": 1663642032047
      }
    ]
  },
  ....
    ]
  }
]

Oops, something went wrong! Could you try that again?

I am not sure what caused this behaviour as the Machines were working for 10 hours before the event, just fine. The only odd thing I notice is tcp_checks are at their default (but apparently that’s an expected limitation of the current flyctl deploy -c ... for Machines apps), and seem very aggressive (for udns):

➜ fly config display -a udns
            "tcp_checks": [
                {
                    "grace_period": "1s",
                    "interval": "15s",
                    "restart_limit": 0,
                    "timeout": "2s"
                }
            ]
fly m list -a udns machines:
23 machines have been retrieved.
udns
ID            	NAME    	STATE  	REGION	IMAGE                                     	IP ADDRESS                      	VOLUME	CREATED             	LAST UPDATED         
23908066ec6875	udns-vin	started	      	:                                         	                                	      	1970-01-01T00:00:00Z	0001-01-01T00:00:00Z	
73287340a36285	udns-hkg	started	hkg   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:7f07:f613:60ba:2	      	2022-09-18T00:43:11Z	2022-09-20T03:06:29Z	
3d8dd1efe40989	udns-ams	started	ams   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:23c5:eb5b:a06b:2	      	2022-09-19T19:25:21Z	2022-09-20T08:43:44Z	
9e78424ebd5783	udns-nrt	started	nrt   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:aaa:206:68e0:2  	      	2022-09-18T00:39:18Z	2022-09-20T05:26:40Z	
e784375b066683	udns-mad	started	mad   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:2656:8569:420:2 	      	2022-09-18T00:44:26Z	2022-09-20T03:51:24Z	
1781994c9d9089	udns-ord	started	ord   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:b0bb:7e80:f1ad:2	      	2022-09-18T00:46:05Z	2022-09-20T06:20:00Z	
06e82557bd2987	udns-dfw	started	dfw   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:2203:b916:9f80:2	      	2022-09-18T00:44:50Z	2022-09-20T03:00:31Z	
6e82975b296287	udns-fra	started	fra   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:b9b8:a0ab:ad59:2	      	2022-09-18T00:43:51Z	2022-09-20T02:32:01Z	
5683991c16e28e	udns-cdg	started	cdg   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:aebf:737e:d205:2	      	2022-09-18T00:38:00Z	2022-09-20T02:49:07Z	
3d8d991fe14289	udns-lax	started	lax   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:85:148c:bf5e:2  	      	2022-09-18T00:42:46Z	2022-09-20T04:30:36Z	
6e82559a296187	udns-iad	started	iad   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:21e0:9463:9892:2	      	2022-09-18T01:20:05Z	2022-09-20T03:33:17Z	
73287119f67285	udns-sin	started	sin   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:232f:561:a9c6:2 	      	2022-09-18T00:41:28Z	2022-09-20T02:03:24Z	
39080554b5ed87	udns-sjc	started	sjc   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:b2e2:1878:ee0b:2	      	2022-09-18T00:40:34Z	2022-09-20T04:34:37Z	
7328726b001085	udns-ewr	started	ewr   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:ab3:6adc:89ee:2 	      	2022-09-17T23:30:21Z	2022-09-20T04:23:02Z	
73287171b5d385	udns-maa	started	maa   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:14bf:c355:dc01:2	      	2022-09-18T01:20:24Z	2022-09-19T21:58:57Z	
d5683061b4158e	udns-mia	started	mia   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:2cc3:a2b6:8dce:2	      	2022-09-18T00:49:49Z	2022-09-20T02:42:59Z	
3d8d340be44089	udns-lhr	started	lhr   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:aa9c:bc2a:6b28:2	      	2022-09-18T00:42:09Z	2022-09-20T01:16:19Z	
73287114c44085	udns-gru	started	gru   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:1f60:f6e5:9160:2	      	2022-09-18T00:40:59Z	2022-09-20T08:40:37Z	
0e286074a97686	udns-syd	started	syd   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:9e6b:360b:afc8:2	      	2022-09-18T00:40:00Z	2022-09-20T01:35:26Z	
3d8dd59fe15689	udns-sea	started	sea   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:2dbb:d65b:d958:2	      	2022-09-17T23:29:33Z	2022-09-20T01:09:30Z	
d568373eb628e9	udns-scl	started	scl   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:d33b:925:71f8:2 	      	2022-09-18T00:52:55Z	2022-09-20T01:50:11Z	
0e286035f43867	udns-yul	started	yul   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:b729:2dea:5e07:2	      	2022-09-18T00:56:11Z	2022-09-19T23:16:26Z	
24d891dec4d687	udns-yyz	started	yyz   	udns:deployment-01GDBJYPWZQT6N34AZ437M6C26	fdaa:0:35f3:a7b:aa2:e626:2efe:2 	      	2022-09-18T00:47:38Z	2022-09-20T02:21:56Z	

Also notice (from the logs) that a Machine deployed to vin has gone.

The two Machines (in maa73287171b5d385, yyz24d891dec4d687) I fly m updated have since transitioned to the stopped state (whereas, others are have been in the started state for a long time now). But: I still can’t fly ssh into them (or any Machine, for that matter).

Ok, so I can confirm that this should be resolved. A fix went out a few days ago, alongside our ongoing infra improvements for machines.

The tl;dr for this one is that we recently were working on fly-proxy’s load-balancing performance.

This worked great for normal apps, but flapped quite a bit for machines. So we reverted the changes to fly-proxy’s load balancing. We’re currently working on other ways to get the same performance benefits :slightly_smiling_face:

1 Like

Thanks a lot for getting back! I want to move 25% of our production traffic over to Fly by the year end. With that in mind:

  1. What are the chances of this happening again (as in, the bug that went in, was it rare enough of an edge-case? I don’t see many other such reports, just my app that went down… what was specific to my app that put a target on its head?).
  2. Can Fly auto-alert (like emails that go out on high bandwidth use) when all Machines in an app go down?
  3. I’d want to set an alert for when it happens, but it isn’t clear how to do so, other than passively monitoring via an external service (which I’d rather avoid having to code up and maintain).

I have left these 20+ odd non-responsive Machines-- which are in started state (and likely racking up those billing hours)-- intact just for debugging. I can tear them down if it wouldn’t be of use, anymore?

I have left these 20+ odd non-responsive Machines-- which are in started state (and likely racking up those billing hours)-- intact just for debugging. I can tear them down if it wouldn’t be of use, anymore?

Yes, feel free to tear them down whenever you’d like!

  1. What are the chances of this happening again (as in, the bug that went in, was it rare enough of an edge-case? I don’t see many other such reports, just our app that went down… what was specific to my app that put a target on its head?).

I’d guess that your app was more susceptible to this problem because of the number of machines in use and the way that they’re distributed across regions.

I don’t think it’s likely that you’ll run into this issue with your production traffic. But if you are planning on a big migration, our launch plan might help ease any hiccups you hit throughout the process.

  1. Can Fly auto-alert (like emails that go out on high bandwidth use) when all Machines in an app go down?

I’m not sure, but it’s a neat idea-- I’ll find out.

  1. I’d want to set an alert for when it happens, but it isn’t clear how to do so, other than passively monitoring via an external service (which I’d rather avoid having to code up and maintain).

You’ve probably already considered this, so it probably won’t meet your needs, but fly status --all should display the current state of all machines on an app.

1 Like

Thank you.

I do plan to take it up once we build some confidence in how we’d operate the service with Fly Machines.

Hm, could this issue-- I recently deployed udns, a Machine app, with flyctl to all available Fly regions, and found that, one VM is spun-up per connection instead of the connection being sent to an already alive VM (link)-- that we encountered be related?

1 Like

I am unable to recover udns app (fly deploy) because of a zombie Machine in vin which I can neither remove nor update:

23908066ec6875	udns-vin	started	 :     1970-01-01T00:00:00Z	0001-01-01T00:00:00Z	
# update
➜ fly m update 23908066ec6875 -a udns
Error failed to get VM 23908066ec6875: nats: no responders available for request
# remove
➜ fly m remove 23908066ec6875 -f -a udns
Error could not retrieve machine 23908066ec6875

Is there a way for you folks to get rid of it?

Unfortunately, this has started happening again (with different errors):

2022-09-24T14:30:36Z proxy[d568373eb628e9] scl [error]
Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

2022-09-24T14:40:47Z proxy[d5683061b4158e] mia [error]
Error: while TCP-proxying data to/from app: failed to copy (direction=client->server, error=Broken pipe (os error 32))

2022-09-24T14:47:23Z proxy[73287119f67285] sin [error]
Error: while TCP-proxying data to/from app: failed to copy (direction=client->server, error=Transport endpoint is not connected (os error 107))

2022-09-24T14:50:09Z proxy[9e78424ebd5783] nrt [error]
Error: while TCP-proxying data to/from app: failed to copy (direction=server->client, error=Connection reset by peer (os error 104))

2022-09-24T14:47:12Z proxy[73287171b5d385] maa [error]
Error: an error occured while trying to wake up a fly machine

Some of the VMs are up and responsive, but some have entered this state (so far, ewr7328726b001085, nrt9e78424ebd5783, sin73287119f67285, miad5683061b4158e, hkg73287340a36285, cdg5683991c16e28e, fra6e82975b296287, maa73287171b5d385 are down and out, while vin23908066ec6875 lingers around as a zombie machine).

I hope it isn’t our code but it is open source, in case there’s enough bandwidth and will to want to deploy and test it out.

To answer my own Qs:

Error: failed to connect to fly machine: Supposedly started, and not stopped, but: connection refused by remote server

Happens when the proxy is trying to connect to a machine VM that has stopped listening on service ports but is yet to exit. In case of our app, this was a result of Node talking its own merry time in closing out other resources long after it had stopped listening for incoming connections.

Ideally, once health-checks are up and working for Machines, this scenario should auto-recover as the Machine should be auto-restarted once health-checks start failing (which they will if process isn’t listening on service ports): How do I change the restart policy for Machines? - #8 by ignoramous

Error: while TCP-proxying data to/from app: failed to copy

Most probably the client / server half-closed the connection early.

Error: an error occured while trying to wake up a fly machine

Not sure, but this auto-recovers (as in, the Machine wakes up the next time, just fine).

udns-vin is a zombie, unreachable, unresponsive machine, but is in started state per fly status. While the zombie vm has broken some flyctl m commands, my real worry is getting billed for it.

How do we get rid of it (I ask since none of the flyctl m commands work)?

23908066ec6875 vin started

Sorry for taking so long to get this resolved. We had some networking issues in that region and have since destroyed the zombie machine and removed the region from being able to deploy machines.

1 Like