fly proxy hanging (and other problems)

It said it couldn’t connect to the builder.

So I destroyed the builder.

Ran deploy again, and it’s doing the exact same thing.

WARN Remote builder did not start in time. Check remote builder logs with flyctl logs -a fly-builder-damp-fire-2944

WARN Failed to start remote builder heartbeat: remote builder app unavailable

Error: failed to fetch an image or build from source: error connecting to docker: server returned a non-200 status code: 504

I check the logs, and all it’s doing is waiting for activity. No errors. Nothing.

Can you rerun this with LOG_LEVEL=debug? That should have a request ID I can look into.

1 Like

I can’t see any request IDs yet, but I am getting quite a bit of this so far:

DEBUG failed to connect metrics websocket: websocket.Dial wss://flyctl-metrics.fly.dev/socket: bad status

I don’t think that would cause the issues you’re seeing. Can you share the entire output?

Sure:

DEBUG Loaded flyctl config from/Users/?/.fly/config.yml
DEBUG determined hostname: "mike.local"
DEBUG determined working directory: "/Users/?/code/bot"
DEBUG determined user home directory: "/Users/?"
DEBUG determined config directory: "/Users/?/.fly"
DEBUG ensured config directory exists.
DEBUG ensured config directory perms.
DEBUG cache loaded.
DEBUG config initialized.
DEBUG skipped querying for new release
DEBUG client initialized.
DEBUG app config loaded from /Users/?/code/bot/fly.toml
DEBUG --> POST https://api.fly.io/graphql

DEBUG {
  "query": "query ($appName: String!) { appbasic:app(name: $appName) { id name platformVersion organization { id slug paidPlan } } }",
  "variables": {
    "appName": "?"
  }
}


DEBUG {}
DEBUG <-- 200 https://api.fly.io/graphql (534.81ms)

DEBUG {
  "data": {
    "appbasic": {
      "id": "?",
      "name": "?",
      "platformVersion": "machines",
      "organization": {
        "id": "yw2NK09lG62eytyo75jV2OwzbVT3O05qR",
        "slug": "{org name}",
        "paidPlan": false
      }
    }
  }
}

==> Verifying app config
DEBUG Starting task manager
DEBUG Config has metrics token

Validating /Users/?/code/bot/fly.toml
Platform: machines
✓ Configuration is valid
--> Verified app config
DEBUG --> POST https://api.fly.io/graphql

DEBUG {
  "query": "query ($appName: String!) { appcompact:app(name: $appName) { id name hostname deployed status appUrl platformVersion organization { id slug paidPlan } postgresAppRole: role { name } imageDetails { repository version } } }",
  "variables": {
    "appName": "?"
  }
}


DEBUG {}
DEBUG <-- 200 https://api.fly.io/graphql (594.12ms)

DEBUG {
  "data": {
    "appcompact": {
      "id": "?",
      "name": "?",
      "hostname": "?.fly.dev",
      "deployed": true,
      "status": "deployed",
      "appUrl": null,
      "platformVersion": "machines",
      "organization": {
        "id": "yw2NK09lG62eytyo75jV2OwzbVT3O05qR",
        "slug": "{org name}",
        "paidPlan": false
      },
      "postgresAppRole": null,
      "imageDetails": {
        "repository": "?",
        "version": null
      }
    }
  }
}

==> Building image
DEBUG trying remote docker daemon
DEBUG --> POST https://api.fly.io/graphql

DEBUG {
  "query": "mutation($input: EnsureMachineRemoteBuilderInput!) { ensureMachineRemoteBuilder(input: $input) { machine { id state ips { nodes { family kind ip } } }, app { name organization { id slug } } } }",
  "variables": {
    "input": {
      "appName": "?",
      "organizationId": null
    }
  }
}


DEBUG {}
DEBUG failed to connect metrics websocket: websocket.Dial wss://flyctl-metrics.fly.dev/socket: bad status

DEBUG <-- 200 https://api.fly.io/graphql (1m1.73s)

DEBUG {
  "data": {
    "ensureMachineRemoteBuilder": {
      "machine": {
        "id": "91850eea2e6e83",
        "state": "started",
        "ips": {
          "nodes": [
            {
              "family": "v6",
              "kind": "public",
              "ip": "2605:4c40:216:7da5:0:b71b:7c18:1"
            },
            {
              "family": "v4",
              "kind": "private",
              "ip": "172.19.129.202"
            },
            {
              "family": "v6",
              "kind": "privatenet",
              "ip": "fdaa:2:83f9:a7b:d6e9:b71b:7c18:2"
            }
          ]
        }
      },
      "app": {
        "name": "fly-builder-damp-fire-2944",
        "organization": {
          "id": "yw2NK09lG62eytyo75jV2OwzbVT3O05qR",
          "slug": "{org name}"
        }
      }
    }
  }
}

DEBUG checking ip &{Family:v6 Kind:public IP:2605:4c40:216:7da5:0:b71b:7c18:1 MaskSize:0}

DEBUG checking ip &{Family:v4 Kind:private IP:172.19.129.202 MaskSize:0}

DEBUG checking ip &{Family:v6 Kind:privatenet IP:fdaa:2:83f9:a7b:d6e9:b71b:7c18:2 MaskSize:0}

Waiting for remote builder fly-builder-damp-fire-2944... 🌍DEBUG --> POST https://api.fly.io/graphql

DEBUG {
  "query": "query ($appName: String!) { appbasic:app(name: $appName) { id name platformVersion organization { id slug paidPlan } } }",
  "variables": {
    "appName": "?"
  }
}


DEBUG {}
Waiting for remote builder fly-builder-damp-fire-2944... 🌎DEBUG <-- 200 https://api.fly.io/graphql (308.89ms)

DEBUG {
  "data": {
    "appbasic": {
      "id": "?",
      "name": "?",
      "platformVersion": "machines",
      "organization": {
        "id": "yw2NK09lG62eytyo75jV2OwzbVT3O05qR",
        "slug": "{org name}",
        "paidPlan": false
      }
    }
  }
}

DEBUG --> POST https://api.fly.io/graphql

DEBUG {
  "query": "mutation($input: ValidateWireGuardPeersInput!) { validateWireGuardPeers(input: $input) { invalidPeerIps } }",
  "variables": {
    "input": {
      "peerIps": [
        "fdaa:2:83f9:a7b:1bfe:0:a:2",
        "fdaa:1:33c4:a7b:1bfe:0:a:602",
        "fdaa:2:83fc:a7b:1bfe:0:a:2"
      ]
    }
  }
}


DEBUG {}
Waiting for remote builder fly-builder-damp-fire-2944... 🌏DEBUG <-- 200 https://api.fly.io/graphql (273.48ms)

DEBUG {
  "data": {
    "validateWireGuardPeers": {
      "invalidPeerIps": []
    }
  }
}

WARN Failed to start remote builder heartbeat: failed building options: failed probing "{org name}": context deadline exceeded

DEBUG Config has metrics token

DEBUG --> POST https://api.fly.io/graphql

DEBUG {
  "query": "\n# @genqlient\nmutation ResolverCreateBuild ($input: CreateBuildInput!) {\n\tcreateBuild(input: $input) {\n\t\tid\n\t\tstatus\n\t}\n}\n",
  "variables": {
    "input": {
      "appName": "?",
      "builderType": "remote",
      "clientMutationId": "",
      "imageOpts": {
        "buildArgs": {
          "NODE_ENV": "production"
        },
        "buildPacks": null,
        "builder": "",
        "builtIn": "",
        "builtInSettings": null,
        "dockerfilePath": "",
        "extraBuildArgs": null,
        "imageLabel": "",
        "imageRef": "",
        "noCache": false,
        "publish": true,
        "tag": "registry.fly.io/?:deployment-01H7B689RMC6FAW3W5P9AZP98X",
        "target": ""
      },
      "machineId": "",
      "strategiesAvailable": [
        "Buildpacks",
        "Dockerfile",
        "Builtin"
      ]
    }
  },
  "operationName": "ResolverCreateBuild"
}

DEBUG {0x14000aca570}
DEBUG <-- 200 https://api.fly.io/graphql (560.68ms)

DEBUG {
  "data": {
    "createBuild": {
      "id": "2954682",
      "status": "started"
    }
  }
}

DEBUG Trying 'Buildpacks' strategy

DEBUG no buildpack builder configured, skipping
DEBUG result image:<nil> error:<nil>

DEBUG Trying 'Dockerfile' strategy

DEBUG --> POST https://api.fly.io/graphql

DEBUG {
  "query": "mutation($input: EnsureMachineRemoteBuilderInput!) { ensureMachineRemoteBuilder(input: $input) { machine { id state ips { nodes { family kind ip } } }, app { name organization { id slug } } } }",
  "variables": {
    "input": {
      "appName": "?",
      "organizationId": null
    }
  }
}


DEBUG {}
DEBUG failed to connect metrics websocket: websocket.Dial wss://flyctl-metrics.fly.dev/socket: bad status

DEBUG Config has metrics token

DEBUG failed to connect metrics websocket: websocket.Dial wss://flyctl-metrics.fly.dev/socket: bad status

DEBUG <-- 504 https://api.fly.io/graphql (1m0.25s)

DEBUG <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

DEBUG result image:<nil> error:error connecting to docker: server returned a non-200 status code: 504

DEBUG --> POST https://api.fly.io/graphql

DEBUG Config has metrics token

DEBUG {
  "query": "\n# @genqlient\nmutation ResolverFinishBuild ($input: FinishBuildInput!) {\n\tfinishBuild(input: $input) {\n\t\tid\n\t\tstatus\n\t\twallclockTimeMs\n\t}\n}\n",
  "variables": {
    "input": {
      "appName": "?",
      "buildId": "2954682",
      "builderMeta": {
        "builderType": "",
        "buildkitEnabled": false,
        "dockerVersion": "",
        "platform": "",
        "remoteAppName": "",
        "remoteMachineId": ""
      },
      "clientMutationId": "",
      "finalImage": {
        "id": "",
        "sizeBytes": 0,
        "tag": ""
      },
      "logs": "error connecting to docker: server returned a non-200 status code: 504",
      "machineId": "",
      "status": "failed",
      "strategiesAttempted": [
        {
          "error": "",
          "note": "no buildpack builder configured, skipping",
          "result": "failed",
          "strategy": "Buildpacks"
        },
        {
          "error": "error connecting to docker: server returned a non-200 status code: 504",
          "note": "",
          "result": "failed",
          "strategy": "Dockerfile"
        }
      ],
      "timings": {
        "buildAndPushMs": 60255,
        "buildMs": 60255,
        "builderInitMs": 60255,
        "contextBuildMs": -1,
        "imageBuildMs": -1,
        "pushMs": -1
      }
    }
  },
  "operationName": "ResolverFinishBuild"
}

DEBUG {0x14000d228d0}
DEBUG <-- 200 https://api.fly.io/graphql (554.63ms)

DEBUG {
  "data": {
    "finishBuild": {
      "id": "2954682",
      "status": "failed",
      "wallclockTimeMs": 60827
    }
  }
}

DEBUG Task manager done
DEBUG failed to connect metrics websocket: websocket.Dial wss://flyctl-metrics.fly.dev/socket: bad status

DEBUG Config has metrics token

DEBUG Shutdown timed out, exiting
Error: failed to fetch an image or build from source: error connecting to docker: server returned a non-200 status code: 504

(redacted a touch, just names)

Could this be related to the JNB connectivity issues?

Finally managed to deploy. First build got stuck at one of the yarn steps. So I cancelled and destroyed the builder. Ran it again and all went smooth.

Going to boil this down to network/gateway issues, which sounds plausible given that my fibre line was also hiccuping throughout the day.

I managed to proxy as well, but then restarted the db instance as it was failing cpu health-checks quite regularly. Now I cannot proxy.

Error: tunnel unavailable for organization {org name}: failed probing “{org name}”: context deadline exceeded

What does “context deadline exceeded” mean? It sounds like it has something to do with a timeout of sorts, but it doesn’t wait very long to show the error.

@ben-io – do you have any updates? All of these issues are persisting.

I’m unable to do anything at this point. Can’t clear wireguard. Can’t use websockets. Uninstalling and re-installing does not help.

When running the agent in the foreground, I see notes about dropped connections, unavailable upstream services, and context deadline exceeded.

No idea what else to try.

Please help.

WARN Failed to start remote builder heartbeat: failed building options: failed probing “personal”: context deadline exceeded

Error: failed to fetch an image or build from source: failed building options: failed probing “personal”: read tcp [fdaa:1:33c4:a7b:1bfe:0:a:600]:19746->[fdaa:1:33c4::3]:53: i/o timeout

DEBUG {
  "data": {
    "finishBuild": {
      "id": "2989109",
      "status": "failed",
      "wallclockTimeMs": 8442
    }
  }
}

Enabled websockets, disabled and restarted the agent twice, and it could then start the build. Everything was going smoothly, until it was loading the build context, which went very slowly, and eventually timed out with the following:

------
 > [internal] load build context:
------
Error: failed to fetch an image or build from source: error building: failed to solve: rpc error: code = Canceled desc = grpc: the client connection is closing

I cannot do a local build as docker times out at random points, usually during an apt command (either getting a package or unpacking another – it’s completely random).

1 Like

Hi Mike, if you’re in South Africa or working on apps/builders hosted in JNB, there appears to have been ongoing network congestion issues across many Internet Service Providers in the area due to a pair of undersea cable breaks that occurred on Sunday. We worked with our datacenter provider to apply routing changes to mitigate the network issues as best we can, but this is probably the source of ongoing issues you’re experiencing.

So I think this might have something to do with local network issues, which I find quite odd as most things are working fine, with the exception of docker. I landed up deploying via CI, and all went smoothly. I do notice that the app is a touch slow (hosted in the jnb region),

@wjordan – thanks, I thought the issue was resolved (per the status page). I guess this explains the slowness of the app itself, but I don’t think it explains the deployment issues I’ve been having, at least I don’t have a reason to think so, especially considering it deploys from CI without issue. I wonder if my fibre provider is also being impacted by the undersea cable breaks – routings on this end could be causing issues, perhaps? The line has gone down several times in the past 48 hours, but has been up and running for the most part.

Also just noting that there could perhaps be other issues at play here. I had similar issues as experienced in this thread (intermittently, though):

On our network, we haven’t seen significant ping loss to jnb (above ~10%) since the incident was resolved, and no ping loss between jnb and iad (where the fly.io dashboard and parts of our deployment-monitoring control plane is hosted).

If some apt operations are timing out on your local docker build, that suggests your local network ISP is having trouble connecting to other remote locations.

It’s possible that your current wireguard peer is located in a region that can’t connect to your local network due to ISP issues. You might have some success creating a new wireguard peer in a different region. (We don’t currently have a wireguard gateway in jnb, but perhaps you might have better luck some other region if it routes differently.)

Thanks for the details – it’s starting to make some sense. ISP issues seems strange, as the line is mostly fine – it must just be connections over specific routes.

I tried resetting wireguard (I believe it is using maa), but couldn’t do that either. My intention was to wipe out all the wireguards on all orgs and start afresh. It would be nice if there was a way to do a full reset, removing everything (unless that’s what the reset command does – I can’t tell though, because it doesn’t work).

Given than I can deploy from CI, I’m going to reduce my stress levels over this and use that for now. :sweat_smile:

I’ll do a deploy from my mac from time to time to see what happens – hopefully things improve in the coming days/weeks.

I checked some more metrics and it does look like maa is one of the 5-10% of regions where we’ve been seeing some ongoing intermittent connectivity loss to jnb since the cable cut. I wouldn’t be surprised if your own local ISP was seeing similar connection issues to the maa gateway, so switching your wireguard peer to a different region could help.

How do I go about this? The wg commands are returning Error: upstream service is unavailable.

fly wireguard reset is the command to connect your local agent to a new wireguard peer, but it defaults to the nearest region. To override this, I just learned that you can restart the agent with the (undocumented) FLYCTL_WG_REGION environment variable to override the region when resetting the wireguard peer:

fly agent stop; FLYCTL_WG_REGION=fra fly wireguard reset [org]