App stuck in Pending mode, no logs,

Hi :wave: ,

Several weeks ago, I managed to successfully deploy an app (envoy proxy) on fly.
From what I gather by running flyctl apps list, I see that the app is in pending mode:

NAME                            OWNER    STATUS    LATEST DEPLOY
envoy                           personal pending   5m3s ago
fly-builder-twilight-river-4862 personal running   6m15s ago

I tried to restart the app, and even redeploy but I’m not getting any output.
It’s stuck at:

deployment-1615830648: digest: sha256:47861d6ac855a4da512efb7da80f732d226a9ef332b2bfa5d66b09b57540369e size: 3034
--> Done Pushing Image
==> Creating Release
Release v124 created
Deploying to : envoy.fly.dev

Monitoring Deployment
You can detach the terminal anytime without stopping the deployment

A quick flyctl status gives:

f status -a envoy
App
  Name     = envoy
  Owner    = personal
  Version  = 124
  Status   = pending
  Hostname = envoy.fly.dev

Instances
ID VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED

I also checked the regions, but it looks fine:

Region Pool:
sjc
Backup Region:
lax
sea

Last logs from when I issued a restart:

2021-03-15T17:23:54.427Z 0c611037 sjc [info] [runner] exiting
2021-03-15T17:23:54.440Z 0c611037 sjc [info] Main child exited normally with code: 0
2021-03-15T17:23:54.441Z 0c611037 sjc [info] Reaped child process with pid: 516 and signal: SIGKILL, core dumped? false

Any idea on how to troubleshoot?

It looks like it may just be taking a hair too long to respond to health checks. If you run flyctl status --all you’ll see a failed VM from 10 min or so ago. Let me see if I can get it running.

2021-03-15T18:26:52.045Z 3a11e19b sjc [info] Reaped child process with pid: 516 and signal: SIGKILL, core dumped? false
[1h later]
2021-03-15T19:26:54.106Z 5b83dabe sjc [info] Starting instance
2021-03-15T19:26:54.129Z 5b83dabe sjc [info] Configuring virtual machine

It’s exactly 1h after the previous VM failed, did failing health checks trigger an exponential backoff?

Yep, it’ll keep retrying for a while.

Am I missing something or does this app have no health checks?

Oh there are two services. Found it. :wink:

Have a look now? I increased the check grace period to 30s. You can set this in your fly.toml by adding grace_period = "30s" under your health check definition.

Thanks @kurt. I’ve now added the grace_period=30s, but I believe there is an underlying issue.

I’m going to disable health-checking for now to investigate why Envoy would fail to connect to a Digital Ocean running the configuration-server.

Thanks to the new ssh console :sparkles: , I’m able to see why the HC failed:
Envoy isn’t able to connect to my droplet:

traceroute to 192.241.212.157 (192.241.212.157), 30 hops max, 46 byte packets
 1  172.19.2.137 (172.19.2.137)  0.092 ms  0.144 ms  0.110 ms
 2  169.254.6.1 (169.254.6.1)  0.217 ms  169.254.6.0 (169.254.6.0)  0.197 ms  0.148 ms
 3  10.253.32.38 (10.253.32.38)  0.163 ms  0.124 ms  10.253.32.34 (10.253.32.34)  0.284 ms
 4  10.253.32.2 (10.253.32.2)  0.649 ms  0.678 ms  10.253.32.6 (10.253.32.6)  0.682 ms
 5  0.et-0-0-7.bsr1.sv5.packet.net (198.16.4.102)  2.259 ms  3.093 ms  0.et-0-0-7.bsr2.sv5.packet.net (198.16.4.104)  1.279 ms
 6  eqix-sv1.digitalocean.com (206.223.117.65)  1.806 ms  1.554 ms  as14061.sfmix.org (206.197.187.10)  3.331 ms
 7  138.197.244.236 (138.197.244.236)  3.117 ms  *  3.585 ms
 8  138.197.248.207 (138.197.248.207)  3.029 ms  *  *
 9  *  *  *
10  *  *  *

I’ve verified from multiple ISPs that the droplet is indeed reachable, and there are no network firewalls configured.

Is this something you have visibility on your end?

                                    My traceroute  [v0.93]
a74c2a16 (172.19.2.138)                                               2021-03-15T20:34:35+0000
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                                      Packets               Pings
 Host                                               Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 172.19.2.137                                     0.0%    19    0.1   0.1   0.1   0.2   0.0
 2. 169.254.6.0                                      0.0%    19    0.2   0.2   0.2   0.4   0.0
 3. 10.253.32.36                                     0.0%    19    0.3   0.2   0.2   0.4   0.1
 4. 10.253.32.4                                      0.0%    19    0.9   3.2   0.7  18.1   5.4
 5. 0.et-0-0-7.bsr1.sv5.packet.net                   0.0%    19    2.3   3.7   2.0  27.2   5.8
 6. eqix-sv1.digitalocean.com                        0.0%    19    2.3   2.4   2.0   5.3   0.8
 7. (waiting for reply)

From fly:

# ./grpcurl -insecure 192.241.212.157:8443 list
Failed to dial target host "192.241.212.157:8443": context deadline exceeded

From scaleway (PAR):

/grpcurl -insecure 192.241.212.157:8443 list
Failed to list services: server does not support the reflection API

Port 8443 is open to all IPV4 and IPV6 on my DigitalOcean droplet :frowning_face:

This looks like it’s a network problem on DigitalOcean’s end, just because it got to their routers in the facility. If that’s the case, they’ll likely have to fix it (although we are checking for a workaround).

One quick thing to try is running in a different region. They’re likely to go through a different peer from LAX, for example.

I tried LAX and ORD with no luck :frowning:
I’ll keep digging, as usual thanks for the help :slightly_smiling_face:

I can actually curl to that IP just fine from the hosts, so there might be something else wrong. Mind if I pop into one of your VMs?

$ curl http://192.241.212.157
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.10.3 (Ubuntu)</center>
</body>
</html>

Yup go for it. The port I’m trying to connect to is 8443 if that makes any difference.

Wow, seems like ipv4 within your VMs just isn’t working. IPv6 addresses work fine. That’s bizarre! We’re looking.

Ok, this is not a problem with that whole IP, it’s a problem with port 8443 specifically. Other ports on that IP work just fine. This is likely a firewall issue on our end, we’ll get it fixed.

I see. Envoy will also attempt to connect to Port 9443 if it helps with debugging.

No rush in fixing this, I won’t have time to tinker with it tonight anyways.

Fixed! The logs look happy on your app. See how it goes when you do have time to tinker.

1 Like