App stuck in Pending mode, no logs

pims · March 15, 2021, 6:03pm

Hi ,

Several weeks ago, I managed to successfully deploy an app (envoy proxy) on fly.
From what I gather by running flyctl apps list, I see that the app is in pending mode:

NAME                            OWNER    STATUS    LATEST DEPLOY
envoy                           personal pending   5m3s ago
fly-builder-twilight-river-4862 personal running   6m15s ago

I tried to restart the app, and even redeploy but I’m not getting any output.
It’s stuck at:

deployment-1615830648: digest: sha256:47861d6ac855a4da512efb7da80f732d226a9ef332b2bfa5d66b09b57540369e size: 3034
--> Done Pushing Image
==> Creating Release
Release v124 created
Deploying to : envoy.fly.dev

Monitoring Deployment
You can detach the terminal anytime without stopping the deployment

A quick flyctl status gives:

f status -a envoy
App
  Name     = envoy
  Owner    = personal
  Version  = 124
  Status   = pending
  Hostname = envoy.fly.dev

Instances
ID VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED

I also checked the regions, but it looks fine:

Region Pool:
sjc
Backup Region:
lax
sea

Last logs from when I issued a restart:

2021-03-15T17:23:54.427Z 0c611037 sjc [info] [runner] exiting
2021-03-15T17:23:54.440Z 0c611037 sjc [info] Main child exited normally with code: 0
2021-03-15T17:23:54.441Z 0c611037 sjc [info] Reaped child process with pid: 516 and signal: SIGKILL, core dumped? false

Any idea on how to troubleshoot?

kurt · March 15, 2021, 7:38pm

It looks like it may just be taking a hair too long to respond to health checks. If you run flyctl status --all you’ll see a failed VM from 10 min or so ago. Let me see if I can get it running.

pims · March 15, 2021, 7:48pm

2021-03-15T18:26:52.045Z 3a11e19b sjc [info] Reaped child process with pid: 516 and signal: SIGKILL, core dumped? false
[1h later]
2021-03-15T19:26:54.106Z 5b83dabe sjc [info] Starting instance
2021-03-15T19:26:54.129Z 5b83dabe sjc [info] Configuring virtual machine

It’s exactly 1h after the previous VM failed, did failing health checks trigger an exponential backoff?

kurt · March 15, 2021, 7:57pm

Yep, it’ll keep retrying for a while.

Am I missing something or does this app have no health checks?

kurt · March 15, 2021, 7:58pm

Oh there are two services. Found it.

kurt · March 15, 2021, 8:03pm

Have a look now? I increased the check grace period to 30s. You can set this in your fly.toml by adding grace_period = "30s" under your health check definition.

pims · March 15, 2021, 8:16pm

Thanks @kurt. I’ve now added the grace_period=30s, but I believe there is an underlying issue.

I’m going to disable health-checking for now to investigate why Envoy would fail to connect to a Digital Ocean running the configuration-server.

pims · March 15, 2021, 8:32pm

Thanks to the new ssh console , I’m able to see why the HC failed:
Envoy isn’t able to connect to my droplet:

traceroute to 192.241.212.157 (192.241.212.157), 30 hops max, 46 byte packets
 1  172.19.2.137 (172.19.2.137)  0.092 ms  0.144 ms  0.110 ms
 2  169.254.6.1 (169.254.6.1)  0.217 ms  169.254.6.0 (169.254.6.0)  0.197 ms  0.148 ms
 3  10.253.32.38 (10.253.32.38)  0.163 ms  0.124 ms  10.253.32.34 (10.253.32.34)  0.284 ms
 4  10.253.32.2 (10.253.32.2)  0.649 ms  0.678 ms  10.253.32.6 (10.253.32.6)  0.682 ms
 5  0.et-0-0-7.bsr1.sv5.packet.net (198.16.4.102)  2.259 ms  3.093 ms  0.et-0-0-7.bsr2.sv5.packet.net (198.16.4.104)  1.279 ms
 6  eqix-sv1.digitalocean.com (206.223.117.65)  1.806 ms  1.554 ms  as14061.sfmix.org (206.197.187.10)  3.331 ms
 7  138.197.244.236 (138.197.244.236)  3.117 ms  *  3.585 ms
 8  138.197.248.207 (138.197.248.207)  3.029 ms  *  *
 9  *  *  *
10  *  *  *

I’ve verified from multiple ISPs that the droplet is indeed reachable, and there are no network firewalls configured.

Is this something you have visibility on your end?

pims · March 15, 2021, 8:34pm

                                    My traceroute  [v0.93]
a74c2a16 (172.19.2.138)                                               2021-03-15T20:34:35+0000
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                                      Packets               Pings
 Host                                               Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 172.19.2.137                                     0.0%    19    0.1   0.1   0.1   0.2   0.0
 2. 169.254.6.0                                      0.0%    19    0.2   0.2   0.2   0.4   0.0
 3. 10.253.32.36                                     0.0%    19    0.3   0.2   0.2   0.4   0.1
 4. 10.253.32.4                                      0.0%    19    0.9   3.2   0.7  18.1   5.4
 5. 0.et-0-0-7.bsr1.sv5.packet.net                   0.0%    19    2.3   3.7   2.0  27.2   5.8
 6. eqix-sv1.digitalocean.com                        0.0%    19    2.3   2.4   2.0   5.3   0.8
 7. (waiting for reply)

From fly:

# ./grpcurl -insecure 192.241.212.157:8443 list
Failed to dial target host "192.241.212.157:8443": context deadline exceeded

From scaleway (PAR):

/grpcurl -insecure 192.241.212.157:8443 list
Failed to list services: server does not support the reflection API

Port 8443 is open to all IPV4 and IPV6 on my DigitalOcean droplet

kurt · March 16, 2021, 1:19am

pims:

 6  eqix-sv1.digitalocean.com (206.223.117.65)  1.806 ms  1.554 ms  as14061.sfmix.org (206.197.187.10)  3.331 ms
 7  138.197.244.236 (138.197.244.236)  3.117 ms  *  3.585 ms
 8  138.197.248.207 (138.197.248.207)  3.029 ms  *  *

This looks like it’s a network problem on DigitalOcean’s end, just because it got to their routers in the facility. If that’s the case, they’ll likely have to fix it (although we are checking for a workaround).

One quick thing to try is running in a different region. They’re likely to go through a different peer from LAX, for example.

pims · March 16, 2021, 1:38am

I tried LAX and ORD with no luck
I’ll keep digging, as usual thanks for the help

kurt · March 16, 2021, 1:43am

I can actually curl to that IP just fine from the hosts, so there might be something else wrong. Mind if I pop into one of your VMs?

$ curl http://192.241.212.157
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.10.3 (Ubuntu)</center>
</body>
</html>

pims · March 16, 2021, 2:00am

Yup go for it. The port I’m trying to connect to is 8443 if that makes any difference.

kurt · March 16, 2021, 2:05am

Wow, seems like ipv4 within your VMs just isn’t working. IPv6 addresses work fine. That’s bizarre! We’re looking.

kurt · March 16, 2021, 2:15am

Ok, this is not a problem with that whole IP, it’s a problem with port 8443 specifically. Other ports on that IP work just fine. This is likely a firewall issue on our end, we’ll get it fixed.

pims · March 16, 2021, 2:18am

I see. Envoy will also attempt to connect to Port 9443 if it helps with debugging.

No rush in fixing this, I won’t have time to tinker with it tonight anyways.

kurt · March 16, 2021, 2:55am

Fixed! The logs look happy on your app. See how it goes when you do have time to tinker.

Topic		Replies	Views
Deploying my app is stuck at "pending"	2	643	September 2, 2022
App is stuck in pending in `sin` region	2	525	March 17, 2021
One of my apps stuck in pending Questions / Help	9	1275	July 17, 2023
App launch stuck in pending Build debugging elixir	7	560	January 24, 2024
App stuck in pending state	9	2993	April 10, 2022

App stuck in Pending mode, no logs

Related topics