[GRU] Instability with deploy and logs

When starting a new deployment of my app, after flyctl completes the build and delivery task, the app is down and I can’t see the logs in the console, and the deployment task keeps running in the console or sometimes it says the app is active but it’s not, and I can’t see the GRU region logs.

I’m using Phoenix.

I’ve tried many commands, but none solved my problem. And there is no indication of a problem on status.fly.io.

The only thing that helped was adding 1 more region, but I don’t know if this is working because for some reason the traffic still going to GRU.

App
  Name     = salao365          
  Owner    = personal          
  Version  = 117               
  Status   = running           
  Hostname = salao365.fly.dev  
  Platform = nomad             

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS  HEALTH CHECKS           RESTARTS        CREATED    
e5bf35c2        app     117     dfw     run     running 1 total, 1 passing      0               12h41m ago
b827c3b1        app     117     gru     run     running 1 total, 1 passing      0               12h42m ago

image

image

$ curl -v api.salao365.com
*   Trying 2a09:8280:1::3:393a...
* TCP_NODELAY set
* Connected to api.salao365.com (2a09:8280:1::3:393a) port 80 (#0)
> GET / HTTP/1.1
> Host: api.salao365.com
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< location: https://api.salao365.com/
< server: Fly/51c45b355 (2022-10-19)
< via: 1.1 fly.io
< fly-request-id: 01GG2BTFA8F4GRR5K3VJAQXSY1-gru
< content-length: 0
< date: Sun, 23 Oct 2022 12:07:01 GMT
<
* Connection #0 to host api.salao365.com left intact
* Closing connection 0

I’m experiencing something similar on the GRU region as well.

Yesterday one app would not show logs on monitoring page and fly logs would get stuck. The deploy would work sometimes (even without any logs) and serve traffic. This was a test app and after deleting and creating another one with the same name, the problem seems to be resolved.


Today I seen other issues with others apps related to serving content, again, in the GRU region.

One app stopped serving content, no new deploys or changes, very slow response times, timings (the time below are in America/Sao_Paulo):

❯ curl --trace-time -v https://APP_DOMAIN
12:38:50.458725 *   Trying APP_IP:443...
12:38:50.466943 * Connected to APP_DOMAIN (APP_IP) port 443 (#0)
12:38:50.469024 * ALPN, offering h2
12:38:50.469044 * ALPN, offering http/1.1
12:38:50.476464 * successfully set certificate verify locations:
12:38:50.476485 *  CAfile: /etc/ssl/cert.pem
12:38:50.476498 *  CApath: none
12:38:50.477730 * (304) (OUT), TLS handshake, Client hello (1):
12:38:50.486743 * (304) (IN), TLS handshake, Server hello (2):
12:38:50.487570 * (304) (IN), TLS handshake, Unknown (8):
12:38:50.487617 * (304) (IN), TLS handshake, Certificate (11):
12:38:50.489504 * (304) (IN), TLS handshake, CERT verify (15):
12:38:50.489728 * (304) (IN), TLS handshake, Finished (20):
12:38:50.489786 * (304) (OUT), TLS handshake, Finished (20):
12:38:50.489808 * SSL connection using TLSv1.3 / AEAD-AES256-GCM-SHA384
12:38:50.489822 * ALPN, server accepted to use h2
12:38:50.489837 * Server certificate:
12:38:50.489853 *  subject: CN=APP_DOMAIN
12:38:50.489897 *  start date: Aug 28 13:02:58 2022 GMT
12:38:50.489911 *  expire date: Nov 26 13:02:57 2022 GMT
12:38:50.489929 *  subjectAltName: host "APP_DOMAIN" matched cert's "APP_DOMAIN"
12:38:50.489946 *  issuer: C=US; O=Let's Encrypt; CN=R3
12:38:50.489960 *  SSL certificate verify ok.
12:38:50.489992 * Using HTTP2, server supports multiplexing
12:38:50.490006 * Connection state changed (HTTP/2 confirmed)
12:38:50.490020 * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
12:38:50.490098 * Using Stream ID: 1 (easy handle 0x7f895f810a00)
12:38:50.490127 > GET / HTTP/2
12:38:50.490127 > Host: APP_DOMAIN
12:38:50.490127 > user-agent: curl/7.79.1
12:38:50.490127 > accept: */*
12:38:50.490127 >
12:38:50.498867 * Connection state changed (MAX_CONCURRENT_STREAMS == 32)!
12:39:07.345726 < HTTP/2 200
12:39:07.345767 < accept-ranges: bytes
12:39:07.345787 < content-length: 1331
12:39:07.345807 < content-type: text/html; charset=utf-8
12:39:07.345826 < request-id: cdam06s45ebs315fcj20
12:39:07.345845 < date: Sun, 23 Oct 2022 15:39:07 GMT
12:39:07.345866 < server: Fly/51c45b355 (2022-10-19)
12:39:07.345887 < via: 2 fly.io
12:39:07.345908 < fly-request-id: 01GG2QYAFSSNVQV5ZK2J2WSZ7C-gru
12:39:07.345923 {CONTENT}
12:39:07.346378 * Connection #0 to host APP_DOMAIN left intact

Timing using this format:

❯ curl -w "@curl-format.txt" -o /dev/null -s "https://APP_DOMAIN/api/healthz"
     time_namelookup:  0.125110s
        time_connect:  0.133622s
     time_appconnect:  0.148274s
    time_pretransfer:  0.148331s
       time_redirect:  0.000000s
  time_starttransfer:  214.727177s
                     ----------
          time_total:  214.727239s

The app present this behavior for some time and suddenly it just restarted.


I dont know the reason but it seems to have solved the problem for now (for this app).


17:27 (UTC) other app stopped serving content or presenting very slow response times.

Tried to restart it using using fly restart, but the app don’t restart, the new instance is stuck on pending state and don’t present any logs.

Using fly scale count 0 -a APP_NAME don’t seem to work as well.
Everything seems to be stuck:

Using fly scale count 1 -a APP_NAME seems to change a little bit and start a new instance, it’s serving traffic.

Metrics for the stuck instance (v6) seems odd as well (time in UTC-3):


I have yet other app starting to present problems, random slow response times, it seems it tried to restart but it get stuck as well.


Other apps that seems to work without problems (all on the GRU region), the apps have different images and do different things, I tried to explain the best as I could but I this time I’m having trouble to understand myself.

We’ve just deployed something that might help in gru as far as response times go.

It appears we had 2 overloaded hosts in gru as well. This would cause issues deploying and performance slowdowns.

There’s still 1 host struggling, I’ve created a status page incident: Fly.io Status - GRU host eb5d resource exhaustion

2 Likes

Ok, right now my Database instance is out. Has any way to migrate to another region temporary to get the app running back @jerome ?

1 Like

My app is down too (GRU). I tried to scale it like crossworth, but the new instance failed to run.

@klucass my users already left some bad reviews in the playstore because off this issue. :frowning:

1 Like

@klucass try to scale to original number of instances + 1 and check if at least one new instance is created.
On my case I think I have executed:
fly scale count 2 -a APP_NAME
and then
fly scale count 1 -a APP_NAME
for a app with a single instance.

1 Like

Maybe you can create a new volume in scl region, restore the last database snapshot to it and create the database on the scl region as well.

But I’m not sure if that would work or if the snapshot would have all the data you expected.

If you somehow can create a new app in GRU, maybe you could mount the database volume on this new app (if your database volume is not mounted on other apps).

it worked! Thank you!

1 Like

I’ll try this approach. The another option seems more dangerous.

The situation in gru should be resolved now. We’re still monitoring it.

3 Likes

I don’t know already if you suggestion have solved my problem, or if is related with the fix send by the fly.io team. But this have worked for me, thanks @crossworth. My app is back to life.

Thanks @jerome