[GRU] Instability with deploy and logs

rodolfosilva · October 23, 2022, 12:09pm

When starting a new deployment of my app, after flyctl completes the build and delivery task, the app is down and I can’t see the logs in the console, and the deployment task keeps running in the console or sometimes it says the app is active but it’s not, and I can’t see the GRU region logs.

I’m using Phoenix.

I’ve tried many commands, but none solved my problem. And there is no indication of a problem on status.fly.io.

The only thing that helped was adding 1 more region, but I don’t know if this is working because for some reason the traffic still going to GRU.

App
  Name     = salao365          
  Owner    = personal          
  Version  = 117               
  Status   = running           
  Hostname = salao365.fly.dev  
  Platform = nomad             

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS  HEALTH CHECKS           RESTARTS        CREATED    
e5bf35c2        app     117     dfw     run     running 1 total, 1 passing      0               12h41m ago
b827c3b1        app     117     gru     run     running 1 total, 1 passing      0               12h42m ago

$ curl -v api.salao365.com
*   Trying 2a09:8280:1::3:393a...
* TCP_NODELAY set
* Connected to api.salao365.com (2a09:8280:1::3:393a) port 80 (#0)
> GET / HTTP/1.1
> Host: api.salao365.com
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< location: https://api.salao365.com/
< server: Fly/51c45b355 (2022-10-19)
< via: 1.1 fly.io
< fly-request-id: 01GG2BTFA8F4GRR5K3VJAQXSY1-gru
< content-length: 0
< date: Sun, 23 Oct 2022 12:07:01 GMT
<
* Connection #0 to host api.salao365.com left intact
* Closing connection 0

crossworth · October 23, 2022, 6:00pm

I’m experiencing something similar on the GRU region as well.

Yesterday one app would not show logs on monitoring page and fly logs would get stuck. The deploy would work sometimes (even without any logs) and serve traffic. This was a test app and after deleting and creating another one with the same name, the problem seems to be resolved.

Today I seen other issues with others apps related to serving content, again, in the GRU region.

One app stopped serving content, no new deploys or changes, very slow response times, timings (the time below are in America/Sao_Paulo):

❯ curl --trace-time -v https://APP_DOMAIN
12:38:50.458725 *   Trying APP_IP:443...
12:38:50.466943 * Connected to APP_DOMAIN (APP_IP) port 443 (#0)
12:38:50.469024 * ALPN, offering h2
12:38:50.469044 * ALPN, offering http/1.1
12:38:50.476464 * successfully set certificate verify locations:
12:38:50.476485 *  CAfile: /etc/ssl/cert.pem
12:38:50.476498 *  CApath: none
12:38:50.477730 * (304) (OUT), TLS handshake, Client hello (1):
12:38:50.486743 * (304) (IN), TLS handshake, Server hello (2):
12:38:50.487570 * (304) (IN), TLS handshake, Unknown (8):
12:38:50.487617 * (304) (IN), TLS handshake, Certificate (11):
12:38:50.489504 * (304) (IN), TLS handshake, CERT verify (15):
12:38:50.489728 * (304) (IN), TLS handshake, Finished (20):
12:38:50.489786 * (304) (OUT), TLS handshake, Finished (20):
12:38:50.489808 * SSL connection using TLSv1.3 / AEAD-AES256-GCM-SHA384
12:38:50.489822 * ALPN, server accepted to use h2
12:38:50.489837 * Server certificate:
12:38:50.489853 *  subject: CN=APP_DOMAIN
12:38:50.489897 *  start date: Aug 28 13:02:58 2022 GMT
12:38:50.489911 *  expire date: Nov 26 13:02:57 2022 GMT
12:38:50.489929 *  subjectAltName: host "APP_DOMAIN" matched cert's "APP_DOMAIN"
12:38:50.489946 *  issuer: C=US; O=Let's Encrypt; CN=R3
12:38:50.489960 *  SSL certificate verify ok.
12:38:50.489992 * Using HTTP2, server supports multiplexing
12:38:50.490006 * Connection state changed (HTTP/2 confirmed)
12:38:50.490020 * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
12:38:50.490098 * Using Stream ID: 1 (easy handle 0x7f895f810a00)
12:38:50.490127 > GET / HTTP/2
12:38:50.490127 > Host: APP_DOMAIN
12:38:50.490127 > user-agent: curl/7.79.1
12:38:50.490127 > accept: */*
12:38:50.490127 >
12:38:50.498867 * Connection state changed (MAX_CONCURRENT_STREAMS == 32)!
12:39:07.345726 < HTTP/2 200
12:39:07.345767 < accept-ranges: bytes
12:39:07.345787 < content-length: 1331
12:39:07.345807 < content-type: text/html; charset=utf-8
12:39:07.345826 < request-id: cdam06s45ebs315fcj20
12:39:07.345845 < date: Sun, 23 Oct 2022 15:39:07 GMT
12:39:07.345866 < server: Fly/51c45b355 (2022-10-19)
12:39:07.345887 < via: 2 fly.io
12:39:07.345908 < fly-request-id: 01GG2QYAFSSNVQV5ZK2J2WSZ7C-gru
12:39:07.345923 {CONTENT}
12:39:07.346378 * Connection #0 to host APP_DOMAIN left intact

Timing using this format:

❯ curl -w "@curl-format.txt" -o /dev/null -s "https://APP_DOMAIN/api/healthz"
     time_namelookup:  0.125110s
        time_connect:  0.133622s
     time_appconnect:  0.148274s
    time_pretransfer:  0.148331s
       time_redirect:  0.000000s
  time_starttransfer:  214.727177s
                     ----------
          time_total:  214.727239s

The app present this behavior for some time and suddenly it just restarted.

I dont know the reason but it seems to have solved the problem for now (for this app).

17:27 (UTC) other app stopped serving content or presenting very slow response times.

Tried to restart it using using fly restart, but the app don’t restart, the new instance is stuck on pending state and don’t present any logs.

Using fly scale count 0 -a APP_NAME don’t seem to work as well.
Everything seems to be stuck:

Using fly scale count 1 -a APP_NAME seems to change a little bit and start a new instance, it’s serving traffic.

Metrics for the stuck instance (v6) seems odd as well (time in UTC-3):

I have yet other app starting to present problems, random slow response times, it seems it tried to restart but it get stuck as well.

Other apps that seems to work without problems (all on the GRU region), the apps have different images and do different things, I tried to explain the best as I could but I this time I’m having trouble to understand myself.

jerome · October 23, 2022, 6:40pm

We’ve just deployed something that might help in gru as far as response times go.

It appears we had 2 overloaded hosts in gru as well. This would cause issues deploying and performance slowdowns.

There’s still 1 host struggling, I’ve created a status page incident: Fly.io Status - GRU host eb5d resource exhaustion

rodolfosilva · October 23, 2022, 7:08pm

Ok, right now my Database instance is out. Has any way to migrate to another region temporary to get the app running back @jerome ?

klucass · October 23, 2022, 7:54pm

My app is down too (GRU). I tried to scale it like crossworth, but the new instance failed to run.

rodolfosilva · October 23, 2022, 8:02pm

@klucass my users already left some bad reviews in the playstore because off this issue.

crossworth · October 23, 2022, 8:02pm

@klucass try to scale to original number of instances + 1 and check if at least one new instance is created.
On my case I think I have executed:
fly scale count 2 -a APP_NAME
and then
fly scale count 1 -a APP_NAME
for a app with a single instance.

crossworth · October 23, 2022, 8:14pm

Maybe you can create a new volume in scl region, restore the last database snapshot to it and create the database on the scl region as well.

But I’m not sure if that would work or if the snapshot would have all the data you expected.

If you somehow can create a new app in GRU, maybe you could mount the database volume on this new app (if your database volume is not mounted on other apps).

klucass · October 23, 2022, 8:18pm

it worked! Thank you!

rodolfosilva · October 23, 2022, 8:45pm

I’ll try this approach. The another option seems more dangerous.

jerome · October 23, 2022, 8:54pm

The situation in gru should be resolved now. We’re still monitoring it.

rodolfosilva · October 23, 2022, 8:55pm

I don’t know already if you suggestion have solved my problem, or if is related with the fix send by the fly.io team. But this have worked for me, thanks @crossworth. My app is back to life.

rodolfosilva · October 23, 2022, 8:55pm

Thanks @jerome

Topic		Replies	Views
fly deploy passes - but no activity on Endpoint Phoenix	12	423	October 3, 2022
Error on deploying Phoenix app Build debugging elixir	2	497	July 18, 2022
App keeps shutting down, not accessible Build debugging elixir	1	273	October 28, 2022
Application VMs down without any change, can't deploy Phoenix	16	1331	October 3, 2022
One of my apps stuck in pending Questions / Help	9	1321	July 17, 2023

[GRU] Instability with deploy and logs

Related topics