App unreachable unless scaled to >2

I have a phoenix app:

❯ fly info         
App
  Name     = xxx          
  Owner    = xxx              
  Version  = 39                 
  Status   = running            
  Hostname = xxx.fly.dev  

Services
PROTOCOL PORTS                   
TCP      80 => 4000 [HTTP]       
         443 => 4000 [TLS, HTTP] 

IP Adresses
TYPE ADDRESS             REGION CREATED AT           
v4   x.x.x.x             2022-04-18T21:17:18Z 
v6   x:x:x::x:x          2022-04-18T21:17:18Z 

Running with a single instance:

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS  HEALTH CHECKS           RESTARTS        CREATED   
4baf3b2d        app     39      iad     run     running 1 total, 1 passing      0               3m19s ago

The app has started just fine, all look good (in logs).

But I can’t reach it:

❯ curl -v https://xxx.fly.dev
*   Trying xxx.xxx.xxx.xxx:443...
* Connected to xxx.fly.dev (xxx.xxx.xxx.xxx) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* (304) (OUT), TLS handshake, Client hello (1):
* error:02FFF036:system library:func(4095):Connection reset by peer
* Closing connection 0
curl: (35) error:02FFF036:system library:func(4095):Connection reset by peer

From the inside everything looks good:

fly ssh console
Connecting to top1.nearest.of.xxx.internal... complete
/ # curl -i localhost:4000
HTTP/1.1 302 Found # 302 expected
...

The weirdest part is that once I add another instance via fly scale count 2 it magically starts to work just fine:

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS  HEALTH CHECKS           RESTARTS        CREATED   
cd7794a2        app     40      iad     run     running 1 total, 1 passing      0               51s ago  
4baf3b2d        app     40      iad     run     running 1 total, 1 passing      0               8m48s ago
curl -i https://xxx.fly.dev
HTTP/2 302 
cache-control: max-age=0, private, must-revalidate
content-length: 83
content-type: text/html; charset=utf-8
cross-origin-window-policy: deny
date: Tue, 19 Apr 2022 21:42:02 GMT
location: https://xxx
server: Fly/affaeede (2022-04-19)
x-content-type-options: nosniff
x-download-options: noopen
x-frame-options: SAMEORIGIN
x-permitted-cross-domain-policies: none
x-request-id: FudqK1RyQ6-sBZwAAAPh
x-xss-protection: 1; mode=block
via: 2 fly.io
fly-request-id: 01G11WCYQSBFPAF84X9DHT08Y2-ams

I’ve tried restarting and removing instances and nothing helps - there must be at least 2 instances to get any response.

Any ideas?

I’m seeing the same thing here just started.

This may be related to some capacity issues in IAD yesterday. Can you try this again?

Seems to be working fine now with only 1 instance :ok_hand:

Well, not anymore. After few deployments the single instance stopped responding. Now event 2 is not enough, I need at least 3 to get any response.

FYI I’ve moved the app to sea and it now works just fine with 1 instance.

Thanks. We’ll take a look to see if something’s up in IAD.

I’ve found an issue in our proxy that could explain symptoms like these - rolling out a fix now. It’s hard to confirm whether you were affected without knowing the app ID you were testing. Let us know if it happens again!

1 Like

The app id is “next-core”.

P.S. Please note that I’m not very comfortable sharing that publicly - is there any way to securely pass sensitive details that should be only accessible to Fly employees? This is going to be much more important once we move to production, and AFAIK there is no other form of support other than this very public forum.

I’m seeing similar trouble with iad too. Observed well after the supposed fix happened.

curl --trace-ascii - http://hostname-hidden.fly.dev/
== Info:   Trying 188.93.150.109:80...
== Info: Connected to hostname-hidden.fly.dev (188.93.150.109) port 80 (#0)
=> Send header, 87 bytes (0x57)
0000: GET / HTTP/1.1
0010: Host: hostname-hidden.fly.dev
002f: User-Agent: curl/7.79.1
0048: Accept: */*
0055:
== Info: Recv failure: Connection reset by peer
== Info: Closing connection 0
curl: (56) Recv failure: Connection reset by peer

As with others, I’m not comfortable sharing names of internal services of my clients that I build as a contractor.

Moving the service to a different data center helped:

$ flyctl regions add sjc
Region Pool:
iad
sjc
Backup Region:
$ flyctl regions remove iad

I want to move back to iad though because that’s where my database is.

I would also like to point out that https://status.flyio.net/ is a lie, now. It claims all green while a proxy service in iad has been known to staff to have issues, for hours! You guys have a great “style”, please stick to it by being very honest with your status page.

2 Likes

We’re still a small team by comparison to the quantity of infra we manage. We try to update it fast, but we don’t always do.

I’ve created an update and have hopefully resolved the situation.