Connections to app started hanging

Nik · July 12, 2022, 10:43am

After a recent deploy, I no longer was able to connect to my app. Not through a custom domain, and not through the *.fly.dev domain. Connections will just hang until they timeout.

I’m able to ssh onto the box and wget localhost successfully, but connecting from the internet has stopped working.

I’ve redeployed my latest change, still nothing. Rolled back to a working version, still nothing.

I’ve double checked my DNS and dig points to the correct IP address for the app.

I’m sort of at a loss as to how to continue. Is there an issue with fly? Or is there something I could debug on my end?

Nik · July 12, 2022, 10:45am

Update: for a while fly ssh console was returning the following error:

Error host unavailable: host was not found in DNS

A restart of the app fixed that problem, but connecting via the internet still does not work.

callum · July 12, 2022, 11:00am

Similar problem here. My app is down after I scaled it. The scale seemed to work (dashboard shows it as successful), but https://MYAPP.fly.dev is not responding. Tried scaling it back, but again the dashboard says it worked but it’s not responding over the web.

fly ssh console seems to work fine.

callum · July 12, 2022, 11:20am

Not sure what to do here. Our production app is completely down. I’ve tried restarting a few times and it still doesn’t work. My dashboard says everything is OK, and status.flyio.net shows no problems, but nothing.

Is anyone looking into it?

Nik · July 12, 2022, 11:34am

This is pretty worrying. My plan is to release this app publicly shortly but this doesn’t fill me with confidence.

Any official word on this?

jerome · July 12, 2022, 11:35am

This is possibly related to a state propagation issue. We’re looking into it now. Will post a status page update when we have a bit more information.

jerome · July 12, 2022, 12:00pm

@callum your app should be back up.

We provisioned a new server in London and it appears to be misconfigured. I marked it as such and drained all instances from it. Your instance got rescheduled to another server and now your app is available again.

Briefly: if an app instance fell on that server, due to this particular kind of misconfiguration our proxy didn’t know how to route to it.

For a production app, it’s recommended to run with more than 1 instance, ideally in a different, close-by, region. Even though we work hard to make sure none of our servers go down, it happens sometimes.

@Nik does this also fix the issue for you?

SebastianSzturo · July 12, 2022, 12:02pm

I ran into the same issue here in LAX.

Nik · July 12, 2022, 12:05pm

@jerome Nope. Still not able to connect, even after a restart. My app is in Frankfurt though.

Not sure if this helps or not, but I found this output interesting from fly curl:

$ fly curl <my app's healthcheck endpoint>
REGION	STATUS	DNS  	CONNECT	TLS  	TTFB 	TOTAL
ams   	301   	2.9ms	3.4ms  	3.5ms	4.4ms	4.7ms
cdg   	301   	0.5ms	0.7ms  	0.8ms	1ms  	1ms
dfw   	301   	0.6ms	0.8ms  	0.9ms	1.3ms	1.3ms
ewr   	301   	0.5ms	0.8ms  	0.8ms	1.2ms	1.2ms
hkg   	301   	1.1ms	1.7ms  	1.8ms	2.6ms	2.7ms
iad   	301   	1ms  	1.4ms  	1.4ms	1.8ms	1.8ms
lax   	301   	0.5ms	0.7ms  	0.8ms	1.1ms	1.1ms
lhr   	301   	0.7ms	1ms    	1.1ms	1.5ms	1.5ms
nrt   	301   	1ms  	2.9ms  	3ms  	3.7ms	3.8ms
sea   	301   	0.5ms	0.7ms  	0.8ms	1.1ms	1.1ms
sin   	301   	0.4ms	0.6ms  	0.7ms	0.9ms	1ms
sjc   	301   	0.4ms	0.6ms  	0.6ms	0.9ms	1ms
syd   	301   	0.8ms	1.2ms  	1.3ms	1.9ms	2ms
yyz   	301   	0.8ms	1.1ms  	1.2ms	1.6ms	1.7ms

Failures
REGION	ERROR
fra   	Region not available: fra
gru   	Region not available: gru
maa   	Region not available: maa
mad   	Region not available: mad
mia   	Region not available: mia
ord   	Region not available: ord
phx   	Region not available: phx
scl   	Region not available: scl
yul   	Region not available: yul

jerome · July 12, 2022, 12:18pm

Different issue, but on a different new node.

I can reach your app now.

Nik · July 12, 2022, 12:21pm

Can confirm that it works now.

Would this outage have been avoided with having multiple instances running, as you mention, in different regions?

jerome · July 12, 2022, 12:24pm

Yes.

Our proxy detects unavailable nodes / instances and will route requests / connections elsewhere.

Topic		Replies	Views
App unreachable unless scaled to >2 Questions / Help	12	727	April 20, 2022
Something not right on Fly.io	35	1949	March 4, 2023
Could not proxy HTTP request. Retrying in 1000 ms	16	1002	March 7, 2023
Problem connecting to app instance	1	333	June 13, 2022
App reachable on client but not reachable via other fly apps	3	228	November 24, 2023

Connections to app started hanging

Related topics