Connections to app started hanging

After a recent deploy, I no longer was able to connect to my app. Not through a custom domain, and not through the *.fly.dev domain. Connections will just hang until they timeout.

I’m able to ssh onto the box and wget localhost successfully, but connecting from the internet has stopped working.

I’ve redeployed my latest change, still nothing. Rolled back to a working version, still nothing.

I’ve double checked my DNS and dig points to the correct IP address for the app.

I’m sort of at a loss as to how to continue. Is there an issue with fly? Or is there something I could debug on my end?

Update: for a while fly ssh console was returning the following error:

Error host unavailable: host was not found in DNS

A restart of the app fixed that problem, but connecting via the internet still does not work.

Similar problem here. My app is down after I scaled it. The scale seemed to work (dashboard shows it as successful), but https://MYAPP.fly.dev is not responding. Tried scaling it back, but again the dashboard says it worked but it’s not responding over the web.

fly ssh console seems to work fine.

Not sure what to do here. Our production app is completely down. I’ve tried restarting a few times and it still doesn’t work. My dashboard says everything is OK, and status.flyio.net shows no problems, but nothing.

Is anyone looking into it?

This is pretty worrying. My plan is to release this app publicly shortly but this doesn’t fill me with confidence.

Any official word on this?

This is possibly related to a state propagation issue. We’re looking into it now. Will post a status page update when we have a bit more information.

@callum your app should be back up.

We provisioned a new server in London and it appears to be misconfigured. I marked it as such and drained all instances from it. Your instance got rescheduled to another server and now your app is available again.

Briefly: if an app instance fell on that server, due to this particular kind of misconfiguration our proxy didn’t know how to route to it.

For a production app, it’s recommended to run with more than 1 instance, ideally in a different, close-by, region. Even though we work hard to make sure none of our servers go down, it happens sometimes.

@Nik does this also fix the issue for you?

I ran into the same issue here in LAX.

@jerome Nope. Still not able to connect, even after a restart. My app is in Frankfurt though.

Not sure if this helps or not, but I found this output interesting from fly curl:

$ fly curl <my app's healthcheck endpoint>
REGION	STATUS	DNS  	CONNECT	TLS  	TTFB 	TOTAL
ams   	301   	2.9ms	3.4ms  	3.5ms	4.4ms	4.7ms
cdg   	301   	0.5ms	0.7ms  	0.8ms	1ms  	1ms
dfw   	301   	0.6ms	0.8ms  	0.9ms	1.3ms	1.3ms
ewr   	301   	0.5ms	0.8ms  	0.8ms	1.2ms	1.2ms
hkg   	301   	1.1ms	1.7ms  	1.8ms	2.6ms	2.7ms
iad   	301   	1ms  	1.4ms  	1.4ms	1.8ms	1.8ms
lax   	301   	0.5ms	0.7ms  	0.8ms	1.1ms	1.1ms
lhr   	301   	0.7ms	1ms    	1.1ms	1.5ms	1.5ms
nrt   	301   	1ms  	2.9ms  	3ms  	3.7ms	3.8ms
sea   	301   	0.5ms	0.7ms  	0.8ms	1.1ms	1.1ms
sin   	301   	0.4ms	0.6ms  	0.7ms	0.9ms	1ms
sjc   	301   	0.4ms	0.6ms  	0.6ms	0.9ms	1ms
syd   	301   	0.8ms	1.2ms  	1.3ms	1.9ms	2ms
yyz   	301   	0.8ms	1.1ms  	1.2ms	1.6ms	1.7ms

Failures
REGION	ERROR
fra   	Region not available: fra
gru   	Region not available: gru
maa   	Region not available: maa
mad   	Region not available: mad
mia   	Region not available: mia
ord   	Region not available: ord
phx   	Region not available: phx
scl   	Region not available: scl
yul   	Region not available: yul

Different issue, but on a different new node.

I can reach your app now.

Can confirm that it works now.

Would this outage have been avoided with having multiple instances running, as you mention, in different regions?

Yes.

Our proxy detects unavailable nodes / instances and will route requests / connections elsewhere.