App is down, monitoring says it's ok, how to troubleshoot?

ruslandoga · June 15, 2022, 12:06pm

Not sure if it’s ok to post here about app downtime but it seems one of the apps I’ve deployed ruslan-now is down right now, http requests time out and fly ssh console times out as well. What is the usual way to troubleshoot it? Monitoring page says the app is running, last logs are from 2022-06-15T11:01:43.793, logs seem normal.

wjordan · June 15, 2022, 12:10pm

We just saw two hosts in AMS go offline, so that’s the region where your app is deployed we’re looking into it!

ruslandoga · June 15, 2022, 12:11pm

Ah, thank you for a quick response!

wjordan · June 15, 2022, 12:17pm

Looks like the servers are back online- must have been a small/temporary network issue at the datacenter.

cooperx86 · June 15, 2022, 1:24pm

Just to confirm, I noticed this as well at AMS. Load balancer was up and started the HTTP/2 response, but I guess a backend was offline as I couldn’t ssh in either. Resolved after about 10 minutes.

kurt · June 15, 2022, 1:27pm

Yep! It was likely a switch failure. Two hosts in the same rack lost network connectivity for a few minutes. The edge was fine, and all the other hosts in AMS stayed up.

Fingers crossed you got your full quota of “Fly.io-caused-outages” filled over the last two days. It’s not normally like this, I promise.

cooperx86 · June 15, 2022, 9:02pm

Haha, hopefully. At the same time, I’ve spent a whole 12 cents so far, so I haven’t got much room to complain yet The communication is great though - it’s a lot better to know everything is in hand.

pier · June 15, 2022, 10:03pm

I’m trying to deploy to AMS and getting this error:

Adding layer 'heroku/nodejs-engine:dist'
ERROR: failed to export: caching layer (sha256:794677c885fbe7334c92f9cedd9ca78256657998776725af428ab730b7e0aa49): write /launch-cache/staging/sha256:794677c885fbe7334c92f9cedd9ca78256657998776725af428ab730b7e0aa49.tar: copy_file_range: no space left on device
Error failed to fetch an image or build from source: executing lifecycle: failed with status code: 62

Is this related to the same issue?

Topic		Replies	Views
Global outage (maybe already recovering) just now? proxy	5	128	December 19, 2024
App was down for 20 minutes in the middle of the night and then restarted. How to investigate?	2	400	January 6, 2022
Something not right on Fly.io	35	1900	March 4, 2023
Dowtime for more that 15 minutes already	7	348	November 3, 2022
Any issues/downtime last night/this morning?	2	314	November 30, 2021

App is down, monitoring says it's ok, how to troubleshoot?

Related topics