Random periods of downtime

blanst · March 28, 2022, 8:21am

I use uptimerobot to monitor my website hosted on fly and I’ve noticed it goes down randomly for a few minutes at a time. This has happened maybe six times or so in the last six months.

I haven’t been able to find anything in my logs. It’s a simple static website served from a lwan webserver container. Any ideas?

greg · March 28, 2022, 4:38pm

Strange.

Since there’s nothing in the logs, next place to check would be the metrics. The simplest place to see them is in the Fly dashboard, in the metrics panel. Look for things like load average and memory usage. If the load average spikes, or the memory usage hits its limit, that would result in the app becoming unresponsive. I’ve had that happen.

You could also try comparing to a monitor on https://debug.fly.dev/ which is Fly’s own app. If that stops responding at the same time, it would point at an infrastructure issue. However if their app is running but your app is not, that would point towards to an issue with your app.

blanst · March 28, 2022, 4:49pm

I’ve got grafana hooked up so i should be able to see metrics but it’s super duper unlikely to be a resource availability issue. The “app” is a container running lwan (very efficient web server) that serves static files. It even uses fly’s static handler so most requests won’t even hit the app. Also there’s at most 10 to 20 requests per hour.

I’ll try and catch an event while It’s happening. This is tricky because there’s no discernable pattern to when it happens. It’s not an issue for this app, but i do want to find out why this happens.

jerome · March 28, 2022, 5:49pm

If you have historical times of these downtimes, we can look into logs and metrics on our end.

blanst · March 29, 2022, 7:52am

Incident	Started at	Duration
Connection Timeout	2022-03-28 13:40:32	8 m
Connection Timeout	2022-02-24 22:29:58	9 m
Connection Timeout	2022-01-21 20:24:27	4 m
Connection Timeout	2022-01-20 21:23:44	4 m

The uptime check happens once every five minutes.

jerome · March 29, 2022, 11:44am

A connection timeout error from UptimeRobot would mean it couldn’t connect to our servers within 45 seconds (according to their FAQ).

Those are hard to troubleshoot, but as far as we can tell, they’re not happening. We have continuous monitoring from multiple different regions and we get notified when a connection timeout occurs (with a shorter timeout setting) or any other proxy-related issues.

According to this: Locations and IPs | UptimeRobot, they are making requests primarily from Dallas and if those fail they try from other locations.

Do you have more details about each failing check? Like from which remote location it might’ve failed?

I hope this doesn’t sound too dismissive. We take these issues seriously, but we need more information.

blanst · March 29, 2022, 11:57am

I don’t have any more info save for the fact that everytime they reported my site as down, i couldn’t connect to it either. What info can i capture for you the next time this happens and I’m in front of my PC?

Thanks for looking at this and I understand you can only do so much with the limited info. I’d like to get to the bottom of this too because these ghost in the shell type bugs irk me.

jerome · March 29, 2022, 12:23pm

Do you know if you’re connecting over IPv4 or IPv6?

I’ll investigate a bit more.

blanst · March 29, 2022, 3:01pm

IPv4 primarily. Uptimerobot also reports IPv4 client addresses so I’m guessing they’re doing the same. Each check reports three remote locations from the pool of New York, Tokyo, Central Canada, Dublin,
& Amsterdam.

Would their remote client IPs help?

kurt · March 29, 2022, 3:03pm

Can you set one up against debug.fly.dev as well?

Which app are you testing?

philipbrown · March 30, 2022, 5:18am

I’m seeing the same problem. I’ve got monitors with UptimeRobot and Cronitor and they both seem to be triggered at the same time fairly regularly.

It happened last night and from Sentry I can see that it was because the app was unable to connect to the database.

This seems to happen quite a lot to my app.

Any ideas @kurt?

blanst · March 30, 2022, 5:47am

I’ll set up a monitor for debug.fly.dev. I’ve got two apps exhibiting this issue on and off: kcmag-ghost and shakthi-palace-web.

Topic		Replies	Views
Fly response times	8	236	August 30, 2024
SSH Connection issues. 11h and counting ... + Uptime Metrics	8	96	July 2, 2025
App going down for 15 minutes regularly Phoenix elixir	1	668	November 6, 2022
Something went wrong? Questions / Help	42	1504	September 22, 2022
How do I know how many connections are open? Questions / Help	19	1473	March 7, 2022

Random periods of downtime

Related topics