Random periods of downtime

I use uptimerobot to monitor my website hosted on fly and I’ve noticed it goes down randomly for a few minutes at a time. This has happened maybe six times or so in the last six months.

I haven’t been able to find anything in my logs. It’s a simple static website served from a lwan webserver container. Any ideas?

Strange.

Since there’s nothing in the logs, next place to check would be the metrics. The simplest place to see them is in the Fly dashboard, in the metrics panel. Look for things like load average and memory usage. If the load average spikes, or the memory usage hits its limit, that would result in the app becoming unresponsive. I’ve had that happen.

You could also try comparing to a monitor on https://debug.fly.dev/ which is Fly’s own app. If that stops responding at the same time, it would point at an infrastructure issue. However if their app is running but your app is not, that would point towards to an issue with your app.

I’ve got grafana hooked up so i should be able to see metrics but it’s super duper unlikely to be a resource availability issue. The “app” is a container running lwan (very efficient web server) that serves static files. It even uses fly’s static handler so most requests won’t even hit the app. Also there’s at most 10 to 20 requests per hour.

I’ll try and catch an event while It’s happening. This is tricky because there’s no discernable pattern to when it happens. It’s not an issue for this app, but i do want to find out why this happens.

If you have historical times of these downtimes, we can look into logs and metrics on our end.

Incident Started at Duration
Connection Timeout 2022-03-28 13:40:32 8 m
Connection Timeout 2022-02-24 22:29:58 9 m
Connection Timeout 2022-01-21 20:24:27 4 m
Connection Timeout 2022-01-20 21:23:44 4 m

The uptime check happens once every five minutes.

A connection timeout error from UptimeRobot would mean it couldn’t connect to our servers within 45 seconds (according to their FAQ).

Those are hard to troubleshoot, but as far as we can tell, they’re not happening. We have continuous monitoring from multiple different regions and we get notified when a connection timeout occurs (with a shorter timeout setting) or any other proxy-related issues.

According to this: Locations and IPs | UptimeRobot, they are making requests primarily from Dallas and if those fail they try from other locations.

Do you have more details about each failing check? Like from which remote location it might’ve failed?

I hope this doesn’t sound too dismissive. We take these issues seriously, but we need more information.

I don’t have any more info save for the fact that everytime they reported my site as down, i couldn’t connect to it either. What info can i capture for you the next time this happens and I’m in front of my PC?

Thanks for looking at this and I understand you can only do so much with the limited info. I’d like to get to the bottom of this too because these ghost in the shell type bugs irk me.

Do you know if you’re connecting over IPv4 or IPv6?

I’ll investigate a bit more.

IPv4 primarily. Uptimerobot also reports IPv4 client addresses so I’m guessing they’re doing the same. Each check reports three remote locations from the pool of New York, Tokyo, Central Canada, Dublin,
& Amsterdam.

Would their remote client IPs help?

Can you set one up against debug.fly.dev as well?

Which app are you testing?

I’m seeing the same problem. I’ve got monitors with UptimeRobot and Cronitor and they both seem to be triggered at the same time fairly regularly.

It happened last night and from Sentry I can see that it was because the app was unable to connect to the database.

This seems to happen quite a lot to my app.

Any ideas @kurt?

I’ll set up a monitor for debug.fly.dev. I’ve got two apps exhibiting this issue on and off: kcmag-ghost and shakthi-palace-web.