Occasionally (but not always), when the secondary machine starts after a period of time being stopped, Uptime Robot reports a 503 error. The time of the error seems to be right in the interim between secondary machine start and its app not quite having finished starting.
I’m guessing that it’s the ping from Uptime Robot, in these instances, that is initiating the secondary machine to start, based on whatever load balancing fly is doing.
Is there a way to prevent this behavior, so that as calls to the running machine get close to the threshold for starting another machine, they’re still routed to the running machine until the next machine is fully ready?
It’s likely your app does not define a http health check. Without this, the proxy will start forwarding connections to your app as soon as the machine has booted - if the app takes a bit of time to initialize then it won’t actually respond to the request and this might result in the 503 error from the Fly proxy.
If you define a health check though, it will only pass when your app is fully up and ready to serve traffic; the proxy takes that into account, it won’t start sending requests to the machine until the health check passes.
Thank you so much for this. That sounds very likely.
I added an [[http_service.checks]] block and will see if UptimeRobot stops complaining.
To be a little more thorough about checking whether this worked, I’d like to stop the secondary machine, then do fly machine start <id> && curl <url for app health check> in order to see if there is any gap in the health-check timing that permits a 503 to come through.
I’d need to specificallycurl the secondary machine’s app instance (by machine id or by region). Is there any way to do that?