I’ve recently moved my personal website to fly. I have uptime tracking setup for the website and since moving to fly I’m seeing the response times regularly spike to quite large times. Before the move I had quite constant 100-600ms response times, now I see spikes up to 10s on the regular intermixed with sections of times where response times are as expected.
As a result if a machine is not running and a request arrives, there is a delay while that happens. 10s seems long but it’s possibly caused by that. I’d start by disabling that and see if the problem goes away. If not you can continue to debug. If it does, that would be the cost trade-off.
This is using elixir and phoenix. The uptime check runs every 5 minutes, so given the documented “Fly Proxy should take when the app is idle for several minutes” it might or might not run into this, depending on what “several” means.
If the uptime check is pinging your app every 5minutes, that should keep the instance awake. Do you see “excess capacity” anywhere in your logs? That means it autostops due to inactivity.
Try setting auto_stop_machines = 'suspend' and redeploy to see if you still get those big spikes. If not, then your app stack’s initialization is somehow slow.
Yeah, I do see such messages from ams (which is my closest region and the primary one), so it could indeed be scaling. I’ve for now set min_machines_running = 1 and will continue to monitor.
Edit: comparing the logs with my spikes does certainly suggest correlation.
The spikes have been gone the last two days, so this was indeed the auto stop behaviour. While I was aware of it I wasn’t expecting it to regulartly trigger within the 5 minute timeframe of the uptime check.