We’ve been live-testing a full stack django webapp deployment on Fly. It’s been deployed on and off for roughly two weeks now and the response times are inconsistent, from very fast to barely usable. Sometimes we get an hour of near instant response times (<50ms) and then it starts lagging to 10s or more. Our mysql database, s3, and Fly machines are all deployed in the same region (SIN).
We only have at most two test users on at one time so it doesn’t appear to be a load issue. Based on the metrics, it looks like we are well below the RAM/CPU threshold as well. DB queries are showing <10ms so it’s not that either.
Logs do not show any warnings or abnormalities except this (which I’ve researched and seems like a non-fatal unrelated warning): PCI: Fatal: No config space access function found
We’re running out of threads to chase and cannot seem to find the source of this random slowness. We’re hoping to ship soon and this is the only concern remaining. Any guidance would be greatly appreciated.
It sounds like you may have already tried this during your debugging, but if not it may be worth adding some external monitoring to see if that reveals what part is taking up the time (assuming the DB query is ruled out).
Thanks for your response. I’ve extensively profiled functions on the app level. DB queries and app function profiling are normal. Will check out the tools you recommended.
I spun up UptimeKuma yesterday to monitor the latency and look for any patterns. The inconsistent latency is between 16:47 and 18:05 on the SIN machines. I decided to see if the issue was isolated to SIN and replaced those with NRT machines. Looks like it’s stable in NRT but not SIN. NRT response times are slow (since db is located in SIN) but at least it’s stable. Is it safe to say this is an issue with the SIN servers and outside of my control?
Interesting. Since the only variable you changed was the region, it would certainly suggest that is the cause . That would take someone within Fly to see more technical/network/traceroute data to debug.
It seems slightly odd that the response time is either e.g 11 seconds or 1 second. Rather than being random-ish. It’s almost like the request is being served from a cache (app, db etc) vs. not. Or something is happen during that time period, like a cron. Strange.
Hopefully someone from Fly can check on the SIN region/host for you.
The spin-up time when starting a stopped machine could cause such a delay if the application is slow to start.
I’m not 100% convinced that’s the issue (it sounds like perhaps the ~10s delays happen in middle of usage instead of after some idle period?), but it’s something to check out! Feel free to set auto_stop_machines = false and reployd to check.
That’s a good point from @fideloper-fly I’d assumed the min_machines_running = 1 would apply, but if there are concurrent requests at that time … perhaps Fly is starting another machine. It might be worth experimenting with that value too, ensuring there is a machine available.
Thanks for your input @fideloper-fly@greg. I’ve tested out the suggested configs (auto_stop_machines = false) without any improvements. The 10s delays happen in the middle of usage and we’re not seeing the machines abnormally stopping/starting in the logs .
For the sake of time, we’ve decided to deploy in Japan for now. Fly’s platform has been a pleasure to work with so far so we’ll stick it through and look forward to improvements in the future. Once we scale up to the Launch plan, we hope email support can give us some more insights on this. Ideally we’d launch as close to our users as possible (in Southeast Asia).