Hi! I’ve built and host www.trackmypilot.com on fly. I have 2 worker machines, 4 web machines in 2 regions (dfw, sjc), and a non-managed postgres cluster - 3 in dfw and 1 backup in sjc. Here are results of the status commands:
fly status
App
Name = track-my-pilot
Owner = personal
Hostname = track-my-pilot.fly.dev
Image = track-my-pilot:deployment-01KJN7DHSN3KFS4VV6TW7035WZ
Machines
PROCESS ID VERSION REGION STATE ROLE CHECKS LAST UPDATED
web 287d674a003d08 237 sjc started 2026-03-01T17:42:57Z
web 48e6393b7e0508 237 sjc stopped 2026-03-01T17:35:06Z
web 9185507db91183 237 dfw stopped 2026-03-01T17:35:04Z
web e2861366ad0686 237 dfw started 2026-03-01T17:35:08Z
worker 17810766b3de89 237 dfw started 2026-03-01T17:35:07Z
worker† 6e823745cd0987 237 dfw stopped 2026-03-01T17:35:04Z
Notes:
† Standby machine (it will take over only in case of host hardware failure)
fly status -a track-my-pilot
App
Name = track-my-pilot
Owner = personal
Hostname = track-my-pilot.fly.dev
Image = track-my-pilot:deployment-01KJN7DHSN3KFS4VV6TW7035WZ
Machines
PROCESS ID VERSION REGION STATE ROLE CHECKS LAST UPDATED
web 287d674a003d08 237 sjc started 2026-03-01T17:42:57Z
web 48e6393b7e0508 237 sjc stopped 2026-03-01T17:35:06Z
web 9185507db91183 237 dfw stopped 2026-03-01T17:35:04Z
web e2861366ad0686 237 dfw started 2026-03-01T17:35:08Z
worker 17810766b3de89 237 dfw started 2026-03-01T17:35:07Z
worker† 6e823745cd0987 237 dfw stopped 2026-03-01T17:35:04Z
Notes:
† Standby machine (it will take over only in case of host hardware failure)
The last 3 days performance has been incredibly slow - sometimes! Sometimes it’ll be great, other times it takes 10 seconds to load a page. Logs for the app and database show nothing. Local testing and looking at database queries shows nothing abnormal (2-3 queries per page load). Everything is built in Python / Django.
Where should I even start to look for the issue, or how can I attempt to monitor the site better?
I would look at the e28613* Machine’s metrics in Grafana, etc. Sometimes an individual physical host machine can have network problems that don’t quite rise to the level of triggering an alert over in Fly.io central, …
(You could even cordon that one Machine for a while, to see if the other dfw instances fare any better.)
Thanks! I’ve tried cordoning the machine and it seems somewhat faster but not great. Curl’ing sjc is definitely faster than dfw. Here’s a view of the http response times. You can see that it got crazy 2/27 (30 minutes!!) then calmed down but is still hovering around 30 seconds for some regions.
Is it worth setting up an Apache image in the same region serving just HTML? I wonder if there is an issue with your app. Check CPU performance too. I appreciate there may well be latency issues in the Fly network, but it’s worth ruling out other issues too.
With properly tuned soft_limits, high load (like a DOS) would have auto-started both the e286* and the 9185* Machines, but in your fly status output only one of those two was running.
If I recall correctly, you can look at how request/connection counts compare between regions under App Concurrency.
Aside: We in the community forum generally can’t see your metrics, or follow links to your Grafana dashboard, etc., so we only know what you yourself post in the form of screenshots, output excerpts, and so on.