Horrible latency last 3 days (DFW, SJC)

LAKE_Development · March 1, 2026, 5:51pm

Hi! I’ve built and host www.trackmypilot.com on fly. I have 2 worker machines, 4 web machines in 2 regions (dfw, sjc), and a non-managed postgres cluster - 3 in dfw and 1 backup in sjc. Here are results of the status commands:

fly status          
App
  Name     = track-my-pilot
  Owner    = personal
  Hostname = track-my-pilot.fly.dev
  Image    = track-my-pilot:deployment-01KJN7DHSN3KFS4VV6TW7035WZ  

Machines
PROCESS ID              VERSION REGION  STATE   ROLE    CHECKS  LAST UPDATED
web     287d674a003d08  237     sjc     started                 2026-03-01T17:42:57Z
web     48e6393b7e0508  237     sjc     stopped                 2026-03-01T17:35:06Z
web     9185507db91183  237     dfw     stopped                 2026-03-01T17:35:04Z
web     e2861366ad0686  237     dfw     started                 2026-03-01T17:35:08Z
worker  17810766b3de89  237     dfw     started                 2026-03-01T17:35:07Z
worker† 6e823745cd0987  237     dfw     stopped                 2026-03-01T17:35:04Z

Notes:
  † Standby machine (it will take over only in case of host hardware failure)

fly status -a track-my-pilot
App
  Name     = track-my-pilot
  Owner    = personal
  Hostname = track-my-pilot.fly.dev
  Image    = track-my-pilot:deployment-01KJN7DHSN3KFS4VV6TW7035WZ

Machines
PROCESS ID              VERSION REGION  STATE   ROLE    CHECKS  LAST UPDATED
web     287d674a003d08  237     sjc     started                 2026-03-01T17:42:57Z
web     48e6393b7e0508  237     sjc     stopped                 2026-03-01T17:35:06Z
web     9185507db91183  237     dfw     stopped                 2026-03-01T17:35:04Z
web     e2861366ad0686  237     dfw     started                 2026-03-01T17:35:08Z
worker  17810766b3de89  237     dfw     started                 2026-03-01T17:35:07Z
worker† 6e823745cd0987  237     dfw     stopped                 2026-03-01T17:35:04Z

Notes:
  † Standby machine (it will take over only in case of host hardware failure)

The last 3 days performance has been incredibly slow - sometimes! Sometimes it’ll be great, other times it takes 10 seconds to load a page. Logs for the app and database show nothing. Local testing and looking at database queries shows nothing abnormal (2-3 queries per page load). Everything is built in Python / Django.

Where should I even start to look for the issue, or how can I attempt to monitor the site better?

mayailurus · March 1, 2026, 6:51pm

Hm… Your dfw Machine is consistently slow from here (10–20 seconds to load the root page), whereas sjc has always been fast everytime I’ve tried:

$ curl -i -H 'fly-prefer-region: sjc' -H 'flyio-debug: doit' \
  'https://track-my-pilot.fly.dev/'

I would look at the e28613* Machine’s metrics in Grafana, etc. Sometimes an individual physical host machine can have network problems that don’t quite rise to the level of triggering an alert over in Fly.io central, …

(You could even cordon that one Machine for a while, to see if the other dfw instances fare any better.)

Hope this helps a little!

LAKE_Development · March 2, 2026, 12:29pm

Thanks! I’ve tried cordoning the machine and it seems somewhat faster but not great. Curl’ing sjc is definitely faster than dfw. Here’s a view of the http response times. You can see that it got crazy 2/27 (30 minutes!!) then calmed down but is still hovering around 30 seconds for some regions.

HTTP Dashboard

Can I be sure this isn’t a DOS attack? Also, I realized earlier I listed the same set of machines twice. Here’s the database status:

fly status -a track-my-pilot-db
ID              STATE   ROLE    REGION  CHECKS                  IMAGE                                   CREATED                 UPDATED
d8d135dc291008  started replica dfw     3 total, 3 passing      flyio/postgres-flex:17.2 (v0.1.0)       2026-02-28T04:12:03Z    2026-02-28T04:12:07Z
d8924d0a6e0148  started replica dfw     3 total, 3 passing      flyio/postgres-flex:17.2 (v0.1.0)       2026-02-28T04:11:30Z    2026-02-28T04:11:45Z
2861e54f794ee8  started replica sjc     3 total, 3 passing      flyio/postgres-flex:17.2 (v0.1.0)       2026-02-28T04:11:15Z    2026-02-28T04:11:19Z
18577d9b667698  started primary dfw     3 total, 3 passing      flyio/postgres-flex:17.2 (v0.1.0)       2025-12-21T17:43:13Z    2025-12-24T21:44:18Z

Should I just destroy the dfw servers and start over with those ones?

halfer · March 2, 2026, 1:01pm

Is it worth setting up an Apache image in the same region serving just HTML? I wonder if there is an issue with your app. Check CPU performance too. I appreciate there may well be latency issues in the Fly network, but it’s worth ruling out other issues too.

mayailurus · March 2, 2026, 5:46pm

I would look more at the concurrency settings before doing so, but it might be worth moving your web instances from dfw → ord eventually, yeah.

https://fly.io/docs/blueprints/setting-concurrency-limits/

With properly tuned soft_limits, high load (like a DOS) would have auto-started both the e286* and the 9185* Machines, but in your fly status output only one of those two was running.

If I recall correctly, you can look at how request/connection counts compare between regions under App Concurrency.

Aside: We in the community forum generally can’t see your metrics, or follow links to your Grafana dashboard, etc., so we only know what you yourself post in the form of screenshots, output excerpts, and so on.