Despite using a dedicated-1x CPU instance, it feels like there’s some sort of contention/throttling going on. (Added later: In the LHR region.)
My test load is a 120MB SQLite database upon which I run a query (from within the ssh console using sqlite3's .timer feature) that does a full table scan. This completes in ~80ms under ideal conditions on my local machine, a third party VPS, and on Fly.io.
If I repeat the query 10-20 times, the times on Fly.io are all over the place, but consistent elsewhere. On my Fly instance they’re perhaps 10% 80-100ms, 60% in the 200-400ms range, and 30% 700ms+. This is with an identical SQL query, repeated back to back, and no other load or traffic. An example of a run just now:
Run Time: real 0.085 user 0.041548 sys 0.026738
Run Time: real 0.225 user 0.047028 sys 0.024278
Run Time: real 0.126 user 0.045533 sys 0.024659
Run Time: real 0.740 user 0.039903 sys 0.033189
If I monitor with top the steal is sometimes going up past 70% during the query (but is 0 when nothing is happening). I expected that on the shared CPU instances but perhaps I’m misunderstanding what dedicated means in the fly.io context? (I appreciate it’s a core, rather than a CPU, and it’s still subject to the whims of virtualization.)
Am I missing anything and is this sort of extremely spiky CPU performance generally to be expected? Thanks!
Update: I decided to go all out and deploy an identical test case in a different region to see if I could narrow down the problem and the performance is consistent and good in AMS even on the smallest shared CPU instance.
On LHR, I changed the instance size numerous times to different levels (including dedicated-2x) and it was spiky and inconsistent on them all. So I’m guessing I was running into an issue with a particular server? That app had a volume too so I’m guessing that would have locked the app to the same server each time.
Equivalent runs in AMS:
Run Time: real 0.055 user 0.038510 sys 0.015553
Run Time: real 0.057 user 0.036355 sys 0.020173
Run Time: real 0.059 user 0.037325 sys 0.020488
Run Time: real 0.059 user 0.049303 sys 0.008383
We’re actually troubleshooting CPU performance issues on two physical hosts in London right now. It’s abnormal, you should see more consistent results.
Thanks for the heads up, I imagine it’s almost certainly related to that given things are solid in another region. I guess this is one downside of using volumes, apps can’t easily get migrated elsewhere Good luck with the troubleshooting!
Can you tell how much disk IO is involved? It could technically be volume performance if your table scans are hitting disk. I’d think it would mostly be cached, but it would be good to know!
No (or trivial) disk I/O. In working down to the smallest case, I cloned the database into a purely in-memory one and got the same variations in wallclock times. That coupled with the oddly large steal times (and 0 i/o wait times) in top led me to thinking mostly about the CPU or virtualization.
Running the same test on the smallest shared CPU instance in AMS straight off of disk runs fantastically, so I wouldn’t be surprised if it’s particular to the server in LHR.
So just an update to close the thread (from my perspective, at least). Since the app only uses a volume to store an SQLite database, I created a volume in AMS, scaled up, used magic-wormhole to transfer the database, then deleted the LHR volume and scaled down. Works a treat and response times are consistently good (and low). So even when there is a problem, fly.io keeps things agile enough to be able to shuffle things around without it being a huge headache - love it!
Nice! We identified the issue on two london hosts. VMs were running an old version of our init. A fair number of them got wedged, and the init itself consumed way more CPU time than is reasonable. We actually patched (probably) the init in around April so once we discovered what was up, it was a quick fix.