We have several different applications, deployed within different Fly organizations, all within IAD that appear to exhibit fairly regular latency spikes during requests to different external hosts. The most curious thing is that many of the spikes all occur at exactly :20 mins after the hour and occur in the 15-25 seconds (:20.00 → :20.25) window following.
Like I mentioned, we’ve noticed this across different applications, in different orgs, reaching to different external hosts. In one of our apps that is running on 3 VMs, we notice the latency spikes on only 2 of the 3 Fly machines. Therefore, our current hunch is leaning towards an issue with Fly.io networking. The other potential suspect is IAD → AWS us-east-1 networking, since the two external hosts are both located in AWS us-east-1 (us-east.connect.psdb.cloud and s3.us-east-1.amazonaws.com). We don’t have enough data from other endpoints to know if it’s limited to us-east-1 though.
In our application that makes regular DB queries to Planetscale, this is our p95 chart:
Our average p95 is around 100ms, but every 20 mins we see p95 jump to about 1sec. APM traces for this application show that DB queries jump from 5-8ms to 100ms in this period. With about 10 queries per request it puts total request latency around 1sec. Edit: Nothing was shipped or changed in this application during the 2-3 hour gap where we didn’t see spikes, they simply stopped occurring. We see similar gaps if we zoom out.
In our application that makes requests to S3, max S3 request time averages <200ms, however we see regular spikes on the :20 mark that jump over 1 second.
We’ve tried different VM types, but we still see the same spikes. No other charts seem to indicate problems with machine capacity. CPU is <2% on all applications. These are the VM types we’ve tested:
I’m posting this now to see if anyone else is experiencing similar issues before committing to more thorough debugging efforts. This isn’t a stop-the-world performance issue for us at the moment, but we still want to understand it better to know where the problem lies. So far the data seems to point to something within Fly causing severe degradation of network performance regularly, :20 mins after the hour. Is it possible some large, network intensive job is scheduled to run then?