I was hoping to move our stack away from AWS, as I love what Fly is doing and think the tooling is awesome. But after doing a lot more testing, we’re seeing some pretty huge performance downgrades that will prevent us from making the change. I’ve upgraded my fly server and DB to max scaled VM’s available, however, there is still too big of a gap between performance metrics.
Also can you tell me a little bit about what that URL is doing? And where you’re calling it from? If you have a public URL you can share I’m happy to do some testing to see what’s up.
App
Name = prd-gql
Owner = arena
Version = 51
Status = running
Hostname = prd-gql.fly.dev
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
d79f2ca2 app 51 iad run running 1 total, 1 passing 0 3h26m ago
Here is the status for the DB
App
Name = prd-gql-db
Owner = arena
Version = 14
Status = running
Hostname = prd-gql-db.fly.dev
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
bca41af3 app 14 iad run running (leader) 3 total, 3 passing 0 2h40m ago
Also I double checked, we had recently upgraded our spot instances on AWS to include c5.4xlarge instances which are 16 vCPU and 32 Gib memory, so they are significantly bigger than the biggest vm instances on fly. But, I was surprised to see a ~33% faster response times just by switching from a FLY postgres DB (in the same region, iad) to our old DB (an db.m5.4xlarge in Ohio).
Here is a request for testing that is averaging ~1.2s where AWS is ~500ms, and perhaps this is what you’d expect between the difference in sizes of the VM’s as outlined above.
I looked into this a little bit and here’s what I found:
The database instance isn’t being stressed at all
I made the request directly to your VM and it responded in the same time (ruling out our proxy)
There’s no load either on your instance making queries to the db app
Pinging your database from your app instance was fast and reliable (0.5ms)
My current theory is differing PostgreSQL settings. Would it be possible to retrieve your PG settings on RDS? Compare with the ones for your DB hosted on us.
A db.m5.4xlarge DB instance has 2x the vCPUs (16) and 4x the memory (64GB) as the Fly.io DB instance you’ve currently configured (8 vCPU, 16GB memory). For a more direct comparison, you’ll want to run an db.m5.2xlarge RDS instance and scale up the Fly.io database to 32GB memory.
Larger vCPU counts typically don’t make much of a difference on a single request (they’re more useful in handling multiple requests concurrently), though they might make a difference if your query plans end up using the parallel query feature.
More memory lets PG fit more pages into its buffer cache, which can speed up read queries if it doesn’t need to fetch as much data from disk.
Durability settings can make a big difference in write-transaction overhead, so make sure that these parameters in particular are similar between deployments (and that they’re appropriate for your workload).
It looks like your app is executing some big database queries responsible for most of the request latency- running EXPLAIN ANALYZE [query] and posting the results would be helpful in further narrowing down the database performance differences on your particular query.