Pretty Big Performance Downgrade from AWS

I was hoping to move our stack away from AWS, as I love what Fly is doing and think the tooling is awesome. But after doing a lot more testing, we’re seeing some pretty huge performance downgrades that will prevent us from making the change. I’ve upgraded my fly server and DB to max scaled VM’s available, however, there is still too big of a gap between performance metrics.

Current Request Timing (AWS)

Request With Fly Servers and Fly Postgres In the Same Region

Request with FLY Servers and AWS DB in Different Region

It would be great to have more options for scaling postgres DB’s as we are unable to use it with the current configuration options.

That is surprisingly poor. There’s no real reason you should see a difference in performance between us and AWS.

Will you run fly status -a <app> and fly status -a <db-name> and share the output?

Also can you share the AWS instance specs you’re using? Equivalent VM sizes should perform about the same on Fly.io.

Also can you tell me a little bit about what that URL is doing? And where you’re calling it from? If you have a public URL you can share I’m happy to do some testing to see what’s up.

Of course!

Here is the status for the server:

App
  Name     = prd-gql          
  Owner    = arena            
  Version  = 51               
  Status   = running          
  Hostname = prd-gql.fly.dev  

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS  HEALTH CHECKS           RESTARTS        CREATED   
d79f2ca2        app     51      iad     run     running 1 total, 1 passing      0               3h26m ago

Here is the status for the DB

App
  Name     = prd-gql-db          
  Owner    = arena               
  Version  = 14                  
  Status   = running             
  Hostname = prd-gql-db.fly.dev  

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS                  HEALTH CHECKS           RESTARTS        CREATED   
bca41af3        app     14      iad     run     running (leader)        3 total, 3 passing      0               2h40m ago

Also I double checked, we had recently upgraded our spot instances on AWS to include c5.4xlarge instances which are 16 vCPU and 32 Gib memory, so they are significantly bigger than the biggest vm instances on fly. But, I was surprised to see a ~33% faster response times just by switching from a FLY postgres DB (in the same region, iad) to our old DB (an db.m5.4xlarge in Ohio).

Here is a request for testing that is averaging ~1.2s where AWS is ~500ms, and perhaps this is what you’d expect between the difference in sizes of the VM’s as outlined above.

https://gql.arena.gl/?operationName=bracket_cell&variables=%7B%22id%22%3A%22ckvbbj8et775711zzuvl3pejn5%22%7D&extensions=%7B%22persistedQuery%22%3A%7B%22version%22%3A1%2C%22sha256Hash%22%3A%223c23124c88f23e3abc0faabc8f34c0494af7916703b73d9683e4e7868d62c713%22%7D%7D

I looked into this a little bit and here’s what I found:

  • The database instance isn’t being stressed at all
  • I made the request directly to your VM and it responded in the same time (ruling out our proxy)
  • There’s no load either on your instance making queries to the db app
  • Pinging your database from your app instance was fast and reliable (0.5ms)

My current theory is differing PostgreSQL settings. Would it be possible to retrieve your PG settings on RDS? Compare with the ones for your DB hosted on us.

2 Likes

Here are a few more items for you to consider:

  • A db.m5.4xlarge DB instance has 2x the vCPUs (16) and 4x the memory (64GB) as the Fly.io DB instance you’ve currently configured (8 vCPU, 16GB memory). For a more direct comparison, you’ll want to run an db.m5.2xlarge RDS instance and scale up the Fly.io database to 32GB memory.
    • Larger vCPU counts typically don’t make much of a difference on a single request (they’re more useful in handling multiple requests concurrently), though they might make a difference if your query plans end up using the parallel query feature.
    • More memory lets PG fit more pages into its buffer cache, which can speed up read queries if it doesn’t need to fetch as much data from disk.
  • Durability settings can make a big difference in write-transaction overhead, so make sure that these parameters in particular are similar between deployments (and that they’re appropriate for your workload).
  • It looks like your app is executing some big database queries responsible for most of the request latency- running EXPLAIN ANALYZE [query] and posting the results would be helpful in further narrowing down the database performance differences on your particular query.
1 Like