I have what I think are network issues for our Postges instances that’s causing our web application to become sluggish.
We’re running a few Postgres instances for our application. We have two “core” instances - one for PR environments and one for production, both in the apps v1 style.
Over the past few weeks, I’ve noticed (but not measured empirically, sadly) the web application in the staging PR environments and production environment being quite slow. Some pages are taking over 18 seconds to load.
While our application has some complexities for write tasks, the read aspect is uninteresting, meaning unless we missed some indexes or have some scary nested loops, I would assume we’d be able to query the database promptly. Locally, it is indeed prompt.
I do the normal steps of checking individual queries with EXPLAIN ANALYZE, checking pgHero, checking our client side code, checking our server side code with bottlenecks and pepper tracing liberally. I find minor areas of improvement but nothing damning.
Progressing from there, I tried executing the SQL statements against the staging Postgres instance using pgAdmin and
fly proxy -a <v1-staging-postgres-appname>, bypassing the web application entirely. There I find the queries are perplexingly slow.
Fairly simple queries in the shape of
SELECT "SchemaName"."TableName"."id", FROM "SchemaName"."TableName" WHERE "SchemaName"."TableName"."organizationId" = ($1) ORDER BY "SchemaName"."TableName"."id" ASC OFFSET $2
Which take 0ms locally takes 260ms (!) remotely. Others taking 14ms locally could be upwards of 4s. These numbers matched what application tracing outputs. This told me the problem is at the database, rather than the server / client layers.
Eventually, out of frustration. I made a new Postgres instance using the v2 app style, and that database was perfectly performant! Essentially local speeds but with what one might expect for UK<>US latency thrown in. That 18s page load is now 1.7s.
As I was typing out a plea for help for how to migrate data between a v1 and v2 style Postgres instance, I searched through a few more threads on this board and found a suggestion that it might be network related, and that it’s worth running the queries on the database machine itself rather than just proxying to it.
fly postgres connect -a <v1-staging-postgres-appname> and run my performance testing queries there. They’re much much better on the box itself, comparable to the v2 instance.
Somewhat relieved that Postgres isn’t magically slow at its core, I am still wondering why it’s slow to go through the v1 app externally, but not v2. Both the v1 and the v2 Postgres instance are in the same region (ord).
I cannot figure out at all how to solve this issue, any guidance would be greatly appreciated.
Thanks in advance and sorry for the rather long post!