Multi-region postgres latency issues

elliotdickison · January 7, 2022, 10:37pm

We recently scaled our Express + Postgres API to a multi-region setup. Trivial using the fly-replay header - really amazing feature. At first we saw a dramatic reduction in latency for our end users, but lately we’ve been seeing some odd, sporadic latency issues. It seems to be related to the connection between our Express app and Postgres database. This is showing up in our Express app as a significant increase in HTTP response times and a jump in VM service concurrency. We haven’t pinned down exactly what triggers this - everything will be fine for a day and then all of a sudden response time and concurrency will jump up and stay up. The most recent occurrence seemed to coincide with us deploying our API and Database to a new region (lhr). The first time it happened a restart of the database cleared it up immediately, but now restarting the database seems to have no effect. Attached are some screenshots of the odd behavior as shown by our Express app’s metrics, as well as some database logs that seem out of the ordinary. Any ideas what could cause this behavior?

Express app HTTP response times

Scale is from 0 to 10 seconds

Screen Shot 2022-01-07 at 2.33.33 PM

Express app VM concurrency

Scale is from 0 to 10 instances

Screen Shot 2022-01-07 at 2.33.39 PM

Postgres database logs

This line in particular is concerning (our leader is in sea): 2022-01-07T22:34:53.993 app[a2c17040] sea [info] keeper | 2022-01-07 22:34:53.992 UTC [5388] LOG: could not receive data from client: Connection reset by peer

2022-01-07T22:32:38.485 app[ee84bebc] iad [info] sentinel | 2022-01-07T22:32:38.484Z	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "6bca1964", "keeper": "ac203d332"}
2022-01-07T22:32:38.703 app[2cbe489e] dfw [info] keeper   | 2022-01-07T22:32:38.703Z	INFO	cmd/keeper.go:1557	our db requested role is standby	{"followedDB": "60882471"}
2022-01-07T22:32:38.703 app[2cbe489e] dfw [info] keeper   | 2022-01-07T22:32:38.703Z	INFO	cmd/keeper.go:1576	already standby
2022-01-07T22:32:38.728 app[2cbe489e] dfw [info] keeper   | 2022-01-07T22:32:38.727Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2022-01-07T22:32:38.728 app[2cbe489e] dfw [info] keeper   | 2022-01-07T22:32:38.728Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2022-01-07T22:32:38.794 app[f17acfc0] lhr [info] keeper   | 2022-01-07T22:32:38.793Z	INFO	cmd/keeper.go:1557	our db requested role is standby	{"followedDB": "60882471"}
2022-01-07T22:32:38.794 app[f17acfc0] lhr [info] keeper   | 2022-01-07T22:32:38.793Z	INFO	cmd/keeper.go:1576	already standby
2022-01-07T22:32:38.808 app[f17acfc0] lhr [info] keeper   | 2022-01-07T22:32:38.808Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2022-01-07T22:32:38.808 app[f17acfc0] lhr [info] keeper   | 2022-01-07T22:32:38.808Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed

and

2022-01-07T22:34:53.838 app[ee84bebc] iad [info] keeper   | 2022-01-07T22:34:53.838Z	INFO	cmd/keeper.go:1557	our db requested role is standby	{"followedDB": "60882471"}
2022-01-07T22:34:53.838 app[ee84bebc] iad [info] keeper   | 2022-01-07T22:34:53.838Z	INFO	cmd/keeper.go:1576	already standby
2022-01-07T22:34:53.856 app[ee84bebc] iad [info] keeper   | 2022-01-07T22:34:53.856Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2022-01-07T22:34:53.856 app[ee84bebc] iad [info] keeper   | 2022-01-07T22:34:53.856Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2022-01-07T22:34:53.993 app[a2c17040] sea [info] keeper   | 2022-01-07 22:34:53.992 UTC [5388] LOG:  could not receive data from client: Connection reset by peer
2022-01-07T22:34:54.126 app[a2c17040] sea [info] keeper   | 2022-01-07 22:34:54.123 UTC [5389] LOG:  PID 5388 in cancel request did not match any process

kurt · January 7, 2022, 10:45pm

Will you have a look at your DATABASE_URL and see what hostname it’s using?

<db-name>.internal will give you a database in a random location. If that’s what in the URL, and you’re using it as is, it might be the source of the problem.

top1.nearest.of.<db-name>.internal will get the nearest IP.

How are you handling database connections for regions? Are you building a per-region connection string for the read regions?

elliotdickison · January 7, 2022, 10:50pm

We use Prisma, here’s the code I’m using to instantiate the client (adapted from this very helpful guide):

let client: PrismaClient | undefined

/**
 * Return true if the API is running in a different Fly region than the primary
 * database
 */
export const notInPrimaryRegion = (): boolean => {
  return config.FLY_REGION && config.DATABASE_PRIMARY_REGION
    ? config.FLY_REGION !== config.DATABASE_PRIMARY_REGION
    : false
}

const getClient = (): PrismaClient => {
  if (!client) {
    // If we're not in the primary region we need to connect explicitly to the
    // regional database instance to avoid all database calls going
    // cross-region.
    let databaseUrlOverride: string | null = null
    if (notInPrimaryRegion()) {
      const databaseUrl = new UrlParse(config.DATABASE_URL)
      databaseUrl.set(
        "hostname",
        config.FLY_REGION + "." + databaseUrl.hostname,
      )
      databaseUrl.set("port", "5433")
      databaseUrlOverride = databaseUrl.toString()
    }

    // Create the Prisma client
    client = new PrismaClient(
      databaseUrlOverride
        ? { datasources: { db: { url: databaseUrlOverride } } }
        : undefined,
    )
  }
  return client
}

I will double check that the env variables are set correctly on all our instances, but as this code was working correctly and then we saw the dramatic shift in latency without any deployments I would be surprised if this code was the culprit.

elliotdickison · January 7, 2022, 10:52pm

We also are not seeing even a blip of SQL read-only errors in our logs, so things seem to be working just very slowly. I know the error reporting is working correctly because I got this wrong the first time around

kurt · January 7, 2022, 10:54pm

I think what’s happening is that the primary region case is connecting to a random IP outside the primary region. What you want in the primary region is something like:

postgres://<region>.db-name.internal:5432

This is the same logic as the replica regions, but with port 5432 instead of port 5433. Something like:

let databaseUrlOverride: string | null = null

const databaseUrl = new UrlParse(config.DATABASE_URL)
databaseUrl.set(
  "hostname",
  config.FLY_REGION + "." + databaseUrl.hostname
)
if (notInPrimaryRegion()) {
  databaseUrl.set("port", "5433")
}
databaseUrlOverride = databaseUrl.toString()

We need to update our docs!

elliotdickison · January 8, 2022, 12:42am

Beautiful, thank you!

Screen Shot 2022-01-07 at 4.42.08 PM