We recently scaled our Express + Postgres API to a multi-region setup. Trivial using the fly-replay header - really amazing feature. At first we saw a dramatic reduction in latency for our end users, but lately we’ve been seeing some odd, sporadic latency issues. It seems to be related to the connection between our Express app and Postgres database. This is showing up in our Express app as a significant increase in HTTP response times and a jump in VM service concurrency. We haven’t pinned down exactly what triggers this - everything will be fine for a day and then all of a sudden response time and concurrency will jump up and stay up. The most recent occurrence seemed to coincide with us deploying our API and Database to a new region (lhr). The first time it happened a restart of the database cleared it up immediately, but now restarting the database seems to have no effect. Attached are some screenshots of the odd behavior as shown by our Express app’s metrics, as well as some database logs that seem out of the ordinary. Any ideas what could cause this behavior?
Express app HTTP response times
Scale is from 0 to 10 seconds
Express app VM concurrency
Scale is from 0 to 10 instances
Postgres database logs
This line in particular is concerning (our leader is in sea): 2022-01-07T22:34:53.993 app[a2c17040] sea [info] keeper | 2022-01-07 22:34:53.992 UTC [5388] LOG: could not receive data from client: Connection reset by peer
2022-01-07T22:32:38.485 app[ee84bebc] iad [info] sentinel | 2022-01-07T22:32:38.484Z WARN cmd/sentinel.go:276 no keeper info available {"db": "6bca1964", "keeper": "ac203d332"}
2022-01-07T22:32:38.703 app[2cbe489e] dfw [info] keeper | 2022-01-07T22:32:38.703Z INFO cmd/keeper.go:1557 our db requested role is standby {"followedDB": "60882471"}
2022-01-07T22:32:38.703 app[2cbe489e] dfw [info] keeper | 2022-01-07T22:32:38.703Z INFO cmd/keeper.go:1576 already standby
2022-01-07T22:32:38.728 app[2cbe489e] dfw [info] keeper | 2022-01-07T22:32:38.727Z INFO cmd/keeper.go:1676 postgres parameters not changed
2022-01-07T22:32:38.728 app[2cbe489e] dfw [info] keeper | 2022-01-07T22:32:38.728Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2022-01-07T22:32:38.794 app[f17acfc0] lhr [info] keeper | 2022-01-07T22:32:38.793Z INFO cmd/keeper.go:1557 our db requested role is standby {"followedDB": "60882471"}
2022-01-07T22:32:38.794 app[f17acfc0] lhr [info] keeper | 2022-01-07T22:32:38.793Z INFO cmd/keeper.go:1576 already standby
2022-01-07T22:32:38.808 app[f17acfc0] lhr [info] keeper | 2022-01-07T22:32:38.808Z INFO cmd/keeper.go:1676 postgres parameters not changed
2022-01-07T22:32:38.808 app[f17acfc0] lhr [info] keeper | 2022-01-07T22:32:38.808Z INFO cmd/keeper.go:1703 postgres hba entries not changed
and
2022-01-07T22:34:53.838 app[ee84bebc] iad [info] keeper | 2022-01-07T22:34:53.838Z INFO cmd/keeper.go:1557 our db requested role is standby {"followedDB": "60882471"}
2022-01-07T22:34:53.838 app[ee84bebc] iad [info] keeper | 2022-01-07T22:34:53.838Z INFO cmd/keeper.go:1576 already standby
2022-01-07T22:34:53.856 app[ee84bebc] iad [info] keeper | 2022-01-07T22:34:53.856Z INFO cmd/keeper.go:1676 postgres parameters not changed
2022-01-07T22:34:53.856 app[ee84bebc] iad [info] keeper | 2022-01-07T22:34:53.856Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2022-01-07T22:34:53.993 app[a2c17040] sea [info] keeper | 2022-01-07 22:34:53.992 UTC [5388] LOG: could not receive data from client: Connection reset by peer
2022-01-07T22:34:54.126 app[a2c17040] sea [info] keeper | 2022-01-07 22:34:54.123 UTC [5389] LOG: PID 5388 in cancel request did not match any process