At this point it would be really helpful to start profiling your application to understand where the request time is being spent. The only possible network issues from your description of the code would be app to database, but given same region it is unlikely for that to be causing such a large backup.
Can you define “many change events at once”? Are we talking dozens coming in at the same time? Hundreds? thousands? If it’s the database pool backing up because you have a thundering herd problem, that would be helpful to pinpoint, but profiling ecto would let us know where time is being spent. If all you’re really doing is a couple reads and conditional write, then the rest of the phoenix stack wouldn’t be blocking on anything. Having said that, please rule out any other plugs that could be talking to an external service, or hitting a single process bottleneck. Again, ecto profiling would help us rule that out if we see Ecto is the place where the time is being spent. Monitoring the VM processes ordered my message queue size would also rule out a single process bottleneck in your system. The LiveDashboard is showing this stuff, but only if you happen to catch it (all data is ephemeral)
You can see the telemetry events that Ecto dispatches here:
https://hexdocs.pm/ecto/Ecto.Repo.html#module-adapter-specific-events
You could also wrap whatever your main operations are in spans and log the times:
More information will help narrow down what is happening.
Thank you @chrismccord.
Actually I just started running ab
against a URL which does not even do sth with the database (and redirects as there is no route): ab -n 100 -c 10 http://127.0.0.1:4000/myhandler
.
Locally it’s not a problem at all:
Concurrency Level: 10
Time taken for tests: 3.192 seconds
Complete requests: 100
Failed requests: 0
Non-2xx responses: 100
Total transferred: 62100 bytes
HTML transferred: 15600 bytes
Requests per second: 31.33 [#/sec] (mean)
Time per request: 319.229 [ms] (mean)
Time per request: 31.923 [ms] (mean, across all concurrent requests)
Transfer rate: 19.00 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.2 0 1
Processing: 170 299 52.2 287 442
Waiting: 168 298 52.2 287 442
Total: 170 299 52.1 288 442
But on Fly with three instances:
Concurrency Level: 10
Time taken for tests: 17.385 seconds
Complete requests: 100
Failed requests: 0
Non-2xx responses: 100
Total transferred: 60200 bytes
HTML transferred: 7800 bytes
Requests per second: 5.75 [#/sec] (mean)
Time per request: 1738.490 [ms] (mean)
Time per request: 173.849 [ms] (mean, across all concurrent requests)
Transfer rate: 3.38 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 68 122 57.4 95 289
Processing: 23 1561 1542.3 1750 4398
Waiting: 22 1560 1542.5 1749 4398
Total: 94 1683 1532.3 1859 4516
I’ll check what I can find out on the machine itself and on the Live Dashboard. And as you suggested checking which plugs are involved.
Found it. It was the log backend towards LogDNA (ex_logdna
). Looks like every log line was sent synchronously.
So yeah, looks like I did some wrong assumptions on my way debugging it. Thanks a lot for your help!