Trying to scale Phoenix Channels above 16,000 connections

:wave:

I’ve been trying for a while to load test websockets (Phoenix Channels) to see how much load my code can handle. The problem is that I stop being able to connect after a certain number of websockets (around 16,000). There are no errors on the server, I just start getting nxdomain errors locally. I’ve tried a number of things that I’ll describe below, but I just wanted to check if there was some sort of hard-coded limit on the fly.io machines / loadbalancer.

I’ve setup the server locally to run MIX_ENV=prod and was able to get above 22k connections before I shut it down.

My setup:

  • fly.io app which has a Phoenix channel (shared-cpu-2x, memory 4096 MB)
  • local app which is connecting using the slipstream (which uses mint) library to create client channel connections. I first make an HTTP request (using HTTPoison) to create a “driver” record, and then I setup the channel for that driver ID. Interestingly these HTTP requests seem to be what always return the nxdomain errors.
  • Connecting to the server through Wireguard (HTTP server isn’t exposed to the public internet)

For all of the things below, keep in mind that this is an app for testing a concept and I’m not planning on running it in real production, so I think it should be OK to have these ridiculous limits in this case.

I’ve set high ulimit values (I got some feedback in this thread on how to set my user). This is my server script file which runs ulimit -aS / -aH to verify that the values are set:

#!/bin/bash
ulimit -n 900000
ulimit -i 500000
ulimit -u 500000
ulimit -s 16384
ulimit -aH
ulimit -aS
cd -P -- "$(dirname -- "$0")"
PHX_SERVER=true exec ./my_app start

I have the following in my fly.toml:

[services.concurrency]
  hard_limit = 100000
  soft_limit = 100000

[http_service.concurrency]
  hard_limit = 100000
  soft_limit = 100000

I’ve set high ulimit values locally for my local application which is connecting with slipstream:

(base) ➜  my_app git:(main) ✗ ulimit -aH
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             65520
-c: core file size (blocks)         unlimited
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       5333
-n: file descriptors                unlimited

(base) ➜  my_app git:(main) ✗ ulimit -aS
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8176
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       5333
-n: file descriptors                1000000

I’ve tried both cowboy and bandit (currently using bandit) and both fail around 16k connections (interestingly cowboy seems to fail a few hundred below 16k and bandit seems to fail a couple of hurdle above 16k)

I’ve looked at configuration options for cowboy, bandit, slipstream, mint, and Phoenix and tried various things that looked like they might work, but no luck.

In LiveDashboard I don’t see any processes with long message queues…

Before things crash, I generally get up to around 2.4GB out of the 4GB I’ve allocated.

Would love any help :smile:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.