EDIT: minimal reproduction repository and logs, further asserting this is an issue with the proxy itself: Traffic (still) routed to instances not passing health check - #7 by sheerlox
Related post from 9 months ago: Traffic routed to instances not passing health check.
During startup, my Elixir API loads and processes a significant chunk of data, which takes a little more than 2 seconds on 4x shared CPU machines.
The health check function validates this process completed, which is supposed to prevent the 5 errors at the beginning of the logs below.
I’ve added a log when the health check passes in my backend to ensure the issue isn’t on my end (which can be seen on the last line).
[d8dd963b929298] Running UnillWeb.Endpoint with Bandit 1.6.7 at :::8080 (http)
[d8dd963b929298] ** (ArgumentError) errors were found at the given arguments: * 1st argument: the table identifier does not refer to an existing ETS table (stdlib 6.2) :ets.lookup(nil, " a")
[d8dd963b929298] ** (ArgumentError) errors were found at the given arguments: * 1st argument: the table identifier does not refer to an existing ETS table (stdlib 6.2) :ets.lookup(nil, " n")
[d8dd963b929298] ** (ArgumentError) errors were found at the given arguments: * 1st argument: the table identifier does not refer to an existing ETS table (stdlib 6.2) :ets.lookup(nil, " x")
[d8dd963b929298] ** (ArgumentError) errors were found at the given arguments: * 1st argument: the table identifier does not refer to an existing ETS table (stdlib 6.2) :ets.lookup(nil, " o")
[d8dd963b929298] ** (ArgumentError) errors were found at the given arguments: * 1st argument: the table identifier does not refer to an existing ETS table (stdlib 6.2) :ets.lookup(nil, " p")
[d8dd963b929298] Loaded 17769 entities from database: - 6730 skills - 6730 translations - 4309 variations
[d8dd963b929298] Built and persisted entity map for 17769 entities.
[d8dd963b929298] Healthcheck failed: {:ets_table_not_found, :inverted_index}
[d8dd963b929298] Healthcheck failed: {:ets_table_not_found, :inverted_index}
[d8dd963b929298] Built and persisted inverted index for 5611 ngrams.
[d8dd963b929298] Healthcheck passed
I would expect the first requests to be health checks, and no request to be forwarded to the instance while they aren’t passing, as mentioned in Healthcheck-based Routing:
- the instance is still busy starting up and is not ready to serve requests
Here’s the relevant part of my config:
[http_service]
[[http_service.checks]]
interval = '5s'
timeout = '1s'
grace_period = '10s'
method = 'GET'
path = '/api/health_check'
Have there been any advances on this issue since then? Or maybe something I’ve missed in the configuration?
Thanks in advance!
Some more context
This is happening during an autoscaling operation while load testing my deployment. I only use a soft_limit
and no hard_limit
:
[http_service.concurrency]
type = 'requests'
soft_limit = 1
[[vm]]
memory = '1024mb'
cpu_kind = 'shared'
cpus = 4
The soft_limit
is so low because the API calls are quite CPU-intensive and I’m trying to stay within the baseline even under load, keeping the CPU balance to handle requests while the new machines start. This makes the autoscaling kick in at about 8-10 req/sec.