When does the timer for a health check grace_period start? How does that interact with the interval?

stale-snow-white-sco · June 20, 2022, 10:48pm

First off, because I know this is probably a support headache for staff, I have a working health check. If I set the grace period to 100s, it works magic. What I’d like to do is tune that 100s in a bit smaller. I searched the docs and didn’t get the positive answer I’m looking for.

I set my grace_period to 10s because by my count the app usually boots in 5s.

In the below snippet, the container boots at 22:34:21.734 and the app begins listening at 22:34:26.631 - 5 seconds. It gets reaped 3s later.

Am I’m misunderstanding when the clock starts? I had assumed it was at “Starting virtual machine” but it’s possible that it starts at “kernel init” or even “Starting instance”? The docs simply say “The time to wait after a VM starts before checking its health.”

I don’t think it matters, but here’s a log snippet:

2022-06-20T22:33:23.633 runner[bc202e43] sea [info] Starting instance
2022-06-20T22:33:25.253 runner[bc202e43] sea [info] Configuring virtual machine
2022-06-20T22:33:25.254 runner[bc202e43] sea [info] Pulling container image
2022-06-20T22:33:41.143 runner[bc202e43] sea [info] Unpacking image
2022-06-20T22:34:19.907 runner[bc202e43] sea [info] Preparing kernel init
2022-06-20T22:34:21.657 runner[bc202e43] sea [info] Configuring firecracker
2022-06-20T22:34:21.734 runner[bc202e43] sea [info] Starting virtual machine
2022-06-20T22:34:21.960 app[bc202e43] sea [info] Starting init (commit: e21acb3)...
2022-06-20T22:34:21.999 app[bc202e43] sea [info] Preparing to run: `./entrypoint.sh` as root
2022-06-20T22:34:22.032 app[bc202e43] sea [info] 2022/06/20 22:34:22 listening on [....]:22 (DNS: [....]:53)
2022-06-20T22:34:22.543 app[bc202e43] sea [info] yarn run v1.22.19
2022-06-20T22:34:22.576 app[bc202e43] sea [info] $ run-migrate-script
2022-06-20T22:34:25.267 app[bc202e43] sea [info] No pending migrations to apply.
2022-06-20T22:34:25.336 app[bc202e43] sea [info] Done in 2.80s.
2022-06-20T22:34:26.589 app[bc202e43] sea [info] [App] 515 - 06/20/2022, 10:34:26 PM LOG [NestFactory] Starting App...
2022-06-20T22:34:26.631 app[bc202e43] sea [info] Listening on port 8080
2022-06-20T22:34:29.020 app[bc202e43] sea [info] Reaped child process with pid: 639, exit code: 0

kurt · June 20, 2022, 11:59pm

The clock actually starts at “Starting instance”. This is a quirk of nomad. It looks like it’s taking close to a minute to get the image pulled.

The docs are not explicit enough about this, it’s very confusing.

stale-snow-white-sco · June 21, 2022, 1:10am

Thank you Kurt. That makes it difficult to know how to predict what a sane grace period is, but it’s good to know regardless. There seems to be a lot of variance in the part of the clock I don’t have any control over.

If I set this high, I gather that it is not going to allow a health check to pass early and therefore put the insurance into service early?

kurt · June 21, 2022, 1:49am

The grace period actually specifies how long it takes to run the first check. Setting it high will keep the VM from getting restarted too early.

How big does the image say it is when you deploy? Optimizing the image size will help, ~1 min to pull makes it sound quite large.

stale-snow-white-sco · June 21, 2022, 2:44pm

It is, yes. There’s some bug with the Alpine distribution at the moment so I’m shipping an entire Ubuntu. Something around a gig.