App instances repeatedly going critical without obvious reasons why

Howdy;
I’m running a variation of the nginx CDN server Kurt wrote about several months ago. It’s been running well until the last few weeks (as near as I can figure) and now I have instances going from passing to critical in a relatively short time span (days or hours) until all are down and the service is unresponsive.

I’d love some help diagnosing to see if it’s me or if something has changed in the system.

The app is ephemera, and I’ve done 3 restarts since Thu/Fri of last week. It needs another restart I believe, but I’m leaving it as-is for diagnostics.

Thanks for any help or suggestions!

  • andrew

This might be because of an init bug related to script checks. Will you try commenting out your script check and then redeploying to see if that helps?

Thanks Kurt;
I’m re-deploying with the the script_check commented out.

Did I read a thread that said script_checks are going away entirely, and if so what would you recommend for a clean/simple way to achieve the equivalent functionality (in this case regularly updating the list of other app instances for the sake of upstream shards)

That is a good question. Script checks aren’t going away anytime soon, but that bug makes them way less functional.

For what you’re doing in particular, it’s possible to make nginx just do the 6pn address resolution itself. The way I did it was contorted because I didn’t know I could do it any other way. Then @julia figured it out: Day 53: a little nginx, IPv6, and wireguard

Hey @kurt;
I think I follow this, and translating @julia’s very cool adaptation to my nginx and getting rid of an extraneous bash script looks like it would do the trick.

But… I’m struggling a bit.

If this is my original setup:

    upstream nginx-nodes {
        hash "$scheme://$host$request_uri" consistent;
        keepalive 100;

        # upon pain of death -- DO NOT REMOVE the following blank line and comment
        # they're a placeholder for other cache servers that
        # gets updated by scripts/check-nodes.sh

        # shard-upstream-placeholder
    }

    # balance across cache servers
    server {
        listen [::]:8080;
        listen 8080 default_server;

        # routes to bother nodes in the `nginx-nodes` pool
        # based on the consistent hash developed from the url
        location / {
            access_log /dev/stdout main_ext;
            proxy_pass http://nginx-nodes/;
            proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
            proxy_next_upstream_timeout 1;
            proxy_connect_timeout 1s;
            proxy_http_version 1.1;
            proxy_set_header Connection "";
            add_header X-Frontend $hostname;
            add_header X-Git-SHA {GIT_SHA};
        }

        location /favicon.ico {
            access_log off;
            return 200 "ok";
        }

    }

What would the new one look like to skip the bash re-writes and still maintain a consistent hash sharding?

So with some further digging, I’m still not sure how to maintain the consistent hash sharding mechanism along with adapting to the nginx built-in DNS resolution identified by @julia. They seem like unrelated solutions.

The sharding depends on setting up a list of upstream nodes as a static list, and that’s where the hash is allowed. I don’t see how to map that to the location based solution that specifies a resolver.

I think I solved this in the worst way possible.

I added supervisor & cron to the docker instance, and set up supervisor to run a crontab instance and the nginx startup script. The crontab runs the Fly node discovery/nginx.conf rewrite, and the startup script launches nginx to begin with.

There’s some fun with missing env vars once cron enters the picture, but it seems to be starting up ok and re-deploys cleanly…

Do not do what I did. Figure out the mysterious way of the resolver and how to do upstream shards without a script… (I still don’t know if that’s possible, so if it is I’m all ears.)