Running a Global Object Cache Cluster using NGINX

NGINX is a very fast proxy server that has a built-in caching system — useful if you want a caching CDN for your users, or a way to cache objects inside Fly for fast and repeated access from your applications.

The first and simplest configuration is to use a set of ephemeral instances that based on GitHub - fly-apps/nginx: A fly app nginx config — this will load up each instance with NGINX, and the nginx.conf has a line to set the origin URL:

set $origin_url https://example.org;

We can modify this to the URL we want to proxy, create the app with fly apps create, update the app name in the fly.toml and we’re off! The configuration sets up the following cache configuration:

proxy_cache_path /tmp/nginx-cache levels=1:2 keys_zone=static:64m max_size=512m use_temp_path=off;

We’ve given it a few options as well:

  • /tmp/nginx-cache, tells NGINX where to hold the cached data.
  • levels tells it how many directories deep to organise the data; keys_zone tells it how much memory to use for key & metadata storage.
  • max_size tells it how big the cache can grow on the disk. Removing max_size is also an option — the cache will simply grow to fill the disk and be managed by NGINX after that.
  • use_temp_path=off tells NGINX to put its temporary files in the same folder as the cache, instead of using a separate folder.

The first problem with this setup is that /tmp/nginx-cache is on the instance’s ephemeral storage — Fly provides each instance with 5GB of storage on the root disk that is not persisted, not even over restarts. This is going to lead to a lot of repeated origin fetches, so let’s fix it by adding a volume:

fly volumes create nginx_data --region maa

We can then update our fly.toml to mount the volume on /data:

[mount]
source = "nginx_data"
destination = "/data"

And update the nginx.conf to use that as the cache directory:

proxy_cache_path /data/nginx-cache levels=1:2 keys_zone=static:128m max_size=10g 

Now we can expect much better cache ratios thanks to our large disk. You can also override how long items stay in the cache using proxy_cache_valid while also telling NGINX to revalidate them with conditional methods (that don’t download the entire object, just the headers) using proxy_cache_revalidate.

This is going great so far, but what if we want to distribute to the cache a bit over multiple servers or use more caching space than we have in a single volume? To do that, we could maintain a list of all the servers we have on each server, and route all requests for any particular URL to a single server, distributing all URLs randomly over the list of servers we have.

We’d do this by changing the origin in our nginx.conf to

proxy_pass http://nginx-nodes/;

and then define nginx-nodes as an upstream:

upstream nginx-nodes {
    hash "$scheme://$host$request_uri" consistent;

    server [fdaa:0:33b5:a7b:14bf:0:641d:2]:8081 max_fails=5 fail_timeout=1s;
    server [fdaa:0:33b5:a7b:14bf:0:641d:3]:8081 max_fails=5 fail_timeout=1s;
    ...
}

The hash "$scheme://$host$request_uri" consistent; directive tells NGINX that we want to route the request to one of the servers on the list based on the hash of the $scheme://$host$request_uri pattern, which is mostly just the URL and host. The consistent keyword tells NGINX to use consistent hashing, which reduces wasted work when the list of servers is changed.

We need to define a new server listening on port 8081 that fetches the actual data, and move our proxy_cache directive there instead of the main server:

server {
        listen       [::]:8081;
        listen 8081;
        ...
        location / {        
            proxy_pass https://example.com/;
            proxy_cache static;
    }
}

This set is what the GitHub - fly-apps/nginx-cluster: A horizontally scalable NGINX caching cluster repo does, along with scripts to automatically update the list of servers every few seconds by querying the Fly DNS reflection APIs.

A caveat is that NGINX will only cache requests that return a Cache-Control or a Expires header, so you need to make sure your origin is returning one or both of them.

S3 & NGINX Caching

Setting up an NGINX cache is a great way to add a layer in front of S3 that reduces costs and improves performance. A couple of extra measures need to be taken to make it work, though.

Dealing with no headers

First, assuming we’re using public download URLs on a bucket that has objects publicly accessible, S3 does not return the Cache-Control or Expires headers, which means NGINX will not cache requests. One way to get around this is to add another proxy server inside NGINX that does nothing but proxy S3 requests with the Cache-Control header added. We can do this by pointing the proxy_pass directive to another sever running on 8082, like this:

proxy_pass http://127.0.0.1:8082/;

with the server defined as

server {
        listen       [::]:8082;
        listen 8082;

        location / {            
            proxy_pass https://example.s3.amazonaws.com/;            
            add_header Cache-Control "public, max-age=315360000";
        }
}

Dealing with signed URLs

If the URLs we’re trying to cache are signed, which will happen if the bucket and its objects are not publicly accessible, we’ll need to run another tool like the GitHub - awslabs/aws-sigv4-proxy: This project signs and proxies HTTP requests with Sigv4 alongside our NGINX cluster. Running the configure signing proxy will allow setting the proxy_pass to localhost:nnnn if we’re running it locally as another process, or http://top1.nearest.of.signing-proxy-app.internal to send the request via the nearest running app instance. The proxy_set_header Host $host; is going to come in handy in this case.

2 Likes

The repo seems to be archived now, is there anything in particular thats known bad about it?
I was thinking of doing something similar

1 Like

Nothing bad! We’ve just archived the ones that we don’t consider ‘official’, as platform changes may not match up 100% with the instructions there over time.