Did I trigger spam/DOS protection with a load test?

I’ve been testing out distributed Caddy reverse proxy clusters and I wanted to see what kinds of resources they’d use under load. So I ran a k6 load test against them, with 450 connections for a minute, which seemed to go okay.

Shortly after though, I’ve been unable to load anything I’m reverse proxying and I’m wondering if I’ve triggered some hidden DOS protection in Fly.

  • Checking the /health route I added on the fly.dev autogenerated subdomain for the app works okay
  • Going directly to the upstream domain on the reverse proxy works okay
  • Going to the domain being reverse proxied does not work
  • Going to the domain being reverse proxied, using a service that tests from different locations, does not work (which I thought it might if it’s DOS issues)

Here’s an example domain:
https://apx1621708518816112.test.approximated.app/
That should load up this:
https://nextjs-docs-ecru-tau.vercel.app/

This reverse proxy worked before I ran the load test. I’ve restarted the app a few times.

Here’s what CURL shows when I try a reverse proxied domain:

curl https://apx1621708518816120.test.approximated.app/ -v
*   Trying 213.188.209.244:443...
* TCP_NODELAY set
* Connected to apx1621708518816120.test.approximated.app (213.188.209.244) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=apx1621708518816120.test.approximated.app
*  start date: May 22 17:42:26 2021 GMT
*  expire date: Aug 20 17:42:26 2021 GMT
*  subjectAltName: host "apx1621708518816120.test.approximated.app" matched cert's "apx1621708518816120.test.approximated.app"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x557095f79820)
> GET / HTTP/2
> Host: apx1621708518816120.test.approximated.app
> user-agent: curl/7.68.0
> accept: */*
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 200
< server: Caddy
< content-length: 0
< date: Sun, 23 May 2021 18:22:41 GMT
<
* Connection #0 to host apx1621708518816120.test.approximated.app left intact

And here’s a Caddy log entry debugging a request:

{
  "level": "debug",
  "ts": 1621794026.0371342,
  "logger": "http.handlers.reverse_proxy",
  "msg": "upstream roundtrip",
  "upstream": "nextjs-docs-ecru-tau.vercel.app:443",
  "request": {
    "remote_addr": "147.75.94.145:36184",
    "proto": "HTTP/1.1",
    "method": "GET",
    "host": "nextjs-docs-ecru-tau.vercel.app",
    "uri": "/",
    "headers": {
      "X-Forwarded-For": [
        "147.75.94.145"
      ],
      "User-Agent": [
        "hackney/1.16.0"
      ],
      "X-Forwarded-Proto": [
        "https"
      ]
    }
  },
  "duration": 4.887757117,
  "error": "context canceled"
}

Alright, so I had scaled count down to 1 to test out autoscaling. Changing the scale back to 3 and disabling autoscaling seems to have fixed things and it’s working again. Not really sure why, but it seems like adding the new instances (in other regions) worked.

Or maybe it was because right after I did that, the original instance changed a health check to critical and shut down? Not sure why that would fix it though, I ran the restart command a bunch of times earlier and it didn’t.

It does seem like I’m getting the occasional issue like that where an instance stops working but I can’t tell why.

It wouldn’t surprise me if you triggered rate limiting / abuse protection on Vercel’s end. That Caddy error message looks like it just timed out waiting for a response from Vercel.

When you scale, it’ll put VMs on different hosts with different IPs. So if Vercel is rate limiting an IP, that’s one way to get around it.

Just in case someone is having a similar issue later, I’m pretty sure this was some kind of request limit protection from Vercel’s side of things. I’d cranked up my concurrency on my Fly config and Fly wasn’t blocking me, but requests were being cancelled at Vercel’s end. What’s interesting is that Vercel cancelled for ALL requests (no matter where they came from) to that particular Vercel site for an hour or so. I’m guessing it went over some kind of internal Vercel limit for free sites and just stopped responding.

Oof. It’s possible they just had a mini outage, too. It’s probably worth asking them about it to see if they’ve documented limits!