DDoS

ali · September 13, 2021, 8:01am

What would happen if a Mirai-style botnet was pointed towards a typical application, eg. as in the graph pictured?

Will the result be any different whether my service is TLS/HTTP/TCP/Proxy type?

Will autoscaling kick in at any point?

What about 100k rps or 1M rps, but sustained over a longer period?

sudhir.j · September 13, 2021, 2:53pm

I’ve had success running the Fly behind Cloudflare itself. Not sure it makes sense for Fly to build another Cloudflare inside it.

That said it probably makes sense to have some kind of billing cutoff or at least a DDoS policy.

jerome · September 13, 2021, 3:06pm

Our upstream infrastructure provider has DDoS protection. It’s kicked in many times over the past few months and it appears to work well.

The RPS number itself isn’t as relevant as packets per second on their end. Since all apps have anycast IP addresses, the load is at least distributed between our edge servers. Currently, our proxy will shed its load once certain connections concurrency and/or connection rates thresholds are reached for an app. There’s also a global limit.

We’re tweaking these limits from time to time and are probably going to change strategy sooner than later. Our proxy can handle a lot of connections, but other resources on each of our edge servers are more problematic.

So the answer is: our upstream infrastructure provider offers us some DDoS protection. It’s not ideal, but it’s been good up until now. As we grow we’ll come up with better protection on this front.

ali · September 13, 2021, 4:49pm

Yes, however I’d prefer not to go behind Cloudflare if I can avoid it, it is an option.

In addition to the benefits the Cloudflare proxy also adds some inflexibility that require a paid plan to make flexible again Also, my use-case requires a trivial feature that happens to require their Enterprise plan

Either case also won’t necessarily help prevent billing overruns as you say. But I’ve read that Fly will waive billing overages related to attacks, for their own resources at least

I’m starting to implement a billing cutoff solution for my app that I might also make available as a SaaS, if a good solution doesn’t already exist for that.

ali · September 13, 2021, 5:14pm

Thanks Jerome

I guessed that provider would have layer 3 and 4 protections. If AS numbers can be dropped with minimal collateral damage then I’m all for that, but otherwise I’m keen to receive as many (potentially) valid connections as possible and drop the bad traffic at my app instances.

The concurrency limit makes me a little worried that potentially valid traffic could be blotted out, and while that could be solved with a DDoS service placed in front, as I mentioned to Sudhir I happen to use a Cloudflare feature (wildcard cname) which would put me straight onto their Enterprise plan Perhaps another DDoS provider has an equivalent solution for cheaper though.

If you’re comfortable sharing the effective connection limits for instances and for an app as a whole then it could help me ensure the app can scale out fast to the right width and that I can configure each instance to be aggressive enough in getting rid of useless connections.

Finally finally, if TCP and Proxy modes are able to have higher concurrency limits due to the absence of buffering, then that could be helpful to know about

(Disclosure - my app is a blogging platform so I expect that sooner or later a blogger’s going to write something which another person doesn’t want to exist on the internet!)

sudhir.j · September 13, 2021, 5:21pm

Not sure which feature that is, but can also check out AWS Cloudfront and BunnyCDN. Cloudfront has a lot of flexibility, and one could argue unlimited flexibility with the Edge functions (little bits of JS that run at the 200 city edge for every request), but high per request costs. They handle DDoS protection.

BunnyCDN has some amount of overrides and a rule engine system that lets you make changes at the edge as well.

ali · September 13, 2021, 5:39pm

check out AWS Cloudfront

I think that’s a good idea - AWS CloudFront has “AWS Shield Standard” which is likewise layer 3/4, though presumably AWS has a lot of resources to throw at the problem.

But CloudFront has the problem that made me want to use Fly in the first place - ridiculous latency. Like 50ms being typical overhead for a cache Hit. If sub-100ms is a goal then CloudFlare will make it a challenge. Requests which go to the origin have further latency on top of that.

jerome · September 13, 2021, 5:57pm

The limit is quite high. We have plans to make it scale with your instance’s size and own limits (or something like that).

This is currently set at 32K concurrent connections per app per host. Concurrency is throttled at 400 connections per second per app per host. Right now we don’t publish the number of hosts we have as edges, but it’s generally >= 2 per region depending on the popularity of the region. That means you can go up to 64K conns per region for your app.

Assuming you have users all over the world, this shouldn’t be a problem for a while. We’ll grow with you, if we see that limit get too close for comfort, we can provision more edge servers pretty fast.

TCP is subject to the same limits unfortunately. We’re mostly preserving number of open file descriptors. It’s not accurate, it’s mostly an estimate for now. 32K open sockets doesn’t seem high for a modern server, but we also generally have to open as many connections outbound (in some cases we can reuse connections, but not all cases). Since this is per-app, so it could get out of hand if too many apps become popular at the same time, in the same regions.

The trick will be to make that limit adaptive, possibly work more like a budget we have to balance. Some apps don’t need more than 1 concurrent connection per region, others may need over 32K.

We’ve started doing some dynamic isolation of applications recently, for our TLS handshake logic. Too many concurrent handshakes can really stress a server and affect every other app. Popular apps get their own, dedicated, isolation, less popular apps are in using a “shared” isolation structure (the same thing, but different limits).

greg · September 13, 2021, 6:21pm

Just to add I’ve experimented with using both BunnyCDN and Cloudflare in front of Fly apps. In my case I wanted the geo header they handily both provide without charging more per request.

I don’t know if there is anything specific about my app that would cause it, but my findings were:

BunnyCDN: I would get occasional, random 502 errors. Fly would not see the request hit their proxy so were unable to debug. And they remain unexplained.

Cloudflare: I would get occasional, random 525 errors. The connection would work but would fail at the TLS handler. Fly’s investigations pointed to how their anycast routing may result in a lost connection between Cloudflare and them. I solved those by using in-app TLS, so removing Fly’s TLS handler and instead using TCP. Along with Cloudflare’s certificate: https://developers.cloudflare.com/ssl/origin-configuration/origin-ca

So if you go down the route of BunnyCDN or Cloudflare, that may be of some help

ali · September 13, 2021, 6:28pm

Thanks Jerome! I’m not using websockets atm, so 32K * 2 concurrently open connections would be more than sufficient for me in normal conditions, and good to know it’s not insurmountable even then.

Concurrency is throttled at 400 connections per second per app per host.

I’m having trouble parsing this, but is that to say that if my instance takes 1s on average to send a response then it’s allowed up to ~400 rps, and if it takes 0.2s to respond on average then it’s allowed up to ~2000 rps?

jerome · September 13, 2021, 6:31pm

No, this is just measuring the rate of connections established, not the whole life of a connection. In other words: how many clients connected to your app’s IP per second.

ali · September 13, 2021, 6:56pm

how many clients connected to your app’s IP per second.

Ok gotcha, so it’s more like if you have long persistent connections then you could take on 400/s for 160 seconds, and at that point you have 64k concurrent connections which is the other limit.

A 400/s (or 800/s if it’s 2 app hosts) limit is about the same level as AWS API Gateway’s default account limits, which seems a sensible default as a safety limiting feature.

My take-away from this is that Cloudflare or equivalent would be a good idea for any app facing a DDoS, at least until the current DDoS protections become fully robust.

I’d be quite keen to try serving / fending off DDoS traffic, if you ever decide to let apps opt in for 1M connections per second per app per host, but also understand if you don’t think it wise

jerome · September 13, 2021, 7:31pm

We don’t agree with this, in general. Cloudflare adds an extra proxy layer in front of your app, sometimes that’s not insignificant latency.

It’s hard to tell how well our current DDoS protection will hold up against a bigger attack. It has been working well up to now. Whenever an attack happens, we get a notification and then we either don’t notice any side effects or we work with our provider’s network team directly to figure out how to further mitigate.

It’s worth noting the article linked in the original post representes a pretty rare event.

If you’re very concerned, you could use Cloudflare’s DNS-only service (proxy icon grayed) and turn on the proxy service only if it becomes a problem.

ali · September 14, 2021, 11:08am

YMMV, but it may be possible to achieve the same functionality using iptables alone. I’m not a network engineer, but a sketch could look like:

Add a --state NEW rule on every public client IP
that causes the established connections to be on a client-specific chain.
Use connlimit and/or log to implement limiting and autoscaling.
Use DNAT to forward packets while maintaining source IP.

ali · September 14, 2021, 11:12am

To answer my own thought bubble - I think that a VM with a public IP would be better suited for this specific use-case than proxied TCP.

ali · September 18, 2021, 3:17pm

Giving it some thought, I think DoS of this limit worries me more than DDoS.

On AWS API Gateway I can request a rate increase from support, and on Linode I don’t have a limit at all. For my mostly static use-case I’d be much happier just serving through a small DoS of 1k-100k rps on many nodes than to run into the app limit.

kurt · September 18, 2021, 3:24pm

The 400 connections per second per host isn’t an app limit, that’s a lower level limit on each of our edge nodes. We have a lot of edge nodes. And to be clear, connections are not the same as requests. A connection is a full new TCP connection with a TLS handshake, each connection serves multiple HTTP requests (for real users, not for attacks).

The limit per app VM come from the concurrency settings in your fly.toml. These are not per second limits, these are concurrent limits. If your VM can handle 1,000+ simultaneous HTTP requests, set the limit to 1,000. The actual throughput will vary based on how fast it responses.

ali · September 18, 2021, 4:14pm

Thank you that puts my mind at rest where I can proxy through through Cloudflare. Wildcard can’t be proxied on Cloudlare with sane cost, so I’d need to address that with programming and budgeting for later.

The DoS case I’m considering is with a random person spinning up a few small nodes on any number of hosting providers and reaching 1k rps, possibly via Tor or AWS Lambda or some VPN.

Or hiring out a botnet and cycling through 1000 hosts per second.

kurt · September 19, 2021, 8:47pm

What we see most is botnets, no one goes to actual work to do a DoS these days.

And, for what it’s worth, only ~0.08% of our customers ever get DoSed. If you are running the kind of app that attracts DoS attacks, this is worth spending time on. If you’re just running a normal SaaS, though, it might just be something to think about and then act on if it ever comes up.

ignoramous · October 29, 2021, 12:41am

If I am allowed to suggest, then give shuffle-sharding a look in terms of managing noisy neighbour problems inherent in large multi-tenant systems such as load-balancers: Workload isolation using shuffle-sharding