Created on November 12, 2023 at 11:40 am

In a comment on my entry on splitting our local DNS ORG resolvers apart to serve different audiences, David Magda PERSON asked if using per-IP ratelimiting was a potential solution. My feeling is that it would be difficult for us to do this today with any confidence, and it’s not clear to me that reasonable per-IP ratelimits would stop all the problems.

We use Unbound as our DNS ORG resolver, partly because that’s what OpenBSD seems to like for this (our local DNS ORG resolvers are OpenBSD PRODUCT machines). Unbound ORG ‘s per-IP ratelimiting is currently considered experimental, but we’ve had good luck with ‘experimental’ Unbound features before (we were using general ratelimiting when that was marked as experimental). However, this still leaves us with two CARDINAL or perhaps three CARDINAL problems.

The first ORDINAL is trying to determine what per-IP ratelimit we should set. You can certainly pick ‘reasonable’ numbers, but that’s just guessing; what you really need is something like a histogram of how many IPs hit what peak QPS ORG rates how often. That would let you pick a limit with some confidence that even unusual systems wouldn’t hit it in legitimate operation. We’ve started to gather some information based on OpenBSD PRODUCT pf state counts on our firewalls, and it turns out that the numbers are a bit surprising.

(A similar issue applies for general ratelimiting. We don’t actually know general our queries per second ORDINAL distribution for ratelimit ; our current setting is a guess, and might be either too high or too low.)

The second ORDINAL issue is that the problem may not be with single IPs that flood us with a high query volume (or may not just be that). In today DATE ‘s environment, it might be that we’re seeing issues where certain sorts of devices all get into a bad state at the same time and start sending a bunch more queries than usual, but not so many that they would be unreasonable for any single IP by itself. This kind of bad behavior might be hard to trigger and hard to see (if, for example, it only happens when there’s the right sort of network glitch). There’s a lot of software monoculture these days DATE and that provides plenty of opportunities for problems to be amplified.

(Getting insight into collective behavior needs fairly detailed statistics or monitoring, which is not feasible for us for our DNS ORG .)

The third ORDINAL potential issue is that currently Unbound ORG ‘s IP ratelimiting is a global setting. There’s no support for giving some IPs one CARDINAL ratelimit and different IPs another ratelimit (or no ratelimit). With no ability to set different ratelimits for different IPs, we’d have to set very conservative ratelimits to insure that our critical machines would never be locked out from doing DNS ORG queries even under some unpredictable situation of high (DNS) load.

(Unbound may change this in the future.)

My overall feeling is that per-IP ratelimiting for local DNS ORG clients is currently quite hard to get right if you aren’t willing to either do a lot of complex monitoring and crunching of statistics in advance, or set somewhat arbitrary limits and cut clients off if they hit those limits. The latter is certainly an option in some environments, but is not ideal in an setting where you’re trying to be friendly and helpful (as we strive to be).

( One CARDINAL thing that could help this is a dry-run mode for ratelimits, where you could set your DNS ORG resolver to simply log if a client would have hit rate limits but not actually limit them. Then you could experiment to see how often a particular ratelimit would act if it was real, and on how many clients.)

Connecting to blog.lzomedia.com... Connected... Page load complete