SplittingDNSResolvers

By admin
We operate a collection of our own servers, and also a bunch of internal networks for other people’s machines. As part of this we operate our own

DNS
ORG

infrastructure, including local

DNS
ORG

recursive resolvers (which are necessary to handle things like our split horizon

DNS
ORG

setup). Historically we have used one set of local

DNS
ORG

resolvers to handle everything; both

DNS
ORG

lookups from our own servers and

DNS
ORG

lookups from other people’s internal machines go to the same

DNS
ORG

resolvers. After all, why not? It’s simpler that way. Well, until things go wrong, which they have now done more than once.

It’s a sad reality of modern life that you cannot count on arbitrary machines (or pieces of software) being sensible

DNS
ORG

clients. Every so often you’re going to have a machine or a piece of software that freaks out or has something go wrong such that it sends your

DNS
ORG

resolvers a flood of queries, sometimes for

DNS
ORG

names that don’t exist or don’t currently resolve; this can happen if, for example, there’s a program that rapidly retries failed

DNS
ORG

lookups (or

DNS
ORG

lookups that merely didn’t get answered fast enough). Therefor,

DNS
ORG

resolvers that handle traffic from arbitrary clients are very likely to get hammered every so often, and if you’re unlucky they’ll be sufficiently badly affected that other clients start having their queries fail.

We’ve realized this sad truth recently, and the corollary that this makes it a bit problematic to use the same local

DNS
ORG

resolvers for

third
ORDINAL

party machines not under our control and our own servers, or at least important and carefully controlled ones like our fileservers or our

Prometheus
PERSON

based monitoring (where parts of it need regular

DNS
ORG

lookups). Because they can run arbitrary user programs, machines such as our SLURM based compute servers are rather more like

third
ORDINAL

party machines, because arbitrary user programs can have arbitrary

DNS
ORG

behavior (especially in a

Computer Science
ORG

research environment; a corporate Unix environment is probably less unpredictable).

What we’ve decided to do is to make some of our machines use another internal

DNS
ORG

resolver as their normal default resolver (currently by listing it

first
ORDINAL

in /etc/resolv.conf ; our other

DNS
ORG

resolvers remain listed, acting as fallbacks). This new resolver is exactly the same as our current

DNS
ORG

resolvers, it’s just not used by other people’s machines. We hope that this new ‘private’ resolver will be less likely to have surprise problems, so critical core systems will be more likely to keep working during

DNS
ORG

problems.

(

DNS
ORG

problems are still problems, because we need to provide working

DNS
ORG

to people. But there’s a difference between having problems and having our fileservers start refusing NFS access because they can’t map IPs to names. If NFS breaks, everything breaks.)