Anatomy of a Linux DNS Lookup – Part V – Two Debug Nightmares

 

In part V of this series of posts I take a slight detour to show you a couple of ‘debug nightmares’ that DNS threw at me.

Previous posts were:

Anatomy of a Linux DNS Lookup – Part I

Anatomy of a Linux DNS Lookup – Part II

Anatomy of a Linux DNS Lookup – Part III

Anatomy of a Linux DNS Lookup – Part IV

Both bugs threw up surprises not seen in previous posts…

 

Landrush and VMs

I’m a heavy Vagrant user, which I use mostly for testing (mostly) Kubernetes clusters of various kinds.

It’s key for these setups to work that DNS lookup between the various VMs works smoothly. By default, each VM is addressable only by IP address.

To solve this, I use a vagrant plugin called landrush. This works by creating a tiny ruby DNS server on the host that runs the VMs. This DNS server runs on port 10053, and keeps and returns records for the VMs that are running. So, for example, you might have two VMs running (vm1 and vm2), and landrush will ensure that your DNS lookups for these hosts (eg vm1.vagrant.test and vm2.vagrant.test) will point to the right local IP address for that VM.

It does this by creating IPTables rules on the VM host and the VMs themselves. These IPTables rules divert DNS requests to the DNS server running on port 10053, and if there’s no match, it will re-route the request to the original DNS server specified in that context.

Here’s a diagram that might help visualise this:

landrush

vagrant-landrush DNS server 

Above is a diagram that represents how Landrush DNS works with Vagrant. The box represents a host that’s running two Vagrant VMs (vm1 and vm2). These have the ipaddresses 1.2.3.4 and 1.2.3.5 respectively.

A DNS request on either vm is redirected from the host’s resolver (in this case systemd-resolved) to the host’s Landrush DNS server. This is achieved using an IPTables rule on the VM.

The Landrush DNS server keeps a small database of the host mappings to IPs given out by Vagrant and responds to any requests for vm1.vagrant.test or vm2.vagrant.test with the appropriate IP local address. If the request is for another address the request is forwarded on to the host’s configured DNS server (in this case DNSMasq).

Host lookups use the same IPTables mechanism to send DNS requests to the Landrush DNS server.

The Problem

Usually I use Ubuntu 16.04 machines for this, but when I tried 18.04 machines networking was failing on them:

$ curl google.com
curl: (6) Could not resolve host: google.com

At first I assumed the VM images themselves were faulty, but taking Landrush out of the equation restored networking fully.

Trying another tack, deleting the IPTables rule on the VM meant that networking worked also. So, mysteriously, the IPTables rule was not working. I tried stracing the two curl calls (working and not-working) to see what the difference was. There was a difference, but I had no idea why it might be happening.

As a next step I tried to take systemd-resolved and Landrush out of the equation (since that was new between 16.04 and 18.04). I did this by using different IPTables rules:

  • Direct requests to google’s 8.8.8.8 DNS server rather than the Landrush failure (FAILED)
    • Showed that Landrush wasn’t the problem
  • Direct /etc/resolv.conf to a different address (changed 127.0.0.53 to 9.8.7.6), and wire IPTables to Google’s DNS server (WORKED)
  • Direct /etc/resolv.conf to a different address (changed 127.0.0.53 to 127.0.0.54), and wire IPTables to Google’s DNS server (FAILED)
    • Showed systemd-resolved not necessarily the problem

The fact that using 9.8.7.6 instead of 127.0.0.53 as a DNS server IP address led me to think that the fact that /etc/resolv.conf was pointed to a localhost address (ie one in the 127.0.0.* range) might be the problem.

A quick google led me here, which suggested that the problem was a sysctl setting:

sysctl -w net.ipv4.conf.all.route_localnet=1

And all was fixed.

The gory detail of the debugging is here.

Takeaway

sysctl settings are yet another thing that can affect and break DNS lookup!

These and more such settings are listed here.

 

DNSMasq, UDP=>TCP and Large DNS Responses

The second bug threw up another surprise.

We had an issue in production where DNS lookups were taking a very long time within an OpenShift Kubernetes cluster.

Strangely, it only affected some lookups and not others. Also, the time taken to do the lookup was consistent. This suggested that there was some kind of timeout on the first DNS server requested, after which it fell back to a ‘working’ one.

We did manual requests using dig to the local DNSMasq server on one of the hosts that was ‘failing’. The DNS request returned instantly, so we were scratching our heads. Then a colleague pointed out that the DNS response was rather longer than normal, which rang a bell.

Soon enough, he came back with this rfc, (RFC5966), which states:

   In the absence of EDNS0 (Extension Mechanisms for DNS 0) (see below),
   the normal behaviour of any DNS server needing to send a UDP response
   that would exceed the 512-byte limit is for the server to truncate
   the response so that it fits within that limit and then set the TC
   flag in the response header.  When the client receives such a
   response, it takes the TC flag as an indication that it should retry
   over TCP instead.

which, to summarise, means that if the DNS response is over 512 bytes, then the DNS server will send back a truncated response, and should make another request over TCP rather than UDP.

We never fixed the root cause here, but suspected that DNSMasq was not correctly returning the TCP to the client requesting. We found a setting that specified which interface DNSMasq would run against. By limiting this to one interface, requests worked again.

From this, we reasoned there was a bug in DNSMasq where if it was listening on more than one interface, and the upstream DNS request resulted in a response bigger than 512, then the response never reaches the original requester.

Takeaway

Another DNS surprise – DNS can stop working if the DNS response is over 512 bytes and the DNS client request program doesn’t handle this correctly.


Summary

DNS in Linux has even more surprises in store and things to check when things don’t go your way.

Here we saw how sysctl settings and plain old-fashioned bugs in seemingly battle-hardened code can affect your setup.

And we haven’t covered caching yet…

 


If you like this, you might like one of my books:

Learn Git the Hard WayLearn Terraform the Hard WayLearn Bash the Hard Way

LearnGitBashandTerraformtheHardWay.png


 

 

 

 

 

 

 

 

 

Or you might like Docker in Practice

Advertisements

Anatomy of a Linux DNS Lookup – Part IV

In Anatomy of a Linux DNS Lookup – Part IPart II, and Part III I covered:

  • nsswitch
  • /etc/hosts
  • /etc/resolv.conf
  • ping vs host style lookups
  • systemd and its networking service
  • ifup and ifdown
  • dhclient
  • resolvconf
  • NetworkManager
  • dnsmasq

In Part IV I’ll cover how containers do DNS. Yes, that’s not simple either…


1) Docker and DNS

In part III we looked at DNSMasq, and learned that it works by directing DNS queries to the localhost address 127.0.0.1, and a process listening on port 53 there will accept the request.

So when you run up a Docker container, on a host set up like this, what do you expect to see in its /etc/resolv.conf?

Have a think, and try and guess what it will be.

Here’s the default output if you run a default Docker setup:

$  docker run  ubuntu cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
# 127.0.0.53 is the systemd-resolved stub resolver.
# run "systemd-resolve --status" to see details about the actual nameservers.

search home
nameserver 8.8.8.8
nameserver 8.8.4.4

Hmmm.

Where did the addresses 8.8.8.8 and 8.8.4.4 come from?

When I pondered this question, my first thought was that the container would inherit the /etc/resolv.conf settings from the host. But a little thought shows that that won’t always work.

If you have DNSmasq set up on the host, the /etc/resolv.conf file will be pointed at the 127.0.0.1 loopback address. If this were passed through to the container, the container would look up DNS addresses from within its own networking context, and there’s no DNS server available within the container context, so the DNS lookups would fail.

‘A-ha!’ you might think: we can always use the host’s DNS server by using the host’s IP address, available from within the container as the default route:

root@79a95170e679:/# ip route                                                               
default via 172.17.0.1 dev eth0                                               
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2     

 

Use the host?

From that we can work out that the ‘host’ is on the ip address: 172.17.0.1, so we could try manually pointing DNS at that using dig (you could also update the /etc/resolv.conf and then run ping, this just seems like a good time to introduce dig and its @ flag, which points the request at the ip address you specify):

root@79a95170e679:/# dig @172.17.0.1 google.com | grep -A1 ANSWER.SECTION
;; ANSWER SECTION:
google.com.             112     IN      A       172.217.23.14

However: that might work if you use DNSMasq, but if you don’t it won’t, as there’s no DNS server on the host to look up.

So Docker’s solution to this quandary is to bypass all that complexity and point your DNS lookups to Google’s DNS servers at 8.8.8.8 and 8.8.4.4, ignoring whatever the host context is.

Anecdote: This was the source of my first problem with Docker back in 2013. Our corporate network blocked access to those IP addresses, so my containers couldn’t resolve URLs.

So that’s Docker containers, but container orchestrators such as Kubernetes can do different things again…

 

2) Kubernetes and DNS

The unit of container deployment in Kubernetes is a Pod. A pod is a set of co-located containers that (among other things) share the same IP address.

An extra challenge with Kubernetes is to forward requests for Kubernetes services to the right resolver (eg myservice.kubernetes.io) to the private network allocated to those service addresses. These addresses are said to be on the ‘cluster domain’. This cluster domain is configurable by the administrator, so it might be cluster.local or myorg.badger depending on the configuration you set up.

In Kubernetes you have four options for configuring how DNS lookup works within your pod.

  • Default

This (misleadingly-named) option takes the same DNS resolution path as the host the pod runs on, as in the ‘naive’ DNS lookup described earlier. It’s misleadingly named because it’s not the default! ClusterFirst is.

If you want to override the /etc/resolv.conf entries, you can in your config for the kubelet.

  • ClusterFirst

ClusterFirst does selective forwarding on the DNS request. This is achieved in one of two ways based on the configuration.

In the first, older and simpler setup, a rule was followed where if the cluster domain was not found in the request, then it was forwarded to the host.

In the second, newer approach, you can configure selective forwarding on an internal DNS

Here’s what the config looks like and a diagram lifted from the Kubernetes docs which shows the flow:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-dns
  namespace: kube-system
data:
  stubDomains: |
    {"acme.local": ["1.2.3.4"]}
  upstreamNameservers: |
    ["8.8.8.8", "8.8.4.4"]

The stubDomains entry defines specific DNS servers to use for specific domains. The upstream servers are the servers we defer to when nothing else has picked up the DNS request.

This is achieved with our old friend DNSMasq running in a pod.

kubedns

The other two options are more niche:

  • ClusterFirstWithHostNet

This applies if you use host network for your pods, ie you bypass the Docker networking setup to use the same network as you would directly on the host the pod is running on.

  • None

None does nothing to DNS but forces you to specify the DNS settings in the dnsConfig field in the pod specification.

CoreDNS Coming

And if that wasn’t enough, this is set to change again as CoreDNS comes to Kubernetes, replacing kube-dns. CoreDNS will offer a few benefits over kube-dns, being more configurabe and more efficient.

Find out more here.

If you’re interested in OpenShift networking, I wrote a post on that here. But that was for 3.6 so is likely out of date now.

End of Part IV

That’s part IV done. In it we covered.

  • Docker DNS lookups
  • Kubernetes DNS lookups
  • Selective forwarding (stub domains)
  • kube-dns

 


If you like this, you might like one of my books:

Learn Git the Hard WayLearn Terraform the Hard WayLearn Bash the Hard Way

LearnGitBashandTerraformtheHardWay.png


Or you might like Docker in Practice

Anatomy of a Linux DNS Lookup – Part III

In Anatomy of a Linux DNS Lookup – Part I I covered:

  • nsswitch
  • /etc/hosts
  • /etc/resolv.conf
  • ping vs host style lookups

and in Anatomy of a Linux DNS Lookup – Part II I covered:

  • systemd and its networking service
  • ifup and ifdown
  • dhclient
  • resolvconf

and ended up here:


linux-dns-2 (2)

A (roughly) accurate map of what’s going on

Unfortunately, that’s not the end of the story. There’s still more things that can get involved. In Part III, I’m going to cover NetworkManager and dnsmasq and briefly show how they play a part.


1) NetworkManager

As mentioned in Part II, we are now well away from POSIX standards and into Linux distribution-specific areas of DNS resolution management.

In my preferred distribution (Ubuntu), there is a service that’s available and often installed for me as a dependency of some other package I install called NetworkManager. It’s actually a service developed by RedHat in 2004 to help manage network interfaces for you.

What does this have to do with DNS? Install it to find out:

$ apt-get install -y network-manager

In my distribution, I get a config file.

$ cat /etc/NetworkManager/NetworkManager.conf
[main]
plugins=ifupdown,keyfile,ofono
dns=dnsmasq

[ifupdown]
managed=false

See that dns=dnsmasq there? That means that NetworkManager will use dnsmasq to manage DNS on the host.


2) dnsmasq

The dnsmasq program is that now-familiar thing: yet another level of indirection for /etc/resolv.conf.

Technically, dnsmasq can do a few things, but is primarily it acts as a DNS server that can cache requests to other DNS servers. It runs on port 53 (the standard DNS port), on all local network interfaces.

So where is dnsmasq running? NetworkManager is running:

$ ps -ef | grep NetworkManager
root     15048     1  0 16:39 ?        00:00:00 /usr/sbin/NetworkManager --no-daemon

But no dnsmasq process exists:

$ ps -ef | grep dnsmasq
$

Although it’s configured to be used, confusingly it’s not actually installed! So you’re going to install it.

Before you install it though, let’s check the state of /etc/resolv.conf.

$ cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 10.0.2.2
search home

It’s not been changed by NetworkManager.

If dnsmasq is installed:

$ apt-get install -y dnsmasq

Then dnsmasq is up and running:

$ ps -ef | grep dnsmasq
dnsmasq  15286     1  0 16:54 ?        00:00:00 /usr/sbin/dnsmasq -x /var/run/dnsmasq/dnsmasq.pid -u dnsmasq -r /var/run/dnsmasq/resolv.conf -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new --local-service --trust-anchor=.,19036,8,2,49AAC11D7B6F6446702E54A1607371607A1A41855200FD2CE1CDDE32F24E8FB5

And /etc/resolv.conf has changed again!

root@linuxdns1:~# cat /etc/resolv.conf 
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 127.0.0.1
search home

And netstat shows dnsmasq is serving on all interfaces at port 53:

$ netstat -nlp4
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address     Foreign Address State   PID/Program name
tcp        0      0 127.0.0.1:53      0.0.0.0:*       LISTEN  15286/dnsmasq 
tcp        0      0 10.0.2.15:53      0.0.0.0:*       LISTEN  15286/dnsmasq
tcp        0      0 172.28.128.11:53  0.0.0.0:*       LISTEN  15286/dnsmasq
tcp        0      0 0.0.0.0:22        0.0.0.0:*       LISTEN  1237/sshd
udp        0      0 127.0.0.1:53      0.0.0.0:*               15286/dnsmasq
udp        0      0 10.0.2.15:53      0.0.0.0:*               15286/dnsmasq  
udp        0      0 172.28.128.11:53  0.0.0.0:*               15286/dnsmasq  
udp        0      0 0.0.0.0:68        0.0.0.0:*               10758/dhclient
udp        0      0 0.0.0.0:68        0.0.0.0:*               10530/dhclient
udp        0      0 0.0.0.0:68        0.0.0.0:*               10185/dhclient

3) Unpicking dnsmasq

Now we are in a situation where all DNS queries are going to 127.0.0.1:53 and from there what happens?

We can get a clue from looking again at the /var/run folder. The resolv.conf in resolvconf has been changed to point to where dnsmasq is being served:

$ cat /var/run/resolvconf/resolv.conf 
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 127.0.0.1
search home

while there’s a new dnsmasq folder with its own resolv.conf.

$ cat /run/dnsmasq/resolv.conf 
nameserver 10.0.2.2

which has the nameserver given to us by DHCP.

We can reason about this without looking too deeply, but what if we really want to know what’s going on?


4) Debugging Dnsmasq

Frequently I’ve found myself wondering what dnsmasq’s state is. Fortunately, you can get a good amount of information out of it if you set change this line in /etc/dnsmasq.conf:

#log-queries

to:

log-queries

and restart ​dnsmasq

Now, if you do a simple:

$ ping -c1 bbc.co.uk

you will see something like this in /var/log/syslog (the [...] indicates that the line’s start is the same as the previous one):

Jul  3 19:56:07 ubuntu-xenial dnsmasq[15372]: query[A] bbc.co.uk from 127.0.0.1
[...] forwarded bbc.co.uk to 10.0.2.2
[...] reply bbc.co.uk is 151.101.192.81
[...] reply bbc.co.uk is 151.101.0.81
[...] reply bbc.co.uk is 151.101.64.81
[...] reply bbc.co.uk is 151.101.128.81
[...] query[PTR] 81.192.101.151.in-addr.arpa from 127.0.0.1
[...] forwarded 81.192.101.151.in-addr.arpa to 10.0.2.2
[...] reply 151.101.192.81 is NXDOMAIN

which shows what dnsmasq received, where the query was forwarded to, and what reply was received.

If the query is returned from the cache (or, more exactly, the local ‘time-to-live’ for the query has not expired), then it looks like this in the logs:

[...] query[A] bbc.co.uk from 127.0.0.1
[...] cached bbc.co.uk is 151.101.64.81
[...] cached bbc.co.uk is 151.101.128.81
[...] cached bbc.co.uk is 151.101.192.81
[...] cached bbc.co.uk is 151.101.0.81
[...] query[PTR] 81.64.101.151.in-addr.arpa from 127.0.0.1

and if you ever want to know what’s in your cache, you can provoke dnsmasq into sending it to the same log file by sending the USR1 signal to the dnsmasq process id:

$ kill -SIGUSR1 <(cat /run/dnsmasq/dnsmasq.pid)

and the output of the dump looks like this:

Jul  3 15:08:08 ubuntu-xenial dnsmasq[15697]: time 1530630488                                                                                                                                 
[...] cache size 150, 0/5 cache insertions re-used unexpired cache entries.                                                                           
[...] queries forwarded 2, queries answered locally 0                                                                                                 
[...] queries for authoritative zones 0                                                                                                               
[...] server 10.0.2.2#53: queries sent 2, retried or failed 0                                                                                         
[...] Host             Address         Flags      Expires                             
[...] linuxdns1        172.28.128.8    4FRI   H                                                                
[...] ip6-localhost    ::1             6FRI   H                                                                
[...] ip6-allhosts     ff02::3         6FRI   H                                                                
[...] ip6-localnet     fe00::          6FRI   H                                                                
[...] ip6-mcastprefix  ff00::          6FRI   H                                                                
[...] ip6-loopback     :               6F I   H                                                                
[...] ip6-allnodes     ff02:           6FRI   H                                                                
[...] bbc.co.uk        151.101.64.81   4F         Tue Jul  3 15:11:41 2018                                     
[...] bbc.co.uk        151.101.192.81  4F         Tue Jul  3 15:11:41 2018                                     
[...] bbc.co.uk        151.101.0.81    4F         Tue Jul  3 15:11:41 2018                                     
[...] bbc.co.uk        151.101.128.81  4F         Tue Jul  3 15:11:41 2018                                     
[...]                  151.101.64.81   4 R  NX    Tue Jul  3 15:34:17 2018                                     
[...] localhost        127.0.0.1       4FRI   H                                                                
[...] <Root>           19036   8   2   SF I                                                                    
[...] ip6-allrouters   ff02::2         6FRI   H        

In the above output, I believe (but don’t know, and ‘?’ indicates a relatively wild guess on my part) that:

  • ‘4’ means IPv4
  • ‘6’ means IPv6
  • ‘H’ means address was read from an /etc/hosts file
  • ‘I’ ? ‘Immortal’ DNS value? (ie no time-to-live value?)
  • ‘F’ ?
  • ‘R’ ?
  • ‘S’?
  • ‘N’?
  • ‘X’

 

Alternatives to dnsmasq

dnsmasq is not the only option that can be passed to dns in NetworkManager. There’s none which does nothing to /etc/resolv,conf, default, which claims to ‘update resolv.conf  to reflect currently active connections’, and unbound, which communicates with the unbound service and dnssec-triggerd, which is concerned with DNS security and is not covered here.


 

End of Part III

That’s the end of Part III, where we covered the NetworkManager service, and its dns=dnsmasq setting.

Let’s briefly list some of the things we’ve come across so far:

  • nsswitch
  • /etc/hosts
  • /etc/resolv.conf
  • /run/resolvconf/resolv.conf
  • systemd and its networking service
  • ifup and ifdown
  • dhclient
  • resolvconf
  • NetworkManager
  • dnsmasq

 


 

 

 

 

 

 

Learn_terraform_the_hard_way

 

 

 

 

 

 

 

hero

Or you might like Docker in Practice

Anatomy of a Linux DNS Lookup – Part II

 

In Anatomy of a Linux DNS Lookup – Part I I covered:

  • nsswitch
  • /etc/hosts
  • /etc/resolv.conf
  • ping vs host style lookups

and determined that most programs reference /etc/resolv.conf along the way to figuring out which DNS server to look up.

That stuff was more general linux behaviour (*) but here we move firmly into distribution-specific territory. I use ubuntu, but a lot of this will overlap with Debian and even CentOS-based distributions, and also differ from earlier or later Ubuntu versions.

(*) in fact, it’s subject to a POSIX standard, so
is not limited to Linux (I learned this from
a fantastic comment on the previous post)

In other words: your host is more likely to differ in its behaviour in specifics from here.

In Part II I’ll cover how resolv.conf can get updated, what happens when systemctl restart networking is run, and how dhclient gets involved.


1) Updating /etc/resolv.conf by hand

We know that /etc/resolv.conf is (highly likely to be) referenced, so surely you can just add a nameserver to that file, and then your host will use that nameserver in addition to the others, right?

If you try that:

$ echo nameserver 10.10.10.10 >> /etc/resolv.conf

it all looks good:

# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 10.0.2.3
search home
nameserver 10.10.10.10

until the network is restarted:

$ systemctl restart networking
$ cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 10.0.2.3
search home

our 10.10.10.10 nameserver has gone!

This is where those comments we ignored in Part I come in…


2) resolvconf

You see the phrase generated by resolvconf in the /etc/resolv.conf file above? This is our clue.

If you dig into what systemctl restart networking does, among many other things, it ends up calling a script: /etc/network/if-up.d/000resolvconf. Within this script is a call to resolvconf:

/sbin/resolvconf -a "${IFACE}.${ADDRFAM}"

A little digging through the man pages reveals that the -a flag allows us to:

Add or overwrite the record IFACE.PROG then run the update scripts
if updating is enabled.

So maybe we can call this directly to add a nameserver:

echo 'nameserver 10.10.10.10' | /sbin/resolvconf -a enp0s8.inet

Turns out we can!

$ cat /etc/resolv.conf  | grep nameserver
nameserver 10.0.2.3
nameserver 10.10.10.10

So we’re done now, right? This is how /etc/resolv.conf gets updated? Calling resolvconf adds it to a database somewhere, and then updates (if configured, whatever that means) the resolv.conf file

No.

systemctl restart networking
root@linuxdns1:/etc# cat /etc/resolv.conf 
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 10.0.2.3
search home

Argh! It’s gone again.

So systemctl restart networking does more than just run resolvconf. It must be getting the nameserver information from somewhere else. Where?


3) ifup/ifdown

Digging further into what systemctl restart networking does tells us a couple of things:

cat /lib/systemd/system/networking.service
[...]
[Service]
Type=oneshot
EnvironmentFile=-/etc/default/networking
ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] && [ -n "$(ifquery --read-environment --list --exclude=lo)" ] && udevadm settle'
ExecStart=/sbin/ifup -a --read-environment
ExecStop=/sbin/ifdown -a --read-environment --exclude=lo
[...]

First, the networking ‘service’ restart is actually a ‘oneshot’ script that runs these commands:

/sbin/ifdown -a --read-environment --exclude=lo
/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] && [ -n "$(ifquery --read-environment --list --exclude=lo)" ] && udevadm settle'
/sbin/ifup -a --read-environment

The first line with ifdown brings down all the network interfaces (but excludes the local interface). (*)

(*) I’m unclear why this doesn’t boot me out of my
vagrant session in my example code (anyone know?).

The second line makes sure the system has done all it needs to do regarding the bringing of network interfaces down before going ahead and bringing them all back up with ifup in the third line. So the second thing we learn is that ifup and ifdown are what the networking service ‘actually’ runs.

The --read-environment flag is undocumented, and is there so that systemctl can play nice with it. A lot of people hate systemctl for this kind of thing.

Great. So what does ifup (and its twin, ifdown) do? To cut another long story short, it runs all the scripts in etc/network/if-pre-up.d/ and /etc/network/if-up.d/. These in turn might run other scripts, and so on.

One of the things it does (and I’m still not quite sure how – maybe udev is involved?) dhclient gets run.


4) dhclient

dhclient is a program that interacts with DHCP servers to negotiate the details of what IP address the network interface specified should use. It also can receive a DNS nameserver to use, which then gets placed in the /etc/resolv.conf.

Let’s cut to the chase and simulate what it does, but just on the enp0s3 interface on my example VM, having first removed the nameserver from the /etc/resolv.conf file:

$ sed -i '/nameserver.*/d' /run/resolvconf/resolv.conf
$ cat /etc/resolv.conf | grep nameserver
$ dhclient -r enp0s3 && dhclient -v enp0s3
Killed old client process
Internet Systems Consortium DHCP Client 4.3.3
Copyright 2004-2015 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/
Listening on LPF/enp0s8/08:00:27:1c:85:19
Sending on   LPF/enp0s8/08:00:27:1c:85:19
Sending on   Socket/fallback
DHCPDISCOVER on enp0s8 to 255.255.255.255 port 67 interval 3 (xid=0xf2f2513e)
DHCPREQUEST of 172.28.128.3 on enp0s8 to 255.255.255.255 port 67 (xid=0x3e51f2f2)
DHCPOFFER of 172.28.128.3 from 172.28.128.2
DHCPACK of 172.28.128.3 from 172.28.128.2
bound to 172.28.128.3 -- renewal in 519 seconds.

$ cat /etc/resolv.conf | grep nameserver
nameserver 10.0.2.3

So that’s where the nameserver comes from…

But hang on a sec – what’s that /run/resolvconf/resolv.conf doing there, when it should be /etc/resolv.conf?

Well, it turns out that /etc/resolv.conf isn’t always ‘just’ a file.

On my VM, it’s a symlink to the ‘real’ file stored in /run/resolvconf. This is a clue that the file is constructed at run time, and one of the reasons we’re told not to edit the file directly.

If the sed command above were to be run on the /etc/resolv.conf file directly then the behaviour above would be different and a warning thrown about /etc/resolv.conf not being a symlink (sed -i doesn’t handle symlinks cleverly – it just creates a fresh file).

dhclient offers the capability to override the DNS server given to you by DHCP if you dig a bit deeper into the supersede setting in /etc/dhcp/dhclient.conf


linux-dns-2 (2)

A (roughly) accurate map of what’s going on


 

End of Part II

That’s the end of Part II. Believe it or not that was a somewhat simplified version of what goes on, but I tried to keep it to the important and ‘useful to know’ stuff so you wouldn’t fall asleep. Most of that detail is around the twists and turns of the scripts that actually get run.

And we’re still not done yet. Part III will look at even more layers on top of these.

Let’s briefly list some of the things we’ve come across so far:

  • nsswitch
  • /etc/hosts
  • /etc/resolv.conf
  • /run/resolvconf/resolv.conf
  • systemd and its networking service
  • ifup and ifdown
  • dhclient
  • resolvconf

 


 

 

 

 

 

 

Learn_terraform_the_hard_way

 

 

 

 

 

 

 

hero

Or you might like Docker in Practice

Anatomy of a Linux DNS Lookup – Part I

Since I work a lot with clustered VMs, I’ve ended up spending a lot of time trying to figure out how DNS lookups work. I applied ‘fixes’ to my problems from StackOverflow without really understanding why they work (or don’t work) for some time.

Eventually I got fed up with this and decided to figure out how it all hangs together. I couldn’t find a complete guide for this anywhere online, and talking to colleagues they didn’t know of any (or really what happens in detail)

So I’m writing the guide myself.

If you’re looking for Part II, click here

Turns out there’s quite a bit in the phrase ‘Linux does a DNS lookup’…


linux-dns-0

“How hard can it be?”


These posts are intended to break down how a program decides how it gets an IP address on a Linux host, and the components that can get involved. Without understanding how these pieces fit together, debugging and fixing problems with (for example) dnsmasq, vagrant landrush, or resolvconf can be utterly bewildering.

It’s also a valuable illustration of how something so simple can get so very complex over time. I’ve looked at over a dozen different technologies and their archaeologies so far while trying to grok what’s going on.

I even wrote some automation code to allow me to experiment in a VM. Contributions/corrections are welcome.

Note that this is not a post on ‘how DNS works’. This is about everything up to the call to the actual DNS server that’s configured on a linux host (assuming it even calls a DNS server – as you’ll see, it need not), and how it might find out which one to go to, or how it gets the IP some other way.


1) There is no such thing as a ‘DNS Lookup’ call


linux-dns-1

This is NOT how it works


The first thing to grasp is that there is no single method of getting a DNS lookup done on Linux. It’s not a core system call with a clean interface.

There is, however, a standard C library call called which many programs use: getaddrinfo. But not all applications use this!

Let’s just take two simple standard programs: ping and host:

root@linuxdns1:~# ping -c1 bbc.co.uk | head -1
PING bbc.co.uk (151.101.192.81) 56(84) bytes of data.
root@linuxdns1:~# host bbc.co.uk | head -1
bbc.co.uk has address 151.101.192.81

They both get the same result, so they must be doing the same thing, right?

Wrong.

Here’s the files that ping looks at on my host that are relevant to DNS:

root@linuxdns1:~# strace -e trace=open -f ping -c1 google.com
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib/x86_64-linux-gnu/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 4
open("/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 4
open("/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 4
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 4
open("/lib/x86_64-linux-gnu/libnss_files.so.2", O_RDONLY|O_CLOEXEC) = 4
open("/etc/host.conf", O_RDONLY|O_CLOEXEC) = 4
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 4
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 4
open("/lib/x86_64-linux-gnu/libnss_dns.so.2", O_RDONLY|O_CLOEXEC) = 4
open("/lib/x86_64-linux-gnu/libresolv.so.2", O_RDONLY|O_CLOEXEC) = 4
PING google.com (216.58.204.46) 56(84) bytes of data.
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 4
64 bytes from lhr25s12-in-f14.1e100.net (216.58.204.46): icmp_seq=1 ttl=63 time=13.0 ms
[...]

and the same for host:

$ strace -e trace=open -f host google.com
[...]
[pid  9869] open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/libdst.cat", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid  9869] open("/usr/share/locale/en/libdst.cat", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid  9869] open("/usr/share/locale/en/LC_MESSAGES/libdst.cat", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid  9869] open("/usr/lib/ssl/openssl.cnf", O_RDONLY) = 6
[pid  9869] open("/usr/lib/x86_64-linux-gnu/openssl-1.0.0/engines/libgost.so", O_RDONLY|O_CLOEXEC) = 6[pid  9869] open("/etc/resolv.conf", O_RDONLY) = 6
google.com has address 216.58.204.46
[...]

You can see that while my ping looks at nsswitch.confhost does not. And they both look at /etc/resolv.conf.

We’re going to take these two .conf files in turn.


2) NSSwitch, and /etc/nsswitch.conf

We’ve established that applications can do what they like when they decide which DNS server to go to. Many apps (like ping) above can refer (depending on the implementation (*)) to NSSwitch via its config file /etc/nsswitch.conf.

(*) There’s a surprising degree of variation in
ping implementations. That’s a rabbit-hole I
didn’t want to get lost in.

NSSwitch is not just for DNS lookups. It’s also used for passwords and user lookup information (for example).

NSSwitch was originally created as part of the Solaris OS to allow applications to not have to hard-code which file or service they look these things up on, but defer them to this other configurable centralised place they didn’t have to worry about.

Here’s my nsswitch.conf:

passwd:         compat
group:          compat
shadow:         compat
gshadow:        files
hosts: files dns myhostname
networks:       files
protocols:      db files
services:       db files
ethers:         db files
rpc:            db files
netgroup:       nis

The ‘hosts’ line is the one we’re interested in. We’ve shown that ping cares about nsswitch.conf so let’s fiddle with it and see how we can mess with ping.

  • Set nsswitch.conf to only look at ‘files’

If you set the hosts line in nsswitch.conf to be ‘just’ files:

hosts: files

Then a ping to google.com will now fail:

$ ping -c1 google.com
ping: unknown host google.com

but localhost still works:

$ ping -c1 localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.039 ms

and using host still works fine:

host google.com
google.com has address 216.58.206.110

since, as we saw, it doesn’t care about nsswitch.conf

  • Set nsswitch.conf to only look at ‘dns’

If you set the hosts line in nsswitch.conf to be ‘just’ dns:

hosts: dns

Then a ping to google.com will now succeed again:

$ ping -c1 google.com
PING google.com (216.58.198.174) 56(84) bytes of data.
64 bytes from lhr25s10-in-f174.1e100.net (216.58.198.174): icmp_seq=1 ttl=63 time=8.01 ms

But localhost is not found this time:

$ ping -c1 localhost
ping: unknown host localhost

Here’s a diagram of what’s going on with NSSwitch by default wrt hosts lookup:


linux-dns-2 (1)

My default ‘hosts:‘ configuration in nsswitch.conf

 


3) /etc/resolv.conf

We’ve seen now that host and ping both look at this /etc/resolv.conf file.

Here’s what my /etc/resolv.conf looks like:

# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 10.0.2.3

Ignore the first two lines – we’ll come back to those (they are significant, but you’re not ready for that ball of wool yet).

The nameserver lines specify the DNS servers to look up the host for.

If you hash out that line:

#nameserver 10.0.2.3

and run:

$ ping -c1 google.com
ping: unknown host google.com

it fails, because there’s no nameserver to go to (*).

* Another rabbit hole: host appears to fall back to
127.0.0.1:53 if there’s no nameserver specified.

This file takes other options too. For example, if you add this line to the resolv.conf file:

search com

and then ping google (sic)

$ ping google
PING google.com (216.58.204.14) 56(84) bytes of data.

it will try the .com domain automatically for you.

End of Part I

That’s the end of Part I. The next part will start by looking at how that resolv.conf gets created and updated.

Here’s what you covered above:

  • There’s no ‘DNS lookup’ call in the OS
  • Different programs figure out the IP of an address in different ways
    • For example, ping uses nsswitch, which in turn uses (or can use) /etc/hosts, /etc/resolv.conf and its own hostname to get the result
  • /etc/resolv.conf helps decide:
    • which addresses get called
    • which DNS server to look up

If you thought that was complicated, buckle up…


 

 

 

 

 

 

Learn_terraform_the_hard_way

 

 

 

 

 

 

 

hero

Or you might like Docker in Practice

A Docker Image in Less Than 1000 Bytes

Here it is (base64-encoded):

H4sICIa2A1sCA2IA7Vrrbts2FFYL7M9+7QUGGNyfDYhtkuJFFLAhWZOhBYJmaLMOWBAEFC+xVlkyJLpYEBjdY+0l+k6jfGvqtkEWp2qD8TMg8vAqnsNzDg9lQhhmEjHDhY4zgWJBBUQJ5ZnCGAubMUQMyhJqoRRMJxYbo7Q2CedYxlQO/myqMroeEEHICIngApspxohEKI4h5DHmGEUQQw7jqAejDjBtnKz9q2w7zubi7gkugazVKHdGuWltQArkWDMCdoCqSpufg/QSPK4aV8pxW+nL96uxzMu39G+NqRe5PeekGj13Oi9BamXRmCtl1dS9X2jqel147C7W+aOJKd8dZ04dlcqsSw7KVyA9Ab/uHT/+cTht6mFRKVkMmywv0yv0mnxbMc8sSP8Apzvg0ViDtJwWxQ54Mpbny5W9qIrp2DSrmt+r+mVenu/ny+UelK6+mFR56VYtjsqfp3mxHupQZqZYdp/NGeo850x99r9j7QloyWEz8kvpK//47vuymvzQ29vf79m8MKnIaIa8bUmwRdByw6TKREIoIzE3xBrjrY7MGDUilomQ3GrNrFaIKqSZ4lkvL3tD12sn/IQCrI10xtcC7C1kH9I+xseQpYilRAwoZ5AI9IcfWFfqpRfzK1M3eeUZDRAfQDGAfc/jHTDKG1fVXiInlzcfctnwLPP9Vszs9VXvUzFy5jlZV5WzTbtN3cWkZWkhL/yS2gXm1p7lumkl24wkpv51FbYcU0EZy7SV0ucEZowkiCjvLbAVikCaGUqhyjT0c0Lj/YrElmmSWANOZ7MooHPwRCiLRaJEzBXKFGTCy49lUHNKjEigVdD6H4uTzPj9wzDCSawU0TQT2ujhjVwjgZzSj/n/eX7D/xPm/T8N/v/Ll/+Lg2fPnxw93eL85xFvyB9Rn4TzXwdAAxiMYLD/t9f/7eM/xDja1P+YBf3vKP7L2+PnttsA/IfjcQiE7nkgdH18Ey4O7pjdH7ygmX0p9n8eFA5aG3pb+0/eP/9jzFmw/13AdTBHK3/OPx7/Ic4X8qecQ9K244QG/98JXh8c/vLwwYM1/TD6KWqpv6LdOb37gT67URKterTpVxu1V9PXq3lW1d8skn++9Y83f4cDeEBAQMBnwliWuTWNu8l33G38/3X3fzGk79wFQ4S4Lwr+vwOcXIJHy4ANkLv4L4APcJ6ZSXUsz+efh1xaSOf3VxstHS6+H/nSu4s6wOns9OugxrdG7WXV5K6qc9NEn0n/ESab+s9o0P+O7v9ce1WzVNI7uAiczYI6BgQEBNwD/AvqV/+XACoAAA==

How I Got There

A colleague of mine showed me a Docker image he was using to test Kubernetes clusters. It did nothing, just starts up as a pod and sits there until you kill it.

‘Look, it’s only 700kb! Really quick to download!’

This got me wondering what the smallest Docker image I could create was.

I wanted one I could base64 encode and send ‘anywhere’ with a cut and paste.

Since a Docker image is just a tar file, and a tar file is ‘just’ a file, this should be quite possible.

A Tiny Binary

The first thing I needed was a tiny Linux binary that does nothing.

There’s some prior art here, a couple of fantastic and instructive articles on creating small executables, which are well worth reading:

Smallest x86 ELF Hello World

A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux

I didn’t want a ‘Hello World’, but a program that just slept and that worked on x86_64.

I started with an example from the first article above:

	SECTION .data
msg:	db "Hi World",10
len:	equ $-msg

	SECTION .text

        global _start
_start:
	mov	edx,len
	mov	ecx,msg
	mov	ebx,1
	mov	eax,4
	int	0x80

	mov	ebx,0
	mov	eax,1
	int	0x80

Running:

nasm -f elf64 hw.asm -o hw.o
ld hw.o -o hw
strip -s hw

Produces a binary of 504 bytes.

But I don’t want a ‘hello world’.

First, I figured I didn’t need the .data or .text sections, nor did I need to load up the data. I figured the top half of the _start section was doing the printing so tried:

 

global _start
_start:
 mov ebx,0
 mov eax,1
 int 0x80

Which compiled at 352 bytes.

But that’s no good, because it just exits. I need it to sleep. So a little further digging and I worked out that the mov eax command loads up the CPU register with the relevant Linux syscall number, and int 0x80 makes the syscall itself call. More info on this here.

I found a list of these here. Syscall 1 is ‘exit’, so what I wanted was syscall 29: pause.

This made the program:

global _start
_start:
 mov eax, 29
 int 0x80

Which shaved 8 bytes off to compile at 344 bytes, and creates a binary that just sits there waiting for a signal, which is exactly what I want.

Hexering

At this point I took out the chainsaw and started hacking away at the binary. To do this I used hexer which is essentially a vim you can use on binary files to edit the hex directly. After a lot of trial and error I got from this:

Screen Shot 2018-05-22 at 07.03.23

to this:

Screen Shot 2018-05-22 at 07.06.30

Which appeared to do the same thing. Notice how the strings are gone, as well as a lot of whitespace. Along the way I referenced this doc, but mostly it was trial and error.

That got me down to 136 bytes.

Sub-100 Bytes?

I wanted to see if I could get any smaller. Reading this suggested I could get down to 45 bytes, but alas, no. That worked for a 32-bit executable, but pulling the same stunts on a 64-bit one didn’t seem to fly at all.

The best I could do was lift a 64-bit version of the program in the above blog and sub in my syscall:

BITS 64
 org 0x400000
 
ehdr: ; Elf64_Ehdr
 db 0x7f, "ELF", 2, 1, 1, 0 ; e_ident
 times 8 db 0
 dw 2 ; e_type
 dw 0x3e ; e_machine
 dd 1 ; e_version
 dq _start ; e_entry
 dq phdr - $$ ; e_phoff
 dq 0 ; e_shoff
 dd 0 ; e_flags
 dw ehdrsize ; e_ehsize
 dw phdrsize ; e_phentsize
 dw 1 ; e_phnum
 dw 0 ; e_shentsize
 dw 0 ; e_shnum
 dw 0 ; e_shstrndx
 ehdrsize equ $ - ehdr
 
phdr: ; Elf64_Phdr
 dd 1 ; p_type
 dd 5 ; p_flags
 dq 0 ; p_offset
 dq $$ ; p_vaddr
 dq $$ ; p_paddr
 dq filesize ; p_filesz
 dq filesize ; p_memsz
 dq 0x1000 ; p_align
 phdrsize equ $ - phdr
 
_start:
 mov eax, 29
 int 0x80
 
filesize equ $ - $$

which gave me an image of 127 bytes.

I gave up reducing at this point, and am open to suggestions.

A Teensy Docker Image

Now I have my ‘sleep’ executable, I needed to put this in a Docker image.

To try and squeeze every byte possible, I created a binary with a filename one byte long called ‘t‘ and put it in a Dockerfile from scratch, a virtual 0-byte image:

FROM scratch
ADD t /t

Note there’s no CMD, as that increases the size of the Docker image. A command needs to be passed to the docker run command for this to run.

Using docker save to create a tar file, and then using maximum compression with gzip I got to a portable Docker image file that was less than 1000 bytes:

$ docker build -t t .
$ docker save t | gzip -9 - | wc -c
976

I tried in vain to reduce the size of the tar file by fiddling with the Docker manifest file, but my efforts were in vain – due to the nature of the tar file format and the gzip compression algorithm, these attempts actually made the final gzip bigger!

I also tried other compression algorithms, but gzip did best on this small file.

Can You Get This Lower?

Keen to hear from you if you can…

Code

is here.


If you like this post, you might like Docker in Practice

hero

Autotrace – Debug on Steroids

Autotrace is a tool that allows you to debug a process and:

  • View the output of multiple debug commands at once
  • Record this output for others to review
  • Replay the output

All of this is done in a single window.

You can think of it as a realtime sosreport or just a cool way to learn more about what’s going on when you run something.

Why?

Have you ever done something like this?

$ some_command
^Z
[1]+  Stopped                 some_command
imiell   17440 22741  0 10:47 pts/3    00:00:00 grep some_command
$ strace -p 17440 > strace.log 2>&1
$ fg
[some_command continues]

That is, you:

  • Ran a process
  • Realised you want to run some tracing on it
  • Suspended it
  • Found the pid
  • Ran your trace command(s), outputting to logfiles
  • Continued the process

Tedious, right?

LMATFY

autotrace can automate that. By default it finds the latest backgrounded pid, attaches to it, runs some default tracing commands on it, records the output, allows you to pause it, replay it elsewhere, and more.

Here’s what it looks like doing something similar to the above scenario:

If you remember you have autotrace before you run the command, you can specify all those commands automatically:

Pause

You can pause the session and scroll back and forth, then continue tracking. It suspends the processes while paused.

Other Features

Record and Replay

It also automatically records those sessions and can replay them – perfect for debugging or sharing debug information in real time.

Here’s the above example being tarred up and replayed somewhere else:

Zoom In and Out

You can zoom in and out by hitting the session number:

Move Windows in and Out

Got more than four commands you want to track output for? No problem, you can supply as many commands as you like.

This command runs five commands, for example:

autotrace \
  'ping google.com' \
  'iostat 1' \
  "bash -c 'while true; do cat /proc/PID/status; sleep 2; done'" \
  'vmstat 1' 
  'bash -c "while true; do echo I was in a hidden window; sleep 1; done"'

The last is hidden at the start, but by hitting ‘m‘ you can move the panes around (the ‘main’ ping one is protected). Like this:

Examples

These examples work on Linux. To work on Mac, you need to find replacements for strace/pstree/whatever.

What does nmap get up to?

sudo autotrace 'nmap meirionconsulting.com' 'strace -p PID' 'tcpdump -XXs 20000'

What about find?

sudo autotrace 'find /' 'strace -p PID'

A monster debug session on nmap involving lsof, pstree, and a tracking of /proc/interrupts (Linux only):

sudo autotrace \
    'nmap localhost' \
    'strace -p PID' \
    'tcpdump -XXs 20000' \
    'bash -c "while true;do free;sleep 5;done"' \
    'bash -c "while true;do lsof -p PID | tail -5;sleep 5;done"' \
    'bash -c "while true;do pstree -p PID | tail -5;sleep 5;done"' \
    'bash -c "while true;do cat /proc/interrupts; sleep 1;done"'

 

Install

pip install autotrace

 

Code

The code is available here.

ACK

It relies heavily on Thomas Ballinger’s wonderful ‘Terminal Whispering’ talk and curtsies python library.