Failover with ISC DHCP
Paul Heinlein
First published on November 6, 2005
Last updated on April 11, 2008
Small- and medium-sized networks often have a single DHCP server, which can become a single point of failure for a large number of hosts on the network. When the DHCP server goes off-line, DHCP client hosts lose their addresses and ability to communicate with the rest of the network. Since most desktop computers, and even some servers, get their networking configuration via DHCP, such an outage can result in a lot of downtime.
If the network has a Unix infrastructure, there’s a good chance that it’s using the Internet Systems Consortium (ISC) DHCP server, which is widely available on Linux and BSD systems.
Starting with version 3.0, the ISC DHCP server offered failover capabilities that allow network administrators to offer a more robust DHCP service. A failover setup requires a little care, but it’s fairly straightforward to implement.
A simple starting point
Before getting to the failover setup, let’s establish a simple baseline DHCP configuration with no frills.
#
# /etc/dhcpd.conf for simple network
#
authoritative;
ddns-update-style none;
subnet 192.168.200.0 netmask 255.255.255.0 {
option subnet-mask 255.255.255.0;
option broadcast-address 192.168.200.255;
option routers 192.168.200.1;
option domain-name-servers 192.168.200.1;
pool {
max-lease-time 1800; # 30 minutes
range 192.168.200.100 192.168.200.254;
}
}
With this configuration, our server will act as the authoritative DHCP server on the 192.168.200.0 subnet, handing out addresses from 192.168.200.100 to 192.168.200.254 to any host that asks for one.
The problem
Our configuration will work fine until the DHCP server goes off-line. The cause of its demise might be a hardware failure, a power outage, or even an OS upgrade; it doesn’t matter. Once it’s gone, all DHCP client hosts will lose their network configurations within 30 minutes (our maximum lease time).
We could just bring another DHCP server online in its place, but the information about leases will be lost, possibly forcing clients to acquire new addresses. In that situation, clients would have to break any existing network connections. In some cases, local X sessions would also break. (If you’re bored sometime, try changing the hostname of your machine when running a live X desktop. The recovery process can be amusing.)
Alternatively, we could plan for a downtime by increasing lease times from 30 minutes to the better part of a day. That would reduce—but not completely remove—the risk of any given client having its lease expire while the server is off-line, but any newly arriving client won’t get an address.
Configuring the primary
Once you identify the machine that will act as the secondary DHCP server
(or the new primary, if you’re going to demote the old server), you’ll
want to make sure the clocks on the two machines are in sync. Timestamps
are very important to dhcpd
! After that, it’s time to configure it.
Using the simple configuration above, we’ll add the bits necessary to
upgrade it to serve as the primary in a failover situation:
-
The
failover peer
section that identifies the primary and secondary servers; in the example below, it’s called “dhcp-failover,” but it can be any string that works for you. The example identifies the two DHCP servers by address, but you can use DNS names as well. In the past couple years, TCP ports 647 (primary) and 847 (peer) have emerged as the standard bindings for DHCP failover. It’s worth noting that as recently as 2005, the dhcpd.conf(5) man page used ports 519 and 520 in its failover example, but 647 and 847 look like good choices as of 2008.The dhcpd.conf(5) man page says that the primary
port
and thepeer port
may be the same number. That’s the configuration I deploy, using the port 647 for both the primary and the peer. -
The
pool
sections for which the failover pair is active; in our example, we have only onepool
section, so we add a reference to our failover peer set.
#
# /etc/dhcpd.conf for primary DHCP server
#
authoritative;
ddns-update-style none;
failover peer "dhcp-failover" {
primary; # declare this to be the primary server
address 192.168.200.2;
port 647;
peer address 192.168.200.3;
peer port 647;
max-response-delay 30;
max-unacked-updates 10;
load balance max seconds 3;
mclt 1800;
split 128;
}
subnet 192.168.200.0 netmask 255.255.255.0 {
option subnet-mask 255.255.255.0;
option broadcast-address 192.168.200.255;
option routers 192.168.200.1;
option domain-name-servers 192.168.200.1;
pool {
failover peer "dhcp-failover";
max-lease-time 1800; # 30 minutes
range 192.168.200.100 192.168.200.254;
}
}
Configuring the secondary
For our simple network, the configuration for the secondary is quite
similar to that of the primary. The only significant differences are in
the failover peer
definition: there’s a secondary
declaration, the
mclt
and split
declarations are omitted, and the local and peer
addresses are switched.
#
# /etc/dhcpd.conf for secondary DHCP server
#
authoritative;
ddns-update-style none;
failover peer "dhcp-failover" {
secondary; # declare this to be the secondary server
address 192.168.200.3;
port 647;
peer address 192.168.200.2;
peer port 647;
max-response-delay 30;
max-unacked-updates 10;
load balance max seconds 3;
}
subnet 192.168.200.0 netmask 255.255.255.0 {
option subnet-mask 255.255.255.0;
option broadcast-address 192.168.200.255;
option routers 192.168.200.1;
option domain-name-servers 192.168.200.1;
pool {
failover peer "dhcp-failover";
max-lease-time 1800; # 30 minutes
range 192.168.200.100 192.168.200.254;
}
}
The folks at ISC note that the DHCP failover protocol is still under development, which makes it sort of a moving target. As a result, they strongly suggest that the primary and secondary servers both be running the same version of dhcpd.
SELinux notes
As noted, running dhcpd in failover mode involves opening a TCP port for communication with the peer server. The SELinux policy distributed with CentOS 4 and 5 allows dhcpd to send packets over ports 647 and 847, but you’ll need to tweak the policy if you want to use different ports.
The instructions below apply specifically to CentOS 4 (and, by extension, to Red Hat Enterprise Linux 4), though I suspect that they would also work on Fedora Core 3 and 4.
Install the selinux-policy-targeted-sources rpm, if it’s not already on your system.
Open a command prompt in the /etc/selinux/targeted/src/policy
directory.
Create or edit domains/misc/local.te
using your editor of choice. Add
a single line:
allow dhcpd_t port_t:tcp_socket name_bind;
Run make reload
to install your modified policy.
What the logs will show
Once both servers are configured and working, the system logs will show when one goes offline. Here’s what shows up when the primary goes down:
Nov 6 19:50:51 secondary dhcpd: failover peer dhcp-failover: I move
from normal to communications-interrupted
When the primary comes back, the log will say (among other things)
Nov 6 19:51:37 secondary dhcpd: failover peer dhcp-failover: I move
from communications-interrupted to normal
The other main difference in the logs will be the presence of pool
reports. In failover mode, dhcpd will try to ensure that the primary and
secondary servers each have a similar number of free dynamic leases for
each pool
declared in the configuration file. As the servers work to
keep that balance, they’ll occasionally log their status.
Nov 6 20:27:09 secondary dhcpd: pool 98e82b8 192.168.200.0/24
total 155 free 38 backup 37 lts 0
In this case, 75 of the 155 of the addresses we declared eligible for
dynamic assignment are still available. The primary holds 38 in reserve,
the secondary 37. As long as the values for free
and backup
differ
by no more than one, things are good. Should they vary by two or more
(with a resulting non-zero lts
), the pool addresses will be juggled
until balance is restored.
Now, the single point of failure is gone. So go hog wild: install those security patches on your DHCP server that you’d put off because you didn’t want to lose leases!