Ceph Hostname Sensitivity
We recently expanded our Ceph cluster at work with six new RHEL 8 storage servers. All the Ceph services run in containers managed by cephadm. I ran into an interesting problem the first time I had to patch and reboot those servers after we had integrated them into the cluster.
The machines I was patching have two 25G interfaces configured in an 802.3ad bond. The bonded interface carries traffic for two VLANs: one is customer facing, the other is internal to the cluster only.
We used Red Hat’s kickstart mechanism to install the operating system on these machines, but since we want a pretty specific look to our network configuration, we didn’t configure the networking within the kickstart file. We waited until the machines had rebooted after installation; then we ran a script that bonds and configures the interfaces.
Since we specified no hostname in the kickstart file, Red Hat defaulted to using its standard “localhost.localdomain” hostname. Once the machine is booted and has an IPv4 address, the hostname will get transiently set based on what the DNS resolver sees for that address.
I had six of these servers to patch, so I started with the machine I’ll here call storage-01. When it was rebooted after patching, there was a delay — which I suspect is related to bringing up the bonded interfaces — before the hostname changed from localhost to storage-01.
The problem is that all the Ceph object-storage daemons took on the “localhost” name — I guess they were started before the hostname changed — which completely threw the Ceph cluster for a loop. Ceph started to remap the machine’s files since it couldn’t find “localhost” in the CRUSH map and didn’t know which datacenter/room/rack housed “localhost.”
I permanently set the hostname on storage-01 and restarted the Ceph services, which calmed things down pretty quickly. I then set the hostnames on the other servers in that group to avoid a similar problem down the line. So things are fine, but there was some excitement for a while.