Ceph cannot connect to any monitors if one of them has a DNS resolution problem
My ceph cluster is configured with this:
mon host = mon1,mon2,mon3
If I remove the DNS entry for mon2 and, from mon1, I get status, it raises an error:
$ ceph -s server name not found: mon2 (Name or service not known) unable to parse addrs in 'mon1,mon2,mon3' 2019-01-11 11:31:10.269 7f9ec32dc700 -1 monclient: get_monmap_and_config cannot identify monitors to contact [errno 22] error connecting to the cluster
According to the docs:
the mon host configuration option only needs to be sufficiently up to date such that a client can reach one monitor that is currently online.
The above configuration matches that requirement, since both mon1 and mon3 can still be resolved.
An additional detail is that if I replace that config line by the actual IP addresses and then check, it properly connects to a monitor and returns a status:
mon host = 172.21.0.3,172.21.0.5,172.21.0.7
$ ceph -s cluster: id: 7060741a-8aad-5f55-b64e-c3f527e322f8 health: HEALTH_WARN 1/3 mons down, quorum mon1,mon3 services: mon: 3 daemons, quorum mon1,mon3, out of quorum: mon2 mgr: mon3(active), standbys: mon1 osd: 4 osds: 0 up, 0 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs:
So, of course, as a workaround, I'm gonna start writing the raw IP address list there. But I still think this is a bug, because it should only fail to contact the cluster in case all DNS entries fail, not in case just one fails, the same way it fails in case all IP addresses cannot be contacted, not when just one fails. That's the whole point of resilience, isn't it?