Project

General

Profile

Actions

Bug #37871

open

Ceph cannot connect to any monitors if one of them has a DNS resolution problem

Added by Jairo Llopis over 5 years ago. Updated over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Administration/Usability
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
4 - irritation
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
MonClient
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

My ceph cluster is configured with this:

mon host = mon1,mon2,mon3

If I remove the DNS entry for mon2 and, from mon1, I get status, it raises an error:

$ ceph -s
server name not found: mon2 (Name or service not known)
unable to parse addrs in 'mon1,mon2,mon3'
2019-01-11 11:31:10.269 7f9ec32dc700 -1 monclient: get_monmap_and_config cannot identify monitors to contact
[errno 22] error connecting to the cluster

According to the docs:

the mon host configuration option only needs to be sufficiently up to date such that a client can reach one monitor that is currently online.

The above configuration matches that requirement, since both mon1 and mon3 can still be resolved.

An additional detail is that if I replace that config line by the actual IP addresses and then check, it properly connects to a monitor and returns a status:

mon host = 172.21.0.3,172.21.0.5,172.21.0.7
$ ceph -s
  cluster:
    id:     7060741a-8aad-5f55-b64e-c3f527e322f8
    health: HEALTH_WARN
            1/3 mons down, quorum mon1,mon3

  services:
    mon: 3 daemons, quorum mon1,mon3, out of quorum: mon2
    mgr: mon3(active), standbys: mon1
    osd: 4 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0  objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     

So, of course, as a workaround, I'm gonna start writing the raw IP address list there. But I still think this is a bug, because it should only fail to contact the cluster in case all DNS entries fail, not in case just one fails, the same way it fails in case all IP addresses cannot be contacted, not when just one fails. That's the whole point of resilience, isn't it?

Actions

Also available in: Atom PDF