Project

General

Profile

Bug #35947

mon_status doesn't populate outside_quorum when some mons are down

Added by Stefan Kooman over 5 years ago. Updated over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I noticed the "mon_outside_quorum' metric always returns "0", despite if there are mons outside quorum or not:

ceph_cluster_stats,fsid=fsid-here,type_instance=mon_outside_quorum

value=0i 1525173805588164096

I asked Wido den Hollander about it, he debugged the ceph "telegraf" mgr but he told me he receives an empty list. This indicates a bug in the mgr code itself, and / or the monitor code.

History

#1 Updated by John Spray over 5 years ago

The structure in question is the mon_status output, so it would be useful if you could look at the output of the mon_status command on a mon.

#2 Updated by Stefan Kooman over 5 years ago

ceph mon_status -f json | jq '.outside_quorum'
[]

^^ HEALTH_OK

ceph mon_status -f json | jq '.outside_quorum'
[]

^^ [WRN] overall HEALTH_WARN 1/3 mons down

So, the problem is that "outside_quorum" never gets set.

This is on a Ceph 12.2.8 (test) setup

@John: your turn ;-)

#3 Updated by John Spray over 5 years ago

  • Project changed from mgr to RADOS
  • Subject changed from mon_outside_quorum always returns "0", even when mons are outside quorum to mon_status doesn't populate outside_quorum when some mons are down
  • Category deleted (ceph-mgr)

#4 Updated by Joao Eduardo Luis over 5 years ago

`outside quorum` does not pertain to down monitors. We may change that if people think it's more obvious, but the main purpose of this structure is to help understanding which monitors have been responding to probes for election purposes. After a successful election, `outside_quorum` is cleared.

#5 Updated by Stefan Kooman over 5 years ago

Let me see if I get this right.

'After a successful election, `outside_quorum` is cleared."

^^ Do I understand correctly that "outside_quorum" gets cleared, even if there's is one monitor down (or more), when the remaining monitors have held a succesful election?

If that's true, than yes, at least to me that is not obvious. I use that metric in a Ceph dashboard to indicate a mon outide quorum. If there would be another metric, i.e. mon_down, that would also be o.k.

#6 Updated by Joao Eduardo Luis over 5 years ago

Yes, `outside quorum` is solely used to track which monitors are outside of the quorum during an election; once the election is over, that information is no longer relevant (for the intent it has been originally tracked).

We also have no other means of tracking which monitors are down: if they are not in quorum, they are presumed "down" (although we really have no idea, at the moment, whether they are dead, or unreachable, or not participating in the election).

Follow up conversation on IRC has shown that more information being exposed, or (at least) changing the semantics of what is exposed in `outside quorum` would be useful.

#7 Updated by Sage Weil over 5 years ago

We could add a new field for monitors that are... not part of the quorum, but I'm not sure what I'd call it if not "outside_quorum".

#8 Updated by Stefan Kooman over 5 years ago

@sage: indeed :-).

Maybe rename the original use for "outside_quorum" to "outside_election" or something similar to indicate mons that do not play well with the rest of the monitors? And make "outside_quorum" a list of mon(s) that are not part of the quorum (but according to cluster config should be in there).

Also available in: Atom PDF