Project

General

Profile

Actions

Bug #21311

closed

ceph perf dump should report standby MDSes

Added by David Galloway over 6 years ago. Updated over 6 years ago.

Status:
Rejected
Priority:
High
Assignee:
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This was discovered when observing the cephmetrics dashboard monitoring the Sepia cluster.

        "num_mds_up": 1,
        "num_mds_in": 1,
        "num_mds_failed": 0,

But mds: cephfs-1/1/1 up {0=mira049=up:active}, 2 up:standby

I think it'd be beneficial to report standbys in perf output.

Actions #1

Updated by Patrick Donnelly over 6 years ago

  • Project changed from Ceph to CephFS
  • Category set to Administration/Usability
  • Assignee set to Douglas Fuller
  • Priority changed from Normal to High
  • Source set to Development
  • Backport set to luminous
  • Component(FS) MDS added

Doug, please take this one.

Actions #2

Updated by John Spray over 6 years ago

This is a collectd thing, which isn't to say that we shouldn't care, but... I'm not sure bugs against collectd really should be filed against cephfs?

Actions #3

Updated by David Galloway over 6 years ago

John Spray wrote:

This is a collectd thing, which isn't to say that we shouldn't care, but... I'm not sure bugs against collectd really should be filed against cephfs?

collectd uses perf dump to gather data. My understanding is ceph has no metric indicating the number of standby MDSes. Is that incorrect and if so, what should we be running to collect that metric?

Actions #4

Updated by John Spray over 6 years ago

So on closer inspection I see that as you say, for the existing stuff it is indeed using perf counters, but it doesn't follow that we should add perf counters that duplicate what's already available from the MDS map.

To put it another way: Ceph already exposes this information (in `ceph fs dump`), just not as a perf counter. collectd is capable of looking at things other than perf counters -- it already has calls for e.g. `df`, `osd pool stats` (looking at https://github.com/ceph/cephmetrics/blob/master/collectors/mon.py).

I'd actually be inclined to rip out some of those perf counters: they are summing over all filesystems, so not terribly informative. We should generally only be using perf counters for things that are really per-daemon, rather than squashing cluster information into them.

Actions #5

Updated by Douglas Fuller over 6 years ago

  • Assignee changed from Douglas Fuller to John Spray

John, if you have strong opinions about ripping out perf counters, I'll send this one over to you. Feel free to send it back if you'd rather I look over them.

Actions #6

Updated by John Spray over 6 years ago

  • Status changed from New to Rejected

OK, so I'm going to take the opinionated position that this is a WONTFIX as we have an existing interface that provides the information, and I've opened a PR (https://github.com/ceph/ceph/pull/17681) to remove the other perf counters in Mimic and beyond to avoid confusion.

Tickets can be re-opened as well as closed, so this does not preclude further discussion.

Actions

Also available in: Atom PDF