Bug #21311
closed
ceph perf dump should report standby MDSes
Added by David Galloway over 6 years ago.
Updated over 6 years ago.
Category:
Administration/Usability
Description
This was discovered when observing the cephmetrics dashboard monitoring the Sepia cluster.
"num_mds_up": 1,
"num_mds_in": 1,
"num_mds_failed": 0,
But mds: cephfs-1/1/1 up {0=mira049=up:active}, 2 up:standby
I think it'd be beneficial to report standbys in perf output.
- Project changed from Ceph to CephFS
- Category set to Administration/Usability
- Assignee set to Douglas Fuller
- Priority changed from Normal to High
- Source set to Development
- Backport set to luminous
- Component(FS) MDS added
Doug, please take this one.
This is a collectd thing, which isn't to say that we shouldn't care, but... I'm not sure bugs against collectd really should be filed against cephfs?
John Spray wrote:
This is a collectd thing, which isn't to say that we shouldn't care, but... I'm not sure bugs against collectd really should be filed against cephfs?
collectd uses perf dump to gather data. My understanding is ceph has no metric indicating the number of standby MDSes. Is that incorrect and if so, what should we be running to collect that metric?
So on closer inspection I see that as you say, for the existing stuff it is indeed using perf counters, but it doesn't follow that we should add perf counters that duplicate what's already available from the MDS map.
To put it another way: Ceph already exposes this information (in `ceph fs dump`), just not as a perf counter. collectd is capable of looking at things other than perf counters -- it already has calls for e.g. `df`, `osd pool stats` (looking at https://github.com/ceph/cephmetrics/blob/master/collectors/mon.py).
I'd actually be inclined to rip out some of those perf counters: they are summing over all filesystems, so not terribly informative. We should generally only be using perf counters for things that are really per-daemon, rather than squashing cluster information into them.
- Assignee changed from Douglas Fuller to John Spray
John, if you have strong opinions about ripping out perf counters, I'll send this one over to you. Feel free to send it back if you'd rather I look over them.
- Status changed from New to Rejected
OK, so I'm going to take the opinionated position that this is a WONTFIX as we have an existing interface that provides the information, and I've opened a PR (https://github.com/ceph/ceph/pull/17681) to remove the other perf counters in Mimic and beyond to avoid confusion.
Tickets can be re-opened as well as closed, so this does not preclude further discussion.
Also available in: Atom
PDF