Project

General

Profile

Bug #22511

Dashboard showing stale health data

Added by Dan van der Ster about 5 years ago. Updated about 5 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In 12.2.2 with a HEALTH_WARN cluster, the dashboard is showing stale health data.

The dashboard shows:

Overall status: HEALTH_WARN
OBJECT_MISPLACED: 395167/541150152 objects misplaced (0.073%)
PG_DEGRADED: Degraded data redundancy: 198/541150152 objects degraded (0.000%), 56 pgs unclean

But ceph status shows:

# ceph status
  cluster:
    id:     eecca9ab-161c-474c-9521-0e5118612dbb
    health: HEALTH_WARN
            1281/541046538 objects misplaced (0.000%)
            Degraded data redundancy: 1 pg unclean

Related issues

Duplicates RADOS - Bug #22142: mon doesn't send health status after paxos service is inactive temporarily Resolved 11/16/2017

History

#1 Updated by John Spray about 5 years ago

Hmm, I've seen a couple of things vaguely similar to this: can you do a "ceph tell mgr.<id> config set debug_mgr 20" and gather the log?

It usually seems to get back up to date next time a mgr restarts but let's gather some evidence if we can

#2 Updated by John Spray about 5 years ago

  • Category set to ceph-mgr

#3 Updated by Dan van der Ster about 5 years ago

Sure, see ceph-post-file: 217cba9a-5ae9-42b4-8e7a-76ba016397e0

At this moment, the dashboard displays:

Health
Overall status: HEALTH_WARN
OBJECT_MISPLACED: 395167/541150152 objects misplaced (0.073%)
PG_DEGRADED: Degraded data redundancy: 198/541150152 objects degraded (0.000%), 56 pgs unclean

#4 Updated by John Spray about 5 years ago

Hmm, so the mon is showing you the same health status that the mgr is sending in DaemonServer::send_report, which is presumably the correct and up to date one.

There are also no handle_mgr_digest messages in the log, so something is going wrong with the transmission of the MMgrDigest (which contains the full health structure) from the mon to the mgr.

The mgr side is using the standard MonClient bits to subscribe, so my hunch would be something wrong in MgrMonitor. Bit suspicious of the part in ::send_digests where it drops out if is_active()==false (from https://github.com/ceph/ceph/pull/15109)

I wonder if this is an edge case where the MonClient has a valid subscription to one of the peon monitors but not to the leader?

#5 Updated by John Spray about 5 years ago

  • Duplicates Bug #22142: mon doesn't send health status after paxos service is inactive temporarily added

#6 Updated by John Spray about 5 years ago

  • Status changed from New to Duplicate

Ah, that suspect piece of code was already updated in master for http://tracker.ceph.com/issues/22142 which is currently pending backport for luminous. Seems highly likely that this is a duplicate of that.

Also available in: Atom PDF