Project

General

Profile

Actions

Bug #23565

open

Inactive PGs don't seem to cause HEALTH_ERR

Added by Greg Farnum about 6 years ago. Updated almost 3 years ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In looking at https://tracker.ceph.com/issues/23562, there were inactive PGs starting at

2018-04-04 16:57:43.702801 mon.reesi001 mon.0 10.8.130.101:6789/0 113 : cluster [WRN] Health check failed: Reduced data availability: 81 pgs inactive, 91 pgs peering (PG_AVAILABILITY)

immediately after the VDO OSD was turned on.
It settled down at
2018-04-04 16:59:38.517021 mon.reesi001 mon.0 10.8.130.101:6789/0 163 : cluster [WRN] overall HEALTH_WARN 463121/13876873 objects misplaced (3.337%); Reduced data availability: 61 pgs inactive; Degraded data redundancy: 14/13876873 objects degraded (0.000%), 152 pgs unclean, 6 pgs degraded; too many PGs per OSD (240 > max 200); clock skew detected on mon.reesi002, mon.reesi003

And then stayed pretty much that way indefinitely. It eventually transitioned to a HEALTH_ERR a couple hours later

2018-04-04 18:27:38.532992 mon.reesi001 mon.0 10.8.130.101:6789/0 1476 : cluster [WRN] overall HEALTH_WARN 405697/13877083 objects misplaced (2.924%); Reduced data availability: 61 pgs inactive; Degraded data redundancy: 13/13877083 objects degraded (0.000%), 139 pgs unclean, 2 pgs degraded; 1 slow requests are blocked > 32 sec; too many PGs per OSD (240 > max 200); clock skew detected on mon.reesi002, mon.reesi003
2018-04-04 18:28:38.533153 mon.reesi001 mon.0 10.8.130.101:6789/0 1494 : cluster [ERR] overall HEALTH_ERR 405508/13877089 objects misplaced (2.922%); Reduced data availability: 61 pgs inactive; Degraded data redundancy: 13/13877089 objects degraded (0.000%), 139 pgs unclean, 2 pgs degraded; 1 stuck requests are blocked > 4096 sec; too many PGs per OSD (240 > max 200); clock skew detected on mon.reesi002, mon.reesi003

But that seems to have been caused by the creation of stuck requests, not the inactive PGs.

I'm not quite sure what's going on here. Perhaps we only transition to HEALTH_ERR when PGs get stuck, but the primary for these inactive PGs was still sending in PGStats messages so that never happened?


Related issues 1 (1 open0 closed)

Related to RADOS - Bug #23049: ceph Status shows only WARN when traffic to cluster failsNew02/20/2018

Actions
Actions #1

Updated by Greg Farnum about 6 years ago

  • Project changed from Ceph to RADOS
Actions #2

Updated by Josh Durgin about 6 years ago

  • Assignee set to Brad Hubbard

Brad, can you take a look at this? I think it can be handled by the stuck pg code, that iirc already warns about pgs stuck unclean for some time.

Actions #3

Updated by Greg Farnum over 4 years ago

  • Related to Bug #23049: ceph Status shows only WARN when traffic to cluster fails added
Actions #4

Updated by Greg Farnum over 4 years ago

  • Assignee deleted (Brad Hubbard)
  • Priority changed from High to Normal
Actions #5

Updated by Dan van der Ster almost 3 years ago

  • Pull request ID set to 42192
Actions #6

Updated by Dan van der Ster almost 3 years ago

  • Status changed from New to Fix Under Review
Actions #7

Updated by Dan van der Ster almost 3 years ago

  • Backport set to octopus,pacific
Actions

Also available in: Atom PDF