Project

General

Profile

Bug #11134

PGs stuck much longer than needed in Peering or Inactive

Added by Alexandre Oliva about 5 years ago. Updated about 5 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

I went overboard in PG count on some pools, so now I have OSDs holding thousands of PGs. I know this is not recommended, and the present issue might be the reason why it isn't, but I can't go back, and so I figured I'd report it any way, in case it's not known and it could be easily fixed.

The problem I have is that in some cases PGs get stuck in Peering or other Inactive states for a very long time, up to hours, until one of the OSDs involved in its recovery completes some major task (emptying some recovery queue or somesuch), at which point PGs start advancing to Inactive, and then when another major task completes, they advance to active states.

I found out that, if I make osdmap changes such as changing max target sizes of tiered pools, as the osdmap propagates, many PGs that were just waiting for something, presumably this event alluded to above, can advance right away to active states, instead of remaining needlessly stuck.

It would be desirable if osds could avoid on their own this needless delay in bringing PGs to an active state, instead of requiring external intervention.

This is with the initial Giant release; I haven't upgraded to the patch release yet.


Related issues

Related to Ceph - Bug #10431: PG can not finish peering due to mismatch between OSD peering queue and PG peering queue Resolved 12/24/2014

History

#1 Updated by Samuel Just about 5 years ago

  • Status changed from New to Duplicate

Fairly sure this is fixed.

Also available in: Atom PDF