Bug #11134: PGs stuck much longer than needed in Peering or Inactive - Ceph - Ceph

Actions

Copy link

Bug #11134

closed

PGs stuck much longer than needed in Peering or Inactive

Added by Alexandre Oliva about 9 years ago. Updated about 9 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I went overboard in PG count on some pools, so now I have OSDs holding thousands of PGs. I know this is not recommended, and the present issue might be the reason why it isn't, but I can't go back, and so I figured I'd report it any way, in case it's not known and it could be easily fixed.

The problem I have is that in some cases PGs get stuck in Peering or other Inactive states for a very long time, up to hours, until one of the OSDs involved in its recovery completes some major task (emptying some recovery queue or somesuch), at which point PGs start advancing to Inactive, and then when another major task completes, they advance to active states.

I found out that, if I make osdmap changes such as changing max target sizes of tiered pools, as the osdmap propagates, many PGs that were just waiting for something, presumably this event alluded to above, can advance right away to active states, instead of remaining needlessly stuck.

It would be desirable if osds could avoid on their own this needless delay in bringing PGs to an active state, instead of requiring external intervention.

This is with the initial Giant release; I haven't upgraded to the patch release yet.

Related issues 1 (0 open — 1 closed)