Project

General

Profile

Actions

Bug #3747

closed

PGs stuck in active+remapped

Added by Faidon Liambotis over 11 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

About a week ago I doubled the number of OSDs in my cluster from 24 to 48 and, in the same day, adjusted CRUSH's default data rule to say "step chooseleaf firstn 0 type rack" instead of "step choose firstn 0 type osd", as the new OSDs were in boxes in different racks. The vast majority of pgs and data are in a single pool (.rgw.buckets) which has a replica count of 2.

After about 5 days of resyncing, I ended up with 95 pgs stuck in active+remapped, while all the rest of them are active+clean. These seem to be in almost all of the OSDs, so there is no distinctable pattern here.

I tried restarting one of the OSDs that had some of these pgs located there and the count dropped to 61. These has been stuck there for almost three days now.

This is on Ceph 0.56, running with the ceph.com stock packages on an Ubuntu 12.04 LTS system.

I queried on IRC and had a bit of an initial debugging/Q&A with sjust. Per his instructions, I've uploaded to the following files: wmf-pg-dump, wmf-osd-dump, wmf-osdmap.

I'd be happy to provide with more information, although I'm afraid I'll have to work around the issue by in/out'ing the OSDs.

Actions #1

Updated by Faidon Liambotis over 11 years ago

I did a "ceph osd out 0; sleep 30; ceph osd in 0" and out of those 61 active+remapped pgs, 5 went into active+remapped+backfilling and slowly moved into active+clean. So there definitely seems to be some underlying bug on why pgs are getting stuck in that state.

Actions #2

Updated by Sage Weil over 11 years ago

  • Status changed from New to Resolved
Actions #3

Updated by Sage Weil over 11 years ago

  • Status changed from Resolved to 12

Sage Weil wrote:

f83fcf63a928fdb8ab4d604bdce596c0c4afd854

oops, wrong bug!

Actions #4

Updated by Sage Weil over 11 years ago

  • Priority changed from Normal to High
Actions #5

Updated by Samuel Just over 11 years ago

  • Assignee set to Samuel Just
Actions #6

Updated by Samuel Just over 11 years ago

  • Status changed from 12 to Need More Info

Faidon: did you also change the replication level of pool 3 (.rgw.buckets) ?

Actions #7

Updated by Faidon Liambotis over 11 years ago

No I didn't, just the CRUSH rule.

Actions #8

Updated by Sage Weil about 11 years ago

  • Status changed from Need More Info to Closed

I think this was probably related to the lagging pg peering workqueue.. is there anything to suggest that isn't the case?

I'm inclined to close this.

Actions #9

Updated by Марк Коренберг about 5 years ago

Please reopen. Happens with my cluster in Mimic.

Actions #10

Updated by Samuel Just about 5 years ago

You'll want to open a new bug, there's not much chance that what you're experiencing is related.

Actions

Also available in: Atom PDF