Bug #3747
closedPGs stuck in active+remapped
0%
Description
About a week ago I doubled the number of OSDs in my cluster from 24 to 48 and, in the same day, adjusted CRUSH's default data rule to say "step chooseleaf firstn 0 type rack" instead of "step choose firstn 0 type osd", as the new OSDs were in boxes in different racks. The vast majority of pgs and data are in a single pool (.rgw.buckets) which has a replica count of 2.
After about 5 days of resyncing, I ended up with 95 pgs stuck in active+remapped, while all the rest of them are active+clean. These seem to be in almost all of the OSDs, so there is no distinctable pattern here.
I tried restarting one of the OSDs that had some of these pgs located there and the count dropped to 61. These has been stuck there for almost three days now.
This is on Ceph 0.56, running with the ceph.com stock packages on an Ubuntu 12.04 LTS system.
I queried on IRC and had a bit of an initial debugging/Q&A with sjust. Per his instructions, I've uploaded to cephdrop@ceph.conf the following files: wmf-pg-dump, wmf-osd-dump, wmf-osdmap.
I'd be happy to provide with more information, although I'm afraid I'll have to work around the issue by in/out'ing the OSDs.