Remapped PGs are sometimes not deleted from previous OSDs
I noticed on several clusters (all Nautilus 14.2.6) that on occasion, some OSDs may still hold data for some PGs long after they have been remapped elsewhere and been active+clean. That was spotted by noticing that a few nodes had slightly higher disk usage than others, despite the upmap balancer having achieved a perfect distribution of PGs.I cannot really tell what triggers that condition - PG are regularly remapped as bad disks are taken out and the balancer does its thing - but what I can see is:
- Those OSDs report more PGs in `ceph --admin-daemon /var/run/ceph/CLUSTER-osd.OSD.asok dump_pgstate_history` than they should have according to `ceph osd df`
- Those extra PGs all have "Start" as last state
- If I do a pg query for an affected PG, the old OSD does not appear as "up" nor "acting", but it does usually appear in "avail_no_missing" and "object_location_counts" sections (for every peer_info)
- An effective workaround is to bounce the primary OSD for that PG - after re-peering, the deletion process immediately begins on the old OSD (state Started/ToDelete/Deleting)
Attaching debug info for one such case: osd.991 is holding data for PG 7.2bf (10th shard) which was remapped to osd.944 several weeks earlier. I included the dump_pgstate_history and pg query information both before and after the primay osd.447 was restarted.
#2 Updated by Dan van der Ster over 3 years ago
- Affected Versions v14.2.10, v14.2.11, v14.2.7, v14.2.8, v14.2.9 added
I can report the same in 14.2.11.
We set some osds to crush weight 0, they were draining. But due to #47044 some of those osds flapped, and when they came back up, the draining finished.
But after all PGs were active+clean on other osds, there are still ~20 PGs on the crush weight = 0.0 OSDs.
Restarting the crush weight 0 OSDs does not trigger any further deletion.
I checked the relevant PGs -- indeed the "old" OSD appears in `avail_no_missing`, and the workaround of re-peering the primary OSD for that PG works to trigger deletion. `ceph osd down $PRIMARY` does this without needing to restart the primary.
#3 Updated by Gabriel Tzagkarakis about 3 years ago
I have the exact same problem with version 15.2.8 as described above and verified it following Eric's steps.
After restarting the primary, the deletion starts again.
In our case this situation started because we replaced most disks in a period of 2 months and had plenty of rebalances.
For example a 3TB OSD that should have around 400GiB of raw data,now has 2.1TBm while other OSD reached a nearfull condition.
You may want to update the "affected versions"