Bug #38931
closedosd does not proactively remove leftover PGs
0%
Description
(Context: cephfs cluster running v12.2.11)
We had an osd go nearfull this weekend. I reweighted it to move out some PGs, but when looking today it's still holding much more data than it should.
The osd currently has 34 PGs mapped to it:
74 hdd 5.45609 1.00000 5.46TiB 3.86TiB 1.60TiB 70.77 1.37 34
But the OSD itself reports 20 more:
{ "whoami": 74, "state": "active", "oldest_map": 46992, "newest_map": 47738, "num_pgs": 54 }
When I restart the OSD, it reloads those 20, e.g. here is a PG it loads but which is mapped to [22,129,14]. That PG is currently active+clean.
2019-03-25 11:09:26.655955 7fe3fdb23d80 10 osd.74 47719 load_pgs loaded pg[2.6d( v 47685'27177090 (47637'27175587,47685'27177090] lb MIN (bitwise) local-lis/les=47680/47681 n=0 ec=371/371 lis/c 47683/47497 les/c/f 47684/47498/0 47686/47688/43553) [22,129,14] r=-1 lpr=47689 pi=[47497,47688)/1 crt=47685'27177090 lcod 0'0 unknown NOTIFY mbc={}] log((47637'27175587,47685'27177090], crt=47685'27177090)
I found a way to remove those leftover PGs (without using ceph-objectstore-tool): If the PG re-peers, then osd.74 notices he's not in the up/acting set then starts deleting the PG.
So at the moment I'm restarting those former peers to trim this OSD.
Is this all an expected behaviour?
Shouldn't the OSD start removing leftover PGs at boot time?