Project

General

Profile

Bug #38931

osd does not proactively remove leftover PGs

Added by Dan van der Ster over 2 years ago. Updated 27 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific,octopus,nautilus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

(Context: cephfs cluster running v12.2.11)

We had an osd go nearfull this weekend. I reweighted it to move out some PGs, but when looking today it's still holding much more data than it should.

The osd currently has 34 PGs mapped to it:

  74   hdd 5.45609  1.00000 5.46TiB 3.86TiB 1.60TiB 70.77 1.37  34

But the OSD itself reports 20 more:

{
    "whoami": 74,
    "state": "active",
    "oldest_map": 46992,
    "newest_map": 47738,
    "num_pgs": 54
}

When I restart the OSD, it reloads those 20, e.g. here is a PG it loads but which is mapped to [22,129,14]. That PG is currently active+clean.

2019-03-25 11:09:26.655955 7fe3fdb23d80 10 osd.74 47719 load_pgs loaded pg[2.6d( v 47685'27177090 (47637'27175587,47685'27177090] lb MIN (bitwise) local-lis/les=47680/47681 n=0 ec=371/371 lis/c 47683/47497 les/c/f 47684/47498/0 47686/47688/43553) [22,129,14] r=-1 lpr=47689 pi=[47497,47688)/1 crt=47685'27177090 lcod 0'0 unknown NOTIFY mbc={}] log((47637'27175587,47685'27177090], crt=47685'27177090)

I found a way to remove those leftover PGs (without using ceph-objectstore-tool): If the PG re-peers, then osd.74 notices he's not in the up/acting set then starts deleting the PG.
So at the moment I'm restarting those former peers to trim this OSD.

Is this all an expected behaviour?
Shouldn't the OSD start removing leftover PGs at boot time?


Related issues

Copied to RADOS - Backport #51582: octopus: osd does not proactively remove leftover PGs Resolved
Copied to RADOS - Backport #51583: nautilus: osd does not proactively remove leftover PGs Resolved
Copied to RADOS - Backport #51584: pacific: osd does not proactively remove leftover PGs Resolved

History

#2 Updated by Greg Farnum over 2 years ago

So should we backport part of that PR, Neha?

To answer your question more directly, Dan: OSDs don't delete PGs themselves because they don't know if the data is still needed; they wait for the primary to tell them to remove it. Based on the linked commit, apparently there's a bug where in some circumstances the primary will erroneously mark the stray OSD as having deleted the PG already though, and you seem to have fallen victim to that.

#3 Updated by Neha Ojha over 2 years ago

Greg Farnum wrote:

So should we backport part of that PR, Neha?

To answer your question more directly, Dan: OSDs don't delete PGs themselves because they don't know if the data is still needed; they wait for the primary to tell them to remove it. Based on the linked commit, apparently there's a bug where in some circumstances the primary will erroneously mark the stray OSD as having deleted the PG already though, and you seem to have fallen victim to that.

I think so, guess I wasn't sure since Xie Xingguo used "Related-to" instead of "Fixes:" in that commit.
As a matter of fact, the parent PR https://github.com/ceph/ceph/pull/27205 is also pending backport.

#4 Updated by Mykola Golub 4 months ago

  • Status changed from New to Fix Under Review
  • Backport set to pacific,octopus,nautilus
  • Pull request ID set to 42141

Our customer reported a similar case, providing an easy way to reproduce the issue: if when purging a pg the osd is marked down (to reproduce one can use `ceph osd down` command, in reality it may happen when pg purge is too heavy and overloads the osd) then the purge is interrupted and is not restarted until the pg is not re-peered. The purging osd keeps sending notifications to the primary asking to purge but the primary ignores them because the osd is in peer_purged list. So this is exact the problem that [1] tried to fix but I think the fix was wrong: adding peer_purged.erase() into the peer_info loop made no effect because in purge_strays() when inserting an osd to peer_purged we simultaneously remove it from peer_info.

See PR [2] for my approach to fix it.

[1] https://github.com/ceph/ceph/pull/27205/commits/f7c5b01e181630bb15e8b923b0334eb6adfdf50a
[2] https://github.com/ceph/ceph/pull/42141

#5 Updated by Kefu Chai 3 months ago

  • Status changed from Fix Under Review to Pending Backport

#6 Updated by Backport Bot 3 months ago

  • Copied to Backport #51582: octopus: osd does not proactively remove leftover PGs added

#7 Updated by Backport Bot 3 months ago

  • Copied to Backport #51583: nautilus: osd does not proactively remove leftover PGs added

#8 Updated by Backport Bot 3 months ago

  • Copied to Backport #51584: pacific: osd does not proactively remove leftover PGs added

#9 Updated by Loïc Dachary 27 days ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF