Bug #38931: osd does not proactively remove leftover PGs - RADOS - Ceph

Actions

Copy link

Bug #38931

closed

osd does not proactively remove leftover PGs

Added by Dan van der Ster about 5 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

pacific,octopus,nautilus

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v12.2.11

ceph-qa-suite:

Component(RADOS):

Pull request ID:

42141

Crash signature (v1):

Crash signature (v2):

Description

(Context: cephfs cluster running v12.2.11)

We had an osd go nearfull this weekend. I reweighted it to move out some PGs, but when looking today it's still holding much more data than it should.

The osd currently has 34 PGs mapped to it:

  74   hdd 5.45609  1.00000 5.46TiB 3.86TiB 1.60TiB 70.77 1.37  34

But the OSD itself reports 20 more:

{
    "whoami": 74,
    "state": "active",
    "oldest_map": 46992,
    "newest_map": 47738,
    "num_pgs": 54
}

When I restart the OSD, it reloads those 20, e.g. here is a PG it loads but which is mapped to [22,129,14]. That PG is currently active+clean.

2019-03-25 11:09:26.655955 7fe3fdb23d80 10 osd.74 47719 load_pgs loaded pg[2.6d( v 47685'27177090 (47637'27175587,47685'27177090] lb MIN (bitwise) local-lis/les=47680/47681 n=0 ec=371/371 lis/c 47683/47497 les/c/f 47684/47498/0 47686/47688/43553) [22,129,14] r=-1 lpr=47689 pi=[47497,47688)/1 crt=47685'27177090 lcod 0'0 unknown NOTIFY mbc={}] log((47637'27175587,47685'27177090], crt=47685'27177090)

I found a way to remove those leftover PGs (without using ceph-objectstore-tool): If the PG re-peers, then osd.74 notices he's not in the up/acting set then starts deleting the PG.
So at the moment I'm restarting those former peers to trim this OSD.

Is this all an expected behaviour?
Shouldn't the OSD start removing leftover PGs at boot time?

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Neha Ojha about 5 years ago

https://github.com/ceph/ceph/pull/27205/commits/f7c5b01e181630bb15e8b923b0334eb6adfdf50a

Actions

Copy link

Updated by Greg Farnum about 5 years ago

So should we backport part of that PR, Neha?

To answer your question more directly, Dan: OSDs don't delete PGs themselves because they don't know if the data is still needed; they wait for the primary to tell them to remove it. Based on the linked commit, apparently there's a bug where in some circumstances the primary will erroneously mark the stray OSD as having deleted the PG already though, and you seem to have fallen victim to that.

Actions

Copy link

Updated by Neha Ojha about 5 years ago

Greg Farnum wrote:

So should we backport part of that PR, Neha?

To answer your question more directly, Dan: OSDs don't delete PGs themselves because they don't know if the data is still needed; they wait for the primary to tell them to remove it. Based on the linked commit, apparently there's a bug where in some circumstances the primary will erroneously mark the stray OSD as having deleted the PG already though, and you seem to have fallen victim to that.

I think so, guess I wasn't sure since Xie Xingguo used "Related-to" instead of "Fixes:" in that commit.
As a matter of fact, the parent PR https://github.com/ceph/ceph/pull/27205 is also pending backport.

Actions

Copy link

Updated by Mykola Golub almost 3 years ago

Status changed from New to Fix Under Review
Backport set to pacific,octopus,nautilus
Pull request ID set to 42141

Our customer reported a similar case, providing an easy way to reproduce the issue: if when purging a pg the osd is marked down (to reproduce one can use `ceph osd down` command, in reality it may happen when pg purge is too heavy and overloads the osd) then the purge is interrupted and is not restarted until the pg is not re-peered. The purging osd keeps sending notifications to the primary asking to purge but the primary ignores them because the osd is in peer_purged list. So this is exact the problem that [1] tried to fix but I think the fix was wrong: adding peer_purged.erase() into the peer_info loop made no effect because in purge_strays() when inserting an osd to peer_purged we simultaneously remove it from peer_info.

See PR [2] for my approach to fix it.

[1] https://github.com/ceph/ceph/pull/27205/commits/f7c5b01e181630bb15e8b923b0334eb6adfdf50a
[2] https://github.com/ceph/ceph/pull/42141

Actions

Copy link

Updated by Kefu Chai almost 3 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by Backport Bot almost 3 years ago

Copied to Backport #51582: octopus: osd does not proactively remove leftover PGs added

Actions

Copy link

Updated by Backport Bot almost 3 years ago

Copied to Backport #51583: nautilus: osd does not proactively remove leftover PGs added

Actions

Copy link

Updated by Backport Bot almost 3 years ago

Copied to Backport #51584: pacific: osd does not proactively remove leftover PGs added

Actions

Copy link

Updated by Loïc Dachary over 2 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #38931

osd does not proactively remove leftover PGs

Updated by Neha Ojha about 5 years ago

Updated by Greg Farnum about 5 years ago

Updated by Neha Ojha about 5 years ago

Updated by Mykola Golub almost 3 years ago

Updated by Kefu Chai almost 3 years ago

Updated by Backport Bot almost 3 years ago

Updated by Backport Bot almost 3 years ago

Updated by Backport Bot almost 3 years ago

Updated by Loïc Dachary over 2 years ago