Project

General

Profile

Actions

Bug #48298

open

hitting mon_max_pg_per_osd right after creating OSD, then decreases slowly

Added by Jonas Jelten over 3 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I just added OSDs to my cluster running 14.2.13.

mon_max_pg_per_osd = 300
osd_max_pg_per_osd_hard_ratio = 3

OSDs of comparable size have maybe 200 PGs on them.

This OSD now somehow has 907 > 300*3 PGs:

ceph daemon osd.422 status
{
    "cluster_fsid": "xxx",
    "osd_fsid": "yyy",
    "whoami": 422,
    "state": "booting",
    "oldest_map": 454592,
    "newest_map": 455185,
    "num_pgs": 907
}

Thus PGs become stuck activating+remapped and the large parts of the cluster die.

The interesting thing is this: Now after I've increased the limit, it does of course boot and PGs become active.
Now, but the num_pgs have increased further to 969. But then they started to decrease, until the device has the expected number of PGs!

Another problem: There's absolutely no hint that the osd_max_pg_per_osd_hard_ratio has hit. You only get the warning when being over the soft limit.

tl;dr:

  • More PGs are allocated on an OSD than there actually are once the remapping is done.
  • There's no cluster error when a OSD does hit the hard limit.

Files

2020-11-20-133944_1906x1477_scrot.png (393 KB) 2020-11-20-133944_1906x1477_scrot.png graph of decreasing num_pgs Jonas Jelten, 11/20/2020 12:44 PM

Related issues 1 (1 open0 closed)

Related to RADOS - Bug #23117: PGs stuck in "activating" after osd_max_pg_per_osd_hard_ratio has been exceeded onceFix Under ReviewPrashant D

Actions
Actions

Also available in: Atom PDF