Actions
Bug #48298
openhitting mon_max_pg_per_osd right after creating OSD, then decreases slowly
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
I just added OSDs to my cluster running 14.2.13.
mon_max_pg_per_osd = 300 osd_max_pg_per_osd_hard_ratio = 3
OSDs of comparable size have maybe 200 PGs on them.
This OSD now somehow has 907 > 300*3 PGs:
ceph daemon osd.422 status { "cluster_fsid": "xxx", "osd_fsid": "yyy", "whoami": 422, "state": "booting", "oldest_map": 454592, "newest_map": 455185, "num_pgs": 907 }
Thus PGs become stuck activating+remapped
and the large parts of the cluster die.
The interesting thing is this: Now after I've increased the limit, it does of course boot and PGs become active.
Now, but the num_pgs
have increased further to 969. But then they started to decrease, until the device has the expected number of PGs!
Another problem: There's absolutely no hint that the osd_max_pg_per_osd_hard_ratio
has hit. You only get the warning when being over the soft limit.
tl;dr:
- More PGs are allocated on an OSD than there actually are once the remapping is done.
- There's no cluster error when a OSD does hit the hard limit.
Files
Actions