Feature #22973
closedlog lines when hitting "pg overdose protection"
0%
Description
After upgrading to Luminous we ran into situation where 10% of our pgs remained unavailable, stuck in "activating" state.
https://ceph.com/community/new-luminous-pg-overdose-protection/
That blog post says:
"If any individual OSD is ever asked to create more PGs than it should it will simply refuse and ignore the request."
The only non-debug direct evidence was this WARNING in ceph status:
'too many PGs per OSD (221 > max 200)'
(We are aware that we need to fix this situation in our cluster)
Many pgs were stuck in "activating" state which is not documented in the pg state table:
http://docs.ceph.com/docs/master/rados/operations/pg-states/
Feature idea would be that the OSD should write to standard log level when it refuses to create the pg / hits the osd_max_pg_per_osd_hard_ratio.
We saw lots of "stuck" in all of the management command outputs but not the underlying reason.
I would also inquire whether this situation should issues an ERROR rather than a WARNING since the cluster becomes "partially unavailable".
Updated by Greg Farnum about 6 years ago
- Status changed from New to Duplicate
You're right that it's bad! This will be fixed in the next luminous release after a belated backport finally happened. :)
Updated by Greg Farnum about 6 years ago
- Is duplicate of Bug #22440: New pgs per osd hard limit can cause peering issues on existing clusters added