Project

General

Profile

Feature #22973

log lines when hitting "pg overdose protection"

Added by Dan Stoner almost 5 years ago. Updated almost 5 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Reviewed:
Affected Versions:
Component(RADOS):
OSD
Pull request ID:

Description

After upgrading to Luminous we ran into situation where 10% of our pgs remained unavailable, stuck in "activating" state.

https://ceph.com/community/new-luminous-pg-overdose-protection/

That blog post says:

"If any individual OSD is ever asked to create more PGs than it should it will simply refuse and ignore the request."

The only non-debug direct evidence was this WARNING in ceph status:

'too many PGs per OSD (221 > max 200)'

(We are aware that we need to fix this situation in our cluster)

Many pgs were stuck in "activating" state which is not documented in the pg state table:

http://docs.ceph.com/docs/master/rados/operations/pg-states/

Feature idea would be that the OSD should write to standard log level when it refuses to create the pg / hits the osd_max_pg_per_osd_hard_ratio.

We saw lots of "stuck" in all of the management command outputs but not the underlying reason.

I would also inquire whether this situation should issues an ERROR rather than a WARNING since the cluster becomes "partially unavailable".


Related issues

Duplicates RADOS - Bug #22440: New pgs per osd hard limit can cause peering issues on existing clusters Resolved 12/14/2017

History

#1 Updated by Greg Farnum almost 5 years ago

  • Status changed from New to Duplicate

You're right that it's bad! This will be fixed in the next luminous release after a belated backport finally happened. :)

#2 Updated by Greg Farnum almost 5 years ago

  • Duplicates Bug #22440: New pgs per osd hard limit can cause peering issues on existing clusters added

Also available in: Atom PDF