Project

General

Profile

Actions

Feature #22973

closed

log lines when hitting "pg overdose protection"

Added by Dan Stoner about 6 years ago. Updated about 6 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Reviewed:
Affected Versions:
Component(RADOS):
OSD
Pull request ID:

Description

After upgrading to Luminous we ran into situation where 10% of our pgs remained unavailable, stuck in "activating" state.

https://ceph.com/community/new-luminous-pg-overdose-protection/

That blog post says:

"If any individual OSD is ever asked to create more PGs than it should it will simply refuse and ignore the request."

The only non-debug direct evidence was this WARNING in ceph status:

'too many PGs per OSD (221 > max 200)'

(We are aware that we need to fix this situation in our cluster)

Many pgs were stuck in "activating" state which is not documented in the pg state table:

http://docs.ceph.com/docs/master/rados/operations/pg-states/

Feature idea would be that the OSD should write to standard log level when it refuses to create the pg / hits the osd_max_pg_per_osd_hard_ratio.

We saw lots of "stuck" in all of the management command outputs but not the underlying reason.

I would also inquire whether this situation should issues an ERROR rather than a WARNING since the cluster becomes "partially unavailable".


Related issues 1 (0 open1 closed)

Is duplicate of RADOS - Bug #22440: New pgs per osd hard limit can cause peering issues on existing clustersResolvedKefu Chai12/14/2017

Actions
Actions #1

Updated by Greg Farnum about 6 years ago

  • Status changed from New to Duplicate

You're right that it's bad! This will be fixed in the next luminous release after a belated backport finally happened. :)

Actions #2

Updated by Greg Farnum about 6 years ago

  • Is duplicate of Bug #22440: New pgs per osd hard limit can cause peering issues on existing clusters added
Actions

Also available in: Atom PDF