Bug #22440

New pgs per osd hard limit can cause peering issues on existing clusters

Added by Nick Fisk almost 5 years ago. Updated almost 5 years ago.

Target version:
% Done:


Community (user)
3 - minor
Affected Versions:
Pull request ID:
Crash signature (v1):
Crash signature (v2):


During upgrade of OSD's in a cluster from Filestore to Bluestore, the CRUSH layout changed in my cluster. This resulted in a number of PG's stuck in an activating+remapped state and were blocking IO. It wasn't initially obvious what was causing the PG's not to be able to complete the peering process.

Finally I discovered in one of the OSD logs the following:
osd.68 106454 maybe_wait_for_max_pg withhold creation of pg 0.1cf: 403 >= 400

Although the average PG's per OSD across the whole cluster was only just over 200, this new node had much larger disks and so took a much greater share of PG's. With other OSD's being taken out of service to be upgraded to Bluestore, this pushed the number over 400 PG's for this OSD and caused a huge number of blocked requests.

Should this new hard pg limit apply to existing PG's which are only being "created" on an OSD purely due to PG's moving due to CRUSH change? Or should it only apply if the PG's are being created for the 1st time when creating a new pool or increasing size of existing one?

I can see several scenarios where if a user is near the limit and a host/rack failure is encountered, that a large number of PG's could be blocked from peering because of this new limit, reducing the availability of the cluster.

Please close if this is believed to be normal behavior.

Related issues

Duplicated by RADOS - Feature #22973: log lines when hitting "pg overdose protection" Duplicate 02/09/2018


#1 Updated by Greg Farnum almost 5 years ago

  • Project changed from Ceph to RADOS
  • Priority changed from Normal to Urgent

I'm inclined to think we just need to surface this better (perhaps as a new state?) rather than try and let it peer in certain cases. (The whole point of this is that allowing too many PGs to go active on a single OSD can make it unrecoverable later on.)

But we definitely need to make it really obvious to administrators when they run into that block!

#2 Updated by Nick Fisk almost 5 years ago

Sure that makes sense.

If not a new state, how about something that would show up in pg query. I queried the pg within seconds of the problem occurring, but there wasn't anything obvious in there. Something in the "blocked by" section or something similar would probably suffice. Maybe there should also be a "ceph -s" warning for actual PG's per OSD, rather than just the current cluster wide average.

OSD.68 has 390 PG's, this is more than X

Where X is somewhere between 200(warn) and 400(hard limit)

#3 Updated by Brad Hubbard almost 5 years ago

We could definitely add a health warning for when we hit that condition in maybe_wait_for_max_pg()? that should show up when doing ceph status and in health detail.

#4 Updated by Dan van der Ster almost 5 years ago

First, perhaps this will help to make these issues more visible:

Second, is there any possibility to use these limits in the CRUSH calculation, so that CRUSH doesn't send too many PGs in the first place?

#5 Updated by Kefu Chai almost 5 years ago

  • Assignee set to Kefu Chai

will backport to luminous. it helps to make this status more visible to user.

#8 Updated by Kefu Chai almost 5 years ago

  • Status changed from New to Resolved
  • Affected Versions v12.2.2 added

@Nick, if you think this issue deserves a different fix, please feel free to reopen this ticket

#9 Updated by Greg Farnum almost 5 years ago

  • Duplicated by Feature #22973: log lines when hitting "pg overdose protection" added

Also available in: Atom PDF