Project

General

Profile

Actions

Bug #20502

closed

crush: Jewel upgrade misbehaving with custom roots/rulesets

Added by Xuehan Xu almost 7 years ago. Updated almost 7 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Recently, we upgraded one of our clusters from Hammer to Jewel, after which we found that some of our pgs stuck in stale state.

After a few checks, we found that all these pgs belong to the pools that used non-default ruleset. Further more, the "host" bucket name in Hammer is the machine's whole hostname, while in Jewel is just the first part of the hostname, and it seems that after the upgrade "host" bucket name in non-default rulesets are still the whole hostname which contains no OSDs.

After we move the new "host" into those ruleset, pgs formerly stuck in stale moved to active+clean.

Actions #1

Updated by Xuehan Xu almost 7 years ago

Sorry, it's non-default root, not non-default ruleset

Actions #2

Updated by Greg Farnum almost 7 years ago

  • Subject changed from Jewel upgrade not considering non-default ruleset to crush: Jewel upgrade misbehaving with custom roots/rulesets

So it sounds like your OSDs updated their host bucket names, and the non-default root/ruleset is referring to buckets which no longer exist.

Can you provide a dump of the current crush map, and a description of what you expect it to look like? How did you set up this separate ruleset?

Actions #3

Updated by Xuehan Xu almost 7 years ago

Yes, it is just like your guess. In our hammer version's ceph-crush-location, the host bucket name is `hostname`, while for jewel version, that's `hostname -s`.

We set up our ruleset as follows:

since we don't know how large a single ruleset can go, we separate the whole cluster into a set of small rulesets, some OSDs belong to one ruleset while others belog to other rulesets. There are almost no intersection between rulesets, and different pools run on different rulesets. So, when we upgrade the cluster, non-default root/ruleset still refer to old host buckets which no longer exists.

I think this should be our fault....

Actions #4

Updated by Sage Weil almost 7 years ago

  • Status changed from New to Won't Fix

Sorry this bit you! I think teh new hostname -s behavior is what we want.

Actions

Also available in: Atom PDF