Bug #20502: crush: Jewel upgrade misbehaving with custom roots/rulesets - Ceph - Ceph

Actions

Copy link

Bug #20502

closed

crush: Jewel upgrade misbehaving with custom roots/rulesets

Added by Xuehan Xu almost 7 years ago. Updated almost 7 years ago.

Status:

Won't Fix

Priority:

Normal

Assignee:

Category:

Target version:

v10.2.8

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Recently, we upgraded one of our clusters from Hammer to Jewel, after which we found that some of our pgs stuck in stale state.

After a few checks, we found that all these pgs belong to the pools that used non-default ruleset. Further more, the "host" bucket name in Hammer is the machine's whole hostname, while in Jewel is just the first part of the hostname, and it seems that after the upgrade "host" bucket name in non-default rulesets are still the whole hostname which contains no OSDs.

After we move the new "host" into those ruleset, pgs formerly stuck in stale moved to active+clean.

Actions

Copy link

Updated by Xuehan Xu almost 7 years ago

Sorry, it's non-default root, not non-default ruleset

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Subject changed from Jewel upgrade not considering non-default ruleset to crush: Jewel upgrade misbehaving with custom roots/rulesets

So it sounds like your OSDs updated their host bucket names, and the non-default root/ruleset is referring to buckets which no longer exist.

Can you provide a dump of the current crush map, and a description of what you expect it to look like? How did you set up this separate ruleset?

Actions

Copy link

Updated by Xuehan Xu almost 7 years ago

Yes, it is just like your guess. In our hammer version's ceph-crush-location, the host bucket name is `hostname`, while for jewel version, that's `hostname -s`.

We set up our ruleset as follows:

since we don't know how large a single ruleset can go, we separate the whole cluster into a set of small rulesets, some OSDs belong to one ruleset while others belog to other rulesets. There are almost no intersection between rulesets, and different pools run on different rulesets. So, when we upgrade the cluster, non-default root/ruleset still refer to old host buckets which no longer exists.

I think this should be our fault....

Actions

Copy link

Updated by Sage Weil almost 7 years ago

Status changed from New to Won't Fix

Sorry this bit you! I think teh new hostname -s behavior is what we want.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #20502

crush: Jewel upgrade misbehaving with custom roots/rulesets

Updated by Xuehan Xu almost 7 years ago

Updated by Greg Farnum almost 7 years ago

Updated by Xuehan Xu almost 7 years ago

Updated by Sage Weil almost 7 years ago